Results¶

Parallel Performance Evaluation (Process-based Matrix Multiplication)

This page reports the execution-time results for matrix multiplication under process-based parallelism using fork() and wait().

Across the tested configurations, increasing PROCS from 1 → 4 did not reduce execution time. In most cases, runtime increased due to process and synchronization overhead.

Executive Summary¶

Main Finding

Using 4 processes generally increased execution time compared to 1 process.
Cause

Overhead from process creation (fork()), scheduling, and synchronization (wait()) dominated at these problem sizes.
Data Quality

Low variability across runs (standard deviation ≤ 0.00094s) indicates consistent behavior in the VM environment.

1) Complete Experimental Results¶

1.1 Raw Execution Time Data (3 runs)¶

Matrix Size (N)	PROCS	Run 1 (s)	Run 2 (s)	Run 3 (s)	Avg (s)	Std Dev
1200	1	0.001	0.001	0.003	0.0016	0.00094
1200	4	0.002	0.002	0.001	0.0016	0.00047
1800	1	0.001	0.001	0.001	0.0010	0.00000
1800	4	0.003	0.003	0.003	0.0030	0.00000
2400	1	0.002	0.001	0.001	0.0013	0.00047
2400	4	0.004	0.003	0.003	0.0033	0.00047

1.2 Derived Performance Metrics (per N)¶

Definitions (per fixed N):

Speedup = ( T_{1} / T_{p} )
Efficiency = ( \text{Speedup} / p )
Overhead Factor = ( (T_{p} - T_{1}) / T_{1} )

Matrix Size (N)	Speedup (P=4 vs P=1)	Efficiency (P=4)	Overhead Factor (P=4 vs P=1)
1200	1.00×	0.25	0%
1800	0.33×	0.083	200%
2400	0.39×	0.098	154%

Interpretation: for N=1800 and N=2400, using 4 processes is slower than using 1 process (speedup < 1), indicating negative scaling.

2) Key Performance Trends¶

Effect of Increasing PROCS
- N=1200: equal performance (0.0016s vs 0.0016s)
- N=1800: 3× slower with 4 processes (0.0030s vs 0.0010s)
- N=2400: ~2.5× slower with 4 processes (0.0033s vs 0.0013s)
Effect of Increasing N
- For P=1, results are small and not strictly monotonic at this timing scale
- For P=4, runtime increases notably from N=1200 → 1800 → 2400
Variability

Standard deviations are small relative to the differences between P=1 and P=4 for N=1800 and N=2400, suggesting the slowdown is consistent.

3) Visual Analysis (Text-Based)¶

3.1 Execution Time vs. Number of Processes¶

Execution Time (s)
0.004 |                         ● (2400,4)
0.003 |              ● (1800,4)  ● (2400,4)
0.002 | ● (1200,4)
0.001 | ● (1200,1)  ● (1800,1)  ● (2400,1)
+-------------------------------------
1 Process            4 Processes

3.2 Efficiency Summary¶

Efficiency (P=4)
0.25 | ● N=1200
0.10 |                 ● N=2400
0.08 |         ● N=1800
+-----------------------------
1200       1800      2400

4) Breakdown: Why PROCS=4 Can Be Slower¶

Process Creation

fork() has non-trivial overhead (process metadata + scheduling + memory mapping behavior), especially noticeable when total runtime is very small.
Synchronization

The parent waits for all children (wait()), so total runtime includes coordination overhead.
Scheduling / Context Switching

Multiple runnable processes introduce context-switching costs, which can outweigh benefits for fine-grained workloads.
Memory Behavior

With large arrays, memory bandwidth and cache behavior can dominate. Multiple processes may increase contention rather than reduce time.

5) Anomalies & Measurement Notes¶

5.1 Observed Anomalies¶

N=1800, P=1 appears faster than N=1200, P=1
P=1 timings are very small and may be affected by timing resolution and VM noise

5.2 Measurement Considerations¶

clock() measures CPU time, not wall-clock time
At near-millisecond values, small effects (scheduler, VM activity, cache) can produce non-monotonic behavior
Despite this, the P=4 slowdowns at N=1800 and N=2400 are large and consistent

6) Conclusion¶

The results show that naive process-based parallelism using fork() and wait() is inefficient for the tested matrix sizes (N ≤ 2400). Overhead dominates computation, producing negative scaling when using 4 processes.

These results support the OS-level principle that parallelism is only beneficial when the workload is large enough to amortize process management overhead.

Data collected on: 2026-02-09
Environment: Ubuntu 24.04.3 LTS (ARM64), 4 CPU cores, 4 GB RAM
Compiler: GCC 11.4.0 (-Wall)
Timing: clock() with CLOCKS_PER_SEC scaling