Skip to content

Results

Parallel Performance Evaluation (Process-based Matrix Multiplication)

This page reports the execution-time results for matrix multiplication under process-based parallelism using fork() and wait().

Across the tested configurations, increasing PROCS from 1 → 4 did not reduce execution time. In most cases, runtime increased due to process and synchronization overhead.


Executive Summary

  • Main Finding


    Using 4 processes generally increased execution time compared to 1 process.

  • Cause


    Overhead from process creation (fork()), scheduling, and synchronization (wait()) dominated at these problem sizes.

  • Data Quality


    Low variability across runs (standard deviation ≤ 0.00094s) indicates consistent behavior in the VM environment.


1) Complete Experimental Results

1.1 Raw Execution Time Data (3 runs)

Matrix Size (N) PROCS Run 1 (s) Run 2 (s) Run 3 (s) Avg (s) Std Dev
1200 1 0.001 0.001 0.003 0.0016 0.00094
1200 4 0.002 0.002 0.001 0.0016 0.00047
1800 1 0.001 0.001 0.001 0.0010 0.00000
1800 4 0.003 0.003 0.003 0.0030 0.00000
2400 1 0.002 0.001 0.001 0.0013 0.00047
2400 4 0.004 0.003 0.003 0.0033 0.00047

1.2 Derived Performance Metrics (per N)

Definitions (per fixed N):

  • Speedup = ( T_{1} / T_{p} )
  • Efficiency = ( \text{Speedup} / p )
  • Overhead Factor = ( (T_{p} - T_{1}) / T_{1} )
Matrix Size (N) Speedup (P=4 vs P=1) Efficiency (P=4) Overhead Factor (P=4 vs P=1)
1200 1.00× 0.25 0%
1800 0.33× 0.083 200%
2400 0.39× 0.098 154%

Interpretation: for N=1800 and N=2400, using 4 processes is slower than using 1 process (speedup < 1), indicating negative scaling.


  • Effect of Increasing PROCS


    • N=1200: equal performance (0.0016s vs 0.0016s)
    • N=1800: 3× slower with 4 processes (0.0030s vs 0.0010s)
    • N=2400: ~2.5× slower with 4 processes (0.0033s vs 0.0013s)
  • Effect of Increasing N


    • For P=1, results are small and not strictly monotonic at this timing scale
    • For P=4, runtime increases notably from N=1200 → 1800 → 2400
  • Variability


    Standard deviations are small relative to the differences between P=1 and P=4 for N=1800 and N=2400, suggesting the slowdown is consistent.


3) Visual Analysis (Text-Based)

3.1 Execution Time vs. Number of Processes

Execution Time (s)
0.004 |                         ● (2400,4)
0.003 |              ● (1800,4)  ● (2400,4)
0.002 | ● (1200,4)
0.001 | ● (1200,1)  ● (1800,1)  ● (2400,1)
+-------------------------------------
1 Process            4 Processes

3.2 Efficiency Summary

Efficiency (P=4)
0.25 | ● N=1200
0.10 |                 ● N=2400
0.08 |         ● N=1800
+-----------------------------
1200       1800      2400

4) Breakdown: Why PROCS=4 Can Be Slower

  • Process Creation


    fork() has non-trivial overhead (process metadata + scheduling + memory mapping behavior), especially noticeable when total runtime is very small.

  • Synchronization


    The parent waits for all children (wait()), so total runtime includes coordination overhead.

  • Scheduling / Context Switching


    Multiple runnable processes introduce context-switching costs, which can outweigh benefits for fine-grained workloads.

  • Memory Behavior


    With large arrays, memory bandwidth and cache behavior can dominate. Multiple processes may increase contention rather than reduce time.


5) Anomalies & Measurement Notes

5.1 Observed Anomalies

  • N=1800, P=1 appears faster than N=1200, P=1
  • P=1 timings are very small and may be affected by timing resolution and VM noise

5.2 Measurement Considerations

  • clock() measures CPU time, not wall-clock time
  • At near-millisecond values, small effects (scheduler, VM activity, cache) can produce non-monotonic behavior
  • Despite this, the P=4 slowdowns at N=1800 and N=2400 are large and consistent

6) Conclusion

The results show that naive process-based parallelism using fork() and wait() is inefficient for the tested matrix sizes (N ≤ 2400). Overhead dominates computation, producing negative scaling when using 4 processes.

These results support the OS-level principle that parallelism is only beneficial when the workload is large enough to amortize process management overhead.


Data collected on: 2026-02-09
Environment: Ubuntu 24.04.3 LTS (ARM64), 4 CPU cores, 4 GB RAM
Compiler: GCC 11.4.0 (-Wall)
Timing: clock() with CLOCKS_PER_SEC scaling