![]() |
|
| code in cache | local data in cache | seq. access | non-shared data | no memory alloc | |
|---|---|---|---|---|---|
| compfib | O | O | N/A | O | O |
| intfib | X | X | N/A | O | O |
| inner-prod | O | O | O | X | O |
| list-replace | O | O | X | X | O |
| copy-seq | O | O | X | X | XX |
| body int-check | X | X | X | X | X |
As can be seen, memory hungry programs cannot run faster even with many processors are assigned. The reason is the mutual exclusion in the memory allocation function `alloc'. Since the free list database is global in the process, every memory reques must be mutex'ed, which cost much when many threads are running concurrently.
Here, even the two memory hungry benchmarks are running faster with more than two processors. The parallel performance is improved. But it is still obvious that the parallel gain never goes up higher than two. This is not because our thread local memory management is not effective. Though the scheme is working well as we expected, there is the large serial fraction, the Garbage Collection. The GC is running sequentially after suspending all mutator threads. The runtime ratio of GC against total execution time is shown in the next graph.
When there is only one thread, approximately 30-35% of CPU time is spent for the GC. The ratio rises with the concurrency, and 60-70% of the elapsed time is spent for the GC when there are 16 threads. With this hard serial fraction, the parallel gain never goes higher than two. Namely, parallel garbage collection is needed.