On Implicit Parallelism

September 23rd, 2024

Parallelizing code requires not only specifying what the program should compute, but also how to organize and distribute the computation using threads, SIMD, or GPUs. Managing this bookkeeping work on top of writing a correct and efficient algorithm is difficult, and oftentimes an elegant sequential algorithm cannot be parallelized simply, requiring ugly wrangling to manage thread creation, data transit, and synchronization.

Writing parallelizable code is an art which practitioners can spend their entire career studying: the field of High Performance Computing (HPC) is a kind of cottage industry existing in relative isolation from software engineering as a whole, trading knowledge of esoteric APIs and hardware eccentricities in order to maximize performance.

For decades it was satisfactory to relegate parallelization to special-purpose work and programmers, as single-thread performance increased at a rate which far overshadowed the speedup to be gleaned by taking advantage of an additional one to three cores as the best of consumer hardware had to offer through the 90s and 2000s.

However, as single-thread performance has leveled off with the ending of Moore's Law, we now must face the difficult challenge of parallelizing our code in order to see continued progress in performance. The number of cores on consumer hardware has roughly doubled since 2010, with even smartphones now having up to twelve. GPUs are now commonplace, offering up thousands of cores which sit unused for general purpose programs. The software we write has yet to adjust to the multicore future we're already in.

It seems that asking software engineers to learn to parallelize their code using the APIs available today is untenable, as:

Current parallelization methods require rewriting code in a way that distracts from its purpose, requiring more work from a reader to understand. Oftentimes poorly documented hardware-specific APIs need to be woven into the codebase, worsening the situation further.
Incorrect concurrent code has serious implications. For example, deadlock is unacceptable for consumer-facing applications.
Debugging is notoriously difficult in the presence of parallelism.

How can we make our programs parallelizable without making them even more bug-ridden and complex to understand? One answer: make it the compiler's problem.

Relying on the compiler and runtime to manage parallelism for you is an example of implicit parallelism, as opposed to the explicit parallelism we're accustomed to today. In an ideal implicitly parallel language, you would write your program in your usual fashion and style, and the compiler would take care of distributing work across cores to maximize performance. Roughly speaking, your program would no longer execute from top to bottom – the only sequentiality in your program would be caused by data dependency, when an instruction relies on the output of another.

Unfortunately, fully implicitly parallel languages have made only sparse appearances in academic papers, and have yet to bear fruit for real-world computation. Some examples you have likely never heard of are Id, pH, and SISAL. Dating from the 90s, these languages were remarkably ahead of their time – perhaps too far given the limitations of the hardware they ran on.

Why hasn't Implicit Parallelism become mainstream?

First off, it is easy to see that obtaining a high degree of implicit parallelism is essentially impossible in imperative, mutable languages. Static analysis in this context can only get us so far. However, functional programming languages provide a much more promising basis for implicit parallelism – in fact, you have likely heard it touted that a benefit of immutability is that it makes concurrency much simpler. If you're familiar with the Lambda Calculus, parallelism can be as simple as evaluating different branches of the expression tree concurrently.

However, in practice, functional programming languages rarely take advantage of this parallelization opportunity by default, instead relying on explicit parallelization annotations from the programmer. This disappointing truth can be attributed to a number of causes:

Real-world programs involve side-effects such as I/O which complicate (or make outright impossible) the safe parallelization of code automatically.
As per Amhdahl's Law, programs vary in the amount of parallelization available. This means that implicitly parallelizing code can have little to no effect (or even introduce overheads) compared to a sequential execution.
Abundant parallelization can worsen performance if the granularity of the parallelizable work is too fine, as the overheads of thread creation and communication outweigh the speedup from parallel computation. Any compiler which implicitly parallelizes code must therefore carefully consider this granularity problem.

In light of these challenges, it appears that compiler designers have (nearly) universally decided that implicit parallelization at this level is not worth the effort and complexity for general purpose programming languages. But that doesn't mean all is lost!

Paradigms of Implicit Parallelism

While implicit parallelism failed to gain traction at the language and runtime level, it has thrived in library internals and domain-specific languages.

Array Languages Reborn

The success of Python for data analysis and machine learning can be partially attributed to the implicit parallelism offered by popular libraries: Numpy, Pandas, Scipy, TensorFlow, and PyTorch all offer implicit parallelism at the SIMD, GPU, or even thread-level.

For example, when you call numpy.dot(A,B), the matrix dot product is computed using BLAS (Basic Linear Algebra Subroutines), a low-level library tuned to your hardware which will automatically utilize multiple CPU cores if it is faster to do so. If you're using PyTorch, a call to torch.matmul(A, B) can distribute work over your GPU's cores implicitly.

These APIs are reminiscent of a much older type of language which had similar capabilities – APL and its descendent array languages can leverage SIMD, or even GPU parallelism in the case of co-dfns. Programming in them is like programming in pure Numpy or PyTorch – fun when you’re dealing with matrices, rough when dealing with less regular structures.

Implicit Parallelism in Dataflow

The relationship between dataflow languages and implicit parallelism is natural. Dataflow languages describe computation as a graph where nodes represent computation (such as a matrix multiplication) and edges are the transfer of data (such as tensors) between them. Because a dataflow program is described in terms of data dependency, we can intuitively parallelize portions of the computation which do not rely on one another.

Dataflow has its roots in Kahn Process Networks, a deterministic, concurrent model of computation. While it has found success in the field of Digital Signal Processing, dataflow has more recently entered the mainstream in the form of the computation graphs which underlie deep learning frameworks such as PyTorch and TensorFlow. Large-scale data processing tools such as Apache Spark can be similarly viewed under a dataflow lens.

The Future of Implicit Parallelism

There are many other places where implicit parallelism is utilized: relational languages, logic languages, and graphical languages like LabVIEW. Notably, though, many domain-specific languages with implicit parallelism are not general purpose. You can't write your website backend in numpy (or at least you don’t want to).

Perhaps this is fine, and implicit parallelism should be isolated to domain-specific libraries where it can aid in matrix munging and the like.

Will we ever see implicit parallelism be deployed in a general-purpose functional language, automatically taking advantage of SIMD, thread-level, and GPU parallelism?

One potential avenue: Interaction Nets.