Correct Optimized GPU Programs

Mohsen Safari

promotor: prof.dr. M. Huisman (UT)
University of Twente
Date: 6 July 2022
Thesis: PDF

Summary

Software is widely incorporated in our daily lives. Almost all electronic devices around us, from smartphones and coffee machines to cars and planes are run by computer software.
Thus our lives heavily depends on software, while software is created by humans who make mistakes by nature. Therefore, developing correct software becomes a major challenge, where by correctness, we mean how to guarantee that software behaves as we expect. Software correctness is in particular crucial in critical systems (e.g., nuclear power plants, medical devices, etc.) where software failure would threaten human lives and environment.
In addition to reliability and correctness, also software performance is demanded by users. Performance is important, in particular, in critical systems where delay in software response would cause serious damage. To increase performance, new hardware and architectures are coming out. Moore’s law that states “the number of transistors on a CPU chip will double every 18 months” has now ended. Nevertheless, multi-core CPUs are invented to address this, but the number of CPU cores on a machine is limited and therefore this cannot maintain Moore’s law.
Graphics Processing Units (GPUs) have emerged as co-processors to address this issue and to reduce the performance limitations of data-parallel programs. In contrast to multi-core CPUs, GPUs have hundreds or thousands of cores which are simpler, but more tuned to data parallelism. CUDA and OpenCL are two dominant programming languages for GPUs. Their programming model supports parallel execution, with many threads cooperating together, executing the same instructions, but on different data (known as Single Instruction Multiple Data (SIMD)).
GPUs are widely-used in industry and they drastically improve the performance of (sequential and multi-threaded) programs. However, errors happen more frequently on GPU programs, because the GPU programming model provides more flexibility to the programmers to control hardware resources. Moreover, also the hierarchical structure of threads and memory layouts make GPU programming more challenging and error-prone. Therefore, proving correctness of GPU programs is of the utmost importance. Concretely, in this thesis we investigate how to prove that a GPU program is data-race free and functionally correct.
Deductive program verification can be used to reason about GPU programs. In a deductive approach, programs are annotated by specifying pre- and postconditions, invariants and intermediate properties written in first-order logic. Then, we use a proof system to verify that the implementation adheres to the specification. However, proving GPU programs using deductive techniques is challenging, because of the many different interleavings of arbitrary threads and the specific programming model of GPU programming. Moreover, program development on GPUs is an iterative process. First, programmers implement an unoptimized parallel program on a GPU. Then, to achieve the most performance out of the GPU, programmers typically apply incremental optimizations manually, tailored to the GPU architecture. These optimizations can be applied manually or semi-automatically at the level of GPU programs prior to compiling the code. Even though there has been some work to automate the process of optimization, there has been little effort to guarantee that correctness is preserved during this optimization process. In addition, automatic correctness preservation of GPU programs is important, to avoid having to manually annotate each new version of the optimized program using deductive approaches.
Therefore, the questions which this thesis addresses is based on two pillars. First, how to use deductive verification techniques in practice to prove functional correctness of complicated GPU programs? Second, how to automatically apply GPU optimizations to an unoptimized verified GPU program to guarantee that the final optmized GPU program is still correct?
This thesis consists of two parts to address the two questions. The first part shows how to leverage deductive verification techniques to reason about complicated GPU programs. Particularly, the first part proves data race-freedom and functional correctness of two implementations of parallel prefix sum algorithms, a parallel stream compaction and summed-area table algorithm, and a parallel Bellman-Ford shortest path algorithm. Moreover, this part also shows a generic pattern to prove the permutation property of any (GPU-based) parallel and even (CPU-based) sequential swap-based sorting algorithms. This technique is applied to prove the permutation property of parallel odd-even transposition sort, and sequential bubble sort, selection sort, insertion sort, quick sort, two in-place merge sorts and TimSort.
The second part of the thesis introduces an annotation-aware (source-to-source) transformations technique. This means that the transformations should be made annotation-aware, i.e., during program transformation, not only the code should be changed automatically, but also the corresponding annotations. The goal is to apply optimization techniques automatically to a verified unoptimized GPU program, step-by-step, with the annotations being transformed automatically, allowing for efficient reverification of the optimized program(s). Concretely, we develop the approach for six optimization techniques, namely loop unrolling, iteration merging, matrix linearization, data prefetching, tiling and kernel fusion to apply them to the verified programs while preserving their provability. As a consequence, this approach results in correct optimized GPU programs. We evaluate this on the same example programs that are verified in the first part of this thesis.