Master's thesis presentation. Ruilin is advised by Hayden Liu Weng.
Previous talks at the SCCS Colloquium
Ruilin Qi: Comparative analysis of existing numerical linear algebra frameworks on modern architectures
SCCS Colloquium |
This thesis examines the performance of two linear algebra frameworks, PETSc and Ginkgo, on modern, heterogeneous architectures with a focus on solving large, sparse linear systems. Both libraries provide a set of Krylov subspace solvers and preconditioners with support for CPU, parallel CPU, GPU, and distributed multi-GPU execution, but differ in design, age, and supported methods. Comparable solvers are implemented in C for PETSc and C++ for Ginkgo, with testing and benchmarking being automated by bash scripts. Library performance is evaluated on a 24-core AMD EPYC CPU and four NVIDIA GeForce RTX 3080 GPUs.
Four methods are evaluated per library, with CGS, BiCGStab, and GMRES being shared, TFQMR being unique to PETSc, and IDR(s) being unique for Ginkgo. These are combined with a set of four preconditioners, Block-Jacobi, ICC, and ILU being shared, BoomerAMG for PETSc, and ISAI for Ginkgo. Test problems are taken from the SuiteSparse Matrix Collection, with sizes ranging from less than 1000 rows/columns up to larger than 1e6 rows/columns, spread. Problems were also filtered by structural properties, such as symmetry, positive definiteness, pattern symmetry, and application domain.
Performance analysis is done by considering end-to-end runtime, solve success with common convergence criteria, strong scaling for CPU and GPU configurations, and runtime comparison between CPU and GPU. Finally, a qualitative analysis is done on device load, memory throughput, and memory footprint. Results differ by library and problem size, with larger systems generally favoring GPU solves. PETSc speedups generally depend on the selected configuration of preconditioner and solver, while Ginkgo achieves a GPU speedup in almost every configuration. Further, preconditioner selection in general can significantly influence runtimes and convergence properties, with ILU and BoomerAMG being very effective preconditioners, while ICC and block-jacobi generally improve convergence less. The quantitative analysis of BiCGStab and ILU on the GPU reveals that sparse triangular solves are not well-optimized, only utilizing a fraction of available compute power.