Previous talks at the SCCS Colloquium

Manuel Schnaus: Towards Resilience Methods for Simulation Applications based on Actor Replication

SCCS Colloquium |


High-Performance Computing is an important field of Scientific Computing with many problems offering the possibility of achieving speedups through high levels of parallelization. One framework for programming such a parallelized program is the actor model. This approach establishes the SPMD principle through actors advancing the program and communicating with each other through specified channels. Especially in exascale computing, undetected data corruptions in an actor can have devastating effects on program executions. In order to detect possible data corruptions, I propose to employ double redundancy through full replication of actors. Redundantly computed results can be checked against each other to find errors. Another important task in High-Performance Computing is balancing the workload evenly between cores. While other approaches achieve promising results on scenarios where imbalances are predictable, they cannot protect the program against non-static and unpredictable imbalances. For these applications, the possibilty of load balancing through redundancy is explored. Here, when an actor is slowed down due to imbalances, it's replica can take over and complete the computations, reducing the waiting times of neighbouring actors. Using replication, errors within the actor model were observed to be detected with a particularly high accuracy under the sacrifice of runtime. Additionally, the idle time of the actors in unbalanced scenarios was reduced dramatically using load balancing through redundancy.

Bachelor’s thesis talk (Informatics: Games Engineering). Manuel is advised by Philipp Samfass and Mario Wille.