Adaptive Redundancy for Manycore Architectures
Start date: 01.07.2017
End date: 30.06.2020
Funded by: DFG (Deutsche Forschungsgemeinschaft)
Local head of project: Prof. Dr. Theo Ungerer, Prof. Dr. Sebastian Altmeyer
Local scientists: Christoph Kühbacher, Christian Mellwig, Dr. Florian Haas
External sientists/ cooperations: Prof. Dr.-Ing. Dr. h. c. Jürgen Becker
Adaptive Redundancy for Manycore Architectures is a DFG funded research project by the research teams of Prof. Dr.-Ing. Dr. h. c. Jürgen Becker (KIT) and Prof. Dr. Theo Ungerer (UniA). The projects' goal is the exploration and evaluation of dynamic redundancy techniques in hardware and software.
The continuous shrinking of the feature size in CMOS fabrication processes comes along with an increased number of transient, intermittent, and permanent faults. As a consequence, the static coupling of redundant modules becomes more and more inflexible and demands a dynamic adaptation of the system to provide reliable and fault-tolerant execution even in case of several failing units.
The goal of this project is to explore the combination of redundancy mechanisms in hardware and software to flexibly cover different redundancy requirements at run-time depending on external causes. We call this dynamic redundancy. We investigate dynamic redundancy switching between hardware modes (no redundancy, DMR, and TMR), same for software modes and combinations.
Furthermore we've developed a functional-style programming model called RAPID (Resilient Analyzable Partitioned DIstributed Data Structures). Our programming model is based on the concept of Resilient Distributed Datasets which eventually led to Apache Spark. Our approach is to leverage the resilience of these datasets for fault tolerance and high performance by using data locality in a way it serves both purposes.
Additionally, an actor-based data flow execution model ARoMA (Adaptive Redundancy for Manycore Architecture) will support the scalable execution of timing analyzable parallel applications. Data flow actors can be easily re-executed in case of failure because of their freedom from side effects. With space and automotive applications in mind our endeavor could be an important step towards dependable multi-core execution of safety-critical applications.
The ARoMA execution model itself does not depend on a specific hardware platform. At the moment RAPID programs can be executed on a standard x86 shared-memory architecture. However, the execution model was designed with a clustered many-core architecture in mind to take advantage of the parallelism of independent dataflow actors. Therefore, an implementation for the Kalray massively parallel processor array (MPPA) is currently under development.