The high parallelism of future Teradevices, which are going to contain more than 1,000 complex cores on a single die, requests new execution paradigms. Coarse-grained dataflow execution models are able to exploit such parallelism, since they combine side-effect free execution and reduced synchronization overhead. However, the terascale transistor integration of such future chips make them orders of magnitude more vulnerable to voltage fluctuation, radiation, and process variations. This means dynamic fault-tolerance mechanisms have to be an essential part of such future system. In this paper, we present a fault tolerant architecture for a coarse-grained dataflow system, leveraging the inherent features of the dataflow execution model. In detail, we provide methods to dynamically detect and manage permanent, intermittent, and transient faults during runtime. Furthermore, we exploit the dataflow execution model for a thread-level recovery scheme. Our results showed that redundant execution of dataflow threads can efficiently make use of underutilized resources in a multi-core, while the overhead in a fully utilized system stays reasonable. Moreover, thread-level recovery suffered from moderate overhead, even in the case of high fault rates.
Scheda prodotto non validato
Scheda prodotto in fase di analisi da parte dello staff di validazione
|Titolo:||Architectural Support for Fault Tolerance in a Teradevice Dataflow System|
|Citazione:||S., W., A., G., B., F., A., M., Giorgi, R., & T., U. (2016). Architectural Support for Fault Tolerance in a Teradevice Dataflow System. INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 1-25.|
|Appare nelle tipologie:||1.1 Articolo in rivista|
File in questo prodotto: