Future computing systems (Teradevices) will probably contain more than 1000 cores on a single die. To exploit this parallelism, threaded dataflow execution models are promising, since they provide side-effect free execution and reduced synchronization overhead. But the terascale transistor integration of such chips make them orders of magnitude more vulnerable to voltage fluctuation, radiation, and process variations. This means reliability techniques have to be an essential part of such future systems, too. In this paper, we conceptualize a fault tolerant architecture for a scalable threaded dataflow system. We provide methods to detect permanent, intermittent, and transient faults during the execution. Furthermore, we propose a recovery technique for dataflow threads

Sebastian, W., Arne, G., Julian, W., Bernhard, F., Avi, M., Giorgi, R., et al. (2011). A fault detection and recovery architecture for a teradevice dataflow system. In DFM-2011: Data-Flow Execution Models for Extreme Scale Computing (pp.39-45). IEEE [10.1109/DFM.2011.9].

A fault detection and recovery architecture for a teradevice dataflow system

GIORGI, ROBERTO
Writing – Review & Editing
;
2011-01-01

Abstract

Future computing systems (Teradevices) will probably contain more than 1000 cores on a single die. To exploit this parallelism, threaded dataflow execution models are promising, since they provide side-effect free execution and reduced synchronization overhead. But the terascale transistor integration of such chips make them orders of magnitude more vulnerable to voltage fluctuation, radiation, and process variations. This means reliability techniques have to be an essential part of such future systems, too. In this paper, we conceptualize a fault tolerant architecture for a scalable threaded dataflow system. We provide methods to detect permanent, intermittent, and transient faults during the execution. Furthermore, we propose a recovery technique for dataflow threads
2011
978-076954646-9
Sebastian, W., Arne, G., Julian, W., Bernhard, F., Avi, M., Giorgi, R., et al. (2011). A fault detection and recovery architecture for a teradevice dataflow system. In DFM-2011: Data-Flow Execution Models for Extreme Scale Computing (pp.39-45). IEEE [10.1109/DFM.2011.9].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/16933