Architectural Support for Fault Tolerance in a Teradevice Dataflow System

IRIS

The high parallelism of future Teradevices, which are going to contain more than 1,000 complex cores on a single die, requests new execution paradigms. Coarse-grained dataflow execution models are able to exploit such parallelism, since they combine side-effect free execution and reduced synchronization overhead. However, the terascale transistor integration of such future chips make them orders of magnitude more vulnerable to voltage fluctuation, radiation, and process variations. This means dynamic fault-tolerance mechanisms have to be an essential part of such future system. In this paper, we present a fault tolerant architecture for a coarse-grained dataflow system, leveraging the inherent features of the dataflow execution model. In detail, we provide methods to dynamically detect and manage permanent, intermittent, and transient faults during runtime. Furthermore, we exploit the dataflow execution model for a thread-level recovery scheme. Our results showed that redundant execution of dataflow threads can efficiently make use of underutilized resources in a multi-core, while the overhead in a fully utilized system stays reasonable. Moreover, thread-level recovery suffered from moderate overhead, even in the case of high fault rates.

Weis, S., Garbade, A., Fechner, B., Mendelson, A., Giorgi, R., Ungerer, T. (2016). Architectural Support for Fault Tolerance in a Teradevice Dataflow System. INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 44(2), 208-232 [10.1007/s10766-014-0312-y].

Architectural Support for Fault Tolerance in a Teradevice Dataflow System

Weis, S.;Garbade, A.;Fechner, B.;Mendelson, A.;Giorgi, R.;Ungerer, T.

2016-01-01

Abstract

The high parallelism of future Teradevices, which are going to contain more than 1,000 complex cores on a single die, requests new execution paradigms. Coarse-grained dataflow execution models are able to exploit such parallelism, since they combine side-effect free execution and reduced synchronization overhead. However, the terascale transistor integration of such future chips make them orders of magnitude more vulnerable to voltage fluctuation, radiation, and process variations. This means dynamic fault-tolerance mechanisms have to be an essential part of such future system. In this paper, we present a fault tolerant architecture for a coarse-grained dataflow system, leveraging the inherent features of the dataflow execution model. In detail, we provide methods to dynamically detect and manage permanent, intermittent, and transient faults during runtime. Furthermore, we exploit the dataflow execution model for a thread-level recovery scheme. Our results showed that redundant execution of dataflow threads can efficiently make use of underutilized resources in a multi-core, while the overhead in a fully utilized system stays reasonable. Moreover, thread-level recovery suffered from moderate overhead, even in the case of high fault rates.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
			2016
		
	Rivista su cui è pubblicata l'opera
	
			INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING
		
	Citazione
	
			Weis, S., Garbade, A., Fechner, B., Mendelson, A., Giorgi, R., Ungerer, T. (2016). Architectural Support for Fault Tolerance in a Teradevice Dataflow System. INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 44(2), 208-232 [10.1007/s10766-014-0312-y].
		
	Appare nelle tipologie:
	
			1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
Weis2016_Article_ArchitecturalSupportForFaultTo.pdf non disponibili Tipologia: PDF editoriale Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 1.01 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.01 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11365/46816