CLARIN Resource Families for Oral History

Lenardič, Jakob; Calamai, Silvia; Scagliola, Stefania; Van Den Heuvel, Henk

doi:10.5281/zenodo.7990535

The CLARIN Resource Families (CRF) initiative provides manually curated overviews of prominent language resources and technologies deposited across the distributed CLARIN infrastructure (Lenardič and Fišer 2022). The main aim of CRF is to support other core services of CLARIN from the perspective of the FAIR principles (Wilkinson et al. 2016). CRF enhances the findability and accessibility of CLARIN resources by collating them under their most common typological characteristic. The initiative facilitates re-use by providing comprehensive descriptions tailored to the unique technical features of each of the families, as well as their qualitative characteristics. Furthermore, CRF provides a funding instrument for external projects to contribute new overviews. Though originally focused on written corpora (e.g., corpora of parliamentary proceedings, corpora of academic texts), in 2022, CRF was expanded to include corpora of oral history. At present one collection is currently featured – the Ravensbrück corpora (Calamai et al. 2022a) – whose creation was supported by the aforementioned CRF funding instrument. This corpus family contains 8 collections of recorded interviews with survivors of the female concentration camp Ravensbrück, conducted in different languages, such as English, German, Hebrew, and French. See https://www.clarin.eu/resource-families/oral-history-corpora. One collection is available for download (Collection Bruzzone; see Bruzzone and Beccaria Rolfi 1976) while the others can be streamed online. The inclusion of the Ravensbrück corpora in CRF represents an illustrative example of how the CLARIN infrastructure incorporates and provides documentation for complex objects like oral history sources whose provenance and metadata documentation widely differ from standard written corpora and even from contemporary interviews born digitally. The team working on the Ravensbrück resource family (see Calamai et al. 2022b) availed themselves of CLARIN’s Component Metadata Infrastructure (CMDI), which is a framework for metadata description that “supports flexible definitions of metadata structure and semantics” by allowing researchers to “create and use their own [metadata] schema tailored specifically towards the requirements of [their] project” (Windhouwer and Goosen 2022: 194 and 199). All the 8 collections within the Ravensbrück family are accompanied by extensive CMDI metadata, prepared by Calamai et al. (2022a,b). The peculiarity of the interviews in the Ravensbrück family is that they were mostly recorded on an analogue carrier (i.e., audio cassettes), so a new CMDI metadata profile was created that is tailored to such legacy interviews not born digitally. This metadata profile has additional components describing “information about the context in which the interviews were conducted” as well as “information about the process of digitisation” (Calamai et al. 2022a: 3). Being thus digitised, comprehensively described, and carefully curated, the Ravensbrück corpora present a unique opportunity to study and compare these historical interviews. To facilitate their use in research, CLARIN offers through its Speech data and Technology network (Draxler et al. 2020) an open-source web application called TranscriptionPortal (https://speechandtech.eu/transcription-portal), where certain audio recordings (e.g., Collection Bruzzone, United States Holocaust Memorial Museum) can be uploaded and then orthographically transcribed on the fly, with manual phonetic and word alignment for a variety of languages.

Lenardič, J., Calamai, S., Scagliola, S., van den Heuvel, H. (2023). CLARIN Resource Families for Oral History [10.5281/zenodo.7990535].