Corpora
CLARIN-CH aims to increase findability and facilitate access to existing corpora created at its member institutions. This list thus serves as a compilation of corpora that are available to other researchers from the community, aligning with Open Research Data principles.
Name | Description | Availability | More information | CLARIN-CH institution(s) |
---|---|---|---|---|
SwissSLi | SwissSLi, the first sign language corpus that contains parallel data of all three Swiss sign languages, namely Swiss German Sign Language (DSGS), French Sign Language of Switzerland (LSF-CH), and Italian Sign Language of Switzerland (LIS-CH). The data underlying this corpus originates from television programs in three spoken languages: German, French, and Italian. The programs have for the most part been translated into sign language by deaf translators, resulting in a unique, up to six-way multi-parallel dataset between spoken and sign languages. | Open | Openly available on SWISSUbase: https://www.swissubase.ch/en/catalogue/studies/20709/19983/overview | University of Zurich |
What's New, Switzerland? | "What's New, Swizerland?" is an academic research project funded by the SNF in the framework of the "Evolving Language" NCCR. The project seeks to advance knowledge on how humans communicate their emotions through digital media, in particular the WhatsApp instant messaging platform, and how this communication evolves over time in relation with technological changes and emerging practices. To that effect, there has been collected (between August and October 2022) and preprocessed (de-identified) a corpus of WhatsApp chats between French-speaking users in Switzerland, with the perspective of making it available to the scientific community. | Institutional login required | The dataset is available on demand for research purposes, under a restricted license contract, from the SWISSUbase repository: https://www.swissubase.ch/en/catalogue/studies/20713/19924/overview | University of Lausanne |
Spoken Hebrew interview corpus | This data collection consists of interviews in Hebrew that were recorded by Philipp Striedl between summer 2018 and early 2020 for the dissertation "Representations of Variation in Modern Hebrew in Israel: Cognitive Processes of Social and Linguistic Categorization" (Striedl 2022). | Institutional login required | The dataset is available on demand for research purposes, under a restricted license contract, from the SWISSUbase repository: https://www.swissubase.ch/en/catalogue/studies/20513/19360/overview | UZH (in collaboration with LMU) |
ParTree | The corpus contains movie subtitles in various languages with a raw text version and, if training data was available, automatically parsed versions with morphosyntactic annotation in Universal Dependencies (UD) style. Building on Levshina's (2016) ParTy corpus of movie subtitles, we extended her collection to cover more languages and added movies with a particularly broad linguistic range of available subtitles. Since the original corpus only contained raw text, the authors trained language models with the Python package spaCy (v2.2.3) on UD treebanks (v2.5 and later) to add morphosyntactic annotation. The annotated data was saved in CoNLL-U format. | Open | Openly available on SWISSUbase: https://www.swissubase.ch/en/catalogue/studies/20295/19511/overview | University of Zurich |
ArchiMob | The ArchiMob corpus represents German linguistic varieties spoken within the territory of Switzerland. This corpus is the first electronic resource containing long samples of transcribed text in Swiss German, intended for studying the spatial distribution of morphosyntactic features and for natural language processing. We provide two releases of the ArchiMob corpus. The first release, initially published in 2016, includes 34 transcribed interviews. The second release, initially published in 2019, includes 43 transcribed interviews. | Open | Transcriptions and a sample of audio transcriptions are openly available on SWISSUbase: https://www.swissubase.ch/en/catalogue/studies/20154/19410/overview. Full audio is only available for research purposes on request via SWISSUbase. | University of Zurich |
ACCOMOJI | ACCOMOJI: Emoji Accommodation in Swiss Multilingual Computer-Mediated Conversations is a collaboration between researchers from University of Lausanne (UNIL) and Swiss Federal Institute of Technology Lausanne (EPFL) at the intersection of data science and linguistics. It was funded by the UNIL-EPFL Collaborative Research on Science and Society (CROSS) 2021, a programme that supports interdisciplinary projects dealing with current issues in society and technology. ACCOMOJI seeks to examine the ways in which people conversing in the Swiss national languages converge or diverge over time with regard to emoji usage, thereby managing social and emotional distance. The authors limited themselves to studying Swiss German and French. The primary source of data is a corpus of Swiss WhatsApp conversations: "What's up Switzerland?" (WUS, cf. https://www.whatsup-switzerland.ch/index.php/en/). The released data contains function- and emotion-based annotations obtained via a Citizen Science campaign for a number of emojis in the context of the chat where they occurred, as well as demographic information about the citizen science annotators. | Open | Openly available on SWISSUbase: https://www.swissubase.ch/en/catalogue/studies/20120/19894/overview | University of Lausanne |
Online Edition of the Paippalāda Recension of the Atharvaveda | Online Edition of the Paippalāda Recension of the Atharvaveda | Open | Openly available on SWISSUbase: https://www.swissubase.ch/en/catalogue/studies/20701/19845/overview | University of Zurich |
Text+Berg digital | Text+Berg digital was a project of digitising the yearbooks of the Swiss Alpine Club (SAC) published since 1864. The electronic data obtained are available as an annotated linguistic corpus. | Online only | Available for queries in LiRI Corpus Platform (LCP): https://catchphrase.linguistik.uzh.ch/query/44/TextBerg-Korpus-Alpine-Journal | University of Zurich |
Collection of Swiss Law Sources online | Since 1898 the Law Sources Foundation of the Swiss Lawyers Society edits a collection of law sources which had been created on Swiss territory, the Collection of Swiss Law Sources. The Collection contains materials from the early Middle Ages until early modern times (1798). | Open | Available for download at The Swiss Lawyers Society website: https://ssrq-sds-fds.ch/en/digital/online/ | External: The Swiss Lawyers Society |
ACQDIV Database | The ACQDIV Database brings together 17 corpora of first language acquisition, representing 15 maximally diverse languages, in a formally and semantically standardized format. It contains video and audio recordings, transcribed speech, and linguistic annotations from these corpora. The database is created and maintained by the TTF DataScience of the NCCR and the UZH ACQDIV Lab, led by Prof. Sabine Stoll. | Open | Part of the database is publicly available for download from Zenodo: https://zenodo.org/records/3558643 | University of Zurich |
JuBe | The Jugendsprache Project collects and studies ethnolectal youth language in the Canton of Bern. The project has created the first corpus of youth language in Switzerland via the cooperation of various Bernese linguists. The corpus will be made available to researchers at all academic levels for use in their own research. The goal of the project is (i) to demonstrate the relevance of youth language in sociological, ethnographic and sociolinguistic research (ii) to examine what conclusions can be drawn about linguistic innovations and change in youth language and (iii) to highlight that Switzerland, as a quadrilingual state and a country of migration, is particularly well suited for research on ethnolects and on youth languages. | Open | The dataset is available for download from Zenodo: https://doi.org/10.5281/zenodo.5648157 | University of Bern |
LETRINT | The LETRINT corpora are four sets of trilingual textual datasets, including one comparable and three parallel corpora. Their scope and features are determined by the goals of the project. They comprise documents published in English, French and Spanish by the four main European Union institutions (the Commission, the Council, the Parliament and the Court of Justice), the United Nations and its International Court of Justice, and the World Trade Organization in 2005, 2010 and 2015. | Online only | The dataset is available for queries from the website of the University of Geneva: https://transius.unige.ch/en/research/letrint/corpora after filling in a short survey | University of Geneva |
If you have suggestions for resources to be added, please do not hesitate to contact us.