CLARIN-CH

This is an old revision of the document!

The University of Geneva is represented in the CLARIN-CH Consortium by Prof. Eric Haeberli, from the English Departement and the Linguistics Department.

The community from the University of Geneva provides CLARIN-CH language resources and expertise in language sciences, and it is actively involved in research projects involving language resources.

Language resources

1. The Incremental Sigmoid Belief Network Dependency Parser (idp) is an NLP tool for synchronous Syntactic Dependency Parsing and Semantic Role Labeling for Multiple Language. It was developped by Andrea Gesmundo and it can be found here.

2. The Temporal Restricted Boltzmann Machines based model Parser is an NLP tool for dependency parsing of natural language sentences. It was developped by Andrea Gesmundo and it can be found here.

3. The HadoopPerceptron annotated dataset is useful for training, prediction and evaluation for Hadoop reference. It was developped by Andrea Gesmundo and it can be found here.

4. The SIWIS database comprises speech recordings of bilingual and trilingual speakers recorded at the University of Geneva. Each speaker utters about 170 prompts in 2 or 3 languages among French, English, German and Italian. It was developed during the SIWIS “Spoken Interaction with Interpretation in Switzerland” project, which was about speech to speech translation. It will allow a person to speak to a machine in their native language and have it automatically recognised, translated and spoken in a different language. One characteristic of recent technology to achieve this is that the spoken synthetic voice can sound like the original speaker instead of a generic speaker or robot. Release of 27.11.2015 included 40 speakers. Access upon request.

5. The corpus CHEU-lex is a parallel and comparable corpus of Swiss and European Union (EU) legislation published in the three official languages of the Swiss Confederation (French, German and Italian). It comprises: 1) bilateral agreements entered between Switzerland and the EU from 1972 to 2017; and 2) Swiss federal legislation representing the reception of these agreements. The corpus aims at providing a richly annotated multilingual resource to investigate the influence of EU drafting and translation practices on Swiss legislation. Its development is led by Prof. Annarita FELICI as part of a project funded by a grant of the Ernest Boninchi Foundation. Owing to its structure, CHEU-lex datasets can be explored from a monolingual (e.g. bilateral agreements in a single language), parallel (e.g. bilateral agreements in the three languages), cross-textual (e.g. bilateral agreements and Swiss legislation in the same language), intratextual (e.g. by text subsections) or diachronic perspective to obtain information on frequency, concordance, parts-of-speech (POS) or syntactic features. The corpus is hosted on NoSketchEngine and can be browsed here.
6. The LETRINT corpora are four sets of trilingual textual datasets, including one comparable and three parallel corpora. Their scope and features are determined by the goals of the eponymous project LETRINT “Legal Translation in International Institutional Settings: Scope, Strategies and Quality Markers” (Prof. Fernando Prieto Ramos, Faculty of Translation and Interpretation). The LETRINT project was funded by a Consolidator Grant ERC grant (2014-2022). The project was conducted in cooperation with the translation services of the institutions selected for this research, and with the support of IAMLADP through its Universities Contact Group (UCG). They comprise documents published in English, French and Spanish by the four main European Union institutions (the Commission, the Council, the Parliament and the Court of Justice), the United Nations and its International Court of Justice, and the World Trade Organization in 2005, 2010 and 2015. This infographic allows to discover the composition and methodological details of each corpus.

7. The LETRINT-Q is an open source corpus query interface that enables users to explore the LETRINT 1 and the LETRINT 1+ corpora (for further details, see Prieto Ramos, Cerutti & Guzmán 2019) through monolingual and parallel queries in English, French and Spanish. It was developed for the project on the basis of the corpus-querying application ParaVoz. Users can perform “basic” queries (i.e., by token, lexeme or grammatical tag) or use the CQP query language, according to the following parameters: organization, main legal function and functional sub-category of the text, year, textual genre, and document code (assigned during compilation). The platform renders results in several formats (e.g., lists or charts) and offers the possibility to download data as xlsx or tsv files. Access credentials may be requested here.

Faculties and Departments involved in CLARIN-CH

Faculty of Humanities