The University of Zurich is represented in the CLARIN-CH Consortium by Prof. Noah Bubenhofer, from the Department of German and the Zurich Center for Linguistics.
The community from the University of Zurich provides CLARIN-CH numerous language resources and expertise in the language sciences:
1.The BNC Dependency Bank corpus was created at the English Department. It contains 1'200'000'000 Words.
2. The GLBCC (Giessen-Long Beach Chaplin Corpus) was created at the English Department. It is a corpus of spoken language with approx. 155,000 words, in the format of audio files and transcriptions).
3.The ZEN (Zurich English Newspaper) corpus was created at the English Department. It is a diachronic (1661-1791) corpus of the first English newspapers with about 1.6 million words, searchable with Corpus Navigator.
4.The International Corpus of English (ICE) began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Twenty-six research teams around the world are preparing electronic corpora of their own national or regional variety of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989. For most participating countries, the ICE project is stimulating the first systematic investigation of the national variety. To ensure compatibility among the component corpora, each team is following a common corpus design, as well as a common scheme for grammatical annotation. Until 2016, the ICE corpora were distributed by Prof. Gerry Nelson at the Department of English, The Chinese University of Hong Kong. They are now coordinated by Prof. Marianne Hundt and hosted at the English Department of the University of Zurich. Access upon request.
5. The PADLF (Les plus anciens documents linguistiques de la France) corpus was created at the Romance Department. The corpus is an electronic edition of 13th century documents from Lorraine dating back to 1133. The corpus is bibliographically annotated (especially places of writing) and partially lemmatized. Available here.
6. The Tzéro database was created at the Romance Department (Meisner, 2016). It is a database on negation in French, which contains approx. 2500 entries with French language data, transcribed in IPA from approx. 40 hours of conversation recordings of approx. 100 speakers from France and Switzerland. Access upon request.
7. The AIS reloaded corpus provides the materials to analyze the diachronic evolution of Italo-Romance dialects by comparing the material from the Atlante linguistico ed etnografico dell’Italia e della Svizzera meridionale (AIS) with new data collected in the same locations almost a hundred years later. It can be accessed online.
7.The SenS-Korpus was created at the German Department. It consists of 34 hours of recordings of conversations of discussion groups that met as part of the Sensory Semantics project; 14 transcribed (180’000 words). Access upon request.
8. The Archimob (Archives de la mobilization) database was created at the German Department. In the oral history project Archimob, 555 video interviews were conducted with contemporary witnesses of the Second World War in Switzerland. Of these interviews, about 50 were selected for dialectological studies and 17 of them were transcribed. Available here.
9. The Picture postcard corpus was created at the German Department. It is a corpus of currently approx. 6000 scanned postcards and 200’000 German words.
10. The DAVADS (Digital Audio/Video Archive) database was created at the German Department. It is a collection of approx. 700 broadcasts of Swiss television DRS in the period 1975-1999. The individual broadcasts are annotated (short description of the topics, broadcast dates, studio guests present, etc., linguistic features). 375 broadcasts are digitized.
11. The Swissenker corpus was created at the German Department. It contains transcriptions of 700 Swiss Wenker questionnaires, collected in 1933/34. The Wenker questionnaire is a traditional dialectological questionnaire that requires the translation of 40 sentences from standard German into dialect.
12. The Zurich summer 1968 corpus was created at the German Department. It contains a total of 958 transcribed documents. Available here.
13. The Temporal entity extraction from historical texts project carried out at the Department of Computational Linguistics resulted into Gold Standard of temporal annotations. The corpus contains 50 historical legal articles in Early New High German. It was annotated in a subset of the TimeML markup language for temporal annotation. The corpus contains about 34,000 tokens and is available here.
14. The JAKOB lexicon was created at the Department of Computational Linguistics in collaboration with the Institute of Psychology. It represents a lexicon of psychological dimensions of German words. Access upon request here.
15. The SADS (Syntactic Atlas of German-speaking Switzerland) database was created at the German Department. It is a digital atlas describing the syntactic landscape of German-speaking Switzerland. Surveys on 54 syntactic variables were conducted in 383 places with altogether 3187 informants. Access upon request.
16. The Swissdox@LiRI database with press articles was created in collaboration with the Schweizer Mediendatenbank AG. Swissdox@LiRI consists of 29 million media articles (press, online) from a wide range of Swiss media sources covering many decades. The database is updated daily with about 5'000 to 6'000 new articles from the German and French speaking parts of Switzerland. It is designed for big data analyses. Data may be enriched optionally, automatically processed and analyzed. Access upon institutional subscription.
17. The bulletin4corpus corpus was created at the Department of Computational Linguistics. The corpus contains the Credit Suisse Bulletin, which is published partially in four languages since 1895: German, French, Italian and English. The magazine contains articles on economic and socially relevant topics and is therefore neither a banking magazine nor a traditional corporate magazine. This makes the Bulletin interesting as a training corpus for applications such as machine translation since it provides access to another genre, which is suitable for newspapers and magazines for instance. Available here.
18. The sms4science corpus was created at the Romance Department. It consists of approx. 26,000 SMS in all four Swiss national languages. In addition to the original texts and a normalized version with general annotations, various sub-corpora are available which are annotated with specific annotations by doctoral students. Access upon here.
19. The SMULTRON (Stockholm Multilingual Treebank) corpus was created at the Department of Computational Linguistics. It is a parallel treebank with subcorpora, each containing texts of different genres (mainly non-fiction texts) in two or more languages; five languages in total: Swiss German, German, French, Italian, Rhaeto-Romanic (Romansh). Available here.
20. The eSSRQ (Electronic Collection of Swiss Legal Sources) corpus was created by the Legal Source Foundation of the Swiss Lawyers' Association. It is a collection of Swiss legal texts from the period 501 – 1882, in German, French, Italian, Rhaeto-Romanic (Romansh) and Latin. Available here.
21. The Phonogram Archives were created at the Department of Computational Linguistics. It is a collection of approx. 3500 sound recordings or carriers from all four Swiss national languages, corresponding to approx. 120 hours of processed sound material. This includes varieties of all major language areas in Switzerland, such as Swiss German dialects, franco-provençal “Patois”, the Lombard dialects of Ticino and parts of the Canton of Grisons and also the Rhaeto-Romance idioms. Currently, all sound carriers are being digitized. Digital versions and transcriptions are already available for many sound carriers. Access upon request.
22. The Text+Berg corpus was created at the Department of Computational Linguistics. It consists of the digitalize volumes of “The yearbooks of the Swiss Alpine Club” from 1864 to 1923, the “Echo des Alpes” from 1872 to 1924, the ALPEN from 1925 to 2011. The corpus currently comprises nearly 45 million words from more than 100,000 book pages and is variously annotated (text structure, part-of-speech, lemmas, toponyms, etc.). The following languages are represented: German, French, Italian, Rhaeto-Romanic (Romansh), Swiss German, English. Available here.
23. The Bullinger Digital corpus is created at the Department of Computational Linguistics. It consists of 2000 letters that Heinrich Bullinger wrote and 10,000 letters that he received have been preserved. The originals are kept in the Zurich State Archives and the Zurich Central Library. 80% of the letters are in Latin, most of the others in Early New High German. About 2900 letters have already been manually transcribed and edited. They can be searched online. Another 5000 letters have been transcribed and are available as electronic texts.
24. The CoNTra_corpora: the Federal Gazette was created at the Department of Computational Linguistics. The Federal Gazette is a journal published by the Swiss Government. The journal is a political newsletter concerned with resolutions and laws of the Swiss Confederation. This corpus contains the German-French and French-German parallel sentences mined with Laser from the digitized Federal Gazette. The heavily filtered corpus contains 1.3 million parallel sentence pairs in both directions. Available here.
25. The PHOIBLE database was created at the Department of Comparative Linguistics. It is repository of cross-linguistic phonological inventory data (more than 1000 languages), which have been extracted from source documents and tertiary databases and compiled into a single searchable convenience sample. Release 2.0 from 2019 includes 3020 inventories that contain 3183 segment types found in 2186 distinct languages. Available here.
26. The European Language Grid resource collection for the languages in Switzerland was created of the Department of Comparative Linguistics with the occasion of their participation in the European Language Equality (ELE) European project. It consist of over 100 resources: corpora (<60), applications (<40) and lexica. Many of the resources are multilingual: between French, German and Italian, as well as Romansh. Access can be acquired by writing to Dr. Annette Rios or to the CLARIN-CH Coordination Office.
27. The What's up, Switzerland? corpus was created in a project funded by the SNSF and thanks to a collaboration among the Universities of Zurich, Bern, Neuchâtel and the University of Leipzig. The Swiss WhatsApp corpus is now available as an open access resource with more than 5 mio tokens in all four national languages of Switzerland. You find the documentation and the access to the corpus here.
28. The CLC (Chintang Language Corpus) corpus was created at Department of Comparative Language Science. It is a multimedia corpus of Chintang (Tibeto-Burman, Nepal, ca. 5000 speakers); ca. 300 hours (1.2 million) words transcribed, most translated into Nepali and English, and morphologically annotated (segments, functions, POS). Includes data from adults (responsibility Seminar for ASW) and longitudinal data on language acquisition (responsibility Psycholinguistics Unit).
29. The NNC (Nepali National Corpus) corpus was created at the Department of Comparative Language Science. It consists of Nepali texts from various genres, with 14’000’000 words. The majority of the texts are primarily written, with a small portion transcribed from recordings.
30. The SEAlang corpus was created at the Department of Comparative Language Science. It consists of audio recordings (conversations, elicitation, stories), texts of Southeast Asian languages for linguistic purposes (language description, areal typology); some texts of literary interest (Southeast Asian traditions and beliefs); transcript of audio partly in indigenous scripts, partly already IPA, some with glosses and translation. Audio recordings in Mon amount to some 10 hours, Burmese about 8 hours, Karen (Pwo) and Nyahkur about 1 hour each. Transcripts of Mon texts (indigenous script and/or transcription) estimated 60,000 words (including literary texts), Burmese about 30,000 words (including e-mail communication), Karen (indigenous script, hand written) and Nyahkur (indigenous script, hand written) about 5000 words each. A total of 100,000 words are transcribed (a total of 50 hours). Available here.
31. The Corporum corpus was created at the Medieval Latin Seminar. It is a collection of medieval Latin texts. Available here.
32.The GeLaTo (Genes and Languages Together) database was created at the Department of Comparative Language Science in collaboration with the Department of Evolutionary Biology and Environmental Studies. It is a new resource developed to link genomic data to cultural and linguistic identifiers and promote multidisciplinary research. Access upon request.
33. The AUTOTYP database was created at the Department of Comparative Language Science in collaboration with the University of California. It represents an international network of typological linguistic databases. AUTOTYP is a large-scale research program with goals in both quantitative and qualitative typology. Quantitative typology is interested in detecting and explaining geographical distributions of typological features and in producing statistical estimates of universal preferences as well as of genealogical inheritance and areal diffusion potentials. Qualitative typology aims at a systematic analysis of the kinds of variation found in various typological domains. Available here.
34. The Zurich Corpora of Slavic Varieties (ZuCoSlaV) was created at the Department of Slavonic Studies. It consists of four corpora. 1. Macedonian Spoken Corpus, which comprises transcriptions of audio files collected in a series of field research trips in the Prespa, Bitola and Debar regions in 2012, 2014, 2016 and 2019. 2. Pre-Standardized Balkan Slavic Literature corpus, which includes various Balkan Slavic texts from the 15th-19th century. The annotated section includes 20 shorter texts with full morphological and syntactic annotation (48k tokens). The raw section contains 14 sources digitized manually or automatically as a whole (ca. 1M tokens). 3. Torlak corpus, which contains transcripts of interviews about traditional culture and history with speakers of Torlak from the Timok area. It comprises 500,697 tokens representing 80 h of recording. 4. Serbian Forms of Address corpus, which contains transcripts of interviews about forms of address that Serbian speakers use in colloquial and formal settings. It consists of 171,552 tokens, corresponding to about 19 h of recording. Available here.
Areas of expertise in the field of Linguistics:
Areas of expertise in the field of Linguistics:
Areas of expertise in the field of Linguistics:
Institute of Asian and Oriental Studies
Areas of expertise in the field of Linguistics:
Department of Comparative Language Studies
Areas of expertise in the field of Linguistics:
Department of Computational Linguistics
Areas of expertise in the field of Linguistics:
Department of Slavonic Studies
Areas of expertise in the field of Linguistics: