The University of Zurich is represented in the CLARIN-CH Consortium by Prof. Marianne Hundt, from the Department of English and the Zurich Center for Linguistics. Prof. Marianne Hundt is the National Coordinator of CLARIN-CH.

The community from the University of Zurich provides CLARIN-CH numerous language resources, expertise in language sciences, and it is actively involved in research projects involving language resources.

Language resources

Corpora and databases: English

1.The BNC Dependency Bank corpus was created at the English Department. It contains 1'200'000'000 Words. Access upon request.

2. The GLBCC (Giessen-Long Beach Chaplin Corpus) was created at the English Department. It is a corpus of spoken language with approx. 155,000 words, in the format of audio files and transcriptions). Access upon request.

3.The ZEN (Zurich English Newspaper) corpus was created at the English Department. It is a diachronic (1661-1791) corpus of the first English newspapers with about 1.6 million words, searchable with Corpus Navigator. Access upon request.

4.The International Corpus of English (ICE) began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Twenty-six research teams around the world are preparing electronic corpora of their own national or regional variety of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989. For most participating countries, the ICE project is stimulating the first systematic investigation of the national variety. To ensure compatibility among the component corpora, each team is following a common corpus design, as well as a common scheme for grammatical annotation. Until 2016, the ICE corpora were distributed by Prof. Gerry Nelson at the Department of English, The Chinese University of Hong Kong. They are now coordinated by Prof. Marianne Hundt and hosted at the English Department of the University of Zurich. Access upon request.

Corpora and databases: French

5. The PADLF (Les plus anciens documents linguistiques de la France) corpus was created at the Romance Department. The corpus is an electronic edition of 13th century documents from Lorraine dating back to 1133. The corpus is bibliographically annotated (especially places of writing) and partially lemmatized. Available here.

6. The Tzéro database was created at the Romance Department (Meisner, 2016). It is a database on negation in French, which contains approx. 2500 entries with French language data, transcribed in IPA from approx. 40 hours of conversation recordings of approx. 100 speakers from France and Switzerland. Access upon request.

Corpora and databases: German and Swiss German

7.The SenS-Korpus was created at the German Department. It consists of 34 hours of recordings of conversations of discussion groups that met as part of the Sensory Semantics project; 14 transcribed (180’000 words). Access upon request.

8. The Archimob (Archives de la mobilization) database was created at the German Department. In the oral history project Archimob, 555 video interviews were conducted with contemporary witnesses of the Second World War in Switzerland. Of these interviews, about 50 were selected for dialectological studies and 17 of them were transcribed. Available here.

9. The Picture postcard corpus was created at the German Department. It is a corpus of currently approx. 6000 scanned postcards and 200’000 German words. Access upon request.

10. The DAVADS (Digital Audio/Video Archive) database was created at the German Department. It is a collection of approx. 700 broadcasts of Swiss television DRS in the period 1975-1999. The individual broadcasts are annotated (short description of the topics, broadcast dates, studio guests present, etc., linguistic features). 375 broadcasts are digitized. Access upon request.

11.The Metalanguage Discourses corpus was created at the German Department. It is a collection of about 1800 media documents on meta-linguistic topics (mainly Anglicisms) in the period 1990-2001. About 1400 documents of the corpus are discourse-analytically annotated via a separate database. Access upon request.

12. The Swissenker corpus was created at the German Department. It contains transcriptions of 700 Swiss Wenker questionnaires, collected in 1933/34. The Wenker questionnaire is a traditional dialectological questionnaire that requires the translation of 40 sentences from standard German into dialect. Access upon request.

13. The Zurich summer 1968 corpus was created at the German Department. It contains a total of 958 transcribed documents. Available here.

14. The Temporal entity extraction from historical texts project carried out at the Department of Computational Linguistics resulted into Gold Standard of temporal annotations. The corpus contains 50 historical legal articles in Early New High German. It was annotated in a subset of the TimeML markup language for temporal annotation. The corpus contains about 34,000 tokens and is available here.

15. The JAKOB lexicon was created at the Department of Computational Linguistics in collaboration with the Institute of Psychology. It represents a lexicon of psychological dimensions of German words. Access upon request here.

16. The SADS (Syntactic Atlas of German-speaking Switzerland) database was created at the German Department. It is a digital atlas describing the syntactic landscape of German-speaking Switzerland. Surveys on 54 syntactic variables were conducted in 383 places with altogether 3187 informants. Access upon request.

Corpora and databases: multilingual

17. The Swissdox@LiRI database with press articles was created in collaboration with the Schweizer Mediendatenbank AG. Swissdox@LiRI consists of 29 million media articles (press, online) from a wide range of Swiss media sources covering many decades. The database is updated daily with about 5'000 to 6'000 new articles from the German and French speaking parts of Switzerland. It is designed for big data analyses. Data may be enriched optionally, automatically processed and analyzed. Access upon institutional subscription.

18. The bulletin4corpus corpus was created at the Department of Computational Linguistics. The corpus contains the Credit Suisse Bulletin, which is published partially in four languages since 1895: German, French, Italian and English. The magazine contains articles on economic and socially relevant topics and is therefore neither a banking magazine nor a traditional corporate magazine. This makes the Bulletin interesting as a training corpus for applications such as machine translation since it provides access to another genre, which is suitable for newspapers and magazines for instance. Available here.

19. The sms4science corpus was created at the Romance Department. It consists of approx. 26,000 SMS in all four Swiss national languages. In addition to the original texts and a normalized version with general annotations, various sub-corpora are available which are annotated with specific annotations by doctoral students. Access upon request.

20. The SMULTRON (Stockholm Multilingual Treebank) corpus was created at the Department of Computational Linguistics. It is a parallel treebank with subcorpora, each containing texts of different genres (mainly non-fiction texts) in two or more languages; five languages in total: Swiss German, German, French, Italian, Rhaeto-Romanic (Romansh). Available here.

21. The eSSRQ (Electronic Collection of Swiss Legal Sources) corpus was created by the Legal Source Foundation of the Swiss Lawyers' Association. It is a collection of Swiss legal texts from the period 501 – 1882, in German, French, Italian, Rhaeto-Romanic (Romansh) and Latin. Available here.

22. The Phonogram Archives corpus was created at the Department of Computational Linguistics. It is a collection of approx. 3500 sound recordings or carriers from all four Swiss national languages, corresponding to approx. 120 hours of processed sound material. This includes varieties of all major language areas in Switzerland, such as Swiss German dialects, franco-provençal “Patois”, the Lombard dialects of Ticino and parts of the Canton of Grisons and also the Rhaeto-Romance idioms. Currently, all sound carriers are being digitized. Digital versions and transcriptions are already available for many sound carriers. Access upon request.

23. The Text+Berg corpus was created at the Department of Computational Linguistics. It consists of the digitalize volumes of “The yearbooks of the Swiss Alpine Club” from 1864 to 1923, the “Echo des Alpes” from 1872 to 1924, the ALPEN from 1925 to 2011. The corpus currently comprises nearly 45 million words from more than 100,000 book pages and is variously annotated (text structure, part-of-speech, lemmas, toponyms, etc.). The following languages are represented: German, French, Italian, Rhaeto-Romanic (Romansh), Swiss German, English. Available here.

24. The Bullinger Digital corpus is created at the Department of Computational Linguistics. It consists of 2000 letters that Heinrich Bullinger wrote and 10,000 letters that he received have been preserved. The originals are kept in the Zurich State Archives and the Zurich Central Library. 80% of the letters are in Latin, most of the others in Early New High German. About 2900 letters have already been manually transcribed and edited. They can be searched online. Another 5000 letters have been transcribed and are available as electronic texts.

25. The CoNTra_corpora: the Federal Gazette was created at the Department of Computational Linguistics. The Federal Gazette is a journal published by the Swiss Government. The journal is a political newsletter concerned with resolutions and laws of the Swiss Confederation. This corpus contains the German-French and French-German parallel sentences mined with Laser from the digitized Federal Gazette. The heavily filtered corpus contains 1.3 million parallel sentence pairs in both directions. Available here.

26. The PHOIBLE database was created at the Department of Comparative Linguistics. It is repository of cross-linguistic phonological inventory data (more than 1000 languages), which have been extracted from source documents and tertiary databases and compiled into a single searchable convenience sample. Release 2.0 from 2019 includes 3020 inventories that contain 3183 segment types found in 2186 distinct languages. Available here.

27. The European Language Grid resource collection for the languages in Switzerland was created of the Department of Comparative Linguistics with the occasion of their participation in the European Language Equality (ELE) European project. It consist of over 100 resources: corpora (<60), applications (<40) and lexica. Many of the resources are multilingual: between French, German and Italian, as well as Romansh. Access can be acquired by writing to Dr. Annette Rios or to the CLARIN-CH Scientific Coordinator Dr. Cristina Grisot.

Corpora and databases: Other languages

28. The CLC (Chintang Language Corpus) corpus was created at Department of Comparative Language Science. It is a multimedia corpus of Chintang (Tibeto-Burman, Nepal, ca. 5000 speakers); ca. 300 hours (1.2 million) words transcribed, most translated into Nepali and English, and morphologically annotated (segments, functions, POS). Includes data from adults (responsibility Seminar for ASW) and longitudinal data on language acquisition (responsibility Psycholinguistics Unit). Access upon request.

29. The NNC (Nepali National Corpus) corpus was created at the Department of Comparative Language Science. It consists of Nepali texts from various genres, with 14’000’000 words. The majority of the texts are primarily written, with a small portion transcribed from recordings. Access upon request.

30. The SEAlang corpus was created at the Department of Comparative Language Science. It consists of audio recordings (conversations, elicitation, stories), texts of Southeast Asian languages for linguistic purposes (language description, areal typology); some texts of literary interest (Southeast Asian traditions and beliefs); transcript of audio partly in indigenous scripts, partly already IPA, some with glosses and translation. Audio recordings in Mon amount to some 10 hours, Burmese about 8 hours, Karen (Pwo) and Nyahkur about 1 hour each. Transcripts of Mon texts (indigenous script and/or transcription) estimated 60,000 words (including literary texts), Burmese about 30,000 words (including e-mail communication), Karen (indigenous script, hand written) and Nyahkur (indigenous script, hand written) about 5000 words each. A total of 100,000 words are transcribed (a total of 50 hours). Available here.

31. The Corporum corpus was created at the Medieval Latin Seminar. It is a collection of medieval Latin texts. Available here.

32.The GeLaTo (Genes and Languages Together) database was created at the Department of Comparative Language Science in collaboration with the Department of Evolutionary Biology and Environmental Studies. It is a new resource developed to link genomic data to cultural and linguistic identifiers and promote multidisciplinary research. Access upon request.

33. The AUTOTYP database was created at the Department of Comparative Language Science in collaboration with the University of California. It represents an international network of typological linguistic databases. AUTOTYP is a large-scale research program with goals in both quantitative and qualitative typology. Quantitative typology is interested in detecting and explaining geographical distributions of typological features and in producing statistical estimates of universal preferences as well as of genealogical inheritance and areal diffusion potentials. Qualitative typology aims at a systematic analysis of the kinds of variation found in various typological domains. Available here.

34. The Zurich Corpora of Slavic Varieties (ZuCoSlaV) was created at the Department of Slavonic Studies. It consists of four corpora. 1. Macedonian Spoken Corpus, which comprises transcriptions of audio files collected in a series of field research trips in the Prespa, Bitola and Debar regions in 2012, 2014, 2016 and 2019. 2. Pre-Standardized Balkan Slavic Literature corpus, which includes various Balkan Slavic texts from the 15th-19th century. The annotated section includes 20 shorter texts with full morphological and syntactic annotation (48k tokens). The raw section contains 14 sources digitized manually or automatically as a whole (ca. 1M tokens). 3. Torlak corpus, which contains transcripts of interviews about traditional culture and history with speakers of Torlak from the Timok area. It comprises 500,697 tokens representing 80 h of recording. 4. Serbian Forms of Address corpus, which contains transcripts of interviews about forms of address that Serbian speakers use in colloquial and formal settings. It consists of 171,552 tokens, corresponding to about 19 h of recording. Available here.

Faculties and Departements involved in CLARIN-CH

Faculty of Arts and Social Sciences

1. Institute of German Studies

Areas of expertise in the field of Linguistics:

  • Conversation analysis
  • Corpus linguistics
  • Dialectology
  • Grammatical structure of Germanic languages and varieties
  • Historical linguistics
  • Language contact
  • Language in the new media
  • Language variation and change
  • Linguistic discourse and cultural analysis
  • Multimodal communication
  • Theoretical linguistics

2. English Department

Areas of expertise in the field of Linguistics:

  • Contact linguistics
  • Corpus Linguistics
  • English and Latin in Medieval England
  • English dialectology
  • Historical Pragmatics
  • Historical Syntax
  • Language variation and (ongoing) language change
  • Second language acquisition in multi-lingual contexts
  • Sociolinguistics (historical and contemporary)
  • World Englishes

3. Department of Romance Studies

Areas of expertise in the field of Linguistics:

  • Bilingualism and second language acquisition
  • Descriptive and contrastive linguistics
  • Historical linguistics
  • Lexicology (focus on loan words)
  • Morphology
  • Morphosyntax of nominal expressions and its typological implications in Romance languages
  • Phonetics and Phonology
  • Sociolinguistics and language contact
  • Spoken language and description problems in variety linguistics; language use in new media
  • Theoretical linguistics (phonology, morphology, syntax)

4. Institute of Asian and Oriental Studies

Areas of expertise in the field of Linguistics:

  • Phonology, etymology and palaeography of Old Chinese
  • Sino-Tibetan comparative linguistics
  • External contacts of Old and Middle Chinese
  • Chinese dialectology
  • History of indigenous Chinese philology and grammatology
  • Epistemological foundations of historical linguistics beyond the “standard average European”
  • Philosophy of language
  • Sanskrit-Chinese translation and lexicography
  • Grammatology of complex non-alphabetic writing systems

5. Department of Comparative Language Studies

Areas of expertise in the field of Linguistics:

  • Auditory neurocognition
  • Celtic languages
  • Comparative linguistics
  • Diachronic dynamics of languages
  • Evolutionary linguistics
  • Indo-European linguistics
  • Language change and reconstructions worldwide
  • Language functions over the life span
  • Psycholinguistics (focus on processing and learning mechanisms of languages)
  • Southeast Asia languages

6. Department of Computational Linguistics

Areas of expertise in the field of Linguistics:

  • Computational Psycholinguistics
  • Cross-language information retrieval
  • Data-driven methods for NLP (corpora and treebanks)
  • Digital Humanities
  • Experimental and forensic phonetics
  • Information Extraction and Ontologies
  • Low Resource NLP
  • Neural Machine Translation (NMT)
  • Speech sciences and speech therapy
  • Translating sign languages

7. Department of Slavonic Studies

Areas of expertise in the field of Linguistics:

  • Corpus Linguistics
  • Diachronic Linguistics
  • Diatopic and diaphasic variation
  • Interfaces: semantics-pragmatics, syntax-discourse
  • Morphosyntax: monolingual, comparative & areal
  • Syntactic structures, synchronic & diachronic

Current research projects

1. The project Sino-Indo-Iranica rediviva - Early Eurasian migratory terms in Chinese and their cultural implications (Prof. Behr Wolfgang, Asia-Orient Institute) is funded by the SNSF and it will carry out a comprehensive analysis of the existing linguistic and archaeological evidence to create a plausible model of the early relations between China and Central Asia, with a focus on Ancient India and Iran. The earliest Iranian and Indian loanwords from and into Chinese, as they are found in transmitted and excavated texts, will be collected, carefully examined and compared with the available archaeological and paleobotanical data. The overall aim of the project is to shed light on the earliest relationships between Inner Asia and the Ancient Near East with China, as reflected in loanwords and archaeological data. An open access database of the terms examined, which will be created in the course of the project, may also serve as the basis for future projects within a similar geotemporal framework. Thus, results from this project will be useful to enhance linguistic reconstructions of Central Asian languages and to calibrate their linguistic genealogies. Historically, the project will provide new perspectives on the relations between China and Central Asia and the history of the migration of individual ethnic groups. Given the current revival of economic and strategic interests in this area, the results may also be fundamental to a more fine-grained understanding of the historical preconditions for the current geopolitical situation in Central Asia and adjacent areas. [Ongoing]

2.The project Mourning practices on the Internet (Prof. Crista Dürscheid, German Department), examines from a linguistic perspective how people express their grief digitally on the internet, how they verbalise expressions of condolence after the loss of a loved one or another tragic event, and how public discourse about this kind of mourning is shaped. Two corpora will be constructed to investigate these questions: Corpus 1 contains data from different web sources (e.g. online memorial sites, social networks), corpus 2 consists of media reports about online mourning practices. The data is analysed using a combination of quantitative and qualitative methods. This approach makes it possible to comprehensively examine the effects of digital mourning practices on the social discourse of mourning and to pursue the question of how new forms of farewell and condolence communities are constituted on the internet. [Ongoing]

3.The project Amish Shwitzer as a mixed language with closely related parents (Prof. Guido Seiler, German Department) is funded by the SNSF (2020-2022). The object of the project is to describe the grammatical system of Shwitzer, a language with roots in 19th century Bernese German that is in close contact with a number of Germanic varieties in longstanding contact. Today, Shwitzer is spoken by a growing community of speakers in the Berne, Indiana, area.

4.The project Historical pragmatics of lawmaking (Dr. Kevin Müller, German Department) is funded by the SNSF and it examines how the linguistic realisation of legal forms of action (e.g. prohibitions, authorisations) has developed diachronically. To this end, the formulations used to express such forms of action in current Swiss legal texts are compared with the formulations found for them in older Swiss legal texts. The focus of interest here is the question of the extent to which the way in which the state communicates with those subject to the law in the law has changed over time. [2019-2022]

5.The project Glarus Dialect Dictionary Association (Dr. Kevin Müller, German Department) aims to develop a Dialect Dictionary for the canton of Glarus. Dialect dictionaries are usually aimed at the local population and are generally comprehensible, but they are also noticed and used by linguistic research. Mostly they have been compiled by volunteers from the respective regions, more rarely partly or entirely by university-trained Germanists. Many of these dialect dictionaries have been included in the series “Grammar and Dictionaries of Swiss German in Generally Comprehensible Presentation”. [2021-2023]

6.The project The Field Names of the Canton of St. Gallen (Prof. Ludwig Rübekeil, German Department) is funded by the SNSF (2019-2022) and it aims to compile the approx. 4,500 toponyms on the National Map 1:25000 of unsettled or cross-settlement places (such as field names, water names, terrain names, room names) of the Canton of St. Gallen and to analyse and interpret them according to linguistic criteria. The project is onomastically oriented, i.e. the linguistic analysis of the genesis, structure and function of the names is in the foreground. The historically documented collection of names is an important basis for local, regional and supra-regional research in the field of language and cultural history. However, the task of the project is also to provide basic data that various disciplines can use as a source for further research (language history, dialectology, settlement history, natural history, economic history, cultural history, religious history) [Ongoing]

7.The project gi-interactive conversation analysis (Prof. Heiko Hausendorf, German Department) aims to establish an innovative form of teaching that is tailored to the specific challenges of teaching conversation analysis (and other qualitative research fields). According to the principle of collaborative research-based learning, students are taught all the essential components of a conversation analysis study in the form of guided group work and internet-supported self-learning. [Ongoing]

8.The project Linguistic Variation in the Middle Ages as a System. A new foundation of the scriptological description of the Gallo-Romance language area (Prof. Martin-Dietrich Glessgen, Romance Department) is funded by the SNSF. The project aims to provide a new empirical and methodological basis for the scriptological description of the French and Gallo-Romance linguistic area, and thus to develop a new approach to medieval linguistic data in the Romania - and in other language families. The aim is to span the arc from the “writing site”, which is responsible for the expression of a concrete linguistic form, to the diasystematic space, in which the individual forms correspond to systematic patterns and can be evaluated as such. The very characteristic variance of the medieval written languages is thus directly located in their context of origin and thus ultimately in speech, but at the same time it can be grasped as a constitutive moment of the linguistic system. On the one hand, the project implies an adequate preparation of the comprehensive, currently available database for empirical scriptological analysis, on the other hand, the further or new development of methodologies in the processing of these data and in their electronic as well as visual representation. The database system to be designed here will then enable a new, very precise access to the phenomenon of medieval variance and the linguistic change inherent in it, and can thus become a reference tool for the entire field of research, including text philology, comparable to the ALF for the linguistic geography of modernity and the FEW for historical lexicography.

9. The projectExperimental Morphosyntax of Romance Languages. Studies on object marking in Portuguese and Romansh (Prof. Johannes Kabatek, Romance Department) is funded by the SNSF. Language has traditionally been studied with two types of data: you look at how people write or speak or you ask them questions about certain forms. In the project, on the other hand, we will work with experiments, with acceptability and production experiments, which on the one hand allows for a targeted investigation and on the other hand allows for comparability of the data. Our topic is the marking of objects in Romance languages; based on our expertise in Spanish, we will now experimentally investigate Portuguese and Romansh (especially Engadin). [Ongoing]

10.The project Indefinite-definite articles in Romance and English (IDaRoE) (Prof. Marianne Hundt, English Department; Prof. Elisabeth Stark, Romance Department; Prof. Artemis Alexiadou, University Humboldt Berlin) aims to investigate an otherwise overlooked usage of the definite article (non-specific uses of all kinds), namely definite noun phrases in contexts where only indefinite readings are allowed. This type of definite noun phrases is attested quite early in the history of Iberoromance and German and has been discussed in the semantics literature on English. All instances of ‘indefinite definite articles’ represent a puzzle for referential semantics, nominal morphosyntax, cross-linguistic comparison/typology and variationist linguistics. This raises the question of what drives the alternation between the presence vs. absence of an article in certain contexts and why such definite nouns are more widespread in some languages/varieties than in others. The overall, scientific objective of the planned collaboration on ‘indefinite definite articles’ is to learn more about these structures from a cross-linguistic perspective (descriptive information from corpora and experiments). We aim to delimit the existing types of these occurrences in Romance and English, their potential text genre sensitivity and their semantics. [Ongoing]

11.The project AIS, reloaded (Prof. Michele Loporcaro, Prof. Stephan Schmid, Romance Department; Prof. Bruno Moretti, University of Bern; ) is funded by the SNSF and it aims to analyse the diachronic evolution of Italian-Romance dialects by comparing material from the Sprach- und Sachatlas Italiens und der Südschweiz (AIS) with new data collected in the same localities almost one hundred years later. The 1705 AIS maps provide a detailed picture of 407 dialects as they were spoken between 1919 and 1928. By 2019, we intend to digitise 50% of the corpus of transcriptions contained in the AIS and collect new data from the 36 localities in southern Switzerland. [Ongoing]

12.The project Rural sociolinguistics in the Canary Islands (Rurican) (Prof. Carlota de Benito Moreno, Romance Department) is funded by the SNSF and it is carried out by researchers from the University of Zurich in collaboration with other European universities. The aim is to get closer to the linguistic, social and cultural reality of the rural, semi-urban and urban environments of the islands of La Palma and La Gomera, in order to better understand these changes. To do this, interviews with adults (aged 20 and over) are conducted in at least five locations on each island. In these interviews, the researchers will talk about the past and present way of life in the different localities and their customs, with the aim of providing a representative sample that will allow them to study not only the language spoken in the Canary Islands in rural environments, but also the culture and tradition of these places. [Ongoing]

13. The project Prepositions in English Argument Structure (PEAS) (Prof. Marianne Hundt, English Department) is funded by the SNSF (2018-2022). Prepositions and prepositional constructions constitute an integral part of Present Day English and play a fundamental role in its system of verbal complementation: Not only are they used to express adverbials of instrument, location or time (John wrote the letter with a pencil in Rome on Monday), but PPs also frequently mark the direct or indirect objects of verbs (John relied on this mother to give the book to Mary). Accordingly, their semantic and syntactic features have received much attention in the literature. Nevertheless, there is still considerable disagreement as to the precise analysis and classification of the range of prepositional patterns available today. In addition to being much broader in scope than earlier work, the project overcomes the limitations of previous studies which were based on very small datasets by using data from large historical and syntactically annotated corpora. The corpus-based approach is furthermore supplemented by evidence from mathematical modelling, specifically agent-based models. On a theoretical level, our approach is grounded in the state-of-the-art frameworks of (diachronic) construction grammar and evolutionary linguistics. By readdressing long-standing questions in English diachronic and synchronic syntax from a novel and original perspective, the project will contribute greatly to the field of language variation and change in and beyond English. [Ongoing]

14.The project Impact of Automated Generated Text on the Linguistic Intuition and the Language Evolution (IMAGINE) (Prof. Martin Volk, Department of Computational Linguistics) is part of the NCCR project Evolving Language: it is the UZH side of the CompuLang Work Package within the Digitisation Project under the Theme 3: Social Cognition of Language. The Digitisation project focuses on the future of language. It explores the impact of language technology and artificial languages on human linguistic intuitions, mental representations and speech output.) In IMAGINE, the aim is to measure the impact of machine translated output and other automated generated texts on the human language. [Ongoing]

15.The project LITHME: Language in the human-machine era is a COST Action network. At UZH, the project is represented by Prof. Martin Volk from Department of Computational Linguistics, and it has also received funds from the SNSF (2021-2024). We live in a ‘human-machine era’, a time when our senses are not just supplemented by handheld mobile devices, but thoroughly augmented. The language we see, hear and produce will be mediated in real time by technology. This has major implications for language use, and ultimately language itself. Are linguists ready for this? Can our theory, methods, and epistemology handle it? The aims of the project are to prepare linguistics and its subdisciplines for what is coming, and to facilitate longer term dialogue between linguists and technology developers. [Ongoing]

16.The project Rich Contexts in Neural Machine Translation (Prof. Martin Volk, Department of Computational Linguistics) is funded by the SNSF (2017-2021). In the project, the team conducts research on neural machine translation systems. This family of systems is called “neural” because models are built with neural networks, for instance recurrent neural nets. Neural machine translation currently is the best performing and most widely used method for automatic translation. In this project, the team explore the potential of additional context that a neural translation system can be conditioned on, such as : coreference annotations that help disambiguate and translate phenomena like pronouns, syntactic structures, multiple input languages, document-level information. They also train MT systems for the three major languages of Switzerland: French, German and Italian - using state-of-the-art methods. They will be available for free. [Completed]

17.The project Automatic Translation of German Train Announcements into Swiss German Sign Language (Prof. Martin Volk, Department of Computational Linguistics) is funded by the Federal Bureau for the Equality of People with Disabilities (EBGB) and the Max-Bircher-Stiftung. In the Trainslate (=train+translate) project we are developing a system that automatically translates German train announcements of the Swiss Federal Railways (Schweizerische Bundesbahnen, SBB) into Swiss German Sign Language. The idea for such a system was suggested to us by Deaf signers in Switzerland. [Ongoing]

18.The project Sentiment Inference (Prof. Martin Volk, Department of Computational Linguistics) is funded by the SNSF (until December 2022) and it focuses on: German sentiment inference based on connotation frames, identifying writer, reader and text perspective, factuality determination for inference validation, role framing, and attitude prediction. Online demo Stancer: Sentiment inference for German [Ongoing]

19.The project Multitask Learning with Multilingual Resources for Better Natural Language Understanding (MUTAMUR) (Prof. Rico Sennrich, Department of Computational Linguistics) is funded by the SNSF Professorship Program (2019-2023). It investigates methods for knowledge sharing and transfer between machine learning models in natural language processing. Modern machine learning models in natural language processing require large amounts of training data to reach high quality. While this training data is task-specific, various tasks are related both in terms of machine learning algorithms and language representations. MUTAMUR investigates new machine learning methods to exploit this relationship, and develop better natural language processing systems for tasks and languages with small amounts of task-specific training data. [Ongoing]

20.The project Bullinger Digital (Prof. Martin Volk, Department of Computational Linguistics; Prof. Andreas Fischer, University of Applied Sciences Friborg; Prof. Tobias Hodel, University of Bern) is funded by the UZH Foundation and the Hasler Foundation (2021-2022). The current project aims at an efficient scan-to-text alignment of the transcribed letters, which will allow researchers to appreciate the transcribed texts in sync with the scan images. On the basis of these scan-aligned transcriptions, we will train systems for handwritten text recognition (HTR) to efficiently convert the remaining 4000 letters into electronic text. Moreover, we will experiment with machine translation to translate the letters written in medieval Latin or Early New High German into modern German or English. The so enriched scans of the letters will be ingested into a search engine and thus accessible to researchers and the public. [Ongoing]

21.The project Typology of Vowel and Consonant Quantity in Southern German varieties: acoustic, perception, and articulatory analyses of adult and child speakers (Prof. Stephan Schmid; Department of Computational Linguistics) is funded by the SNSF (2016-2023). Languages change over time due to variability that is already present at a certain point in time, such as occurs in language acquisition in children. This research project investigates possible factors of sound change using the durations of short and long vowels and consonants as an example. The project conducts basic research on current questions of modern linguistics, especially on the relationship between language change, language variation and language acquisition. At the same time, it contributes to a deeper knowledge of the varieties of German and its dialects spoken in southern Germany, Austria and Switzerland. [Ongoing]

22.The project Stories from the 'Little Paris' of Bulgaria. A Digital Edition of the Sbornik of Pop Punčo (1796) (Dr. Ivan Šimko, Institute of Slavonic Studies). The project will produce an online edition of the 'Sbornik of Pop Punčo' (NBKM 693). It is a manuscript from the end of the 18th century, from a region of intense cultural exchange. The document is unique on several levels: the language corresponds to an ancient state of transitional dialect between Serbian and Bulgarian; the content shows a curious mixture of sacred and secular themes; the author, quite contrary to previous manuscript tradition, even emerges with a self-portrait. The aim of the project is to process this source for modern linguistics. This involves the production of a searchable text corpus with morphological and syntactic annotation. In the process, the text will also be available as a browser-enabled edition for philological research.The project began in September 2020 and is funded by the Empiris Foundation, Jakob Wüest Fund, via the Foundation for Scientific Research at the University of Zurich. [Ongoing]

23.The project Albanian in Contact. Horizontal Transfer and Identity Creation in Multilingualism Practice (Prof. Barbara Sonnenhauser, Department of Slavonic Languages; Prof. Claudia Maria Riehl from the University of Munich). Language diversity and multilingualism are central concepts of Swiss language policy and linguistic landscape and also concern languages of origin of migrant groups. Although Albanian-speaking communities (mostly from Kosovo and from Macedonia) have been among the largest migrant groups in the German-speaking area and especially in Switzerland since the 1980s, little is known about the language and linguistic behaviour of this speaker community, which now spans several generations. In this project, a comprehensive picture of the linguistic practice of speakers of Albanian as a language of origin and of the language(s) used will be developed over time and in diverse contact situations by combining approaches and methods from linguistics and didactics of language of origin with those from contact, socio- and variation linguistics. Empirical data for contact linguistics will be made available for the estimation of contact influence for phylogenetic models, and teaching materials for the teaching of languages of origin will be created. The results will be disseminated in the form of qualification papers, scientific publications, information brochures and websites, and the collected data will be deposited and made available in suitable repositories (DaSCH). [Ongoing]

24.The project Ill-bred Sons, Family and Friends. Multiple Affiliations in Balkan Slavic (Prof. Barbara Sonnenhauser, Department of Slavonic Languages). The South Slavic dialect continuum is characterised by an intricate encounter of affiliations: genealogically, it is intersected by an old bundle of isoglosses differentiating West and East South Slavic, areally, parts of it – the ‘ill-bred’ sons, as Schleicher (1850) called them – share a number of morpho- syntactic innovations with their neighbouring non-Slavic languages. The project focuses on a set of morpho-syntactic BS innovations and their diffusion and integration into the South Slavic system from a diatopic and diachronic perspective by contrasting Torlak with the surrounding varieties and by drawing on evidence from pre-standardised vernacular sources. The analysis will be based on annotated corpora for each of the varieties to be compared. Providing more fine-grained data, it becomes possible to establish correlations between features and structures and hence reveal usage conditions, illustrate converging and diverging developments, in particular as concerns their functions, and to map the data in time and space by geo-referencing and visualising them with GIScience technology. To this end, existing processing tools will be improved in such a manner that they can be applied to these still under-resourced languages. This methodological aspect adds a further dimension to the project, beyond its contribution to (Slavic) dialect syntax and the linking of dialectology and areal typology. [Ongoing]

