Workshop on the CLARIN-CH learner corpora and SLA working group

The working group is pleased to announce that on Friday, 21 November 2025, a workshop will be held at the University of Fribourg.

The workshop will focus on presenting and discussing the results of the learner corpora survey. The aim of the survey was to identify existing learner corpora in Switzerland, determine which are publicly accessible, understand their barriers and explore ways to facilitate their availability.

The workshop will feature contributions from the members of the working group in the form of presentations.

Programme of the workshop at UNIFR

Time	Programme	Speaker	Title
09:45 - 09:50	Welcome
09:50 - 10:00	Introduction	The convenors of the Working Group
10:00 - 11:00	Keynote speech	Katrin Wisniewski	Challenges in creating a large learner corpus database: insights from the DAKODA project
11:00 - 11:30	Coffee Break
11:30 - 12:10	Talk 2	Simona Pekarek Doehler and Melissa Juillet	CLA Corpus (UniNE) for longitudinal, multimodal and mixed analyses
12:10 - 13:30	Lunch
13:30 - 14:10	Talk 3	Thomas Studer	Annotation, analysis, and alternative facts: Challenges for sharing corpora taking SWIKO as an example
14:10 - 15:00	Contributions	Regula Schmidlin and Samuel Felder	QuaTexD – a corpus linguistic research project on the writing skills of L1-German-speaking Swiss high school students
		Elsa Liste Lamas	Looking back and forward: lessons learnt from the compilation of two German learner corpora
		Anita Thomas	Les interminables erreurs de transcriptions
15:00 - 15:30	Coffee Break
15:30 - 16:10	Presentation of the produced documentation	Melissa Juillet	Toward a national inventory of learner corpora in Switzerland: results of our survey, analysis of the results and next steps
16:10 - 16:30	Discussion	Cristina Grisot
16:30	End of the workshop

Workshop abstracts

Challenges in creating a large learner corpus database: insights from the DAKODA project – Katrin Wiesniewski:

The DAKODA project set out to make as large a number of German learner corpora as possible available in a common format and with harmonised, comprehensive metadata. Its main aim, however, was to explore the extent to which automatic annotations of verb position can lead to reliable results. The presentation focuses on linguistic and methodological challenges and findings in this context.

Das Projekt DAKODA setzte sich zum Ziel, nicht nur eine möglichst große Zahl deutscher Lernerkorpora in einem gemeinsamen Format und mit harmonisierten, umfassenden Metadaten verfügbar zu machen, sondern vor allem zu explorieren, inwiefern automatische Annotationen der Verbstellung zu zuverlässigen Ergebnissen führen können. Der Vortrag konzentriert sich auf linguistische und methodische Herausforderungen und Befunde in diesem Kontext.

CLA Corpus (UniNE) for longitudinal, multimodal and mixed analyses – Simona Pekarek Doehler and Melissa Juillet:

In this presentation, we begin by listing the corpora of authentic second language interactions (in the classroom and “in the wild”), recorded on video and audio, that we have compiled at the CLA (Centre for Applied Linguistics) at the University of Neuchâtel. We will then illustrate the different types of use to which these corpora lend themselves through two examples of studies on the development of interactional competence: a multimodal longitudinal study and a study combining conversation analysis methods with corpus linguistics. We open the discussion on the challenges and opportunities of these corpora.

Annotation, analysis, and alternative facts: Challenges for sharing corpora taking SWIKO as an example – Thomas Studer:

SWIKO (Swiss Learner Corpus) is a German-English-French parallel corpus consisting of task-based, oral, and written productions by young learners at the entry levels of the CEFR. L’annotation et l’analyse de ces productions, généralement courtes, souvent peu ciblées et parfois plurilingues, offrent une marge de manœuvre qui peut avoir des conséquences considérables sur les résultats. Dieser Beitrag diskutiert am Beispiel von schriftlichen SWIKO-Texten in Deutsch als Fremdsprache, wie mehrsprachige Tokens annotiert werden können, wie sich der Ein- oder Ausschluss solcher Tokens auf die Resultate lexikalischer Analysen auswirkt und welche Herausforderungen sich daraus für das Teilen von Korpora ergeben.

Looking back and forward: lessons learnt from the compilation of two German learner corpora – Elsa Liste Lamas:

This presentation reflects on the compilation of the LeKoBe and LeKoDe-CH corpora, two error-annotated German learner corpora comprising texts from different groups of learners. It discusses the key challenges in data collection and error annotation, highlighting both cross-corpus and corpus-specific issues. Building on these challenges, the presentation then outlines how the developed procedures and chosen infrastructure can contribute to the sustainability of learner corpora and lay the groundwork for future projects.

QuaTexD – a corpus linguistic research project on the writing skills of L1-German-speaking Swiss high school students – Samuel Felder and Regula Schmidlin:

How well do German-speaking Swiss high school students write shortly before graduating? The writing skills of Swiss German high school students are increasingly viewed critically, but there is a lack of empirical research on this topic. The project QuaTexD (Qualität von Deutschschweizer Lernertexten) focuses on various linguistic dimensions in texts written by secondary school students. The texts written by the Swiss students are compared with texts written by bachelor’s degree students and with texts written by students from other regions of the German-speaking world. The project thus focuses on both developmental and regional aspects of text competence. In our talk, we will present our dataset and the objectives of how the corpus can be used in the future. We will show the tools we are using to establish and annotate our corpus. Finally, we will explain how we are planning the publication of our corpus and what needs to be considered when anonymising the data.

Les interminables erreurs de transcriptions – Anita Thomas:

Dans cette contribution je discuterai de quelques défis liés à la publication d’un corpus longitudinal sur deux ans d’interactions orales, à l’exemple du corpus DiCoi. Au total ce corpus comprend huit enregistrements en format audio de dix minutes de 29 apprenant·e·s du français L2, dont la plupart ont le tigrinya comme L1. D’ici peu, ce corpus devra être versé dans le corpus SWIKO… enfin, dès que les transcriptions seront présentables.

Toward a national inventory of learner corpora in Switzerland: results of our survey, analysis of the results and next steps – Melissa Juillet:

In this presentation, I will briefly retrace the history of our working group on learner corpora in Switzerland and outline the first steps in developing our survey aimed at establishing an inventory of existing learner corpora in Switzerland. On the one hand, this survey seeks to establish an inventory of existing corpora in terms of their metadata, and on the other hand, to examine their accessibility: who can use them? why are they, or are they not, publicly available? The results of the survey allow us both to gain a clearer understanding of the needs of researchers in Switzerland in the field of SLA, and to better grasp why (many) resources are not publicly available.