Tools

Research based on language data often requires specialized software and tools for data processing and analysis. A number of tools have been developed within CLARIN-CH institutions and are open to be used by other researchers – we recommend to consult their documentation before use. If you are a tool owner and are willing to share your asset with the community, please do not hesitate to contact us.

Additionally, CLARIN centers all over Europe offer a wide variety of tools that help researchers explore and analyse language data:

Tool nameTool typeFunctionalityURLCLARIN-CH institution
NematusMachine Translation ToolsAttention-based encoder-decoder model for neural machine translation built in Tensorflow.
https://github.com/EdinburghNLP/nematusUniversity of Zurich
SwissBERTLanguage ModelSwissBERT is a masked language model for processing Switzerland-related text. It has been trained on more than 21 million Swiss news articles retrieved from Swissdox@LiRI.https://github.com/ZurichNLP/swissbertUniversity of Zurich
Subword-NMTWord SegmentationUnsupervised Word Segmentation for Neural Machine Translation and Text Generationhttps://github.com/rsennrich/subword-nmtUniversity of Zurich
NMTScoreText SimilarityNMTScore is a library of translation-based text similarity measures, providing reference-free evaluation by scoring translations based on neural machine translation models.https://github.com/ZurichNLP/nmtscoreUniversity of Zurich
ZmorgeMorphological AnalysisZmorge is a morphology tool that combines a lexicon that is automatically extracted from Wiktionary, and a modified version of the finite-state morphological grammar SMOR. The extraction script is open source, so that new versions of the lexicon can be extracted from future, expanded versions of Wiktionary.https://pub.cl.uzh.ch/users/sennrich/zmorge/University of Zurich
ParZuDependency Parsing ToolsParZu is a dependency parser for German. This means that it analyzes the linguistic structure of sentences and, among other things, identifies the subject and object(s) of a verb.https://github.com/rsennrich/ParZuUniversity of Zurich
clevertaggerPart-of-Speech Tagging and Lemmatisationclevertagger is a German part-of-speech tagger based on a CRF tool and SMOR. Its main component is a module that extracts features from SMOR's morphological analysis.https://github.com/rsennrich/clevertaggerUniversity of Zurich
BleualignSentence AlignmentBleualign is a tool to align parallel texts (i.e. a text and its translation) on a sentence level.https://github.com/rsennrich/bleualignUniversity of Zurich
Swiss German POS modelPart-of-Speech Tagging/Dependency Parsing and LemmatisationThe swiss_german_pos_model is a part-of-speech tagging model for Swiss German. The model is trained on Universal POS tags (upos).https://huggingface.co/noeminaepli/swiss_german_pos_modelUniversity of Zurich
Swiss German STTS POS Tagging ModelPart-of-Speech Tagging/Dependency Parsing and LemmatisationThe swiss_german_pos_model is a part-of-speech tagging model for Swiss German. The model is trained on STTS POS Tags. Note that there is also a model trained on Universal POS tags (upos): swiss_german_pos_model.https://huggingface.co/noeminaepli/swiss_german_stts_pos_modelUniversity of Zurich
Swiss German XLM-RoBERTaMachine learning modelsThe xlm-roberta-base model (Conneau et al., ACL 2020) trained on Swiss German text data via continued pre-training.https://huggingface.co/ZurichNLP/swiss-german-xlm-roberta-baseUniversity of Zurich
Swiss German CANINE-s modelMachine learning modelsPretrained CANINE model on Swiss German using a masked language modeling (MLM) objective. It was introduced in the paper CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation by Google.https://huggingface.co/ZurichNLP/swiss-german-canineUniversity of Zurich
Okra: Mobile App for Conducting Reading Comprehension ExperimentsApplicationMobile (Android/iOS) app enabling participation in cloze tests, lexical decision tasks, multiple-choice reading comprehension, n-back working memory tasks, picture naming, and reaction time tests in English, German, French, and Italian.The app, its documentation and the corresponding paper are all available on GitHub: https://github.com/saeub/okraUniversity of Zurich
Sign Language Processing DemoNotebookThis Jupyter notebook demonstrates a comprehensive pipeline for Sign Language Processing (SLP), focusing on converting sign language videos into more compact and machine-readable representations. It walks through steps such as pose estimation from video, segmentation into individual signs, and gloss recognition using pretrained models. The notebook also outlines future directions in sign language assessment, embedding via SignCLIP, and translation between glosses, text, and phonetic systems like SignWriting. Overall, it offers a practical guide to building SLP systems using current tools and datasets.The Jupyter Notebook is shared via Google Colab, it can be accessed by using the following link: https://colab.research.google.com/drive/1VaUIdrLRWiaNGb_4z8kSl6B1ZH4HdNic#scrollTo=yw7P4wTBSA6gUniversity of Zurich
SwissADT – An Audio Description Translation System for Swiss LanguagesMachine learning modelsSwissADT is the first audio description translation system implemented for three main Swiss languages and English. By collecting well-crafted AD data augmented with video clips in German, French, Italian, and English, and leveraging the power of Large Language Models (LLMs), it aims to enhance information accessibility for diverse language populations in Switzerland by automatically translating AD scripts to the desired Swiss language.

Combining human expertise with the generation power of LLMs can further enhance the performance of ADT systems, ultimately benefiting a larger multilingual target population.
The code of the model and the demo are available on GitHub: https://github.com/fischerl92/swissADT/. Additionally, there is an online demo for it: https://www.youtube.com/watch?v=5PQs8DscubUUniversity of Zurich
SignCLIP: Connecting Text and Sign Language by Contrastive LearningMachine learning modelsSignCLIP re-purposes CLIP (Contrastive Language-Image Pretraining) to project spoken language text and sign language videos, two classes of natural languages of distinct modalities, into the same space. SignCLIP is an efficient method of learning useful visual representations for sign language processing from large-scale, multilingual video-text pairs.The code, its documentation and the demo are available on GitHub: https://github.com/J22Melody/fairseq/tree/main/examples/MMPTUniversity of Zurich