Tools

Research based on language data often requires specialized software and tools for data processing and analysis. A number of tools have been developed within CLARIN-CH institutions and are open to be used by other researchers – we recommend to consult their documentation before use. If you are a tool owner and are willing to share your asset with the community, please do not hesitate to contact us.

Additionally, CLARIN centers all over Europe offer a wide variety of tools that help researchers explore and analyse language data:

Tool name	Tool type	Functionality	URL	CLARIN-CH institution
Nematus	Machine Translation Tools	Attention-based encoder-decoder model for neural machine translation built in Tensorflow.	https://github.com/EdinburghNLP/nematus	University of Zurich
SwissBERT	Language Model	SwissBERT is a masked language model for processing Switzerland-related text. It has been trained on more than 21 million Swiss news articles retrieved from Swissdox@LiRI.	https://github.com/ZurichNLP/swissbert	University of Zurich
Subword-NMT	Word Segmentation	Unsupervised Word Segmentation for Neural Machine Translation and Text Generation	https://github.com/rsennrich/subword-nmt	University of Zurich
NMTScore	Text Similarity	NMTScore is a library of translation-based text similarity measures, providing reference-free evaluation by scoring translations based on neural machine translation models.	https://github.com/ZurichNLP/nmtscore	University of Zurich
Zmorge	Morphological Analysis	Zmorge is a morphology tool that combines a lexicon that is automatically extracted from Wiktionary, and a modified version of the finite-state morphological grammar SMOR. The extraction script is open source, so that new versions of the lexicon can be extracted from future, expanded versions of Wiktionary.	https://pub.cl.uzh.ch/users/sennrich/zmorge/	University of Zurich
ParZu	Dependency Parsing Tools	ParZu is a dependency parser for German. This means that it analyzes the linguistic structure of sentences and, among other things, identifies the subject and object(s) of a verb.	https://github.com/rsennrich/ParZu	University of Zurich
clevertagger	Part-of-Speech Tagging and Lemmatisation	clevertagger is a German part-of-speech tagger based on a CRF tool and SMOR. Its main component is a module that extracts features from SMOR's morphological analysis.	https://github.com/rsennrich/clevertagger	University of Zurich
Bleualign	Sentence Alignment	Bleualign is a tool to align parallel texts (i.e. a text and its translation) on a sentence level.	https://github.com/rsennrich/bleualign	University of Zurich
Swiss German POS model	Part-of-Speech Tagging/Dependency Parsing and Lemmatisation	The swiss_german_pos_model is a part-of-speech tagging model for Swiss German. The model is trained on Universal POS tags (upos).	https://huggingface.co/noeminaepli/swiss_german_pos_model	University of Zurich
Swiss German STTS POS Tagging Model	Part-of-Speech Tagging/Dependency Parsing and Lemmatisation	The swiss_german_pos_model is a part-of-speech tagging model for Swiss German. The model is trained on STTS POS Tags. Note that there is also a model trained on Universal POS tags (upos): swiss_german_pos_model.	https://huggingface.co/noeminaepli/swiss_german_stts_pos_model	University of Zurich
Swiss German XLM-RoBERTa	Machine learning models	The xlm-roberta-base model (Conneau et al., ACL 2020) trained on Swiss German text data via continued pre-training.	https://huggingface.co/ZurichNLP/swiss-german-xlm-roberta-base	University of Zurich
Swiss German CANINE-s model	Machine learning models	Pretrained CANINE model on Swiss German using a masked language modeling (MLM) objective. It was introduced in the paper CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation by Google.	https://huggingface.co/ZurichNLP/swiss-german-canine	University of Zurich
Okra: Mobile App for Conducting Reading Comprehension Experiments	Application	Mobile (Android/iOS) app enabling participation in cloze tests, lexical decision tasks, multiple-choice reading comprehension, n-back working memory tasks, picture naming, and reaction time tests in English, German, French, and Italian.	The app, its documentation and the corresponding paper are all available on GitHub: https://github.com/saeub/okra	University of Zurich
Sign Language Processing Demo	Notebook	This Jupyter notebook demonstrates a comprehensive pipeline for Sign Language Processing (SLP), focusing on converting sign language videos into more compact and machine-readable representations. It walks through steps such as pose estimation from video, segmentation into individual signs, and gloss recognition using pretrained models. The notebook also outlines future directions in sign language assessment, embedding via SignCLIP, and translation between glosses, text, and phonetic systems like SignWriting. Overall, it offers a practical guide to building SLP systems using current tools and datasets.	The Jupyter Notebook is shared via Google Colab, it can be accessed by using the following link: https://colab.research.google.com/drive/1VaUIdrLRWiaNGb_4z8kSl6B1ZH4HdNic#scrollTo=yw7P4wTBSA6g	University of Zurich
SwissADT – An Audio Description Translation System for Swiss Languages	Machine learning models	SwissADT is the first audio description translation system implemented for three main Swiss languages and English. By collecting well-crafted AD data augmented with video clips in German, French, Italian, and English, and leveraging the power of Large Language Models (LLMs), it aims to enhance information accessibility for diverse language populations in Switzerland by automatically translating AD scripts to the desired Swiss language. Combining human expertise with the generation power of LLMs can further enhance the performance of ADT systems, ultimately benefiting a larger multilingual target population.	The code of the model and the demo are available on GitHub: https://github.com/fischerl92/swissADT/. Additionally, there is an online demo for it: https://www.youtube.com/watch?v=5PQs8DscubU	University of Zurich
SignCLIP: Connecting Text and Sign Language by Contrastive Learning	Machine learning models	SignCLIP re-purposes CLIP (Contrastive Language-Image Pretraining) to project spoken language text and sign language videos, two classes of natural languages of distinct modalities, into the same space. SignCLIP is an efficient method of learning useful visual representations for sign language processing from large-scale, multilingual video-text pairs.	The code, its documentation and the demo are available on GitHub: https://github.com/J22Melody/fairseq/tree/main/examples/MMPT	University of Zurich
Romansh Lemmatizer	Part-of-Speech Tagging and Lemmatisation	This Python package presents a basic dictionary-based lemmatizer for the Romansh language. Provided a Romansh text, the lemmatizer splits it into words and looks up each word in the Pledari Grond dictionaries for the five standard Romansh idioms: Sursilvan, Sutsilvan, Surmiran, Puter, and Vallader, as well as the dictionary for Rumantsch Grischun. The GitHub repository also contains a link to a demo published on HuggingFace that can be used for quick experiments and qualitative assessment of the lemmatizer.	The code, its documentation and the demo are available on GitHub: https://github.com/ZurichNLP/romansh_lemmatizer	University of Zurich