Tools

Research based on language data often requires specialized software and tools for data processing and analysis. A number of tools have been developed within CLARIN-CH institutions and are open to be used by other researchers – we recommend to consult their documentation before use. If you are a tool owner and are willing to share your asset with the community, please do not hesitate to contact us.

Tool nameTool typeFunctionalityURLCLARIN-CH institution
NematusMachine Translation ToolsAttention-based encoder-decoder model for neural machine translation built in Tensorflow.
https://github.com/EdinburghNLP/nematusUniversity of Zurich
SwissBERTLanguage ModelSwissBERT is a masked language model for processing Switzerland-related text. It has been trained on more than 21 million Swiss news articles retrieved from Swissdox@LiRI.https://github.com/ZurichNLP/swissbertUniversity of Zurich
Subword-NMTWord SegmentationUnsupervised Word Segmentation for Neural Machine Translation and Text Generationhttps://github.com/rsennrich/subword-nmtUniversity of Zurich
NMTScoreText SimilarityNMTScore is a library of translation-based text similarity measures, providing reference-free evaluation by scoring translations based on neural machine translation models.https://github.com/ZurichNLP/nmtscoreUniversity of Zurich
ZmorgeMorphological AnalysisZmorge is a morphology tool that combines a lexicon that is automatically extracted from Wiktionary, and a modified version of the finite-state morphological grammar SMOR. The extraction script is open source, so that new versions of the lexicon can be extracted from future, expanded versions of Wiktionary.https://pub.cl.uzh.ch/users/sennrich/zmorge/University of Zurich
ParZuDependency Parsing ToolsParZu is a dependency parser for German. This means that it analyzes the linguistic structure of sentences and, among other things, identifies the subject and object(s) of a verb.https://github.com/rsennrich/ParZuUniversity of Zurich
clevertaggerPart-of-Speech Tagging and Lemmatisationclevertagger is a German part-of-speech tagger based on a CRF tool and SMOR. Its main component is a module that extracts features from SMOR's morphological analysis.https://github.com/rsennrich/clevertaggerUniversity of Zurich
BleualignSentence AlignmentBleualign is a tool to align parallel texts (i.e. a text and its translation) on a sentence level.https://github.com/rsennrich/bleualignUniversity of Zurich
Swiss German POS modelPart-of-Speech Tagging/Dependency Parsing and LemmatisationThe swiss_german_pos_model is a part-of-speech tagging model for Swiss German. The model is trained on Universal POS tags (upos).https://huggingface.co/noeminaepli/swiss_german_pos_modelUniversity of Zurich
Swiss German STTS POS Tagging ModelPart-of-Speech Tagging/Dependency Parsing and LemmatisationThe swiss_german_pos_model is a part-of-speech tagging model for Swiss German. The model is trained on STTS POS Tags. Note that there is also a model trained on Universal POS tags (upos): swiss_german_pos_model.https://huggingface.co/noeminaepli/swiss_german_stts_pos_modelUniversity of Zurich
Swiss German XLM-RoBERTaMachine learning modelsThe xlm-roberta-base model (Conneau et al., ACL 2020) trained on Swiss German text data via continued pre-training.https://huggingface.co/ZurichNLP/swiss-german-xlm-roberta-baseUniversity of Zurich
Swiss German CANINE-s modelMachine learning modelsPretrained CANINE model on Swiss German using a masked language modeling (MLM) objective. It was introduced in the paper CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation by Google.https://huggingface.co/ZurichNLP/swiss-german-canineUniversity of Zurich

Additionally, CLARIN centers all over Europe offer a wide variety of tools that help researchers explore and analyse language data: