Tools
Research based on language data often requires specialized software and tools for data processing and analysis. A number of tools have been developed within CLARIN-CH institutions and are open to be used by other researchers – we recommend to consult their documentation before use. If you are a tool owner and are willing to share your asset with the community, please do not hesitate to contact us.
Tool name | Tool type | Functionality | URL | CLARIN-CH institution |
---|---|---|---|---|
Nematus | Machine Translation Tools | Attention-based encoder-decoder model for neural machine translation built in Tensorflow. | https://github.com/EdinburghNLP/nematus | University of Zurich |
SwissBERT | Language Model | SwissBERT is a masked language model for processing Switzerland-related text. It has been trained on more than 21 million Swiss news articles retrieved from Swissdox@LiRI. | https://github.com/ZurichNLP/swissbert | University of Zurich |
Subword-NMT | Word Segmentation | Unsupervised Word Segmentation for Neural Machine Translation and Text Generation | https://github.com/rsennrich/subword-nmt | University of Zurich |
NMTScore | Text Similarity | NMTScore is a library of translation-based text similarity measures, providing reference-free evaluation by scoring translations based on neural machine translation models. | https://github.com/ZurichNLP/nmtscore | University of Zurich |
Zmorge | Morphological Analysis | Zmorge is a morphology tool that combines a lexicon that is automatically extracted from Wiktionary, and a modified version of the finite-state morphological grammar SMOR. The extraction script is open source, so that new versions of the lexicon can be extracted from future, expanded versions of Wiktionary. | https://pub.cl.uzh.ch/users/sennrich/zmorge/ | University of Zurich |
ParZu | Dependency Parsing Tools | ParZu is a dependency parser for German. This means that it analyzes the linguistic structure of sentences and, among other things, identifies the subject and object(s) of a verb. | https://github.com/rsennrich/ParZu | University of Zurich |
clevertagger | Part-of-Speech Tagging and Lemmatisation | clevertagger is a German part-of-speech tagger based on a CRF tool and SMOR. Its main component is a module that extracts features from SMOR's morphological analysis. | https://github.com/rsennrich/clevertagger | University of Zurich |
Bleualign | Sentence Alignment | Bleualign is a tool to align parallel texts (i.e. a text and its translation) on a sentence level. | https://github.com/rsennrich/bleualign | University of Zurich |
Swiss German POS model | Part-of-Speech Tagging/Dependency Parsing and Lemmatisation | The swiss_german_pos_model is a part-of-speech tagging model for Swiss German. The model is trained on Universal POS tags (upos). | https://huggingface.co/noeminaepli/swiss_german_pos_model | University of Zurich |
Swiss German STTS POS Tagging Model | Part-of-Speech Tagging/Dependency Parsing and Lemmatisation | The swiss_german_pos_model is a part-of-speech tagging model for Swiss German. The model is trained on STTS POS Tags. Note that there is also a model trained on Universal POS tags (upos): swiss_german_pos_model. | https://huggingface.co/noeminaepli/swiss_german_stts_pos_model | University of Zurich |
Swiss German XLM-RoBERTa | Machine learning models | The xlm-roberta-base model (Conneau et al., ACL 2020) trained on Swiss German text data via continued pre-training. | https://huggingface.co/ZurichNLP/swiss-german-xlm-roberta-base | University of Zurich |
Swiss German CANINE-s model | Machine learning models | Pretrained CANINE model on Swiss German using a masked language modeling (MLM) objective. It was introduced in the paper CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation by Google. | https://huggingface.co/ZurichNLP/swiss-german-canine | University of Zurich |
Additionally, CLARIN centers all over Europe offer a wide variety of tools that help researchers explore and analyse language data: