CLARIN-CH Documentation Platform

Welcome to the CLARIN-CH Documentation Platform. Here you will find useful information relevant at the different steps of your data life-cycle, which are usually covered by the Data Management Plan. The Platform offers best practices and resources that may be helpful for researchers to engage in FAIR-compliant data management in the context of CLARIN-CH.

open-access.network CC BY 4.0 International


Planning

Data Management Planning

Effective data management starts with the research design and involves planning ahead to encompass the entire data lifecycle. It is not only a necessary step in the research process, it is also a great benefit for you and the research community, if done right. Aligning data management practices with the FAIR principles and open data standards ensures transparency, accessibility, and the potential for broader collaboration within the research community.

Alongside, it's valuable to address ethical and legal aspects early on to ensure compliance with relevant regulations. In principle, all CLARIN-CH institutions provide legal services and ethics committees, offering expertise to navigate complexities and establish sound consent mechanisms. In addition, the topic is being addressed by the CLARIN-CH working group on sensitive and personal data in the context of linguistics and social sciences to help researchers deal with questions that might come when working with language data.

Creating a DMP

When receiving funding for a research project, researchers are required to submit a plan for research data management that is compliant with the FAIR principles. According to the SNSF, a data management plan should discuss the following topics:

  1. Data collection and documentation
  2. Ethics, legal and security issues
  3. Data storage and preservation
  4. Data sharing and reuse

It is crucial to include the resources and funding required for the implementation of the planned data management practices. The SNSF supports the preparation of data and their online storage with up to 10’000 CHF, on the condition that non-commercial service providers are used. SWISSUbase offers their data storing services for free. Other than the funding, resources such as the extra time required for careful data management and human resources should also be taken into account.

The DMP is meant to be used as a living document, i.e. it is good practice to periodically revise the data management plan throughout the research process. This can be done for example during project meetings held every six months, so that the plan becomes a documentation tool and a statement of quality assurance for your research data over the long term.

When dealing with sensitive data, special attention needs to be given to its organization and management. Several steps can be taken to ensure safe use of the data, i.e. a carefully drafted consent form and anonymization or de-identification of the data. More information on data protection, sensitive data and informed consent can be found on the following page: Data protection

Copyright issues should be addressed from the start: Who owns the copyright in the data? How can the data be shared, which kind of license is appropriate and are there any restrictions that might apply? If there are restrictions, how can we still make the data as FAIR as possible? To give an example, the data from the Swissdox@LiRI database cannot be shared directly due to copyright reasons. However, it is possible to share the queries that were used to download the data, and thus make it accessible for reuse. More information on copyright and Intellectual Property Rights issues can by found here: Copyright


Resources

The following resources may be helpful for researchers creating a data management plan:

  • The Swiss National Science Foundation (SNSF) provides a document with a list of questions that should be addressed in the DMP including good practices for each of them:

    Content of the mySNF form

    Further information from the SNF, including a checklist to identify FAIR data repositories, can be found in the DMP Guidelines for researchers.
  • The FORS Center has published a guide on “How to draft a DMP from the perspective of the social sciences, using the SNSF template”. It discusses each part of the DMP giving explanations and practical tips for SSH researchers applying for research funding in Switzerland.

    FORS DMP Guide
  • A comprehensive guide on data management is offered by CESSDA, the Consortium of European Social Science Data Archives:

    Data Management Expert Guide (DMEG)

    It was designed to help researchers make their research data FAIR and includes well-structured, detailed information for each step of the data lifecycle.
  • Several CLARIN-CH institutions offer information and services that help with creating a DMP:

    UNIL UNIGE UNIBE UNINE UZH

    Let us know if your institution has a similar platform that could be useful for other CLARIN-CH members!

Data collection

Data collection

Data build the foundation of research – carefully planning and carrying out the data collection process is therefore crucial for your research project. This will ensure obtaining high-quality, reliable data that aligns with your research objectives.

Back to the overview

research_data_diversity.jpg
CC BY-SA 4.0 Heinz-Vale via Wikimedia Commons

Thoughtful planning involves selecting appropriate methods, considering ethical implications and legal issues, as well as anticipating potential challenges. A well-executed data collection phase thus not only streamlines subsequent analysis but also lays the groundwork for drawing meaningful conclusions. Before collecting new data, you should always check if there are existing data sets that are suitable for your research endeavour. Find out more on how to discover and re-use data on the following page: Data reuse.

Data types

Research data can take on various forms and for each of them, the data collection process is different. Classifications can be made according to various different characteristics, one of them being the data type:

  • Textual Data: Text corpora, annotated texts, parallel corpora, learner corpora, etc.
  • Phonetic Data: Audio recordings and phonetic transcriptions
  • Lexical Data: Dictionaries, language models, lexica, glossaries, wordlists, conceptual resources (e.g. WordNet/FrameNet), etc.
  • Syntactic Data: Treebanks, dependency parsed texts, etc.
  • Semantic Data: Semantic annotations, ontologies, word embeddings, etc.
  • Discourse Data: Conversational transcripts, digital discourse, etc.
  • Experimental Data: Psycholinguistic experiments, eye-tracking, brain activity recordings, etc.
  • Language Variation Data: Dialectal data, sociolinguistic surveys, etc.
  • Multimodal Data: Video recordings, gesture annotations, etc.

The methods and instruments used for these data types each entail different challenges and questions that need to be addressed. However, what is common to all of them, is the need for good documentation.

Documenting data collection

Just as important as the creation of the research data itself is the metadata that contextualizes it, supporting the interpretation of the research data and thus fostering transparency as well as reproducibility in the research field.

A number of questions about data collection should be answered by the documentation of your data:

  • For what purpose was the data created?
    Describe the project history, its aims, objectives, concepts and hypotheses
  • What does the dataset contain?
    Give information on the type of data (interviews, images, questionnaires, etc.), file size, file formats used and relationships between files.
  • How was the data collected?
    Describe the data collection method and all sources the data come from.
  • Who collected the data and when?
    Indicate the name(s) of the data collector(s), date of data collection and geographical coverage of the data.

Source: CESSDA Data Management Expert Guide

While these questions address general information at the project level, it is also important to be specific about the data objects themselves. Depending on whether you are dealing with qualitative or quantitative data, different requirements may apply.

  • Documentation of qualitative data should give background information and contextualize how it was created. ➡️ Check the DMEG for details and useful tips on documenting different types of qualitative data.
  • Documentation of quantitative data should describe the data file (e.g. file format, size, processing scripts, etc.) and the variables that are used in it. ➡️ Discover best practices and examples in the DMEG.

Data processing

Data processing and analysis

As every field has its own ways of analysing data, the best practices for data processing heavily depend on the methods you choose for your research. However, some things are relevant for all kinds of research:

  • Keep several copies of your data:
    It is important to have both physical and virtual copies of your research data as back-up. It is also advisable to work with a systematic versioning system:
  • Ensure the integrity of your data
    Take measures to make sure your data is accurate, consistent and complete, e.g. using automation to prevent mistakes arising from manually entered data. Chapter 3 in the CESSDA Data Management Expert Guide contains a detailed guide on this topic: Data entry and integrity
  • Choose interoperable file formats:
    When processing data, you may have to decide on file formats for the output of your analysis. Make sure to use file formats that have high compatibility and are widely used (see Standard data formats).
  • Be careful with personal/sensitive data:
    If your data contains personal information, use anonymization / de-identification procedures before carrying out data analysis (see Data protection).
  • Implement data security measures:
    Make sure your data is stored securely and can only be accessed by authorized users (see Data access and security).

Resources

We recommend familiarizing yourself with the tools that could be useful for processing your research data:

SSH Open Marketplace
The SSH Open Marketplace is a European discovery platform for resources from the Social Sciences and Humanities (SSH) field. It does not only offer language resources but also workflows that are carefully described in a step-by-step guide. For example, you can find a workflow on linguistic annotation of corpora here.


CLARIN Tools

CLARIN centers offer a wide variety of tools that help researchers explore and analyse language data. An interface has been created that combines all these tools:

The CLARIN Language Resource Switchboard is a tool that helps you to find a matching language processing web application for your data. After uploading a file or entering a URL, you can select which task to perform. The Switchboard will then provide you with a list of available CLARIN tools to analyse the input.

Have you developed your own tool which could be useful for other researchers? You can add it to the Switchboard Tool Registry. Find out more about sharing your tools here.


Forschungsdaten.info
This website designed for researchers from DACH countries discusses a lot of topics on research data management in great detail. You might find specific information that is relevant for your research project, for example here:

Useful tools for research data management
Working with large amounts of data
Visualizing data
Data transfer when working with sensitive data
(see also the CLARIN-CH working group on this topic)

Data sharing

Data sharing

How to share your language resources?

When sharing language data, the FAIR principles can serve you as a guide in the process of making your resource available to other researchers in a useful way and thereby contribute to facilitating knowledge discovery.

You have several options to increase the FAIR-ness of your data:

For corpora

1️⃣ Publish and archive the data with LaRS@SWISSUbase

2️⃣ Include the corpus on the Linguistic Corpus Platform (LCP)

3️⃣ Add the corpus to the SSH Open Marketplace

4️⃣ Add your corpus on the webpage of the CLARIN Resource Families


For tools

1️⃣ Add your tool to the CLARIN Switchboard

2️⃣ Add your tool to the SSH Open Marketplace

3️⃣ Add your tool on the webpage of the CLARIN Resource Families


For lexical resources

1️⃣ Add your lexical resource on the SSH Open Marketplace

2️⃣ Add your lexical resource on the webpage of the CLARIN Resource Families


What are the recommended standard data formats?

Using standardized formats ensures that the data can be read/processed with widely used software. This makes your data easier to be integrated into various existing linguistic analysis tools or workflows, enhancing the accessibility and utility of your data.

Additionally, standardized data formats facilitate collaboration among researchers and institutions by reducing compatibility issues and promoting interoperability. This seamless exchange of linguistic data in a common format fosters a more open and collaborative research environment, accelerating the progress of linguistic studies and advancing our understanding of language in diverse contexts.

➡️ Researchers are encouraged to prioritize the use of standardized formats to maximize the impact of their work and contribute to the advancement of their field.

You can consult this CLARIN page on format recommendations to check whether you are using one of the standardized formats. More information can be found here: Standard data formats. For converting data or file formats, consider the SSH Conversion Hub in order to find a suitable tool.


I want to share my data. How can I find a suitable repository?

While there are innumerable options for sharing research data, it makes sense to follow recommendations for repositories that ensure the FAIRness of your data and support open research data practices, such as this list given by the Swiss National Science Foundation (SNSF).

CLARIN-CH recommends the Language Repository of Switzerland (LaRS@SWISSUBase) and the Linguistic Corpus Platform (LCP), which are specifically tailored to linguistic data and free for members of CLARIN-CH institutions.

➡️ More options can be found here: How to find a suitable repository

Data archiving

Data archiving

What happens with research data after the completion of a project?

This question should already be addressed in the data management plan. It is a crucial step of the research process: Archiving ensures the preservation and accessibility of valuable data in the long-term, facilitates transparency in methodology, and enables the reproducibility of findings by allowing other researchers to scrutinize and build upon previous work.

However, these benefits only play out if the data are archived according to the FAIR and CARE principles. In some cases it can be difficult to stick to these principles, as they may be in conflict with copyright or data protection. More information on how to handle sensitive data and copyright issues can be found on the following pages: Data protection & Copyright.

I want to archive my research data. How can I find a suitable repository?

As a researcher at a CLARIN-CH institution you have several options to deposit your data:

  • SWISSUbase is a national repository for research data providing researchers with a solution for long-term storage of their data. The linguistic data service unit LaRS (Language Repository of Switzerland) is an important part of CLARIN-CH, as it is a reliable way to store your research data in Switzerland and is tailored to language resources thanks to a discipline-specific metadata scheme which can easily be applied to your data.

    Go to SWISSUbase
  • DaSCH is the Swiss National Data and Service Center for the Humanities, providing expertise in research data management and long-term preservation. It was established by the Digital Humanities Lab at the University of Basel and the Swiss Academy of Humanities and Social Sciences (SAGW) in 2017 and operates as a national research infrastructure promoting Open Data since 2021.

    Go to DaSCH
  • Many CLARIN-CH institutions offer their own data repository, such as BORIS (Bern Open Repository and Information System). University libraries usually provide archiving services as well and more recently, data steward services have been established in various institutions, who will be able to help you when choosing a repository.

    Data stewardship services in Switzerland
  • Most research data repositories are also listed on re3data.org, a global registry of research data repositories that aims at promoting a culture of sharing, increased access and better visibility of research data. It can be used as a discovery platform for data repositories, offering a short description and comprehensive metadata on each of the listed repositories, including database access and the standards that it uses.

    re3data.org


How to publish sensitive and personal data?

When your research data contains sensitive and personal data, several measures need to be taken for publication. From informed consent to de-identification and controlled access, there are many options to ensure your data is published securely. Learn more on the following pages:

Data protection Data access and security

Data reuse

Data reuse

Not every research endeavour needs to be addressed by creating new data. In fact, reusing data is an effective way to leverage what has already been done in the research community. The R in FAIR is thus important to keep in mind when planning your research endeavour. You can skip the time-consuming step of data collection and move on to data analysis, thereby saving resources and acknowledging other researchers’ work.

CC BY-SA 4.0 SangyaPundir via Wikimedia Commons
(altered: coloring)

Discovering data

To find language data that may be suitable for your research, CLARIN-CH recommends the following platforms:

  • SSH Open Marketplace

    The SSH Open Marketplace is a European discovery platform for resources from the Social Sciences and Humanities (SSH) field. It offers a wide range of language resources that can be browsed by category and keywords (e.g. L2 learner corpora, spoken corpora). It also includes the CLARIN Resource Families (see below.)
  • CLARIN VLO

    The Virtual Language Observatory is a search tool that provides over a million records of language resources that are hosted at CLARIN institutions. It harvests metadata from CLARIN centers and makes it accessible in a simple online-tool for searching and filtering language resources.
  • LaRS @ SWISSUbase

    SWISSUbase has recently been established as a national repository and data sharing platform, which is operated in partnership between FORS and the Universities of Zurich and Lausanne. It hosts the Language Repository of Switzerland (LaRS), where researchers from CLARIN-CH institutions publish and archive their language data.

These platforms all provide a metadata description of the datasets and language resources including information on accessibility and licensing. If you prefer searching by category, the CLARIN Resource Families (categorized into corpora / language tools / lexical resources) may be useful. Have a look at the resources provided by CLARIN-CH institutions to find examples from the Swiss research ecosystem (grouped by institution).

LiRI, the technical center of CLARIN-CH, also provides resources enabling data reuse. One good example is the Swissdox database: Swissdox@LiRI is an easy-to-use interface which allows researchers and students to download media data from all major and a number of minor media outlets in Switzerland. It is updated on a daily basis with 5000-6000 articles and can be used by all members of the CLARIN-CH institutions.

To learn more about the process of data discovery and how to decide which data to use, have a look at Chapter 7 in the CESSDA Data Management Expert Guide. Further information on Open Data and Data reuse can also be found on forschungsdaten.info.

Know your data

When re-using data, it is important to know precisely what the data is describing (and what it isn’t describing). The documentation of data is therefore crucial: The better the documentation, the more you will be able to understand the data. Metadata should be comprehensive and give you context on how the data has been created (see Metadata standards). In addition, to make sure that you are allowed to use the data, inform yourself about the terms and conditions (see Copyright).

The next step is familiarizing yourself with the data by looking at the individual data points and carrying out first analyses that could be relevant for your research project, e.g. descriptive statistics of a (sub)corpus you are using. It takes time to get to know the data, but the longer you work with the data, the more you will learn about it. It can be useful to keep a journal / logbook to write down insights you might have during the familiarization process.

Citing data

When you reuse data, attributing the data to its authors is critical. FORS, the Swiss Center of Expertise in the Social Sciences, has published a guide on Data Citation. The most important recommendations for researchers reusing data:

  • Cite all sources used in your publication or assignment (primary and secondary data), both in-text and in the reference list.
  • Cite the data directly, that is, where the data has been published and is accessible (for published data).
  • While citing the data, make sure that all the core components are included in the citation: data authors, publication year, title, version, data publisher, and persistent identifier. These are the minimal elements to be included.

Bornatici, C. & Fedrigo, N. (2023). Data Citation: How and Why Citing (Your Own) Data. FORS Guide No. 19, Version 1.0. Lausanne: Swiss Centre of Expertise in the Social Sciences FORS. doi:10.24449/FG-2023-00019