Welcome to the CLARIN-CH Documentation Platform. Here you will find useful information relevant at the different steps of your data life-cycle, which are usually covered by the Data Management Plan. The Platform offers best practices and resources that may be helpful for researchers to engage in FAIR-compliant data management in the context of CLARIN-CH.
open-access.network CC BY 4.0 International
Effective data management starts with the research design and involves planning ahead to encompass the entire data lifecycle. It is not only a necessary step in the research process, it is also a great benefit for you and the research community, if done right. Aligning data management practices with the FAIR principles and open data standards ensures transparency, accessibility, and the potential for broader collaboration within the research community.
Alongside, it's valuable to address ethical and legal aspects early on to ensure compliance with relevant regulations. In principle, all CLARIN-CH institutions provide legal services and ethics committees, offering expertise to navigate complexities and establish sound consent mechanisms. In addition, the topic is being addressed by the CLARIN-CH working group on sensitive and personal data in the context of linguistics and social sciences to help researchers deal with questions that might come when working with language data.
When receiving funding for a research project, researchers are required to submit a plan for research data management that is compliant with the FAIR principles. According to the SNSF, a data management plan should discuss the following topics:
It is crucial to include the resources and funding required for the implementation of the planned data management practices. The SNSF supports the preparation of data and their online storage with up to 10’000 CHF, on the condition that non-commercial service providers are used. SWISSUbase offers their data storing services for free. Other than the funding, resources such as the extra time required for careful data management and human resources should also be taken into account.
The DMP is meant to be used as a living document, i.e. it is good practice to periodically revise the data management plan throughout the research process. This can be done for example during project meetings held every six months, so that the plan becomes a documentation tool and a statement of quality assurance for your research data over the long term.
When dealing with sensitive data, special attention needs to be given to its organization and management. Several steps can be taken to ensure safe use of the data, i.e. a carefully drafted consent form and anonymization or de-identification of the data. More information on data protection, sensitive data and informed consent can be found on the following page: Data protection
Copyright issues should be addressed from the start: Who owns the copyright in the data? How can the data be shared, which kind of license is appropriate and are there any restrictions that might apply? If there are restrictions, how can we still make the data as FAIR as possible? To give an example, the data from the Swissdox@LiRI database cannot be shared directly due to copyright reasons. However, it is possible to share the queries that were used to download the data, and thus make it accessible for reuse. More information on copyright and Intellectual Property Rights issues can by found here: Copyright
The following resources may be helpful for researchers creating a data management plan:
Data build the foundation of research – carefully planning and carrying out the data collection process is therefore crucial for your research project. This will ensure obtaining high-quality, reliable data that aligns with your research objectives.
CC BY-SA 4.0 Heinz-Vale via Wikimedia Commons
Thoughtful planning involves selecting appropriate methods, considering ethical implications and legal issues, as well as anticipating potential challenges. A well-executed data collection phase thus not only streamlines subsequent analysis but also lays the groundwork for drawing meaningful conclusions. Before collecting new data, you should always check if there are existing data sets that are suitable for your research endeavour. Find out more on how to discover and re-use data on the following page: Data reuse.
Research data can take on various forms and for each of them, the data collection process is different. Classifications can be made according to various different characteristics, one of them being the data type:
The methods and instruments used for these data types each entail different challenges and questions that need to be addressed. However, what is common to all of them, is the need for good documentation.
Just as important as the creation of the research data itself is the metadata that contextualizes it, supporting the interpretation of the research data and thus fostering transparency as well as reproducibility in the research field.
A number of questions about data collection should be answered by the documentation of your data:
Source: CESSDA Data Management Expert Guide
While these questions address general information at the project level, it is also important to be specific about the data objects themselves. Depending on whether you are dealing with qualitative or quantitative data, different requirements may apply.
As every field has its own ways of analysing data, the best practices for data processing heavily depend on the methods you choose for your research. However, some things are relevant for all kinds of research:
We recommend familiarizing yourself with the tools that could be useful for processing your research data:
SSH Open Marketplace
The SSH Open Marketplace is a European discovery platform for resources from the Social Sciences and Humanities (SSH) field. It does not only offer language resources but also workflows that are carefully described in a step-by-step guide. For example, you can find a workflow on linguistic annotation of corpora here.
CLARIN centers offer a wide variety of tools that help researchers explore and analyse language data. An interface has been created that combines all these tools:
The CLARIN Language Resource Switchboard is a tool that helps you to find a matching language processing web application for your data. After uploading a file or entering a URL, you can select which task to perform. The Switchboard will then provide you with a list of available CLARIN tools to analyse the input.
Have you developed your own tool which could be useful for other researchers? You can add it to the Switchboard Tool Registry. Find out more about sharing your tools here.
Forschungsdaten.info
This website designed for researchers from DACH countries discusses a lot of topics on research data management in great detail. You might find specific information that is relevant for your research project, for example here:
Useful tools for research data management
Working with large amounts of data
Visualizing data
Data transfer when working with sensitive data
(see also the CLARIN-CH working group on this topic)
When sharing language data, the FAIR principles can serve you as a guide in the process of making your resource available to other researchers in a useful way and thereby contribute to facilitating knowledge discovery.
You have several options to increase the FAIR-ness of your data:
1️⃣ Publish and archive the data with LaRS@SWISSUbase
LaRS@SWISSUbase offers an easy-to-use and reliable platform for sharing your data. It has been established as a cross-disciplinary and FAIR-compliant national research data service in 2022. It includes a searchable catalogue with a growing number of studies and research data sets, for which SWISSUbase provides a solution for long-term storing.
➡️ Go to the SWISSUbase website here.
In September 2022, the Language Repository of Switzerland (LaRS) was introduced as a discipline-specific data service unit (DSU) of SWISSUbase. Researchers from CLARIN-CH institutions are invited to share their publications and datasets on the platform to benefit from this new infrastructure and make their research accessible to the community.
Data can be published at various degrees of openness to account for data including sensitive information. If your corpus cannot be shared openly, you can either publish the metadata of your corpus or the data itself and choose a closed-type of licence.
Learn more on the SWISSUbase page for linguistic resources, where you can read the User Guide and find out more about the process as well as metadata requirements. Additionally, the FORS Center provides a detailed practical guide on sharing SSH data on SWISSUbase.
2️⃣ Include the corpus on the Linguistic Corpus Platform (LCP)
The Linguistic Corpus Platform (LCP) is being developed at LiRI as a tool to make corpora searchable through a web interface:
➡️ Check out the pre-released version of the LCP here.
The LCP can be accessed by all CLARIN-CH institutions and will offer the option to upload your own corpus for data exploration and analysis. The LCP uses its own query language which allows for powerful, complex queries on text data and time-aligned multimodal data, such as video recordings of sign language and interactional data.
If you want to find out more about how to use the LCP, have a look at the LCP documentation page.
3️⃣ Add the corpus to the SSH Open Marketplace
The SSH Open Marketplace is a European discovery platform for resources from the Social Sciences and Humanities (SSH) field.
➡️ Discover the SSH Open Marketplace here.
In order to register your corpus, you can follow these steps (choose the dataset item category).
4️⃣ Add your corpus on the webpage of the CLARIN Resource Families
The CLARIN Resource Families website provides an overview of the available language resources in the CLARIN infrastructure per data type. The following types of corpora are listed:
Discover the CLARIN Resource Families
Contact us if you want to list your corpus in one of these categories!
1️⃣ Add your tool to the CLARIN Switchboard
The CLARIN Language Resource Switchboard is a tool that helps researchers to find a matching language processing web application for their data. After uploading a file or entering a URL, the Switchboard provides a list of available CLARIN tools to perform the task indicated by the researcher (e.g. Named Entity Recognition, lemmatization, POS-tagging).
➡️ Discover the CLARIN Switchboard here.
Information on how to add your tool to the Switchboard Tool Registry is available on the GitHub page. See the CLARIN Switchboard website for a list of the currently available tools.
2️⃣ Add your tool to the SSH Open Marketplace
The SSH Open Marketplace is a European discovery platform for resources from the Social Sciences and Humanities (SSH) field.
➡️ Discover the SSH Open Marketplace here.
In order to register your tool, you can follow these steps (choose the Tools & services item category).
3️⃣ Add your tool on the webpage of the CLARIN Resource Families
The CLARIN Resource Families website provides an overview of the available language resources in the CLARIN infrastructure per data type. The following types of tools are listed:
Discover the CLARIN Resource Families
Contact us if you want to list your tool in one of these categories!
1️⃣ Add your lexical resource on the SSH Open Marketplace
The SSH Open Marketplace is a European discovery platform for resources from the Social Sciences and Humanities (SSH) field.
➡️ Discover the SSH Open Marketplace here.
In order to register your lexical resource, you can follow these steps (choose the Dataset item category).
2️⃣ Add your lexical resource on the webpage of the CLARIN Resource Families
The CLARIN Resource Families website provides a user-friendly overview per data type of the available language resources in the CLARIN infrastructure. The following types of lexical resources are listed:
Discover the CLARIN Resource Families
Contact us if you want to add your lexical resource in one of these categories!
Using standardized formats ensures that the data can be read/processed with widely used software. This makes your data easier to be integrated into various existing linguistic analysis tools or workflows, enhancing the accessibility and utility of your data.
Additionally, standardized data formats facilitate collaboration among researchers and institutions by reducing compatibility issues and promoting interoperability. This seamless exchange of linguistic data in a common format fosters a more open and collaborative research environment, accelerating the progress of linguistic studies and advancing our understanding of language in diverse contexts.
➡️ Researchers are encouraged to prioritize the use of standardized formats to maximize the impact of their work and contribute to the advancement of their field.
You can consult this CLARIN page on format recommendations to check whether you are using one of the standardized formats. More information can be found here: Standard data formats. For converting data or file formats, consider the SSH Conversion Hub in order to find a suitable tool.
While there are innumerable options for sharing research data, it makes sense to follow recommendations for repositories that ensure the FAIRness of your data and support open research data practices, such as this list given by the Swiss National Science Foundation (SNSF).
CLARIN-CH recommends the Language Repository of Switzerland (LaRS@SWISSUBase) and the Linguistic Corpus Platform (LCP), which are specifically tailored to linguistic data and free for members of CLARIN-CH institutions.
➡️ More options can be found here: How to find a suitable repository
This question should already be addressed in the data management plan. It is a crucial step of the research process: Archiving ensures the preservation and accessibility of valuable data in the long-term, facilitates transparency in methodology, and enables the reproducibility of findings by allowing other researchers to scrutinize and build upon previous work.
However, these benefits only play out if the data are archived according to the FAIR and CARE principles. In some cases it can be difficult to stick to these principles, as they may be in conflict with copyright or data protection. More information on how to handle sensitive data and copyright issues can be found on the following pages: Data protection & Copyright.
As a researcher at a CLARIN-CH institution you have several options to deposit your data:
When your research data contains sensitive and personal data, several measures need to be taken for publication. From informed consent to de-identification and controlled access, there are many options to ensure your data is published securely. Learn more on the following pages:
Not every research endeavour needs to be addressed by creating new data. In fact, reusing data is an effective way to leverage what has already been done in the research community. The R in FAIR is thus important to keep in mind when planning your research endeavour. You can skip the time-consuming step of data collection and move on to data analysis, thereby saving resources and acknowledging other researchers’ work.
CC BY-SA 4.0 SangyaPundir via Wikimedia Commons
(altered: coloring)
To find language data that may be suitable for your research, CLARIN-CH recommends the following platforms:
These platforms all provide a metadata description of the datasets and language resources including information on accessibility and licensing. If you prefer searching by category, the CLARIN Resource Families (categorized into corpora / language tools / lexical resources) may be useful. Have a look at the resources provided by CLARIN-CH institutions to find examples from the Swiss research ecosystem (grouped by institution).
LiRI, the technical center of CLARIN-CH, also provides resources enabling data reuse. One good example is the Swissdox database: Swissdox@LiRI is an easy-to-use interface which allows researchers and students to download media data from all major and a number of minor media outlets in Switzerland. It is updated on a daily basis with 5000-6000 articles and can be used by all members of the CLARIN-CH institutions.
To learn more about the process of data discovery and how to decide which data to use, have a look at Chapter 7 in the CESSDA Data Management Expert Guide. Further information on Open Data and Data reuse can also be found on forschungsdaten.info.
When re-using data, it is important to know precisely what the data is describing (and what it isn’t describing). The documentation of data is therefore crucial: The better the documentation, the more you will be able to understand the data. Metadata should be comprehensive and give you context on how the data has been created (see Metadata standards). In addition, to make sure that you are allowed to use the data, inform yourself about the terms and conditions (see Copyright).
The next step is familiarizing yourself with the data by looking at the individual data points and carrying out first analyses that could be relevant for your research project, e.g. descriptive statistics of a (sub)corpus you are using. It takes time to get to know the data, but the longer you work with the data, the more you will learn about it. It can be useful to keep a journal / logbook to write down insights you might have during the familiarization process.
When you reuse data, attributing the data to its authors is critical. FORS, the Swiss Center of Expertise in the Social Sciences, has published a guide on Data Citation. The most important recommendations for researchers reusing data:
Bornatici, C. & Fedrigo, N. (2023). Data Citation: How and Why Citing (Your Own) Data. FORS Guide No. 19, Version 1.0. Lausanne: Swiss Centre of Expertise in the Social Sciences FORS. doi:10.24449/FG-2023-00019