Data sharing
How to share your language resources?
When sharing language data, the FAIR principles can serve you as a guide in the process of making your resource available to other researchers in a useful way and thereby contribute to facilitating knowledge discovery.
You have several options to increase the FAIR-ness of your data:
For corpora
1️⃣ Publish and archive the data with LaRS@SWISSUbase
LaRS@SWISSUbase offers an easy-to-use and reliable platform for sharing your data. It has been established as a cross-disciplinary and FAIR-compliant national research data service in 2022. It includes a searchable catalogue with a growing number of studies and research data sets, for which SWISSUbase provides a solution for long-term storing.
In September 2022, the Language Repository of Switzerland (LaRS) was introduced as a discipline-specific data service unit (DSU) of SWISSUbase. Researchers from CLARIN-CH institutions are invited to share their publications and datasets on the platform to benefit from this new infrastructure and make their research accessible to the community.
Data can be published at various degrees of openness to account for data including sensitive information. If your corpus cannot be shared openly, you can either publish the metadata of your corpus or the data itself and choose a closed-type of licence.
Learn more on the SWISSUbase page for linguistic resources, where you can read the User Guide and find out more about the process as well as metadata requirements. Additionally, the FORS Center provides a detailed practical guide on sharing SSH data on SWISSUbase.
2️⃣ Include the corpus on the Linguistic Corpus Platform (LCP)
The Linguistic Corpus Platform (LCP) is being developed at LiRI as a tool to make corpora searchable through a web interface:
➡️ Check out the pre-released version of the LCP here.
The LCP can be accessed by all CLARIN-CH institutions and will offer the option to upload your own corpus for data exploration and analysis. The LCP uses its own query language which allows for powerful, complex queries on text data and time-aligned multimodal data, such as video recordings of sign language and interactional data.
If you want to find out more about how to use the LCP, have a look at the LCP documentation page.
3️⃣ Add the corpus to the SSH Open Marketplace
The SSH Open Marketplace is a European discovery platform for resources from the Social Sciences and Humanities (SSH) field.
In order to register your corpus, you can follow these steps (choose the dataset item category).
4️⃣ Add your corpus on the webpage of the CLARIN Resource Families
The CLARIN Resource Families website provides an overview of the available language resources in the CLARIN infrastructure per data type. The following types of corpora are listed:
Computer-Mediated Communication Corpora
Corpora of Academic Texts
Historical Corpora
L2 Learner Corpora
Legal Corpora
Literary Corpora
Manually Annotated Corpora
Multimodal Corpora
Newspaper Corpora
Oral History Corpora
Parallel Corpora
Parliamentary Corpora
Reference Corpora
Sign Language Resources
Spoken Corpora
Discover the CLARIN Resource Families
Contact us if you want to list your corpus in one of these categories!
1️⃣ Add your tool to the CLARIN Switchboard
The CLARIN Language Resource Switchboard is a tool that helps researchers to find a matching language processing web application for their data. After uploading a file or entering a URL, the Switchboard provides a list of available CLARIN tools to perform the task indicated by the researcher (e.g. Named Entity Recognition, lemmatization, POS-tagging).
Information on how to add your tool to the Switchboard Tool Registry is available on the GitHub page. See the CLARIN Switchboard website for a list of the currently available tools.
2️⃣ Add your tool to the SSH Open Marketplace
The SSH Open Marketplace is a European discovery platform for resources from the Social Sciences and Humanities (SSH) field.
In order to register your tool, you can follow these steps (choose the Tools & services item category).
3️⃣ Add your tool on the webpage of the CLARIN Resource Families
The CLARIN Resource Families website provides an overview of the available language resources in the CLARIN infrastructure per data type. The following types of tools are listed:
Discover the CLARIN Resource Families
Contact us if you want to list your tool in one of these categories!
For lexical resources
1️⃣ Add your lexical resource on the SSH Open Marketplace
The SSH Open Marketplace is a European discovery platform for resources from the Social Sciences and Humanities (SSH) field.
In order to register your lexical resource, you can follow these steps (choose the Dataset item category).
2️⃣ Add your lexical resource on the webpage of the CLARIN Resource Families
The CLARIN Resource Families website provides a user-friendly overview per data type of the available language resources in the CLARIN infrastructure. The following types of lexical resources are listed:
Language Models
Lexica
Dictionaries
Conceptual Resources
Glossaries
Wordlists
Discover the CLARIN Resource Families
Contact us if you want to add your lexical resource in one of these categories!
Using standardized formats ensures that the data can be read/processed with widely used software. This makes your data easier to be integrated into various existing linguistic analysis tools or workflows, enhancing the accessibility and utility of your data.
Additionally, standardized data formats facilitate collaboration among researchers and institutions by reducing compatibility issues and promoting interoperability. This seamless exchange of linguistic data in a common format fosters a more open and collaborative research environment, accelerating the progress of linguistic studies and advancing our understanding of language in diverse contexts.
➡️ Researchers are encouraged to prioritize the use of standardized formats to maximize the impact of their work and contribute to the advancement of their field.
You can consult this CLARIN page on format recommendations to check whether you are using one of the standardized formats. More information can be found here: Standard data formats. For converting data or file formats, consider the SSH Conversion Hub in order to find a suitable tool.
I want to share my data. How can I find a suitable repository?
While there are innumerable options for sharing research data, it makes sense to follow recommendations for repositories that ensure the FAIRness of your data and support open research data practices, such as this list given by the Swiss National Science Foundation (SNSF).
CLARIN-CH recommends the Language Repository of Switzerland (LaRS@SWISSUBase) and the Linguistic Corpus Platform (LCP), which are specifically tailored to linguistic data and free for members of CLARIN-CH institutions.
➡️ More options can be found here: How to find a suitable repository