Standard data formats

Using standardized formats ensures that the data can be processed with widely used software. This makes your data easier to be integrated into various existing linguistic analysis tools or workflows, enhancing the accessibility and utility of your data.

Additionally, standardized data formats facilitate collaboration among researchers and institutions by reducing compatibility issues and promoting interoperability. This seamless exchange of linguistic data in a common format fosters a more open and collaborative research environment, accelerating the progress of linguistic studies and advancing our understanding of language in diverse contexts.

➡️ Researchers are encouraged to prioritize the use of standardized formats to maximize the impact of their work and contribute to the advancement of their field.

CLARIN Standards Information System

The CLARIN Standards Information System (SIS) provides format recommendations for researchers categorizing formats into three levels: recommended, accepted or discouraged. The data formats can be explored based on functional domain, file extensions or media types. Additionally, the SIS gives an overview of data deposition formats which are supported by CLARIN centers.

CLARIN has defined the following principles on standard formats:

  • Open standards are preferred over proprietary standards
  • Formats and protocols should be:
    • Well-documented
    • Verifiable
    • Proven (being used in practice)
  • Text-based formats are (where possible) preferred over binary formats
  • In the case of digitisation of an analogue signal, using no or lossless compression is recommended.

Read more about standards in the CLARIN-enviroment on this page. Further information about standards in the CLARIN infrastructure can also be found in the FAQ section.

Format recommendations within CLARIN-CH

We are currently working on a survey to collect standard formats that are used within the CLARIN-CH community. Based on the results and requirements from our technical center LiRI, we will give recommendations for which standards to use and which are discouraged.

The EPFL Library offers a guide for research data management including format recommendations by type of data, which has proven to be useful for language data:

EPFL Fast Guide #4: Formats

Remember that standards are not only important when it comes to the research data itself, but also the metadata describing it. For more information on metadata you can read up here: Metadata standards.

Standards only make sense if widely used - i.e. make sure to familiarize yourself with the data formats that are used in your domain to ensure interoperability. Note what standards other researchers have used and stick to the recommendations given by your technical center. The CESSDA Data Management Expert Guide also provides an informative section on File formats and data conversion which may help you choose the right file format. If your data does not correspond to a standard format, check if you can convert it using existing software. The SSHOC Conversion Hub provides a comprehensive inventory of solutions for data conversions:

SSHOC Conversion Hub

If you encounter problems with standardization issues, don’t hesitate to contact your data steward.

