Standard data formats
Why should I use standard data formats?
Using standardized formats ensures that the data can be processed with widely used software. This makes your data easier to be integrated into various existing linguistic analysis tools or workflows, enhancing the accessibility and utility of your data.
Additionally, standardized data formats facilitate collaboration among researchers and institutions by reducing compatibility issues and promoting interoperability. This seamless exchange of linguistic data in a common format fosters a more open and collaborative research environment, accelerating the progress of linguistic studies and advancing our understanding of language in diverse contexts.
Researchers are encouraged to prioritize the use of standardized formats to maximize the impact of their work and contribute to the advancement of their field.
The CLARIN Standards Information System (SIS) provides format recommendations for researchers categorizing formats into three levels: recommended, accepted or discouraged. The data formats can be explored based on functional domain, file extensions or media types. Additionally, the SIS gives an overview of data deposition formats which are supported by CLARIN centers.
CLARIN has defined the following principles on standard formats:
- Open standards are preferred over proprietary standards
- Formats and protocols should be:
- Well-documented
- Verifiable
- Proven (being used in practice)
- Text-based formats are (where possible) preferred over binary formats
- In the case of digitisation of an analogue signal, using no or lossless compression is recommended.
Read more about standards in the CLARIN-environment on this page. Further information about standards in the CLARIN infrastructure can also be found in the FAQ section.
One good example of a standard is the text encoding initiative (TEI) format: It has become widely accepted as the XML standard used to represent various kinds of literary and linguistic texts in digital form. TEI is based on a machine-readable encoding scheme that is maximally expressive and minimally obsolescent. The TEI consortium collectively develops and maintains the standard, offering guidelines as well as tools and other resources for the implementation of the standard.
Format recommendations within CLARIN-CH
We have carried out a survey among the CLARIN-CH community to examine which data formats are used among researchers from our member institutions. The results can be viewed here. Based on these results we have submitted our recommendations to the CLARIN Standards Information System:
Further resources
The EPFL Library offers a guide for research data management including format recommendations by type of data, which has proven to be useful for language data:
Remember that standards are not only important when it comes to the research data itself, but also the metadata describing it. For more information on metadata you can read up here: Metadata standards
Standards only make sense if widely used – i.e. make sure to familiarize yourself with the data formats that are used in your domain to ensure interoperability. Note what standards other researchers have used and stick to the recommendations given by your technical center.
The CESSDA Data Management Expert Guide also provides an informative section on File formats and data conversion which may help you choose the right file format. If your data does not correspond to a standard format, check if you can convert it using existing software. The SSHOC Conversion Hub provides a comprehensive inventory of solutions for data conversions:
If you encounter problems with standardization issues, you can contact your data steward: