Metadata standards

Metadata plays an integral role in the creation of FAIR research data. It ensures meaningful (re-)use of the data by facilitating the interpretation of the data, allowing other researchers to reproduce findings and potentially also answer new research questions.

There are several different types of metadata (as proposed by the ANDS, descriptions from the UNIGE website):

Metadata should be as complete as possible, using the standards and conventions of the discipline in question, and it should be machine readable. This is especially important to make your research data findable: CLARIN offers several platforms that harvest metadata to make datasets, tools and other language resources accessible to the research community. One example is the Virtual Language Observatory (VLO), where over a million records of language resources hosted at CLARIN centers are gathered.


Controlled vocabularies

Controlled vocabularies are used for organizing knowledge and data in a structured way to make it retrievable for future use. Within a controlled vocabulary, concepts and terms are defined as data descriptors, which can be related to each other hierarchically or associatively. Thus, knowledge is stored in a machine-readable information system, through which data can be queried, retrieved, analyzed and linked to each other.

Different types of controlled vocabularies include (according to DARIAH-CAMPUS):

Using a shared definition of data descriptors and their relations, controlled vocabularies can reduce the ambiguity of terms used in research and thus contribute to a common understanding within the field. Additionally, they play a significant part in the implementation of the FAIR principles as they facilitate not only data organization, but also data retrieval, exploration and interoperability. CVs are also a tool used in the Semantic Web to give data a machine-readable meaning, for example based on the Resource Description Framework (RDF) which allows metadata to be shared and retrieved across different applications. If you want to learn more about controlled vocabularies, read the tutorial provided by DARIAH.

CLARIN and SSHOC have teamed up in order to explore the possibilities for collecting, registering and harmonizing domain-specific controlled vocabularies in the SSH sector and managing them on a single platform. An overview of the gathered insights can be found online. CLARIN has set up a vocabulary and alignment service CLAVAS, which currently offers ISO 639-3, a list of language codes based on the Ethnologue.


Metadata standards in the CLARIN ecosystem

CLARIN uses the CMDI, a metadata infrastructure designed to allow CLARIN centers and researchers to re-use existing data description formats and adapt the metadata schemes to their needs. This is achieved through the use of Metadata Components, which are listed in the Component Registry. Metadata records are expressed in XML files with a link to the corresponding metadata profile.

SWISSUbase, the national repository of CLARIN-CH has its own metadata scheme based on Meta-Share, which will however be easily convertible to CMDI. For information on the metadata scheme used for the LaRS (Language Repository of Switzerland) see the Linguistics Metadata Guide.

The Data Documentation Initiative (DDI) is a free international standard for describing the data produced by surveys and other observational methods in the social, behavioral, economic, and health sciences. It was adopted as a recommended standard by CESSDA and many other institutions.

You might also encounter other metadata standards used in research: Dublin Core, for example, is another widely used standard for machine-readable metadata descriptions. Many tools have been developed for its implementation, including an online generator. Similarly, the DataCite Metadata Scheme can be compiled online using the DataCite generator.


Resources

Further information on the topic can be found here:

UNIGE: Creating Metadata

UNIL: Organisation and description of research data

UZH UB: Data documentation

CESSDA: Data Management Expert Guide

DCC: Metadata in Social Science & Humanities

Let us know if there are other resources that CLARIN-CH members should know about.