Metadata standards

Metadata plays an integral role in the creation of FAIR research data. It ensures meaningful (re-)use of the data by facilitating the interpretation of the data, allowing other researchers to reproduce findings and potentially also answer new research questions.

There are several different types of metadata (as proposed by the ANDS, descriptions from the UNIGE website):

  • Descriptive Metadata
    Provides a description of the data which serves to facilitate discovery, evaluation and understanding of its content, e.g. Title, Author, Location, Language, DOI, etc.
  • Technical Metadata
    Gives information for ensuring the interoperability of data, e.g. Which data formats are used? What are the configurations of the database?
  • Provenance Metadata
    Describes the origin of the data, e.g. Why was it collected? Who collected it, where and when? What instruments and technologies were used to collect the data? How was the data processed?
  • Rights and Access Metadata
    Gives information on the copyright status status, licence conditions and the right holders and describes how the data can be accessed and used.
  • Citation Metadata
    Contains the information necessary to cite the data appropriately.

Metadata should be as complete as possible, using the standards and conventions of the discipline in question, and it should be machine readable. This is especially important to make your research data findable: CLARIN offers several platforms that harvest metadata to make datasets, tools and other language resources accessible to the research community. One example is the Virtual Language Observatory (VLO), where over a million records of language resources hosted at CLARIN centers are gathered.

Controlled vocabularies

Controlled vocabularies are used for organizing knowledge and data in a structured way to make it retrievable for future use. Within a controlled vocabulary, concepts and terms are defined as data descriptors, which can be related to each other hierarchically or associatively. Thus, knowledge is stored in a machine-readable information system, through which data can be queried, retrieved, analyzed and linked to each other.

Different types of controlled vocabularies include (according to DARIAH-CAMPUS):

  • Thesaurus - a type of controlled vocabulary used in information systems that organizes concepts in hierarchical and/or associative relationships and provides their semantic definitions
  • Classification schema - a system that based primarily on classifying things or concepts into groups or classes with a detailed explanation of those classification methods
  • Subject heading list - a list of terms describing subjects in information system
  • Taxonomy - a system that organizes things and concepts in groups based on their common characteristics and/or differences
  • Terminology - a list of terms used to describe concepts in a certain domain
  • Glossary - an alphabetical list of terms with their explanation used in a specific context

Using a shared definition of data descriptors and their relations, controlled vocabularies can reduce the ambiguity of terms used in research and thus contribute to a common understanding within the field. Additionally, they play a significant part in the implementation of the FAIR principles as they facilitate not only data organization, but also data retrieval, exploration and interoperability. CVs are also a tool used in the Semantic Web to give data a machine-readable meaning, for example based on the Resource Description Framework (RDF) which allows metadata to be shared and retrieved across different applications. If you want to learn more about controlled vocabularies, read the tutorial provided by DARIAH.

CLARIN and SSHOC have teamed up in order to explore the possibilities for collecting, registering and harmonizing domain-specific controlled vocabularies in the SSH sector and managing them on a single platform. An overview of the gathered insights can be found online. CLARIN has set up a vocabulary and alignment service CLAVAS, which currently offers ISO 639-3, a list of language codes based on the Ethnologue.

Metadata standards in the CLARIN ecosystem

CLARIN uses the CMDI, a metadata infrastructure designed to allow CLARIN centers and researchers to re-use existing data description formats and adapt the metadata schemes to their needs. This is achieved through the use of Metadata Components, which are listed in the Component Registry. Metadata records are expressed in XML files with a link to the corresponding metadata profile.

SWISSUbase, the national repository of CLARIN-CH has its own metadata scheme based on Meta-Share, which will however be easily convertible to CMDI. For information on the metadata scheme used for the LaRS (Language Repository of Switzerland) see the Linguistics Metadata Guide.

The Data Documentation Initiative (DDI) is a free international standard for describing the data produced by surveys and other observational methods in the social, behavioral, economic, and health sciences. It was adopted as a recommended standard by CESSDA and many other institutions.

You might also encounter other metadata standards used in research: Dublin Core, for example, is another widely used standard for machine-readable metadata descriptions. Many tools have been developed for its implementation, including an online generator. Similarly, the DataCite Metadata Scheme can be compiled online using the DataCite generator.


Further information on the topic can be found here:

UNIGE: Creating Metadata

UNIL: Organisation and description of research data

UZH UB: Data documentation

CESSDA: Data Management Expert Guide

DCC: Metadata in Social Science & Humanities

Let us know if there are other resources that CLARIN-CH members should know about.

documentation-platform/metadata-standards.txt ยท Last modified: 2024/02/02 11:49 (external edit)