Data collection

Image by Heinz-Vale via Wikimedia Commons, CC BY-SA 4.0

Data build the foundation of research – carefully planning and carrying out the data collection process is therefore crucial for your research project. This will ensure obtaining high-quality, reliable data that aligns with your research objectives.

Thoughtful planning involves selecting appropriate methods, considering ethical implications and legal issues, as well as anticipating potential challenges. A well-executed data collection phase thus not only streamlines subsequent analysis but also lays the groundwork for drawing meaningful conclusions. Before collecting new data, you should always check if there are existing data sets that are suitable for your research endeavour. Find out more on how to discover and re-use data here.

Data Types

Research data can take on various forms and for each of them, the data collection process is different. Classifications can be made according to various different characteristics, one of them being the data type:

Textual Data: Text corpora, annotated texts, parallel corpora, learner corpora, etc.
Phonetic Data: Audio recordings and phonetic transcriptions
Lexical Data: Dictionaries, language models, lexica, glossaries, wordlists, conceptual resources (e.g. WordNet/FrameNet), etc.
Syntactic Data: Treebanks, dependency parsed texts, etc.
Semantic Data: Semantic annotations, ontologies, word embeddings, etc.
Discourse Data: Conversational transcripts, digital discourse, etc.
Experimental Data: Psycholinguistic experiments, eye-tracking, brain activity recordings, etc.
Language Variation Data: Dialectal data, sociolinguistic surveys, etc.
Multimodal Data: Video recordings, gesture annotations, etc.

The methods and instruments used for these data types each entail different challenges and questions that need to be addressed. However, what is common to all of them, is the need for good documentation.

Documenting data collection

Just as important as the creation of the research data itself is the metadata that contextualizes it, supporting the interpretation of the research data and thus fostering transparency as well as reproducibility in the research field.

A number of questions about data collection should be answered by the documentation of your data:

For what purpose was the data created?
Describe the project history, its aims, objectives, concepts and hypotheses.
What does the dataset contain?
Give information on the type of data (interviews, images, questionnaires, etc.), file size, file formats used and relationships between files.
How was the data collected?
Describe the data collection method and all sources the data come from.
Who collected the data and when?
Indicate the name(s) of the data collector(s), date of data collection and geographical coverage of the data.

Source: CESSDA Data Management Expert Guide

While these questions address general information at the project level, it is also important to be specific about the data objects themselves. Depending on whether you are dealing with qualitative or quantitative data, different requirements may apply.

Documentation of qualitative data should give background information and contextualize how it was created.
Documentation of quantitative data should describe the data file (e.g. file format, size, processing scripts, etc.) and the variables that are used in it.

Check the DMEG for details and useful tips and best practices on documenting different types of data.