Fundamentals of Metadata
Metadata are, simply put, “data about data”. They must contain all the information that makes the research data understandable, interoperable and reusable, except for the content of the research data itself. This means at least:
- information about the research project and authors,
- information about the origin (provenance) of data (time and place of creation, measuring instruments and their settings, methods of data processing...),
- accompanying documents, such as protocols and software required for data reuse.
Metadata are typically licensed with the CC0 license (“No Rights Reserved”) because they mostly represent non-copyrighted factual information. Even if it is necessary to delete research data after a certain period of time (e.g., in accordance with the "right to be forgotten" as defined by the GDPR), the metadata must remain permanently and publicly available as proof of the existence and properties of the data. Deviations from this principle are possible only in exceptional cases, e.g., when the metadata itself would threaten the protection of intellectual property, personal data, statutory confidentiality, etc. In this case, the eligible exemptions from openness apply .
Metadata can be recorded in various formats, from plain text to ReadMe files, from data papers to extensive, standardized, machine-readable metadata schemas. Individual disciplines or repositories may direct or even specify the content or format of metadata, preferably based on a formal standard.
Metadata can be categorized in different ways. Here we will look at two that are most practical from a research perspective.
Metadata by Content
The Oregon State University Library distinguishes between project-level metadata and dataset-level metadata, depending on their content.
Project-level metadata are, for example:
- project title,
- project description,
- dataset title,
- dataset summary,
- date of the dataset publication,
- time and place of data collection or creation,
- principal investigator and project team members,
- contact information,
- permanent identifier or link to the data set (DOI, Handle, permanent URL - PURL, etc.),
- instructions or rights for data reuse (including licenses).
Dataset-level metadata are:
- information about the data origin (whether they were obtained experimentally, by observation, derived or combined on the basis of other data sets, by modelling, from reference or official databases),
- information about the type of data (numbers, text, photos, video, sound...),
- information about file formats,
- information about instruments or observers,
- details of the data acquisition process (e.g., experiment design and implementation, instrument calibration data, sensor location data, etc.),
- information about data processing, including software, scripts and/or code used,
- labels of entries in the data set:
- names of variables,
- descriptions of variables (especially if the names or abbreviations are not widely used),
- units of measurement.
Metadata by Purpose
The Cornell University Library, however, divides metadata into descriptive, structural and administrative, depending on the purpose.
1. Descriptive Metadata
Descriptive metadata enable findability of data at:
- the local or system level, e.g., by searching the contents of a hard disk, a repository or a local bibliographic database,
- the level of the World Wide Web, e.g., through searches with general or specialized search engines (Google, Bing, Yahoo!, DuckDuckGo, Google Scholar, etc.).
The information elements that enable findability are:
- persistent identifiers (e.g., PURL, DOI, Handle),
- information about the properties of data files (e.g., file format, file size, creation date),
- bibliographic information about data files (title, author, language, keywords).
Descriptive metadata is created using standard metadata schemas described below, and domain-specific vocabularies and ontologies.
Descriptive metadata, including appropriate file names and keywords, will become increasingly important with the development of dedicated data search engines. In addition to Google Scholar, which is intended for scientific publications, Google also launched Google Dataset Search in 2018, which is designed to discover open datasets. Instructions for its use can be found on Google's blog The Keyword.
For Google Dataset Search and general browsers to show your datasets among the results, you need to equip them with an informative title that includes suitable keywords. You can use Google's tools Google Trends and Google Ngram Viewer to help you select the best keywords for your purpose.
2. Structural Metadata
Structural metadata provide insight into the structure of electronic resources and enable navigation through them. They act as a kind of index, namely by:
- Providing information on the internal structure of sources, which in the context of research data refers mainly to file folder hierarchy (number of folders, number of hierarchical levels, arrangement of content by folders);
- Describing the relationship between the data and scientific publications (e.g., image B was included in the original scientific article A);
- Associating related files and code (e.g., photo D is a processed version of raw photo C; result F was created using code E).
In the context of research data, structural metadata are mainly recorded in the form of rich metadata schemas in XML or JSON format. They can also be part of a repository's digital infrastructure.
3. Administrative Metadata
Administrative metadata enable short-term data processing and long-term management of databases. They include:
- technical information about the creation of data files, versioning and quality control,
- reuse rights, access control and user requirements,
- information about permanent storage.
In the context of research data, administrative metadata include information about open licenses, eligible exemptions from openness and the properties and operation of repositories, which must be trustworthy.
Metadata Schemas and Standards
When preparing metadata, it is best to follow the metadata schemas prescribed by the repository where you intend to deposit your data. According to the University of California - San Diego Library, metadata schemas define general concepts about the structure of data (ie, its building blocks and properties) for the purpose of describing data. When a schema is formally summarised or implemented by a (preferably international) standards organisation, it becomes a metadata standard.
With exceptions to general standards such as Dublin Core and schema.org, metadata standards mostly apply only within a specific domain or specialised field. For example, the Data Documentation Initiative (DDI) standard is mostly used to describe social science data, Geospatial Metadata (ISO 19115) to describe geographic data and related services, and Simple Darwin Core to describe biodiversity by recording the spatial distribution of species.
The documentation that accompanies each metadata schema defines element repeatability and cardinality rules in more detail and provides instructions for entering and formatting values. Many schemas use vocabularies and ontologies, i.e. the scientific community's agreed terms that ensure that metadata use the same names for the same things and concepts. By choosing an appropriate metadata standard, you ensure that the description of your data is detailed enough to follow the scientific community's established practices and be useful to other users of your data.
Last update: 12 May 2022