Provenance (Origin) of Research Data

Minimum Description of Provenance
Since the research data are very diverse (i.e. obtained experimentally, by observation, derived or compiled on the basis of other data sets, obtained through simulations or from reference or official databases), they also require a tailored level of detail in the description of their provenance. In a review article titled Data Provenance Standards and Recommendations for FAIR Data, Jauer and Deserno (2020) compared the requirements of four metadata systems, namely the general DataCite schema, the ECRIN recommendations for clinical research, the Working Group on Data Citation (Research Data Alliance) recommendations for experimental research and the Data Quality Collaborative schema. According to their findings, each metadata record should contain at least the six basic pieces of information listed in the table below.
Metadata | DataCite | ECRIN | WGDC | DQC |
Persistent identifier | ✔ | ✔ | ✔ | ✖ |
Project or experiment | ✔ | ✔ | ✖ | ✔ |
Author of the data | ✔ | ✔ | ✖ | ✔ |
Time of data creation | ✔ | ✔ | ✔ | ✖ |
Versioning | ✔ | ✔ | ✔ | ✔ |
Subsets of data are assigned a new persistent identifier of their own | ✖ | ✖ | ✔ | ✔ |
Some repositories prescribe a specific metadata schema for the description of provenance, whereas others leave the choice of the metadata format to the authors. It is recommended that you provide minimal metadata, but even better, you should follow one of the schemas presented below. If a structured metadata schema is not prescribed, you can record the metadata in the form of a ReadMe file or data paper.
For projects within Horizon Europe, the EU Grants - AGA: Annotated Model Grant Agreement specifies in Annex 5: Communication, dissemination, open science and visibility that a minimum description of provenance must contain the following metadata:
- information about the scientific publication (author(s), title, date of publication, publication venue - journal or publishing house);
- funding source (Horizon Europe or Euratom), including project name, acronym and number;
- licensing terms;
- persistent identifiers for the publication, authors, and, if possible, for their organisations and the grant (see also the article on persistent identifiers of non-digital objects).
If possible, the metadata must include persistent identifiers for any research output or any other tools and instruments needed to validate the conclusions of the publication.
General Metadata Schemas
A good example of basic information that should accompany research data is the DataCite metadata schema, which is suitable for various research fields. Information is divided into mandatory, recommended and optional. It is used by repositories such as Zenodo, Dryad and Figshare.
ID | Metadata | Degree of necessity |
1 | Identifier (with mandatory type sub-property) | Mandatory |
2 | Creator (with optional given name, family name, name identifier - e.g., ORCiD - and affiliation sub-properties) | Mandatory |
3 | Title (with optional type sub-properties) | Mandatory |
4 | Publisher | Mandatory |
5 | Publication year | Mandatory |
6 | Subject (with scheme sub-property) | Recommended |
7 | Contributor (with optional given name, family name, name identifier - e.g., ORCiD - and affiliation sub-properties) | Recommended |
8 | Date (with type subproperty) | Recommended |
9 | Language | Optional |
10 | Resource type (with mandatory general type description sub-property) | Mandatory |
11 | Alternate identifier (with type sub-property) | Optional |
12 | Related identifier (with type and relation type sub-properties) | Recommended |
13 | Size | Optional |
14 | Format | Optional |
15 | Version | Optional |
16 | Rights | Optional |
17 | Description (with type sub-property) | Recommended |
18 | Geolocation (with point, box and polygon subproperties) | Recommended |
19 | Funding reference (with name, identifier, and award related subproperties) | Optional |
Similar to the DataCite schema is the Dublin Core metadata schema, which contains all of the DataCite categories (some of which are named differently), as well as more specific ones, such as repository entry date, license and audience. Zenodo, Dryad and Figshare are compatible both with Dublin Core as well as with DataCite. Dryad is also compatible with the RDF Data Cube Vocabulary and OAI-ORE – Open Archives Initiative Object Reuse and Exchange schemas.
Domain-specific Description of Provenance
For many research fields, especially in the field of natural sciences and engineering, where many people participate in research projects and the acquisition of data and their processing are very complex, simple metadata schemas for describing the provenance of data are not sufficient. The UK Digital Curation Centre and Research Data Alliance maintain a list of over 40 domain-specific metadata schemas, covering all major research fields. Of these, the following are worth mentioning due to their wide adoption, for example:
- Data Documentation Initiative (DDI) for the description of data from the fields of social sciences, humanities and economics;
- Crystallographic Information Framework (CIF), which is used to report crystal structures in the Acta Crystallographica journal and related publications;
- NeXus for storage and exchange of data on experiments with X-rays, neutrons and muons;
- Minimum Information for Biological and Biomedical Investigations (MIBBI), which brings together more than 40 minimum standards for various research subfields within biology and biomedicine;
- Darwin Core for preservation and exchange of biodiversity data;
- Genome Meatdata, which defines 61 criteria for describing genomic data;
- Geographic Information Metadata (ISO 19115-1:2014) for geographic data and services,
- etc.
In addition to standardized metadata schemas, there is also a series of minimum requirements for reporting experiments, which were created at the initiative of the scientific community within a particular research field and are currently still under development. Most of them are available for life sciences. Some of these are, for example:
- Minimum information about a microarray experiment (MIAME),
- Minimum information about a proteomics experiment (MIAPE),
- Minimum information about a spinal cord injury experiment (MIASCI),
- Minimum information about a simulation experiment (MIASE),
- Standard for exchange of nonclinical data (SEND),
- Standardized battery reporting guidelines,
- CONSORT (for randomized controlled clinical trials),
- QUOROM (for meta-analyses of controlled clinical trials),
- STARD (for reporting diagnostic tests),
- etc.
In most cases, domain-specific repositories prescribe an appropriate, field-specific metadata schema that is most suitable for the type of data these repositories store. A good example is the metadata schema of the Environmental Information Data Centre, the UK's national repository for land and water sciences, which includes all aspects of information about data, data collection process and quality control.
If a suitable domain-specific repository does not yet exist for your research field and you will deposit the data in a general or institutional repository, CTK UL advises you not to settle for the basic metadata schema used by these repositories. Instead, we recommend that you follow the domain-specific metadata schema or minimum research reporting requirements that apply to your field. If neither of these two options exist, you can use the Core Scientific Metadata Model (CSMD) generic metadata schema for experimental data.
Example of Provenance Description (PCCS Standard)
As a representative example of a detailed provenance description for complex research, we will look at the Provenance and Context Content Standard (PCCS). This standard has been developed since 2011 by the Earth Science Information Partners (ESIP), which was founded by NASA in 1998 to improve methods for preserving, discovering, accessing and reusing Earth data. PCCS provides technical standards for identifying, capturing and tracking all relevant details that ensure the validity and repeatability of Earth observations. PCCS has also been outlined by NASA as a mandatory reporting format for new missions. The source table can be viewed on the ESIP website.
For the purpose of presentation on this website, we have summarized only part of the table.
Open Academy Lecture on Provenance
Category | Title | Description | Source | Capture | Data format |
1.1 Preflight/Pre-operational | Instrument description | Documentation of instrument/sensor characteristics including pre-flight or pre-operational performance measurements (e.g., spectral response, instrument geometric calibration (geo-location offsets), noise characteristics, etc.). Components include: instrument specifications, vendor calibration reports, user guides, spectral and radiometric calibration. | Instrument teams and vendors | Preflight/Pre-operational | Word documents |
1.2 Preflight/Pre-operational | Preflight/Pre-operational calibration data | Numeric (digital) files of Instrument/sensor characteristics including pre-flight or pre-operational performance measurements (e.g., spectral response, instrument geometric calibration (geo-location offsets), noise characteristics, etc.). Components include: instrument specifications, vendor calibration reports, user guides, spectral and radiometric calibration. | Instrument developer | Pre-mission (during early part of mission, at the latest; certainly before development contracts expire) | Numerical files containing such information |
2.1 Products (Data) | Raw data or Level 0 data | Raw data = Data as measured by a spaceborne, airborne or in situ instrument. Level 0 data = Reconstructed, unprocessed instrument and payload data at full resolution, with any and all communications artifacts (e.g., synchronization frames, communications headers, duplicate data) removed. | Data gathering projects | During mission | Original measurements or equivalent (i.e., only reversible operations are applied such that recovery is guaranteed); one of accepted standard formats |
2.2 Products (Data) | Level 1A data products | Reconstructed, unprocessed instrument data at full resolution, time-referenced, and annotated with ancillary information, including radiometric and geometric calibration coefficients and georeferencing parameters (e.g., platform ephemeris) computed and appended but not applied to Level 0 data. | Instrument teams associated with data gathering projects | During mission and post-mission reprocessing | One of accepted standard formats |
2.3 Products (Data) | Level 1B data products | Level 1A data that have been processed to sensor units (not all instruments have Level 1B source data). | Instrument teams associated with data gathering projects | During mission and post-mission reprocessing | One of accepted standard formats |
2.4 Products (Data) | Level 2 data products | Derived geophysical variables at the same resolution and location as Level 1 source data. | Instrument teams associated with data gathering projects | During mission and post-mission reprocessing | One of accepted standard formats |
2.5 Products (Data) | Level 3 data products | Variables mapped on uniform space-time grid scales, usually with some completeness and consistency. | Instrument teams associated with data gathering projects | During mission and post-mission reprocessing | One of accepted standard formats |
2.6 Products (Data) | Level 4 data products | Model output or results from analyses of lower-level data (e.g., variables derived from multiple measurements). | Interdisciplinary modeling community | During mission and post-mission reprocessing; also as new products are developed post mission | One of accepted standard formats |
2.7 Metadata | Metadata | Information about data to facilitate discovery, search, access, understanding and usage associated with each of the data products. (Ensure that granule level metadata indicates which version of software was used for producing a given granule.) | Same as for corresponding data products | Same as for corresponding data products | Follow ISO 19115 standard (adaptations thereof) |
3.01 Product documentation | Product team | State the product team members (development, help desk and operations), roles, and contact information. As responsibility cahnges hands over time, the names of indiviudals and periods durign which they were responsible for various aspects of the product should be documented. | Science teams responsible for products; product generation support teams | During mission; needs to be kept updated as changes occur | Word documents |
3.02 Product documentation | Product requirements | Project's requirements for each product, either explicitly or by reference to the project's requirements document, if available. Product requirements should include content, format, latency, accuracy and quality. | Sponsoring program/project | During mission; needs to be kept updated as changes occur | Word documents |
3.03 Product documentation | Product development history | Major product development steps and milestones, with links to other relevant items that are part of the preserved provenance and context contents. | Product generation support teams | During mission; needs to be kept updated as major events occur | Word documents |
3.04 Product documentation | Processing history | Documentation of processing history and production version history, indicating which versions were used when, why different versions came about, and what the improvements were from version to version. | Science teams responsible for products; product generation support teams | During mission; post-mission for final versions | Word documents; pointers to documents in metadata |
3.05 Product documentation | Algorithm version history | Processing history including versions of processing source code corresponding to versions of the data set or derived product held in the archive. Granule level metadata should indicate which version of software was used for producing a given granule. | Science teams responsible for products; product generation support teams | When products are defined; updates as needed | Collection and granule-level metadata |
3.06 Product documentation | Processing history (Maintenance history) | Excerpts and/or references to maintenance documentation deemed of value to product users (e.g., relevant sections of maintenance reports). | Science teams responsible for products; product generation support teams | During mission; post-mission for final versions | Word documents; log files |
3.07 Product documentation | Processing history (Operations documentation) | Excerpts and/or references to operations documentation deemed of value to product users (e.g., relevant sections of operations event logs). | Science teams responsible for products; product generation support teams | During mission; post-mission for final versions | Log files |
3.08 Product documentation | Product generation algorithms | Processing algorithms and their scientific and mathematical basis, including complete description of any sampling or mapping algorithm used in creation of the product (e.g., contained in peer-reviewed papers, in some cases supplemented by thematic information introducing the data set or derived product) - geo-location, radiometric calibration, geophysical parameters, sampling or mapping algorithms used in creation of the product, algorithm software documentation, ATBD & high-level data flow diagrams | Science teams responsible for products | Pre-mission for ATBDs; updates during after mission as new versions are created | Word documents |
3.09 Product documentation | Product generation algorithms/ATBD (Algorithm output) | Describe the output data products - not format - at a level of detail to determine if the product meets user requirements. | Science teams responsible for products | Pre-mission for ATBDs; updates during after mission as new versions are created | Word documents |
3.10 Product documentation | Product generation algorithms/ATBD (Algorithm performance assumptions) | Describe all assumptions that have been made concerning the algorithm performance estimates. Note any limitations that apply to the algorithms (e.g., conditions where retrievals cannot be made or where performance may be significantly degraded. To the extent possible, the potential for degraded performance should be explored, along with mitigating strategies. | Science teams responsible for products | Pre-mission for ATBDs; updates during after mission as new versions are created | Word documents |
3.11 Product documentation | Product generation algorithms/ATBD (Error budget) | Organize the various error estimates into an error budget. Error budget limitations should be explained. Describe prospects for overcoming error budget limitations with future maturation of the algorithm, test data, and error analysis methodology. | Science teams responsible for products | Pre-mission for ATBDs; updates during after mission as new versions are created | Word documents |
3.12 Product documentation | Product generation algorithms/ATBD (Numerical computation considerations) | Describe how the algorithm is numerically implemented, including possible issues with computationally intensive operations (e.g., large matrix inversions, truncation and rounding). | Science teams responsible for products | Pre-mission for ATBDs; updates during after mission as new versions are created | Word documents |
3.13 Product documentation | Quality | Documentation of product quality assessment (methods used, assessment summaries for each version of the datasets) Description of embedded data at the granule level including quality flags, product data uncertainty fields, data issues logs, etc. | Science teams responsible for products; product generation support teams | During mission; post-mission for final versions | Word documents |
3.14 Product documentation | Quality (Product accuracy) | Accuracy of products, as measured by validation testing, and compared to accuracy requirements. References to relevant test reports. | Science teams responsible for products; product generation support teams | During mission; post-mission for final versions | Word documents |
3.15 Product documentation | Quality (Sensor effects) | Flowed-through effects of sensor noise, calibration errors, spatial and spectral errors, and/or un-modeled or neglected geophysical phenomena on the quality of products | Science teams responsible for products; product generation support teams | During mission; post-mission for final versions | Word documents |
3.16 Product documentation | Quality assessment and potential algorithm improvments | Describe potential future enhancements to the algorithm, the limitations they will mitigate, and provide all possible and useful related information and links. | Science teams responsible for products | Towards end of mission | Word documents |
3.17 Product documentation | References | A bibliography of pertinent technical notes and articles, including refereed publications reporting on research using the data set | Science teams; active archive data centers' user service groups | During and post mission | Word documents |
3.18 Product documentation | User feedback | Information received back from users of the data set or product | Science teams; active archive data centers' user service groups | During and post mission | Word documents or web pages |
4.1 Calibration | Instrument/ Sensor calibration during mission | Instrument/sensor calibration method - Radiometric calibration; Spectral response/ calibration; Noise characteristics; Geo-location | Instrument teams (calibration support teams) | During mission operations | Word documents; pointers to documents in collection-level metadata |
4.2 Calibration | In situ measurement environment | In the case of Earth based data, station location and any changes in location, instrumentation, controlling agency, surrounding land use and other factors which could influence the long-term record | PI's of in situ measurement projects | During data collection | Word documents; pointers in metadata to documents |
4.3 Calibration | Mission/Platform history | Instrument events and maneuvers; attitude and ephemeris; aircraft position; event logs | Mission operations teams | During mission | Word documents; pointers to documents in collection-level metadata |
4.4 Calibration | Mission calibration data | Instrument/sensor calibration data - Radiometric calibration; Spectral response/ calibration; Noise characteristics; Geo-location | Science teams responsible for products; product generation support teams | During mission; post-mission for final versions | Numerical files containing such information |
4.5 Calibration | Calibration software | Source code used in applying calibration to generate look-up tables and/or parameters needed for producing calibrated products | Instrument teams (calibration support teams) | During mission operations; final versions may be captured post-mission | Source code |
5.1 Product software | Product generation algorithms | Source code used to generate products at all levels. | Instrument teams (calibration support teams) | During mission operations; final versions may be captured post-mission | Source code |
5.2 Product software | Output dataset description | For each output data file, details on data product's structure, format/type, range of values and special error values. Include data volume and file size. All information needed to verify that the required output data is created by a run; i.e. Verify that all expected datasets are produced in the expected format. | Science teams responsible for products; product generation support teams | When product generation software is developed and tested. Updated as needed | Word or XML documents |
5.3 Product software | Programming and procedural considerations | Describe any important programming and procedural aspects related to implementing the algorithm into operating code. | Science teams responsible for products; product generation support teams | When product generation software is developed and tested | Word documents |
5.4 Product software | Exception handling | List the complete set of expected exceptions, and describes how they are identified, trapped, and handled. | Science teams responsible for products; product generation support teams | When product generation software is developed and tested | Word documents |
5.5 Product software | Test data description | Description of data sets used for software verification and validation, including unit tests and system test, either explicitly or by reference to the developer's test plans, if available. This will be updated during operations to describe test data for maintenance. | Science teams responsible for products; product generation support teams | When product generation software is developed and tested | Word documents |
5.6 Product software | Unit Test Plans | Description of all test plans that were produced during development, including links or references to the artifacts. | Science teams responsible for products; product generation support teams | When product generation software is developed and tested | Word documents |
5.7 Product software | Test results | Description of testing and test results performed during development, either explicitly or by references to test reports. If test reports are not available to external users, provide a summary of the test results in sufficient detail to give external users a good sense of how the test results indicate that the products meet requirements. | Science teams responsible for products; product generation support teams | When product generation software is developed and tested | Word documents |
6.1 Algorithm input | Algorithm input documentation | Complete information on any ancillary data or other data sets used in generation or calibration of the data set or derived product, either explicitly or by reference to appropriate documents. Information should include full description of the input data and their attributes covering all input data used by the algorithm, including primary sensor data, ancillary data, forward models (e.g. radiative transfer models, optical models, or other model that relates sensor observables to geophysical phenomena) and look-up tables. | Science teams responsible for products | Initial version pre-mission; updates during mission | Word documents; pointers in metadata to documents |
6.2 Algorithm input | Algorithm input data | At granule level, include information on all inputs (including ancillary or other data granules, calibration files, look-up tables etc.) that were used to generate the product. At the appropriate level (granule or dataset) include calibration parameters, precision orbit & attitude data; Climatological norms, geophysical masks or first guess fields, spectrum and transmittance information Numerical weather or climate model inputs | Science teams responsible for products; product generation support teams | During mission; post-mission for final versions | Pointers in metadata to files of appropriate input data |
7.1 Validation | Validation record | Description of validation process, including identification of validation data sets; Cal/Val plans & status; detailed history of validation activities | Science teams responsible for products; Validation teams; product generation support teams | During mission; post-mission for final versions | Word documents |
7.2 Validation | Validation datasets | Validation data sets along with metadata | Science teams responsible for products; Validation teams; product generation support teams | During mission; post-mission for final versions | Numerical files containing such information |
8.1 Software tools | Software tools for users | Readers & data tools | Science teams; product generation support teams; active archive data centers' user service groups | During and post mission | Source code (and executable code if maintained) |
Last update: 2 September 2022