Provenance (Origin) of Research Data

Business by Gerd Altmann from Pixabay
The concept of provenance or the origin of research data refers to all information about the circumstances of data creation, e.g., about authors, time of creation, research equipment and its calibration, etc. Provenance is one of the most important aspects of metadata. Provenance information not only enables interoperability and reusability of the data but also contributes to maintaining research integrity and combating research irreproducibility. The European Commission also recognized this, which is why it classified provenance as a key part of research data management in the Horizon Europe programme. Even if you follow the requirements of other funders or institutions, provenance information is a key part of managing your data.

Minimum Description of Provenance

Since the research data are very diverse (i.e. obtained experimentally, by observation, derived or compiled on the basis of other data sets, obtained through simulations or from reference or official databases), they also require a tailored level of detail in the description of their provenance. In a review article titled Data Provenance Standards and Recommendations for FAIR Data, Jauer and Deserno (2020) compared the requirements of four metadata systems, namely the general DataCite schema, the ECRIN recommendations for clinical research, the Working Group on Data Citation (Research Data Alliance) recommendations for experimental research and the Data Quality Collaborative schema. According to their findings, each metadata record should contain at least the six basic pieces of information listed in the table below.

MetadataDataCiteECRINWGDCDQC
Persistent identifier
Project or experiment
Author of the data
Time of data creation
Versioning
Subsets of data are assigned a new persistent identifier of their own

Some repositories prescribe a specific metadata schema for the description of provenance, whereas others leave the choice of the metadata format to the authors. It is recommended that you provide minimal metadata, but even better, you should follow one of the schemas presented below. If a structured metadata schema is not prescribed, you can record the metadata in the form of a ReadMe file or data paper.

For projects within Horizon Europe, the EU Grants - AGA: Annotated Model Grant Agreement specifies in Annex 5: Communication, dissemination, open science and visibility that a minimum description of provenance must contain the following metadata:

  • information about the scientific publication (author(s), title, date of publication, publication venue - journal or publishing house);
  • funding source (Horizon Europe or Euratom), including project name, acronym and number;
  • licensing terms;
  • persistent identifiers for the publication, authors, and, if possible, for their organisations and the grant (see also the article on persistent identifiers of non-digital objects).

If possible, the metadata must include persistent identifiers for any research output or any other tools and instruments needed to validate the conclusions of the publication.

General Metadata Schemas

A good example of basic information that should accompany research data is the DataCite metadata schema, which is suitable for various research fields. Information is divided into mandatory, recommended and optional. It is used by repositories such as Zenodo, Dryad and Figshare.

IDMetadataDegree of necessity
1Identifier (with mandatory type sub-property)Mandatory
2Creator (with optional given name, family name, name identifier - e.g., ORCiD - and affiliation sub-properties)Mandatory
3Title (with optional type sub-properties)Mandatory
4PublisherMandatory
5Publication yearMandatory
6Subject (with scheme sub-property)Recommended
7Contributor (with optional given name, family name, name identifier - e.g., ORCiD - and affiliation sub-properties)Recommended
8Date (with type subproperty)Recommended
9LanguageOptional
10Resource type (with mandatory general type description sub-property)Mandatory
11Alternate identifier (with type sub-property)Optional
12Related identifier (with type and relation type sub-properties)Recommended
13SizeOptional
14FormatOptional
15VersionOptional
16RightsOptional
17Description (with type sub-property)Recommended
18Geolocation (with point, box and polygon subproperties)Recommended
19Funding reference (with name, identifier, and award related subproperties)Optional

Similar to the DataCite schema is the Dublin Core metadata schema, which contains all of the DataCite categories (some of which are named differently), as well as more specific ones, such as repository entry date, license and audience. Zenodo, Dryad and Figshare are compatible both with Dublin Core as well as with DataCite. Dryad is also compatible with the RDF Data Cube Vocabulary and OAI-ORE – Open Archives Initiative Object Reuse and Exchange schemas.

Domain-specific Description of Provenance

For many research fields, especially in the field of natural sciences and engineering, where many people participate in research projects and the acquisition of data and their processing are very complex, simple metadata schemas for describing the provenance of data are not sufficient. The UK Digital Curation Centre and Research Data Alliance maintain a list of over 40 domain-specific metadata schemas, covering all major research fields. Of these, the following are worth mentioning due to their wide adoption, for example:

In addition to standardized metadata schemas, there is also a series of minimum requirements for reporting experiments, which were created at the initiative of the scientific community within a particular research field and are currently still under development. Most of them are available for life sciences. Some of these are, for example:

In most cases, domain-specific repositories prescribe an appropriate, field-specific metadata schema that is most suitable for the type of data these repositories store. A good example is the metadata schema of the Environmental Information Data Centre, the UK's national repository for land and water sciences, which includes all aspects of information about data, data collection process and quality control.

If a suitable domain-specific repository does not yet exist for your research field and you will deposit the data in a general or institutional repository, CTK UL advises you not to settle for the basic metadata schema used by these repositories. Instead, we recommend that you follow the domain-specific metadata schema or minimum research reporting requirements that apply to your field. If neither of these two options exist, you can use the Core Scientific Metadata Model (CSMD) generic metadata schema for experimental data.

Example of Provenance Description (PCCS Standard)

As a representative example of a detailed provenance description for complex research, we will look at the Provenance and Context Content Standard (PCCS). This standard has been developed since 2011 by the Earth Science Information Partners (ESIP), which was founded by NASA in 1998 to improve methods for preserving, discovering, accessing and reusing Earth data. PCCS provides technical standards for identifying, capturing and tracking all relevant details that ensure the validity and repeatability of Earth observations. PCCS has also been outlined by NASA as a mandatory reporting format for new missions. The source table can be viewed on the ESIP website.

For the purpose of presentation on this website, we have summarized only part of the table.

CategoryTitleDescriptionSourceCaptureData format
1.1 Preflight/Pre-operationalInstrument descriptionDocumentation of instrument/sensor characteristics including pre-flight or pre-operational performance measurements (e.g., spectral response, instrument geometric calibration (geo-location offsets), noise characteristics, etc.). Components include: instrument specifications, vendor calibration reports, user guides, spectral and radiometric calibration.Instrument teams and vendorsPreflight/Pre-operationalWord documents
1.2 Preflight/Pre-operationalPreflight/Pre-operational calibration dataNumeric (digital) files of Instrument/sensor characteristics including pre-flight or pre-operational performance measurements (e.g., spectral response, instrument geometric calibration (geo-location offsets), noise characteristics, etc.). Components include: instrument specifications, vendor calibration reports, user guides, spectral and radiometric calibration.Instrument developerPre-mission (during early part of mission, at the latest; certainly before development contracts expire)Numerical files containing such information
2.1 Products (Data)Raw data or Level 0 data

Raw data = Data as measured by a spaceborne, airborne or in situ instrument.

Level 0 data = Reconstructed, unprocessed instrument and payload data at full resolution, with any and all communications artifacts (e.g., synchronization frames, communications headers, duplicate data) removed.

Data gathering projectsDuring missionOriginal measurements or equivalent (i.e., only reversible operations are applied such that recovery is guaranteed); one of accepted standard formats
2.2 Products (Data)Level 1A data productsReconstructed, unprocessed instrument data at full resolution, time-referenced, and annotated with ancillary information, including radiometric and geometric calibration coefficients and georeferencing parameters (e.g., platform ephemeris) computed and appended but not applied to Level 0 data.Instrument teams associated with data gathering projectsDuring mission and post-mission reprocessingOne of accepted standard formats
2.3 Products (Data)Level 1B data productsLevel 1A data that have been processed to sensor units (not all instruments have Level 1B source data).Instrument teams associated with data gathering projectsDuring mission and post-mission reprocessingOne of accepted standard formats
2.4 Products (Data)Level 2 data productsDerived geophysical variables at the same resolution and location as Level 1 source data.Instrument teams associated with data gathering projectsDuring mission and post-mission reprocessingOne of accepted standard formats
2.5 Products (Data)Level 3 data productsVariables mapped on uniform space-time grid scales, usually with some completeness and consistency.Instrument teams associated with data gathering projectsDuring mission and post-mission reprocessingOne of accepted standard formats
2.6 Products (Data)Level 4 data productsModel output or results from analyses of lower-level data (e.g., variables derived from multiple measurements).Interdisciplinary modeling communityDuring mission and post-mission reprocessing; also as new products are developed post missionOne of accepted standard formats
2.7 MetadataMetadataInformation about data to facilitate discovery, search, access, understanding and usage associated with each of the data products. (Ensure that granule level metadata indicates which version of software was used for producing a given granule.)Same as for corresponding data productsSame as for corresponding data productsFollow ISO 19115 standard (adaptations thereof)
3.01 Product documentationProduct teamState the product team members (development, help desk and operations), roles, and contact information. As responsibility cahnges hands over time, the names of indiviudals and periods durign which they were responsible for various aspects of the product should be documented.Science teams responsible for products; product generation support teamsDuring mission; needs to be kept updated as changes occurWord documents
3.02 Product documentationProduct requirementsProject's requirements for each product, either explicitly or by reference to the project's requirements document, if available. Product requirements should include content, format, latency, accuracy and quality.Sponsoring program/projectDuring mission; needs to be kept updated as changes occurWord documents
3.03 Product documentationProduct development historyMajor product development steps and milestones, with links to other relevant items that are part of the preserved provenance and context contents.Product generation support teamsDuring mission; needs to be kept updated as major events occurWord documents
3.04 Product documentationProcessing historyDocumentation of processing history and production version history, indicating which versions were used when, why different versions came about, and what the improvements were from version to version.Science teams responsible for products; product generation support teamsDuring mission; post-mission for final versionsWord documents; pointers to documents in metadata
3.05 Product documentationAlgorithm version historyProcessing history including versions of processing source code corresponding to versions of the data set or derived product held in the archive. Granule level metadata should indicate which version of software was used for producing a given granule.Science teams responsible for products; product generation support teamsWhen products are defined; updates as neededCollection and granule-level metadata
3.06 Product documentationProcessing history (Maintenance history)Excerpts and/or references to maintenance documentation deemed of value to product users (e.g., relevant sections of maintenance reports).Science teams responsible for products; product generation support teamsDuring mission; post-mission for final versionsWord documents; log files
3.07 Product documentationProcessing history (Operations documentation)Excerpts and/or references to operations documentation deemed of value to product users (e.g., relevant sections of operations event logs).Science teams responsible for products; product generation support teamsDuring mission; post-mission for final versionsLog files
3.08 Product documentationProduct generation algorithmsProcessing algorithms and their scientific and mathematical basis, including complete description of any sampling or mapping algorithm used in creation of the product (e.g., contained in peer-reviewed papers, in some cases supplemented by thematic information introducing the data set or derived product) - geo-location, radiometric calibration, geophysical parameters, sampling or mapping algorithms used in creation of the product, algorithm software documentation, ATBD & high-level data flow diagramsScience teams responsible for productsPre-mission for ATBDs; updates during after mission as new versions are createdWord documents
3.09 Product documentationProduct generation algorithms/ATBD (Algorithm output)Describe the output data products - not format - at a level of detail to determine if the product meets user requirements.Science teams responsible for productsPre-mission for ATBDs; updates during after mission as new versions are createdWord documents
3.10 Product documentationProduct generation algorithms/ATBD (Algorithm performance assumptions)Describe all assumptions that have been made concerning the algorithm performance estimates. Note any limitations that apply to the algorithms (e.g., conditions where retrievals cannot be made or where performance may be significantly degraded. To the extent possible, the potential for degraded performance should be explored, along with mitigating strategies.Science teams responsible for productsPre-mission for ATBDs; updates during after mission as new versions are createdWord documents
3.11 Product documentationProduct generation algorithms/ATBD (Error budget)Organize the various error estimates into an error budget. Error budget limitations should be explained. Describe prospects for overcoming error budget limitations with future maturation of the algorithm, test data, and error analysis methodology.Science teams responsible for productsPre-mission for ATBDs; updates during after mission as new versions are createdWord documents
3.12 Product documentationProduct generation algorithms/ATBD (Numerical computation considerations)Describe how the algorithm is numerically implemented, including possible issues with computationally intensive operations (e.g., large matrix inversions, truncation and rounding).Science teams responsible for productsPre-mission for ATBDs; updates during after mission as new versions are createdWord documents
3.13 Product documentationQualityDocumentation of product quality assessment (methods used, assessment summaries for each version of the datasets) Description of embedded data at the granule level including quality flags, product data uncertainty fields, data issues logs, etc.Science teams responsible for products; product generation support teamsDuring mission; post-mission for final versionsWord documents
3.14 Product documentationQuality (Product accuracy)Accuracy of products, as measured by validation testing, and compared to accuracy requirements. References to relevant test reports.Science teams responsible for products; product generation support teamsDuring mission; post-mission for final versionsWord documents
3.15 Product documentationQuality (Sensor effects)Flowed-through effects of sensor noise, calibration errors, spatial and spectral errors, and/or un-modeled or neglected geophysical phenomena on the quality of productsScience teams responsible for products; product generation support teamsDuring mission; post-mission for final versionsWord documents
3.16 Product documentationQuality assessment and potential algorithm improvmentsDescribe potential future enhancements to the algorithm, the limitations they will mitigate, and provide all possible and useful related information and links.Science teams responsible for productsTowards end of missionWord documents
3.17 Product documentationReferencesA bibliography of pertinent technical notes and articles, including refereed publications reporting on research using the data setScience teams; active archive data centers' user service groupsDuring and post missionWord documents
3.18 Product documentationUser feedbackInformation received back from users of the data set or productScience teams; active archive data centers' user service groupsDuring and post missionWord documents or web pages
4.1 CalibrationInstrument/ Sensor calibration during missionInstrument/sensor calibration method - Radiometric calibration; Spectral response/ calibration; Noise characteristics; Geo-locationInstrument teams (calibration support teams)During mission operationsWord documents; pointers to documents in collection-level metadata
4.2 CalibrationIn situ measurement environmentIn the case of Earth based data, station location and any changes in location, instrumentation, controlling agency, surrounding land use and other factors which could influence the long-term recordPI's of in situ measurement projectsDuring data collectionWord documents; pointers in metadata to documents
4.3 CalibrationMission/Platform historyInstrument events and maneuvers; attitude and ephemeris; aircraft position; event logsMission operations teamsDuring missionWord documents; pointers to documents in collection-level metadata
4.4 CalibrationMission calibration dataInstrument/sensor calibration data - Radiometric calibration; Spectral response/ calibration; Noise characteristics; Geo-locationScience teams responsible for products; product generation support teamsDuring mission; post-mission for final versionsNumerical files containing such information
4.5 CalibrationCalibration softwareSource code used in applying calibration to generate look-up tables and/or parameters needed for producing calibrated productsInstrument teams (calibration support teams)During mission operations; final versions may be captured post-missionSource code
5.1 Product softwareProduct generation algorithmsSource code used to generate products at all levels.Instrument teams (calibration support teams)During mission operations; final versions may be captured post-missionSource code
5.2 Product softwareOutput dataset descriptionFor each output data file, details on data product's structure, format/type, range of values and special error values. Include data volume and file size. All information needed to verify that the required output data is created by a run; i.e. Verify that all expected datasets are produced in the expected format.Science teams responsible for products; product generation support teamsWhen product generation software is developed and tested. Updated as neededWord or XML documents
5.3 Product softwareProgramming and procedural considerationsDescribe any important programming and procedural aspects related to implementing the algorithm into operating code.Science teams responsible for products; product generation support teamsWhen product generation software is developed and testedWord documents
5.4 Product softwareException handlingList the complete set of expected exceptions, and describes how they are identified, trapped, and handled.Science teams responsible for products; product generation support teamsWhen product generation software is developed and testedWord documents
5.5 Product softwareTest data descriptionDescription of data sets used for software verification and validation, including unit tests and system test, either explicitly or by reference to the developer's test plans, if available. This will be updated during operations to describe test data for maintenance.Science teams responsible for products; product generation support teamsWhen product generation software is developed and testedWord documents
5.6 Product softwareUnit Test PlansDescription of all test plans that were produced during development, including links or references to the artifacts.Science teams responsible for products; product generation support teamsWhen product generation software is developed and testedWord documents
5.7 Product softwareTest resultsDescription of testing and test results performed during development, either explicitly or by references to test reports. If test reports are not available to external users, provide a summary of the test results in sufficient detail to give external users a good sense of how the test results indicate that the products meet requirements.Science teams responsible for products; product generation support teamsWhen product generation software is developed and testedWord documents
6.1 Algorithm inputAlgorithm input documentationComplete information on any ancillary data or other data sets used in generation or calibration of the data set or derived product, either explicitly or by reference to appropriate documents. Information should include full description of the input data and their attributes covering all input data used by the algorithm, including primary sensor data, ancillary data, forward models (e.g. radiative transfer models, optical models, or other model that relates sensor observables to geophysical phenomena) and look-up tables.Science teams responsible for productsInitial version pre-mission; updates during missionWord documents; pointers in metadata to documents
6.2 Algorithm inputAlgorithm input dataAt granule level, include information on all inputs (including ancillary or other data granules, calibration files, look-up tables etc.) that were used to generate the product. At the appropriate level (granule or dataset) include calibration parameters, precision orbit & attitude data; Climatological norms, geophysical masks or first guess fields, spectrum and transmittance information Numerical weather or climate model inputsScience teams responsible for products; product generation support teamsDuring mission; post-mission for final versionsPointers in metadata to files of appropriate input data
7.1 ValidationValidation recordDescription of validation process, including identification of validation data sets; Cal/Val plans & status; detailed history of validation activitiesScience teams responsible for products; Validation teams; product generation support teamsDuring mission; post-mission for final versionsWord documents
7.2 ValidationValidation datasetsValidation data sets along with metadataScience teams responsible for products; Validation teams; product generation support teamsDuring mission; post-mission for final versionsNumerical files containing such information
8.1 Software toolsSoftware tools for usersReaders & data toolsScience teams; product generation support teams; active archive data centers' user service groupsDuring and post missionSource code (and executable code if maintained)

 

Last update: 2 September 2022

Skip to content