Fundamentals of Research Data

Open science means that all results of research work, including research data, are made openly available. One of the main arguments for open or public sharing of research results is the fact that a large proportion of scientific research is publicly funded. The public therefore has a right to access the results of research conducted with the so-called "taxpayers' money". Moreover, research data are high-value commodities in the digital economy, as they are indispensable for the operation of production processes, online services and logistics chains, as well as for newer concepts such as the Internet of Things, smart cities and artificial intelligence. The importance of public and early sharing of research results was further reinforced by the SARS-CoV-2 pandemic, as rapid access to the latest research findings was key to understanding the disease, developing vaccines and effective treatment.
In the European Union, public sharing of research data became mandatory in 2021 with the start of the Horizon Europe financial mechanism. Regardless of this, certain scientific publishers are already introducing requirements for public sharing of research data that underline scientific articles. This will initially be voluntary, but later it will become mandatory and the data will also be reviewed, as can be seen, for example, from the ACS Publications Policy. The public administration has also started to open its data, namely through the Open Data of Slovenia website.
In this article, we will have a look at general concepts related to research data and its public sharing.
What are Research Data?
Currently, there is no universally accepted definition of research data. The definitions that do exist are also of varying precision. Let us take a look at some of them:
Definition by Springer Nature:
Research data refers to the collection of files that support your research project, study or publication such as spreadsheets, documents, images, videos or audio. (www.springernature.com)
*
Slovenian translation: Pojem raziskovalni podatki se nanaša na zbirko datotek, ki podpirajo vaš raziskovalni projekt, študijo ali publikacijo, kot so preglednice, dokumenti, slike, videi ali avdiodatoteke.
Definition by OECD:
Research data are defined as factual records (numerical scores, textual records, images and sounds) used as primary sources for scientific research, and that are commonly accepted in the scientific community as necessary to validate research findings. A research data set constitutes a systematic, partial representation of the subject being investigated. This term does not cover the following: laboratory notebooks, preliminary analyses, and drafts of scientific papers, plans for future research, peer reviews, or personal communications with colleagues or physical objects (e.g. laboratory samples, strains of bacteria and test animals such as mice). (OECD Principles and Guidelines for Access to Research Data from Public Funding)
*
Slovenian translation: Raziskovalni podatki so definirani kot stvarni zapisi (numerični rezultati, besedilni zapisi, slikovno in zvočno gradivo), ki se uporabljajo kot primarni viri za namene znanstvenih raziskav in so v znanstveni skupnosti splošno sprejeti kot nujni za potrditev raziskovalnih izsledkov. Nabor raziskovalnih podatkov predstavlja sistematično, delno predstavitev raziskovalne tematike. Ta pojem ne vključuje: laboratorijskih dnevnikov, preliminarnih analiz, osnutkov znanstvenih člankov, načrtov bodočih raziskav, strokovnih recenzij, komunikacije s kolegi in predmetov (npr. laboratorijskih vzorcev, bakterijskih sevov in testnih živali, kot so miši).
Definition by CODATA (Committee on Data of the International Science Council):
Data that are used as primary sources to support technical or scientific enquiry, research, scholarship, or artistic activity, and that are used as evidence in the research process and/or are commonly accepted in the research community as necessary to validate research findings and results. All other digital and non-digital content have the potential of becoming research data. Research data may be experimental data, observational data, operational data, third party data, public sector data, monitoring data, processed data, or repurposed data. (www.codata.org)
*
Slovenian translation: Raziskovalni podatki so podatki, ki se uporabljajo kot primarni viri za podporo tehničnim ali znanstvenim raziskavam, akademskemu znanju ali umetniški dejavnosti, ki se uporabljajo kot dokazi v raziskovalnem procesu in/ali so v raziskovalni skupnosti splošno sprejeti kot nujni za potrjevanje raziskovalnih ugotovitev in rezultatov. Vse ostale digitalne in nedigitalne vsebine imajo potencial, da postanejo raziskovalni podatki. Raziskovalni podatki so lahko eksperimentalni podatki, podatki opazovanj, operativni podatki, podatki tretjih oseb, podatki javnega sektorja, podatki monitoringa, obdelani podatki ali ponovno uporabljeni podatki.
By some definitions, any number and file you create in your work could be considered as a piece of research data. However, CTK UL believes that such a strict definition of research data is not sustainable from a practical point of view. Therefore, below, we propose for your consideration our point of view on what should be treated as research data for the purposes of implementing European Research Area projects.
Before starting, however, you should of course check your funder's policy on research data. If it does not contain specific provisions, the decision on the scope of open research data is left to you. You define it in the research data management plan.
CTK UL Recommendations Regarding the Scope of Open Research Data
CTK UL information specialists recommend that in research where data can be relatively easily recovered by repeating measurements, you open to the public at least the data that is absolutely necessary for other interested researchers to repeat your experiments. This means, at a minimum, the data that formed the basis of your research publication. But you can definitely share more than that if you want.
Such a dataset may, but need not, include raw data. Judgment always rests with the authors. If you choose to share processed numerical data, add measurement uncertainties where possible. If it makes sense, and if you know the distribution of the data, you can include in the open data set the raw data from the area around the mean and both extremes of the distribution, or some other meaningful cross-section of the data.
In any case, the set of research data should include accompanying documents (e.g., research protocols), source computer code, etc., if the data cannot be read or understood without them.
In the case of unique data that cannot be recovered in time and space, we recommend that you share all data, including raw data, unless there are eligible exemptions for reasons of intellectual property rights or other security aspects.
Types and Formats of Research Data
The Oregon State University Library has prepared a useful classification of research data. According to this classification, research data can be divided into five groups:
1. Data obtained through observation
- Data obtained in the field (in situ).
- They cannot be recaptured, recreated or replaced.
- Examples: environmental monitoring (physico-chemical, biological), field observations (natural sciences, social sciences and humanities), surveys.
- CTK UL recommendation: all data should be open, including raw data, except for eligible exemptions. They must be accompanied by all documents (e.g., research protocols) describing how the data were obtained.
2. Data obtained experimentally
- Data obtained under controlled field conditions (in situ) or in a laboratory.
- They should be reproducible, but recovery is expensive.
- Examples: microscopy, gene sequencing, chromatography, spectroscopy, chemical synthesis.
- CTK UL recommendation: all analyzed data should be open (together with measurement uncertainties if the data is numerical) except for eligible exemptions, and a few example subsets of the raw data. They must be accompanied by all documents (e.g., research protocols) describing how the data were obtained.
3. Derived or compiled data
- They can be recovered, but recovery is expensive.
- Examples: data obtained from text or number mining, derived variables, compiled datasets and databases, 3D models.
- CTK UL recommendation: all analyzed data should be open (together with measurement uncertainties if the data is numerical) except for eligible exemptions. They must be accompanied by links to the primary data (if open) and any documents (e.g., research protocols, computer code) describing how the derived data were obtained.
4. Data obtained through simulations
- The results of the models designed to study the operation or performance of an actual or theoretical system.
- Models and metadata where the input data may be more important than the output data.
- Examples: climate models, economic models, biogeochemical models.
- CTK UL recommendation: simulation results should be open (together with measurement uncertainties if the data is numerical) except for eligible exemptions. They must be accompanied by links to the source data (if open), computer code that allows the simulation to be repeated, and any documents that allow the process to be understood.
5. Reference or canonical data
- Static or dynamic data collections, most often professionally peer-reviewed and curated or even published.
- Examples: gene sequence databanks, chemical structures, national censuses, public spatial information (geographical, geodetic, geological).
- CTK recommendation: when using this type of data, ensure proper citation. You can find instructions on this on the website of the UK's Digital Curation Centre. Sometimes specific citation instructions are provided by collection authors, but you can also use the Cite This for Me website for help.
Research data can often be exported in a variety of formats, but not all of them are suitable for open sharing. Data formats recommended to maximize data interoperability and reusability can be found on the UK Data Service website. We also write more about this on the page about the formatting of open data.
What Counts as Open Research Data?
Open, openly accessible or publicly available data are data that meet the FAIR principles. In short, this means that they are:
- deposited in a trusted repository, where they are equipped with a permanent identifier and rich metadata (fulfilling the demands of findability and accessibility),
- described in a formal, generally accessible and widely used language for the dissemination of knowledge (fulfilling the demand of interoperability),
- licensed with an open license and equipped with all information (e.g., methods, protocols, software) that enable other researchers to understand and reuse them (fulfilling the demand of reusability).
Part of the so-called FAIR-ification of data is provided by a trustworthy repository, especially permanent storage and assignment of a permanent identifier. However, the second part of the process, i.e. proper structuring of data and equipping them with metadata, must be taken care of by the researchers themselves. You can read more about how to prepare data for public sharing and what metadata is on the pages about data formatting and metadata.
Researchers are not required to disclose their data immediately, fully or unconditionally. If your data is of a sensitive nature, you can open it after a certain time has passed (i.e., you can establish an embargo on public sharing). You can open them only partially or restrict physical access (e.g., enable access only through a secure connection or from a secure room). You can read more about methods of partial or conditional disclosure of data on the page on eligible exemptions from openness.
How to Cite Research Data?
Most repositories already provide default citation formats for datasets in one or more citation styles. The DiRROS repository, for example, allows you to format citations according to ABNT, ACM, AMA, APA, Chicago, Harvard, IEEE, ISO 690, MLA, and Vancouver citation styles. In larger repositories, such as Zenodo, you can even choose between the styles of individual scientific journals. If you are going to deposit your data in a repository that does not offer the citation style you need, you can use the Cite This for Me website to help you cite it.
Are Research Data Entered Into COBISS?
Yes, research data can be entered into the COBISS system in accordance with the Typology of Documents/Works for Bibliography Management under section 2.20 Complete Scientific Database of Research Data. This section includes:
An electronic research data collection, the scientific relevance of which is demonstrated by the use for the purpose of researching a wide range of theoretical and applied problems. The data collection must be the outcome of an accomplished research and comply with high quality standards. The quality is assessed on the basis of the detailed accompanying documentation. The data collection must be publicly available in the national or international scientific data archives (repository). The research data collection must be documented and available in a form that allows the repetition of published scientific findings made on its basis.
*
A corpus is a special collection; it is a uniform collection of authentic texts, internally structured and labelled in a standard manner, created in accordance with predefined criteria and with a specific aim, accessible electronically and equipped with tools for multi-layer search and statistical data processing.
Are Research Data Included in the Quantitative Assessment of Scientific Performance?
According to the Bibliographic Criteria of Scientific and Professional Performance defined by the Slovenian Research Agency, a complete scientific database or corpus (2.20) from the agency's BIBLIO-D list is awarded 30 points. For now, the BIBLIO-D list includes only complete scientific databases that are deposited in the Slovenian Social Science Data Archives.
Last update: 14 October 2022
Page Contents
- What are Research Data?
- CTK UL Recommendations Regarding the Scope of Research Data for Public Sharing
- Types and Formats of Research Data
- What Counts as Open Research Data?
- How to Cite Research Data?
- Are Research Data Entered Into COBISS?
- Are Research Data Included in the Quantitative Assessment of Scientific Performance?
The CODATA Definition of Research Data
Research data are the data used as primary sources to support technical or scientific enquiry, research, scholarship, or artistic activity, and that are used as evidence in the research process and/or are commonly accepted in the research community as necessary to validate research findings and results. All other digital and non-digital content have the potential of becoming research data. Research data may be experimental data, observational data, operational data, third party data, public sector data, monitoring data, processed data, or repurposed data.