Formatting Research Data for Open Sharing
Research data must be properly formatted before sharing so that other researchers can understand and reuse them. In this way, we satisfy the FAIR principles of interoperability and reusability. In some cases, the formatting of research data for open sharing is not much different from its formatting for scientific publications, but a few details should be noted. In repositories, research data will stand on their own without an accompanying context, which is why it is necessary to pay attention to the appropriate naming of files, the hierarchy of file folders, and metadata (which can be described in ReadMe files or data articles). We also need to pay attention to file formats, as only some are interoperable enough to be suitable for sharing.
You can choose how to name the files yourself, but it is useful to follow some general recommendations. Above all, file names must be understandable to people who will reuse the data, so explain the naming method in the research data management plan. Ideally, file names should also be machine-readable.
Good File Naming Practices
(Adapted from Princeton University Library, Brown University Library and UK Data Service)
- File naming should be consistent;
- File names should be short (ideally <25 characters, but definitely <40);
- Avoid using spaces, periods, slashes and special characters (e.g., & and %);
- To improve readability and separate individual elements of the name, use capital letters, underscores and hyphens;
- Write the dates in ISO 8601 format: YYYYMMDD (Y = year, M = month, D = day);
- Include the file version in the name;
- The order of the name elements should be such that the files can be sorted by creation date, serial number, or version.
Recommended Name Elements
(Adapted from Princeton University Library)
- The date the file was created (writing the date at the beginning of the name will make it easier to sort the files),
- Project name or number,
- The name of the author,
- A brief description of the file contents,
- Sample number,
- Type of analysis,
- File version.
An example of good practice: 20190523_H2020MatChem_GL_exp5_c2_XRF1
This name contains information that is important to the author, the research group and other users of the data:
- The date the file was created, i.e. 23 May 2019 in YYYYMMDD format,
- The name of the hypothetical project titled "Materials Chemistry" (abbreviation MatChem), which was financed within the framework of the Horizon 2020 (H2020) programme,
- Initials of the hypothetical author, i.e. G. L.,
- The title of the experiment, i.e. exp5 (“Experiment 5”)
- Designation of the compound, i.e. C2 (“Compound 2”)
- Designation of the analysis, i.e. XRF1 (first measurement with X-ray fluorescence).
Sorting files by folders and organising folders into a hierarchical or tree structure helps to make the content more transparent. Consider which hierarchy is more appropriate - the one with more levels or the one with fewer. The UK Data Service recommends that the hierarchy should have no more than four levels and that each folder should contain no more than 10 files. Individual levels should reflect the most meaningful file classification, e.g., by experiments, dates, locations, analysis types, file types...
Data Formatting and File Formats
Instrumental Measurement Files in Proprietary Formats
Some data from experiments or field observations are unique because they cannot be retrieved spatially and/or temporally. CTK UL recommends open sharing of all data, including raw data (except for eligible exemptions), when it is invaluable and irreplaceable. In some of these types of research, the raw data is obtained using instruments manufactured by private companies, and the files are in proprietary formats.
We recommend that you convert such data from proprietary formats to generic file sharing formats if possible. For example, chromatograms or spectra can often be converted into tables since they are in principle a numerical relationship between the dependent and independent variables. The tables can then be saved in a recommended format, such as .csv or .tab (commonly used proprietary formats such as .xls/.xlsx are also acceptable), and the curves re-visualized with data processing tools.
The disadvantage of this approach is that certain metadata contained in files in proprietary formats may be lost (e.g., information about the instrument, timestamps, operator...). In this case, you must manually add the lost metadata to the rest of the experiment metadata.
If converting data from proprietary formats to open formats is not possible or would be too time-consuming, you can also deposit raw data in proprietary formats into the repositories, but you must specify in the metadata the software with which the files can be opened. Please also specify the version of the software. When upgrading instruments, it often happens that newer versions of the software are no longer compatible with the filetypes of older versions of the instruments and vice versa. If possible, add a link to the manufacturer's website where the appropriate software can be obtained.
Large amounts of raw numerical data are best stored in the form of tables. Such a format is not only transparent, but also enables other users to easily import data into various programs for data processing and reanalysis. In the case of contingency tables, i.e., of tables showing the relationship between an independent and dependent variable (or several of them), it is worthwhile to mention that, by convention, independent variables are listed in columns and dependent variables in rows. Percentages are always given in the direction of the independent variable, as we are testing the hypothesis of how the independent variable affects the distribution of the dependent variable across certain categories.
Table columns and rows must be clearly and comprehensibly labeled. If you use abbreviations that are not generally accepted in your research field to save space, define them at an easily noticeable location. In the programs Microsoft Excel, LibreOffice Calc and OpenOffice Calc, it is recommended to collect the numerical values on one tab and give the description of the table on another. In this way, you create a mini ReadMe description right inside the file, which will make it easier for other users to understand the content.
According to the UK Data Service, recommended formats for tabular files are:
- .csv (comma-separated values),
- .tab (tab-delimited file),
- delimited text of given character set with SQL data definition statements where appropriate.
Acceptable formats are:
- .txt (delimited text of given character set where only characters not present in the data are used as delimiters),
- commonly used formats: Microsoft Excel (.xls/.xlsx), Microsoft Access (.mdb/.accdb), dBase (.dbf) , spreadsheets in OpenDocument format (.ods).
When reporting numeric values and exporting data in .csv format, pay attention to decimal separators. The rules for the use of decimal and thousands separators are very different around the world. We suggest that you indicate in the metadata which decimal separator you used so that other users of your data do not have problems understanding and importing the data.
Statistical Analyses and Graphs
It is often practical to visualise the processed numerical data since visualisation facilitates the understanding of the relationships between the variables. However, it is even more useful to share files in which visualisation and/or statistical analysis has been done compared to mere graphical representations without accompanying context. With this, you not only share richer data but also enable other researchers to check what kind of analysis you have performed on the data and adapt it to their needs.
The UK Data Service recommends using the following formats for data that has been significantly processed (i.e. tabular data with a lot of metadata):
- proprietary formats of statistical packages, e.g., .sav (SPSS), .dta (Stata), .sas7bdat (SAS), etc.,
- delimited text and command (‘setup’) file (SPSS, Stata, SAS, etc.) containing metadata,
- structured text or mark-up file containing metadata, e.g., DDI XML format.
Also acceptable formats are .por (SPSS portable format) and .mdb/.accdb (Microsoft Access).
Graphs exported as images are subject to the same file format rules as photos. Also, make sure you label all axes, add units of measurement, label the meaning of all curves, and use a sufficient font size. Helpful guidance on visualizing your data can be found in the University of Queensland educational material.
Photographs and Other Pictorial Material
For open sharing, it is recommended to design photos and other pictorial material according to the same principles as for publication in scientific publications, which ensure maximum transparency and research integrity. You can follow the recommendations of Springer Nature, especially the Nature journal:
- Conscious manipulation of images to change or improve your results is never acceptable. To avoid unintentional misrepresentation, process your images only minimally. The processed images should accurately reflect the originals.
- Changing brightness or contrast (e.g., in fluorescence microscopy) is acceptable only if all images, including controls, are processed in the same way. The contrast must not be changed to such an extent that part of the data disappears. Over-processing to emphasize one part of the image at the expense of another (e.g., by biasing threshold settings) or test values compared to control values is not acceptable.
- Using editing tools, e.g., Photoshop's cloning and healing tools and any features that blur processing traces, is not acceptable.
- Image cropping should be avoided unless it significantly improves the clarity or conciseness of the content. When cropping, no information that is necessary to understand the images should be lost, e.g., molecular markers in electrophoresis gels.
- Combining the images or photos that were taken at different times or in different locations into a single image is not acceptable, unless the data is averaged over time or a time-lapse sequence is involved. If combining images is necessary, clearly mark the boundaries between different parts in the final image and describe its properties in metadata.
- Any use of image processing software must be clearly indicated in the metadata along with a description of the corrections.
Additional instructions for electrophoresis and microscopy images are available on the Nature journal website. CTK UL has also prepared some tips on how you can check whether other researchers' images have been manipulated (link).
The UK Data Service states as the recommended format for photos and raster images:
- .tif (uncompressed TIFF 6.0),
- .dcm, .dcm30 (Digital Imaging and Communications in Medicine - DICOM) for computed tomography (CT) and magnetic resonance imaging (MRI) data.
Acceptable formats are:
- JPEG (.jpeg, .jpg) if the original was created in this format,
- BMP (.bmp) if the original was created in this format,
- PNG (.png) if the original was created in this format,
- other types of TIFF (.tif, .tiff) format,
- RAW (.raw) image format,
- Photoshop (.psd) files,
- Adobe Portable Document Format – PDF/A, PDF (.pdf).
For vector drawings, the UK Data Service recommends the .dwg format of the CAD software, but acceptable formats are also .dxf, .svg (CAD), .ai (Adobe Illustrator) and binary formats of CAD packages.
Public data sharing has a long tradition in the field of geography, geology, urban planning, etc. Examples comprise various geographic information systems (GIS) and related open source programs and public databases of geospatial data. In Slovenia, such bases are, for example, e-Space Portal, ARSO Geoportal, GIS of the Statistical Office of the Republic of Slovenia, Slovenian Cultural Heritage Register, etc. As with other types of data, rich metadata, which enable the understanding of the data and data mining using computer algorithms, are also extremely important here.
The UK Data Service recommends the following formats for geospatial data:
- ESRI Shapefile (.shp, .shx, .dbf, .prj, .sbx, .sbn),
- georeferenced TIFF (.tif, .tfw),
- tabular GIS attribute data,
- Geography Markup Language (.gml).
Acceptable formats are:
- ESRI Geodatabase format (.mdb),
- MapInfo Interchange Format (.mif) for vector data,
- Keyhole Mark-up Language (.kml),
- binary formats of GIS packages.
In many research fields, video files are the exception rather than the norm among research results, so recommendations for the formatting of video files are less developed than recommendations for the formatting of image material. When creating video files, you can rely on the instructions of publishers who have already devoted more attention to this topic, e.g., Cell Press. Some of the general requirements listed by several publishers are:
- file size: maximum 150 MB,
- frame rate: at least 15 per second,
- frame size (frame size): at least 320 x 240 px
- frame aspect ratio: desirable 4 : 3, also acceptable 16 : 9,
- bit rate: at least 265 kbps,
- codec: H.264 recommended.
CTK UL recommends that, from the point of view of research integrity, you follow those recommendations for the design of photographs and other pictorial material that can be extrapolated to video files. Since the manipulation of video material is even more difficult to detect than the manipulation of images, we have prepared some guidelines for anyone who would like to verify the authenticity of other researchers' recordings (link).
The UK Data Service recommends the following formats for video files:
- .mp4 (MPEG-4),
- .ogv, .ogg (OGG video),
- .mj2 (motion JPEG 2000).
Acceptable formats are:
- .mov (MOV),
- .wmv (Windows Media Video),
- .webm (WebM).
As with video files, there are relatively few general guidelines for formatting audio files. One of the recommendations of several scientific publishers is that the bit rate of the audio should be at least 128 kbps, and the size of each file should not exceed 30 MB. We write about the manipulation of audio files on the page with instructions for the detection of data manipulation (link).
The UK Data Service recommends that you share the audio in .flac (Free Lossless Audio Codec) format. Acceptable formats are:
- .mp3 (MPEG-1 Audio Layer 3) if the original was created in this format,
- .aif (Audio Interchange File Format),
- .wav (Waveform Audio Format).
Science funders typically treat computer code as the piece of data necessary to validate scientific findings. Computer code is therefore a valuable research output that contributes to a more transparent and verifiable research process and should be preserved even after the end of the research project. The University of Reading has produced detailed guidelines for publishing computer code, and additional guidance can also be found in the document “Five recommendations for “FAIR software” by the Netherlands eScience Center.
General recommendations state that computer code is best uploaded to a dedicated online repository that will provide version control, code review, bug detection, documentation, user support, and other capabilities. Among the most popular repositories are GitHub, Bitbucket and GitLab. Versions of code supporting research results should be exported from the repository and archived in a trusted public data repository. This will give the specific version of the code that was used to generate or analyse the research data a DOI by which it can be cited. GitHub, for example, already provides an easy feature to archive computer code in the Zenodo repository. It is recommended to equip the archived code with open licenses, with which you set the conditions of reuse.
In addition to the general principles of clear expression in writing, the most important aspect of sharing text files is their formats. Although Microsoft Word is one of the most widely used text editors, the .doc/.docx format is not one of the recommended formats for sharing, but is only acceptable. The UK Data Service states that text files are best shared in the following formats:
- .rtf (Rich Text Format),
- .pdf (PDF/UA, PDF/A ali PDF),
- .htm (HTML),
- .odt (OpenDocument Text),
- .rmd (R Markdown files, including the HTML version).
Acceptable formats are:
- .txt (unformatted text),
- commonly used formats: .doc/.docx (Microsoft Word), .xls/.xlsx (Microsoft Excel),
- .xml (XML marked-up text according to an appropriate document type definition, DTD, or schema, e.g., XHMTL 1.0.)
Last update: 24 August 2022