Metadata Ontologies

Dictionary by PDPics from Pixabay

If we want to ensure that our data are interoperable and reusable, we must first describe them in a uniform and understandable way. We accomplish this by using metadata vocabularies and ontologies that define terms commonly accepted in the research community and the relationships between them. An important aspect of metadata is machine readability, which is also achieved using an agreed-upon, unambiguous encoding of concepts in the form of standardised metadata schemas. Humans can understand the meaning of non-standardly described data from their context to some degree, but computers can correctly interpret data only on the basis of precise, unambiguous and structured metadata tags. An appropriately defined and structured network of data and metadata is the foundation of the so-called Semantic Web or Web 3.0.

Semantic Web

The term "Semantic Web" was coined in 1999 by the father of the World Wide Web, Tim Berners-Lee. The term "semantic" refers to the fields of linguistics and formal logic that deal with meaning. The Semantic Web is an upgrade of Webs 1.0 ("Read-Only") and 2.0 ("Social Web"), which consist of a network of documents and web pages (digital objects), interconnected by hyperlinks. Such objects are machine-findable, but their content is understandable only to humans. The goal of the Semantic Web is to equip existing digital objects with metadata that describe their properties and the relationships between them, thus making also their meaning and not just the location machine-findable or readable. In this way, the exchange, analysis and use of digitised information, including research data, would be accelerated and simplified.

The Semantic Web is being built by the World Wide Web Consortium (W3C) based on a standard model for online data exchange called the Resource Description Framework (RDF). RDF is based on syntaxes that express the meaning of data and the connections between them with the help of uniform resource identifiers (URIs). The basic building block of this syntax is the so-called "semantic triplet" or "RDF triplet", which encodes the relationship between data in the form of the language combination subject → predicate → object. The sentence "John Doe is writing a book" could be written in RDF format as:

http://example.name#JohnDoe1083http://xmlns.com/foaf/0.1/writehttp://example.object#book

Semantic triplets are further connected in semantic networks called "semantic graphs" or “RDF graphs”, which can be used to describe the entire set of relevant concepts. The diagram below shows an example of such a simple graph.

Preprost semantični graf

Domain-Specific Vocabularies

A fundamental prerequisite for the efficient functioning of the Semantic Web is a unified and unambiguous terminology. A standardised terminology is the result of the agreement among experts in a particular research or professional field. Clark et al. (2021) illustrated its importance in the article titled Toward a Unified Description of Battery Data on the example of the definition of the term "electrode", which varies widely among authors. Only the definition of the International Electrotechnical Commission (IEC), which reflects the nature of electrodes and not only their properties, is suitable for use in metadata ontologies:

ConceptDefinitionSource
ElectrodeAn electron conductor in an electrochemical cell connected to the external circuit.IUPAC
A conductive part in electric contact with a medium of lower conductivity and intended to perform one or more of the functions of emitting charge carriers to or receiving charge carriers from that medium or to establish an electric field in that medium.IEC 60 050
A material in which electrons are the mobile species and therefore can be used to sense the potential of electrons.Electrochemical Systems
The site, area, or location at which electrochemical processes take place.Linden’s Handbook of Batteries

Some professional communities have already managed to agree upon a unified terminology. Two such examples of good practice are the International Classification of Diseases established by the World Health Organization, and controlled vocabularies of the CESSDA consortium for social sciences and humanities. Elsewhere, the process is still ongoing; one of the most active organizations in this field is the Research Data Alliance (RDA). One noteworthy example of good practice is the vocabulary and metadata schema for the discovery of data in the field of materials science, which is the result of the work of RDA members.

Metadata Ontologies

Based on an agreed-upon, standardized, domain-specific vocabulary, a metadata ontology can be built. From the viewpoint of computer or information science, an ontology is a data model, formalized in the form of machine-readable code, which represents knowledge as a network of concepts from a certain field and the relationships between them. An ontology defines a set of concepts and their categories, attributes (annotates) meaning to data, provides connections between data and enables machine inference based on their meaning. Ontologies are useful in areas such as software development, data interoperability and process automation.

The use of ontologies has the longest tradition is the life sciences, or, more precisely, bioinformatics. The best-known example is Gene Ontology, which represents structured, machine-readable knowledge about the functions of genes and their products. Some other well-known ontologies are, for example:

An extensive collection of ontologies from the field of biomedicine can be found in the BioPortal repository.

CTK UL recommends that you describe your research data with metadata using already created ontologies. If these do not exist, we recommend that you build them yourself. The guide Ontology Development 101: A Guide to Creating Your First Ontology, prepared by Stanford University, can help you achieve this. To create ontologies, you can use open source software such as Protégé or Owlready2.
 
Last update: 2 September 2022

Skip to content