January 29, 2021

Linking classes and properties with RDF

At its core, any RDF document is just a collection of semantic triples, recognized by any basic RDF tool. However, the call for standardization and mutual intelligibility has led to several ways to describe the schema that the data should follow, with two use cases in mind: reasoning and validation.

Reasoning

Reasoning means we can infer additional knowledge behind the actual triples that are read from a store, basically the notion that some triples are “implicit” and it is possible to express rules that a reasoner could follow to arrive at such triples. This is compatible with the open-world principle, meaning that the absence of a particular information in an RDF store does not mean it does not exist at all, only that it limits our ability to infer certain facts from it.

A reasoner may be used to partially validate the data, for example if it arrives at a contradiction. However, it is weaker than actual validation tools, since the absence of information can never lead to a contradiction. A reasoner may be used to “materialize” inferrable triples, or to ensure that these triples are already materialized (explicitly contained in the store).

Standards vocabularies used for reasoning are RDF Schema and OWL.

Validation

When an application should only accept documents that conform to a specific “shape”, we need a schema for a different purpose: to ensure that correct properties and datatypes are used, mandatory attributes are not missing, and only a specific set of allowed nodes is used. This conforms to the closed-world principle: omission implies non-existence.

The standard vocabulary intended for validation is SHACL. However, it may be possible to use it for inference as well, in some cases, like obtaining the default value of an attribute.

Classes and properties

In this article, we shall take a look at several vocabularies that allow various types of links between classes and properties. Typically, this expresses the correspondence between the classes of resources used as subjects or objects in triples and the properties used in the predicate position. Commonly, the terms “domain” and “range” are used.

RDF Schema

The more or less “default” schema for RDF vocabularies is fairly small, but covers the majority of use-cases well. It also has one interesting property: it is not possible to infer contradictions from it, meaning that anything that uses it is always consistent.

There are only two properties in RDFS that match the topic of this article: rdfs:domain and rdfs:range.

These property-centric properties have the intended meaning of specifying the class of one of the resources used in a triple (subject or object). Note that this is the only meaning, it is not possible to express if an instance of the class must have the property, or whether it can have multiple properties of this kind (i.e. there is no arity information).

It might not be immediatelly obvious how could a knowledge base still be consistent if it uses a property incorrectly with respect to its range or domain. To illustrate, see this example:

:prop a rdf:Property ;
  rdfs:domain :Person ;
  rdfs:range xsd:boolean .

:place a :Place ;
  :prop 7 .

While to us humans this is nonsensical, it is perfectly valid according to RDFS. All facts in RDFS are additive, so if we specify that the domain of :prop must be a person, then it follows that :place must be a person as well. Conversely, it follows that 7 must be a boolean. Note that RDFS doesn't allow any restrictive statements at all, so it is not possible to state that the classes :Person, :Place, xsd:integer and xsd:boolean are (pairwise) disjoint. Thus the only way to satisfy the schema is to accept that the corresponding resources are simply instances of additional classes, since that is what the data says.

It is a common source of errors to think that the value of rdfs:domain or rdfs:range is only one possible class of the set of allowed resources. That is simply not true – it is the only allowed class:

:prop2 a rdf:Property ;
  rdfs:range xsd:boolean , xsd:integer , xsd:float .

:thing :prop2 true .

There are 3 individual rdfs:range triples, and we must consider all of them individually. Since every one of these triples can be used to infer a fact about the class of a resource it is used with, this means that all of them are simultaneously true: true is a boolean, an integer and a float at the same time. To rectify this mistake, we have to bring in a new class:

:SimpleValue a rdfs:Class .
xsd:boolean rdfs:subClassOf :SimpleValue .
xsd:integer rdfs:subClassOf :SimpleValue .
xsd:float rdfs:subClassOf :SimpleValue .

:prop2 a rdf:Property ;
  rdfs:range :SimpleValue .

Usually, this makes sense, since the choice of the allowed classes is not arbitrary. :SimpleValue can be thought of as an interface in common programming languages.

Schema.org

Schema.org is not particularly known as a schema vocabulary (ironically), but there are two properties that are of interest to us there: schema:domainIncludes and schema:rangeIncludes. The basic (meta) resources in Schema.org in some ways mirror RDFS, but due to its universal nature, the domain and range properties are meant to be extensible. In a way, this is similar to the example above:

:prop schema:domainIncludes :Person , :Place .

# Equivalent to

:prop rdfs:domain _:domain .
:Person rdfs:subClassOf _:domain .
:Place rdfs:subClassOf _:domain .

Note that the presence of this property doesn't actually allow us to infer anything meaningful from the data itself, it just states that any resource it is used with is an instance of some class that contains the specified classes (which is actually redundant, since it is always satisfied at least by rdfs:Resource). It does not mean that other classes cannot be used, nor does it mean that any instance of the specified class can be used (that is actually meta-reasoning). Rather, it tells us something about the tools that recognize the property: instances of these classes are commonly used here and should be supported. It is also a signal to the data provider that not sticking to the recommended classes might decrease the usability in some applications (but is by no means an error).

OWL

OWL is a standard vocabulary for creating ontologies and for reasoning. It is a superset of RDFS and uses most of its vocabulary, so it is not so surprising that there are actually no new properties aside of rdfs:domain and rdfs:range to link arbitrary classes and properties. However, much of OWL's power comes from owl:Restricion used for describing “anonymous” classes based on values of specific properties.

Even with restrictions, OWL still tries hard to make the knowledge base consistent, so even a strict cardinality restriction isn't so easily violated:

:Person rdfs:subClassOf [
  a owl:Restriction ;
  owl:onProperty :prop ;
  owl:cardinality 1
] .

If you omit the property in your triples, OWL will still tell you that such a predicate is specified (somewhere), but it doesn't know the value. Conversely, if there are more triples with the same property, OWL infers that all values must actually be descriptions of the same individual (as if linked with owl:sameAs).

The primary way to make a database inconsistent is to use owl:Nothing. This is the class that cannot contain any individual, so for example specifying that the intersection of xsd:integer and xsd:boolean is a subclass of owl:Nothing means that no individual can be in both of them (they are disjoint).

Restrictions can be used for limiting some or all values to a certain class, with an allowed cardinality interval, or for specifying reflexivity. Again these are mostly used for inference and it is not that easy to create contradictory data.

SHACL

SHACL is the youngest of the vocabularies presented here. It is more oriented on browsing existing data and relies on SPARQL in some of its constructs.

The vocabulary doesn't link classes and properties directly, rather it allows the creation of rules that link classes and matching properties via a shape. Unlike RDFS, it is class-centric, so the shapes are constructed from the point of view of a target class (or list of nodes or other constraints) that have to conform.

Shapes in SHACL essentially divide into node shapes and property shapes. Both of them are applied to specific nodes, and a node shape can link to property shapes via sh:property (like you would create a subclass of a restriction in OWL). Property shapes apply to targets of paths followed from the source (focus) node, like you could specify in a SPARQL path. It is both possible to restrict a global use of a property (on instances of rdfs:Resource) and to flip the subject and object of a triple (via sh:inversePath).

Unlike OWL, SHACL does not go through any interpretation-like mapping of nodes to individuals or sets, instead it operates directly on the actual nodes in RDF. This allows some interesting conditions, like sh:nodeType, making it possible to for example prohibit the use of blank nodes in a graph.

SHACL is used with the closed-world assumption. There are conditions that assure properties have specific values or cardinalities, but unlike their counterparts in OWL, they refer to the actual triples that we have at hand. In other words, SHACL assumes the document it is used to validate is always the complete description of the resources it refers to. Moreover, there is a notion of a maximally described node, as sh:closed true

No comments:

Post a Comment