March 14, 2021

URI scheme for identifying linked data entities by their identifier

It is a good practice in RDF to identify entities with a URI scheme that may allow one to eventually arrive at a description of the identified entity. However, the individual URI patterns aren't semantical, and there isn't a uniform standard that would allow identifying an entity within a particular dataset. This is an attempt to devise one. The resulting URI should behave like a "link" that connects to the dataset and finds the actual URIs used for the entity. In theory, it could even be used directly to identify the entity itself.

Identifying resources works by traversing inverse functional properties. This kind of a property (an instance of owl:InverseFunctionalProperty) behaves like a function from its range to its domain. Thus a particular value assigned to the property serves as its primary key which, when used with the property, uniquely identifies the resource.

Overview

For an inverse functional property ex:id and a dataset located at https://example.org/, the URI identifying an entity with an identifier of 1000 would be:

lid://example.org/ex:id/1000

Resolving this URI would attempt to send a SPARQL query (to an endpoint found at https://example.org/sparql) in this form:

PREFIX owl: <http://www.w3.org/2002/07/owl#>

DESCRIBE ?a
WHERE {
  ex:id a owl:InverseFunctionalProperty .
  ?a ex:id ?b .
  FILTER (isLiteral(?b) && str(?b) = "1000")
}

The WHERE clause in the query first ensures that ex:id is an inverse functional property, and then attempts to find all identifiers that have a literal value of "1000". All entities that are identified this way are returned, but they should all describe the same thing. Notice that the datatype of the literal is not checked, thus it is (potentially) possible to obtain descriptions of distinct entities, one identified by "1000" and the other by 1000. This might be potentially fixed by placing more restrictions on the domain of ex:id, but it is not needed in most cases.

In this case, the prefix ex: is supposed to be understood by the target SPARQL service, as its definition is not provided anywhere. This is not guaranteed however, thus there needs to be a way to specify the full URI of the property somehow:

lid://example.org/http%3A%2F%2Fexample.org%2Fid/1000

The presence of literal : in the property name states whether it should be interpreted as a prefixed name or URI. Since slashes are used to delimit the properties, they also have to be escaped. All names/URIs should be unescaped once before used in the query.

Prefixes

Writing the full URIs as properties is cumbersome however, in which case the prefix can be defined in the query portion of the URI:

lid://example.org/ex:id/1000?ex=http%3A//example.org/

The slashes don't need to be escaped anymore, as they aren't a part of the path. The colon has to be escaped (for now) however, because otherwise it would be interpreted as a prefix.

Having to specify prefixes is in many cases tedious, but thankfully, a lot of them doesn't have to be specified at all. These are contained in the initial context as published by the W3C, and downloadable as JSON here. This list should be sanitized first, so that URIs that don't end on / or # are removed (as of the date of writing, this removes xml, describedby, license and role, the first of which being a badly imported XML namespace, and the rest being properties).

The initial context is also modified to include several additional prefixes, regardless of their prior definition. The first is base: defined to be the empty relative URI <> left as is in the query. Next there are several "imported" URI schemes that are defined as prefixes in order to fix cases where a colon is not properly escaped in a URI. These are http:, https:, shttp:, file:, ftp:, sftp:, tftp:, jar:, dns:, data:, blob:, view-source:, javascript:, cid:, mid:, mailto:, payto:, info:, magnet:, ni:, nih:, git:, svn:, ssh:, telnet:, oid:, doi:, sip:, sips:, tel:, vnc:, udp:, tag:, urn: and, of course, lid:. Any such prefix x: is simply defined to be x:.

Any prefix can be redefined in the query portion of the URI. The definition of a prefix is resolved in the same way as in a property identifier – if it contains a colon, the rest is appended to a prefix. The query parameters are processed in order, and they may redefine any prefix defined in a previous step. Unknown prefixes are allowed in any of these steps, and they will be preserved when the SPARQL query is constructed. Conversely, a prefix may be set to an empty value, in which case it is undefined and treated as unknown. Additionally, all explicitly included URIs must be absolute; relative URIs can only be constructed from base: (to avoid common mistakes).

It should added that prefix definitions don't have to be written in the SPARQL query and, in some cases, that might not be even possible. However, including the prefixes may result in better formatted output.

Datatypes

As illustrated above, a "plain" identifier matches any literal with the specified string value. Using the facilities already shown here, it is very easy to add the type of the identifier now:

lid://example.org/ex:id/1000@xsd:integer

The constructed SPARQL query is very simple now:

PREFIX owl: <http://www.w3.org/2002/07/owl#>

DESCRIBE ?a
WHERE {
  ex:id a owl:InverseFunctionalProperty .
  ?a ex:id "1000"^^<http://www.w3.org/2001/XMLSchema#integer> .
}

The section after the @ can also indicate a language tag, or be empty (all these forms are disjoint). If the section is empty, the preceding string is interpreted as denoting a URI node (with the same semantics as all other URIs):

lid://example.org/foaf:mbox/mailto:user%40example.org@
DESCRIBE ?a
WHERE {
  ex:id a owl:InverseFunctionalProperty .
  ?a foaf:mbox <mailto:user@example.org> .
}

Property paths

It is also possible to specify more than one property in the path. This is translated to SPARQL in a straightforward way:

lid://example.org/ex:a/ex:b/1@xsd:integer
DESCRIBE ?a
WHERE {
  ex:a a owl:InverseFunctionalProperty .
  ex:b a owl:InverseFunctionalProperty .
  ?a ex:a/ex:b 1 .
}

And lastly, a property component may start with a single ', turning it into its inverse property. This is translated accordingly to ^ in SPARQL:

lid://example.org/'ex:a/ex:b/1@xsd:integer
DESCRIBE ?a
WHERE {
  ex:a a owl:FunctionalProperty .
  ex:b a owl:InverseFunctionalProperty .
  ?a ^ex:a/ex:b 1 .
}

Navigation

The result of navigating to such a URI depends on content types accepted by the navigator. In case of RDF, the results of the SPARQL query can fulfill the request. In other cases, one of the URIs described by the query can be accessed and redirected to. The fragment part of the original lid: URI survives and is used when processing the document.

Inference

Some ontologies are not explicit enough when describing properties, in which case the inverse functional property check may not be successful, even though it can be inferred. There are more options that simulate basic reasoning:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

DESCRIBE ?a
WHERE {
  {
    ex:id (rdfs:subPropertyOf|owl:equivalentProperty|^owl:equivalentProperty)*/a/(rdfs:subClassOf|owl:equivalentClass|^owl:equivalentClass)* owl:InverseFunctionalProperty .
  } UNION {
    ex:id (rdfs:subPropertyOf|owl:equivalentProperty|^owl:equivalentProperty)*/owl:inverseOf/(rdfs:subPropertyOf|owl:equivalentProperty|^owl:equivalentProperty)*/a/(rdfs:subClassOf|owl:equivalentClass|^owl:equivalentClass)* owl:FunctionalProperty .
  }
  ?subj ex:id 1000 .
  ?subj (owl:sameAs|^owl:sameAs)* ?a .
}

This ensures that ex:id is either owl:InverseFunctionalProperty or an inverse of owl:FunctionalProperty via inheritance chain. In addition to the primary result, all resources that are owl:sameAs are returned as well. An additional entailment regime may also be specified if the endpoint supports it.

No comments:

Post a Comment