February 27, 2021

Recommendations for producing linked data

While working with RDF and linked data, I have tried to come up with some tips that, in my opinion, make publishing RDF data better in a wide range of qualities. I shall call them IS4 Recommendations and list them in this article (which will be updated in case of new ones):

IS4 Recommendation #1: Look for existing identifiers

This follows one of the fundamental principles of linked data: link. When you want to use a new concept (individual, property or class), try to find an existing identifier or identifier scheme that makes it possible to identify what you are using. If the meaning matches your intention exactly, use the identifier, and if not, use one of rdfs:subClassOf, rdfs:subPropertyOf, owl:equivalentClass, owl:equivalentProperty, rdfs:isDefinedBy, or a subproperty of skos:mappingRelation to express the proper relation. You don't need to write a whole ontology but this ensures that other data producers will most likely come up with the same identifier.

URI schemes and URN namespaces

There is a large variety of schemes usable not only for identifying resources on the web, but practically anything. Even though some of them don't have a common resolution mechanism, you can still use them as a secondary identifier and link it via owl:sameAs to a primary identifier.

As an illustration, geo: can be used for identifying locations on the Earth, jar: for locating files in an archive, dns: for talking about a particular domain, tv: for identifying TV stations, mailto: for e-mail addresses, payto: for bank accounts, tel: for telephone numbers, or data: for embedding (preferably small) files into your data.

URN, on the other hand, forms a plain hierarchy of names which are usually owned by a registering authority. They do not store any additional data, but some namespaces can be used to convert existing naming schemes, such as urn:publicid: for DTD public identifiers (used in XML), urn:isbn: for books, urn:uuid: for UUIDs or urn:oid: for ISO Object Identifiers. A notable hierarchy is urn:ietf: for the purposes of Internet Engineering Task Force, with urn:ietf:rfc: for individual RFC documents, and most notably urn:ietf:params:. This sub-namespace contains identifiers for protocol and format-specific parameters or entities (some of them are suitable for use as RDF properties), such as urn:ietf:params:xml:.

An interesting subset of URIs are tags, in the tag: scheme. They are essentially names, but outside the URN hierarchy, as they also store an identifier of a particular tagging entity. They look similar to the conventions for HTTP identifiers used by W3C (as shown below), but without an associated protocol, and with the possibility of using an e-mail address to specify the tagging entity. For example, tag:yaml.org,2002:null (taken from YAML) denotes the null datatype.

XML namespaces

Close to RDF, XML has seen a wide use of URIs as identifiers in practice. While it is possible to use URIs as identifiers of document types, notations or external entities, by far the largest use of them is as identifiers of XML namespaces. Namespaces in XML, XQuery, XPath and related technologies are common and standardized, but there is a caveat – XML qualified names are pairs, not URIs. A circle in SVG is really {http://www.w3.org/2000/svg}circle, a pair of a URI and a local name. Arguably the best but not perfect mapping is to concatenate the two strings with #, producing http://www.w3.org/2000/svg#circle.

Some XML namespaces explicitly use this convention to form URI identifiers, such as http://www.w3.org/2001/XMLSchema (xsd:), http://www.w3.org/2005/xpath-functions (fn:) or http://www.w3.org/2005/xqt-errors (err:). This is perfect for identifying datatypes, functions and errors in any programming language or environment. SKOS is perfect for linking these, as for example a "division by zero" error in one language isn't identical to err:FOAR0001 in XQuery, but skos:closeMatch or skos:exactMatch gives enough information to determining how related they are. In general, if you have something that identifies the sine function in your language, you should link it to http://www.w3.org/2005/xpath-functions/math#sin.

IS4 Recommendation #2: Use specific datatypes

Every meaningful literal must have a datatype, since a datatype is just a simple triple: the lexical space (the set of syntactically valid strings), the value space (what the strings actually denote, essentially their equivalence classes) and a mapping between the two. This covers everything from simple tokens used as enums to sentences in a human language. This datatype might be implicit (coming from the property) but it should be explicit whenever possible, so that any application can find something it understands.

Since RDF 1.1, xsd:string is the default datatype for plain untagged and untyped literals, but it should be used only when the literal is semantically just a string, or if the datatype is unknown (e.g. when it is imported from a dataset that does not preserve datatypes, such as contextless JSON or schemaless XML). There is also rdf:PlainLiteral which may be used in situations where another datatype is not applicable.

Note that the meaning of an instance of rdfs:Datatype in a context of a class rather than the datatype of a literal refers solely to its value space. Thus the value of "str@^^rdf:PlainLiteral is simply a string str and the original literal is forgotten.

XML Schema has a pretty rich hierarchy of datatypes which are suitable for most primitives. Unfortunately it doesn't go much further: other choices are YAML (only tag:yaml.org,2002:null usable) and OWL brings in owl:rational and owl:real. The two numeric types are used in theory and in analysis, but not much in actual data, as owl:real doesn't even have its own lexical space (which makes sense as its value space is much larger than that of xsd:string).

Unfortunately, there is still no standardized method of assigning URIs to MIME content types. Thus the choices for storing code in JavaScript would be ContentType:text/javascript (proposed but not published), urn:mime:text/javascript (not a registered namespace, but used sometimes in practice), urn:mime:text:javascript (same but follows the URN naming scheme, not used in practice), or urn:publicid:text%2Fjavascript (analogous to a XML notation public identifier, not used in practice but valid). It's also possible to use the data: URI scheme, which might be better in this case.

Language codes

BCP 47 language codes offer great flexibility when it comes to describing languages, especially due to their use of other standards for working with languages and scripts. They are usable in contexts too, as they include two special subtags: und and zxx. und specifies an undetermined language, so something like "erΕ‘mΕ±"@und can be used to indicate that a language might be specified, but is unknown. The second subtag, zxx, indicates no linguistic context. It is debatable whether programming languages have a linguistic context, but if not, something like zxx-Latn (no language, Latin script) could be suitable for a piece of code (or ASCII art). Private-use subtags can be used for storing additional information, such as zxx-Latn-x-js storing information for a syntax highlighter. Artificial (but still "human") languages are covered by art.

Note that there is a significant difference between the "target audience" for datatypes and language tags. A datatype has a formal definition and strict validity rules, thus a literal "<text>something<>"^^rdf:XMLLiteral is semantically invalid, as it does not denote anything in the value space of rdf:XMLLiteral. Language tags are intended for humans, thus a (hypothetical) literal "<text>something<>"@zxx-Latn-x-xml is a valid way of describing something that looks like XML and is intended to be displayed, formatted and highlighted this way, even though it is not valid. To a human, the piece of code may still be decipherable.

Like content types, languages also have no standard way of identifying them via a URI, and even no non-standard ways I am aware of. Language codes used as data should have the datatype xsd:language but note that this does not identify any particular language, only its identifier (it is derived from xsd:string, unlike xsd:anyURI for example). You should use skos:notation to link a language to its code.

IS4 Recommendation #3: Use emoji

While it may not seem so at first, emoji in Unicode offer a multitude of possibilities for semantic applications, both for humans and computers. These characters can be used to denote both concrete and abstract concepts, shapes, colors (general, skin or hair), locations, animals and so on. The language code for emoji is und-Zsye (I believe that zxx is not applicable here, as there are legitimate studies on the linguistics of emoji) or und-Zsym for plain symbols (not requiring emoji support).

Staying true to linked data principles, you should not use these characters in identifiers directly, rather you should use them as labels for identifiers. Thus a URI that denotes a cat (a type, class or concept) should have the label "🐈"@und-Zsye, the URI for the concept of a red color should have "πŸŸ₯"@und-Zsye, and the Earth can have a label "🌎"@und-Zsye, "🌍"@und-Zsye or "🌏"@und-Zsye (preferably all of them). ZWJ sequences can be used in any form to describe a combination of concepts, thus "🐈‍🟩‍🦰‍πŸ—¨️"@und-Zsye could represent a green cat with red hair in a speech bubble, whatever that means. Some concepts can be described without the use of emoji, such as "♂"@und-Zsym.

Of course this does not mean that emoji is sufficient as a language for human-readable descriptions, and you should still include labels in whatever language you like. There are devices that may fail to render it properly, but it works well as a fallback.

IS4 Recommendation #4: Do not use volatile or context-dependent identifiers or data

Meaning of URIs or literals is not inherent in RDF and depends on interpretation. However, there are some schemes and datatypes that are not intended to be used universally, as they may be ambiguous.

Most URI schemes are universal, but there are exceptions. The cid: scheme is used to identify individual files inside the current MIME message, so it makes no sense to use it outside one (unlike mid: which specifies the message). There are also ranges of URIs in specific schemes, such as file: URIs without a host (denoting the local computer), or the blank tv: URI that denotes the current TV station.

Using these URIs individually or as parts of other URIs that do not sufficiently provide their context increases the risk of "collisions", that is using the same URI for different things. The only exception is when describing tautologies, i.e. facts that are true for any such entity. Stating that file:///dir/file is inside file:///dir is most likely always true, but useless nonetheless.

The prime example of an ambiguous datatype is xsd:QName (and all types that use it). Its lexical-to-value space mapping depends on declared XML namespaces present in the current context, but there are no namespace declarations in RDF. The only prefixes that a reasoning scheme supporting xsd:QName could be expected to resolve correctly are xml: and xmlns:, but anything else could be wildly inconsistent, thus it is better to stick to a datatype that supports explicit namespaces, such as http://www.w3.org/1999/XSL/Transform#EQName (with the same value space). Note that for all other purpose other than as a datatype, xsd:QName is fine (if used as a class for example, in which case it is equivalent to xsl:EQName).

It is possible to embed RDF data in XML, send it in a MIME message or store it in a local file, in which case an application can infer the meaning from the context, but the nodes are not portable and the application must remember to transform them when the context changes, for example to blank nodes.

No comments:

Post a Comment