March 13, 2021

More about datatypes in RDF

Datatypes in RDF are arguably the most confusing and also underutilized tool in RDF. To understand datatypes, we first need to understand literals.

Literals

While RDF and XML are very different in many aspects, they share some core concepts. XML at its core doesn't really have standard datatypes like you'd find in many programming languages and other data languages like JSON. Instead, you have text (or character data if you will) which might get a specific meaning via other facilities, but to an external observer, everything is only text (or whitespace if you want to go into details).

RDF is very similar to XML in that a literal is simply a piece of (character) data. Unlike XML however, it is also possible to assign a datatype to the literal. The notion of a plain (untyped) literal was changed somewhat in RDF 1.1, making xsd:string the implicit datatype (more on that later). Specific serialization formats may define syntax for other common typed literals, such as numbers or booleans, but all of them are still backed by text. Thus a literal is simply a piece of text, optionally with something that identifies its datatype (we forget language-tagged literals for now).

What is a datatype?

Asking what an RDF datatype is is like asking what an XML namespace is. It's just a concept that is used in some places, at least in core RDF, and it is up to the implementation of a specific data processor how to treat (or resolve) them. So far, there are is no universal way to define a datatype, but implementations could recognize definitions in XML Schema or other formats.

A datatype has a formal definition, as a map from a set of strings (its lexical space) to an arbitrary set of entities (its value space). Like most identifiers in RDF, dereferencing the URI of a datatype should eventually lead to its (at least only human-readable) description. As most of the datatypes are taken from XML Schema, they are also defined there, but there are some outliers, as well as some special types.

Like other parts of the RDF vocabulary, it is not prescribed how datatypes are supposed to be used. There are restrictions on standard RDF and XSD datatypes, which are to be interpreted equivalently in all conforming processors, but otherwise you can use them for any purpose. Formatting, input validation and reasoning are common use cases.

In RDFS, rdfs:Literal is a superclass of all datatypes (identified as rdfs:Datatype). This makes rdfs:Literal a kind of a datatype, but without a lexical space. Another special datatype is rdf:langString for all language-tagged literals; this one is not useful much on its own as language-tagged strings are constructed directly, but is fine for use as a class. Then there is rdf:PlainLiteral that, contrary to its name, is just a union of both xsd:string and rdf:langString. Another interesting type is owl:real for all real numbers, again without a lexical space, as it is impossible to cover all real numbers by finite strings.

It is important to be aware that a specific case of "type erasure" occurs when reasoning about data. A typed literal denotes a member in the value space of its datatype, thus becomes "disconnected" from its original syntactical form. Therefore "en"^^xsd:language and "en"^^xsd:string denote the same entity, simply a string and not a language identifier. However XML Schema defines specific primitive types that are each disjoint, so something like "en"^^xsd:anyURI does denote a different entity. This only affects reasoners and not SPARQL, so if someone uses xsd:language to denote a language tag (and they don't have to), you will be able to find all of them in a dataset. If an ontology restricts a property to xsd:language, you will still be able to use "en" without creating an inconsistency.

Where to look for datatypes?

The first place to look is the XML Schema specification for descriptions of standard datatypes, both primitive and derived. Only a subset of those is used in RDF however, as some of them depend on an XML context. For markup, RDFS informally identifies two more types, rdf:XMLLiteral and rdf:HTML as (syntactically valid) fragments of XML and HTML, respectively. OWL then defines owl:real and owl:rational. These are all the types that are most likely to be correctly recognized.

Outside of standard datatypes, I encourage you to pick a dataset and look for datatype declarations via SPARQL SELECT DISTINCT ?dt WHERE { ?dt a rdfs:Datatype }. I was able to find a list of datatypes used by DBpedia and other datasets may come up with their own.

I have also found Extra Types! containing a list of datatypes for pieces of code in various languages. These are defined in a somewhat peculiar manner however, since the language of the fragment is essentially defined as a part of its datatype identifier. Many languages like Lua are also missing, and it is not clear what to do in that case. Personally, I find language-tagged literals a better way for storing code, especially if the language is user defined, like "local a = 0"@zxx-Latn-x-lua where the name may eventually end up as an input to a syntax highligher supporting the particular grammar.

It is also possible to use types defined for other languages or systems that can be resolved even without any prior knowledge. DTD notations use public identifiers, so any URI that starts with urn:publicid: can be treated like a notation identifier if used as a datatype (for the limited number of cases notations are good for). XML Schema allows one to define richer datatypes that are compatible with RDF semantics, so one could arrive at something like http://www.w3.org/2001/10/synthesis#duration which would encourage the processor to look for a XSD file for the namespace and find the datatype there. Other things that can be identified by a URI are YAML types and, interestingly, Google Protocol Buffers types, where one could come up with something like https://type.googleapis.com/google.protobuf.Duration for example. CBOR tags are also applicable, but I wasn't able to find any suitable URI scheme for them. A lot of these types can be recognized and hooked up with a specific processor or language-specific types; for example, something like urn:schemas-microsoft-com:System.DateTime might be recognized and used to parse the literal as a type in .NET.

Functions as datatypes

In XPath (and XQuery and SPARQL by extension), it is possible to use a datatype to construct its value from a string like you'd do in other languages, for example xsd:integer("1"). This is the same syntax as for a function call, thus it can be argued that any unary function can be technically used as a datatype. The lexical space of such a datatype is simply the space of all strings that don't cause an error when used as the argument, and the value space is simply its return type.

Once again, XPath can supply us with a number of interesting functions. One may use "http://example.org/file.xml"^^fn:doc to refer to the actual XML document object at that address, "%/"^^fn:encode-for-uri to refer to the string value "%25%2F", or "abc"^^fn:upper-case to produce "ABC". And finally, JSON data can be represented as well in the form of '{"x":1}'^^fn:parse-json.

This convention is somewhat extensible, e.g. nullary functions can be "called" by ignoring the literal string, and functions with a non-string argument may first convert the argument to that type. However, this gets a bit complicated with overloaded functions and functions with multiple parameters, so I'd recommend sticking to unary functions.

No comments:

Post a Comment