IS4 Code Blog: JSON-LD is the new XML

It's been a pastime of mine to compare XML and JSON, as a mature, expressive and self-descriptive format against one that is ambiguous and unextensible. As many of you should realize by now, this comparison is actually quite meaningless ‒ these two formats have a vastly different focus and primary area of use ‒ XML is focused on documents (apparent for example in situations where formatting may be relevant), while JSON is focused on representing commonly-used structures in programming languages (well, only those in JavaScript) and nothing more and nothing less. There are however usage areas where these two formats overlap, and that is when describing entities or objects of various kinds, linked together using properties. In other words, linked data.

JSON-LD is a format that gives JSON a more well-defined semantics, as well as extension mechanism, thanks to the underlying use of RDF. This brings the possibility to reuse standard objects, compose different JSON-LD documents together while retaining the semantics, and similar. What is however interesting to realize is that JSON-LD effectively turns JSON from dumb to smart in a similar way that this XML snippet:

<doc>
  <title>My document</title>
</doc>

is turned into:

<!DOCTYPE doc SYSTEM "http://example/doc.dtd">
<doc xmlns="http://example/doc.dtd">
  <title xml:lang="en">My document</title>
</doc>

JSON-LD has several features that, in essence, were just XML in disguise all along. Let's take a look at them!

The Context and Processing

The almighty import mechanism of JSON-LD is the @context property that allows you to upgrade your existing JSON documents while keeping most of the generation or processing routines intact. In fact, you don't even have to physically add it to the document; you may use the Link HTTP header to add it silently, just for the applications that require it.

Using the context allows you primarily to pre-define the interpretation of other properties, such as:

Specifying that some properties are just aliases to other properties, even keywords.
Setting the type or language of property values.
Defining prefixes to be used in identifiers, expanding to URIs.
Setting the vocabulary for all unaliased and unprefixed properties through @vocab.
Setting base URI of the document through @base.

The context may be remote, accessed through its URL, or defined inline, so a stand-alone JSON-LD document may always be created. Applying the context to the document creates a compacted JSON-LD document with a possibly different structure, but the same semantics.

Interestingly, this is very similar to the DOCTYPE mechanism in XML:

The Document Type Definition may be imported from external resource, or included locally (inlined).
It can add attributes to any element with default values, even special attributes like xmlns, xml:base, and xml:lang.
It can specify attributes to refer to other elements or entities through their ID/name.
It can define entities to be used to include character or XML data in their place.
It can include and process other DTD documents, even conditionally.
And much more, such as validation.

As such, both may be used for roughly the same situations:

A document may be written in a very concise style, but expanded later by linking to the context/DTD or processing it directly through it, resulting in less verbosity with the same expressivity.
The context/DTD may be hosted externally, derived from others, or updated in time, sharing all the positive and negative implications of doing that.
Commonly written expressions may be reused from a single definition. Only prefixes in JSON-LD allow this, while XML supports reuse of all kinds of data, in text nodes and attributes alike (but not in element/attribute names).
Additional semantics can be provided for processing down the line, such as specifying which properties store identifiers or which have values of a particular datatype (could be represented as xsi:type on an XML element), or the language and base URI of the document or individual section, through the relevant attributes.

All in all, both mechanisms affect the processing of the document, providing additional context or metadata when given to other consumers, or to shorten otherwise longer syntax. When compared to DTD however, the context does not validate, i.e. it does not restrict the structure of the JSON in the same way DTDs do. Such a thing is however not necessary, for reasons we will see very soon.

RDF and Semantics

What gives XML semantics and a level of self-descriptiveness are namespaces ‒ URI-identified abstract collections of terms for elements and attributes. Placing the elements in a particular namespace, applied using a prefix or affecting the whole document, has the advantage of disambiguation between other elements with the same name but from different namespaces. Thanks to the preprocessing phase through the DTD, namespaces may be defined implicitly, so you can still keep your document plain and simple.

JSON-LD uses RDF as its underlying semantic framework. In RDF, there are no individual namespaces; instead, all terms are identified using a URI, while namespace-like URIs are more or less just a loose way to link and express them together. In order for JSON-LD terms to be semantically meaningful, they have to map to a URI that identifies the referenced resource.

Let's compare the two approaches:

Both approaches use URIs, making it possible to take advantage of the various URI schemes there are and ever will be.
XML namespaces are optional ‒ elements do not have to use them and they are still perfectly parseable. While properties in RDF may theoretically be identified using relative URIs, JSON-LD does not support this ‒ all properties must map to an absolute URI, or a blank node, but that is usually not representable in other RDF serialization formats and will not be much useful anyway. Unmapped properties are not visible to JSON-LD processors, and are either ignored or reported.
Both approaches use colon-separated prefixes as a shorthand syntax. XML itself does not allow specifying the namespace URI directly, only via an xmlns attribute, and the local name has some restrictions on its characters, complicating formats like RDF/XML which treat XML namespaces as RDF namespaces.
Qualified names in XML (namespace + local name) are not resolvable directly; only the namespace itself is. If you want to obtain more information about the elements/attributes, you have to locate a schema for that namespace. In RDF, every vocabulary item is resolvable as a URI, and people may easily find its description, expressed in RDF again. If you want to create your own vocabulary however, you don't have to publish anything at the URIs you choose, but it's definitely a good practice to use RDFS or OWL to model it.

Conclusion

All in all, the comparison between JSON-LD and XML is definitely fairer as opposed to using JSON, even though it's at slightly different levels (as far as I am aware, I might be the first to treat these two technologies as comparable), however I think that the step from JSON to JSON-LD is definitely similar to that of SGML to XML, and it's a good step, in both cases. Like I think that all public XML documents should use namespaces and be semantic, all public JSON documents should expose a context and be fully loadable as JSON-LD (notice documents here; API payloads are generally not documents). This makes it possible to embed such documents within one another, and so forth.

To be sincere, I also think the nature of JSON aligns better with RDF than XML does, due to its natural unorderedness and focus on properties rather than embedding elements within other elements. Nevertheless, I also wish for something that could add context to XML documents, or a standardized approach to treating them as RDF in a semantical way (I attempted such a thing in my tool xml2rdf, and will try the same in SFI). To derive full information from text nodes or attributes, XML also needs a schema or DTD, something most existing documents hardly provide. I sure wish there had been a better format.

There is also RDF/XML, which is a reasonable RDF serialization format, but not a step from XML to RDF, and is too strict for people to want to upgrade existing XML documents. It is also not perfect for encoding RDF, since properties using URIs that cannot be converted to qualified names cannot be expressed. It could however serve as a good starting point for a potential XML annotation mechanism, such as by using rdf:datatype instead of xsi:type to preserve type information for those who don't want to deal with namespaces.

As always, the choice between JSON and XML depends on what your particular use case is, but if you love JSON so much that you want to use it even in situation where XML would have been fine, use JSON-LD!

IS4 Code Blog

June 12, 2023

JSON-LD is the new XML

The Context and Processing

RDF and Semantics

Conclusion

No comments:

Post a Comment