February 15, 2021

A completely meaningless comparison of XML and JSON

I have never had anything against XML. Despite its many quirks and burdens, I think it is even nowadays a perfectly reasonable format for providing or accepting structured data, and I think it is meaningless (albeit trendy) to compare it with JSON trying to determine whichever is better. It is obvious those two languages were designed for different purposes, and the real virtue in data engineering (or software engineering in general) is to know and use the proper tools for completing a task.

In that regard, no language should come out as the winner in this comparison, but it should give you an idea when to use one or the other.

Focus

Both languages have a vastly different view of the world of data. XML is "essentially" text with a bit of well-formed markup inbetween, giving rise to its hierarchy, while JSON already is the hierarchy, storing text or other literals inside.

An application that doesn't understand XML can (theoretically) strip the markup and still retrieve (somewhat) useful textual representation of the content (depending on its distribution between attributes and elements). With JSON, there has to be some parsing involved (to differentiate between string keys and values), and there is no single canonical text representation due to the unordered nature of JSON objects, unless one uses a special key for a "default" value.

Structure

An XML (fragment) is (again "essentially") a sequence of nodes, either simple nodes (text, whitespace or CDATA sections), complex elements, or metadata nodes (comments and processing instructions). An element has a map of its attributes, and a content which is again an XML fragment. This forms the hierarchical structure, with elements as containers of other elements.

JSON, on the other hand, is not a sequence, but a value. This value can be either simple (string, number, boolean or null) or complex (objects and arrays). In any case, both formats have a highly hierarchical structure, with focus on nested data rather than complex graphs.

Power

With a data analogue of computational power in mind, JSON is, in a sense, slightly more powerful than XML. The sole reason is that JSON natively supports nested unordered maps, unlike XML. In XML, the only way to store an unordered map natively is to use the attribute collection of an element, but attributes can only have simple (textual) values. This is easily fixed simply by treating the nodes inside an element as unordered by the application, or by storing references to elements defined elsewhere in attributes, but it is not direct.

Note however that some implementations of JSON may treat same JSON objects but with different order of properties as different (especially when stored in textual form).

Datatypes

Unlike in JSON, there is a significant difference between text and other nodes in XML. XML recognizes that essentially, there can only be text in a text file, and so there are no primitive datatypes like numbers or booleans, just text. XML DTDs can, however, use some special text types like ID or IDREF which affect their processing and semantics but don't really use any other base datatype than text. Elements are the complex objects, with attributes acting as their (simple) properties and one primary content, again (parsed) text.

JSON, on the other hand, doesn't invent any specific types and instead uses concepts that are analogous to common programming languages: strings, numbers, booleans, null, arrays and maps. The distinction of simple types does however have its downsides – the way numbers are parsed is implementation defined, and some implementations may sacrifice precision when converting a number to a common numeric datatype, while others will simply store JSON numbers as strings internally. Anything outside the range of simple types has to be encoded via one of the other datatypes in JSON (dates are arguably the most wanted datatype not in JSON).

It should be noted that XML Schema, a technology based on XML, does offer a rich hierarchy of data types, even with ones such as double or float. This not only allows the document to state the type intention more precisely than in JSON, but in case of floats, it also supports infinities and NaN. This information is however (usually) not stated in the document itself, so one has to query the XML Schema or guess to find out the intended datatype.

Difficulty to use

This is the one category where JSON clearly wins, due to its simplicity. It is very easy to express concepts understandable by both humans and computers in a portable and standardized way. XML, on the other hand, has a cloud of associated standard technologies that, while making it together more expressive than JSON, might make common tasks much harder. Specifying the type of an element's text content in XML is easy, but specifying the type of its attribute requires a whole schema and cannot be made individually.

Difficulty to write

Personally, I find the wordiness of XML better for writing and navigating, especially with very large elements. I am actually glad that </> did not become a way to end an element in XML, as otherwise there would be the same issue as in JSON, where I often have no idea what a particular } actually ends (even with proper indentation).

Expressiveness

The higher complexity of XML allows expressing natively more things than in JSON. The word "natively" is again important here, as if an application supports it, anything can be expressed in JSON as well. With XML, this is contained within parsers, so there is no need to clutter the application with additional logic.

Examples include comments and processing instructions which should be ignored by normal parsers altogether (and thus are not subject to validation), using a DTD to specify default values of attributes or the processing of strings (trimming whitespace), standardized interlinking of elements via ID and IDREF attributes, and the whole concept of entities which makes it possible to reuse the same text in multiple places, import it from a file or a web resource, or even to link to resources of arbitrary formats from attributes. Theoretically, when combined with HTTP and MIME, it is possible to craft an XML file that contains multiple elements with infinite content or elements whose content changes over time, all of that without ever defining any custom instructions (but most likely with extending parsers beyond their usual capabilities).

Metadata and linking

One of the very useful facilities of XML are namespaces and qualified names. This makes it very easy to embed an element with a specific meaning to one application in a more general wrapper (such as a SOAP envelope) without the risk of confusing any processor. Using a namespace (either directly or with a prefix) turns names of all elements into pairs, with the first component a URI of the namespace (which may or may not provide something useful about the elements) and the second component the local name of the element. When this is used properly, something can be understood from any XML document, if it uses common namespaces.

JSON documents, on the other hand, are plain and without any distinction between data and metadata in syntax. However, the use of JSON-LD combines the simplicity of JSON with the power of RDF, making it possible to produce documents that are somewhat self-describing. This is not as good as in XML however, since applications that do not understand JSON-LD have to ignore any property name that starts with @, and it is not a core feature of the format so programmers are less likely to notice it.

No comments:

Post a Comment