December 12, 2021

XML Lite (definitely not XML 2.0)

Years after the conception of XML, it had been customary for programmers to point out its perceived flaws and suggest fixes to the language. This trend has somewhat declined in recent years, as developers have either learnt to live with XML, or discovered that they can use something else.

I do not share these feelings towards XML. In many cases, I find it suitable for various purposes, and I think many cases of criticism stemmed from misunderstanding. Yet even I would like to add some features to the language, based on my experience. Let's take a look at them.

The format itself

XML Lite should be considered a binary format by transport media, similarly to JSON. This has these two consequences:

  • There is no external indicator of the encoding of individual documents (no charset in Content-Type). The parser has to start in a Unicode encoding by default (indicated by a BOM or defaulting to UTF-8), then switch to whichever encoding is indicated in the XML declaration once it is encountered.
  • Line ending and other conversions are not performed when an XML Lite document is transported over network.

As a result, every XML Lite document will be treated the same by every viewer, making it impossible to produce a document in a native encoding of a computer that is different from another one.

Document Type Definition

As namespaces are going to be incorporated more tightly (see below), there is no reason to differentiate documents primarily based on DTDs. XML had been a step from SGML towards uniformity, and the same could be done here: DTDs had been useful when defining custom entities or default values for attributes, but while their use may be worthwhile in specific cases, there were also cases of attacks caused by external entities.

It should not be the parser's responsibility to understand the schema of a document. That is not to say that DTD as a language should be completely eliminated, but it should be linked to the document via other means, on the same level as an XML Schema or Schematron description. Using an attribute or a processing instruction to link (or embed) a DTD will make it possible to embed an XML Lite document completely into another one without a loss of any information.

Integrated namespaces

XML follows SGML in that a colon may be specified as part of a name, but while the use of namespaces has been standardized in all technologies based on XML, the specification itself does not incorporate them.

In XML Lite, namespaces are an integral part of the document. All XML Lite parsers should understand how namespaces are declared and assigned, and reject documents that use undefined namespaces (most parsers do that for XML 1.0 anyway, but they technically do not have to).

Every XML Lite document has a default namespace that is the same as its base URI. Namespace declarations could be relative URIs, in which case they are interpreted according to the base URI of the particular element they are placed on. This means that the syntax for undeclaring a namespace (xmlns:x="") cannot be a special case anymore. Namespaces can be still undeclared, but the syntax will require further additions.

Unlike in normal XML, a prefix should never be given any meaning besides making data more human-readable. As a consequence, the prefixes xml and xmlns do not have any special behaviour – it's the namespace URI that determines how they should be interpreted.

Namespaced names are also recognized in new places: processing instruction targets and entity references (e.g. &math:integral;), with the same rules as elements and attributes. There is a slight issue with top-level processing instructions however, in that there is no parent element they could get the namespace prefix from. This is something that will also be resolved, however.

Older standards of XML Namespaces contained a portion about the different "subnamespaces" of a namespace: one for its elements, one for its global attributes, and one added for every element to store its attributes. It is clear that processing instructions and entities should also get their own subnamespaces, but I don't think this is something that would require much elaboration on – a simple note that names used within different contexts do not necessarily have to point to the same entities (but may, depending on the namespace) is sufficient.

Default namespaces and namespace contents

There is a number of commonly used prefixes that might warrant inclusion by default, forming a sort of "initial context" (like in RDFa). In addition to xml and xmlns, these prefixes are xsi, xs, xi, xlink, xsl, xhtml, svg, soap, wsdl, rdf, and possibly more. While this list should not be extended by parsers on their own, there could still be place for facilities allowing importing additional namespaces from external locations (as is already possible for XML by means of using DTD to add xmlns attributes).

Conversely, for compatibility reasons, there are several local names that must be valid in any namespace, to accomondate for the namespacelessness of entity references and processing instructions in standard XML, such as amp, lt, gt, quot, apos, or xml-stylesheet.

Expanded QNames

Qualified names (QNames) take the form of a prefix and a name, but what if the namespace could not be declared for some reason? A syntax used in some other XML-related technologies is a so-called expanded qualified name (EQName) which does not require prior declaration, with a notation like {http://www.w3.org/2000/svg}circle. Such syntax is therefore possible in any place where a normal name is expected, even in entity references (&{http://example.org}entity;).

This way, namespace prefixes could easily be removed from a document in order to convert it to a some sort of a "canonical form", since they should not add any information to it. There is an issue with some XML formats however, since QNames are also rarely used in attributes. This is however something that could be immediately fixed with another addition.

Non-string attributes

To accomodate for natural conversion between XML Lite and other data formats (like YAML or JSON), attributes should provide some way to indicate the type of the value. This is usually indicated by a schema, or explicitly for elements (xsi:type) but it could be useful when merging XML fragments and necessary when dynamically-typed values are expected.

This is fortunately something that is easily done – without the quotes around an attribute value, it becomes "simple" and accepts one of several allowed forms, in this order:

  • A sequence of characters without spaces of markup, matching one of the xsd:boolean, xsd:dateTime, xsd:date, xsd:time, xsd:duration, xsd:gYearMonth, xsd:gMonthDay, xsd:gYear, xsd:gMonth datatypes. These datatypes have disjoint lexical spaces.
  • A numeric value, matching the xsd:integer, xsd:decimal, or xsd:double datatype, in this particular order.
  • A special value of nil (with the same semantics as xsi:nil="true" on an empty element).
  • A name (bare, qualified prefixed or expanded).

This type information is something that should be reported in the same way as existing parsers report it for elements (if supported). Each simple attribute value also has a corresponding textual form that is reported for compatibility. In the first two cases (numbers and booleans), the textual form is the same as the lexical form of the literal. In the case of nil, the textual form is the empty string. In the last case, the textual form is an entity reference (i.e. attr=ns:name is textually equivalent to attr="&ns:name;").

A namespace can be easily unregistered by setting it to nil. The default namespace can be unregistered as well (xmlns=nil), effectively prohibiting its use anywhere in the element.

A possibility to add even more types to attribute value could be enabled by using Turtle-like syntax, such as attr="about:example"^^xs:anyURI, or perhaps language-tagged strings in the form of attr="hallo"@de.

Empty prefix and empty name

A minor addition, to accomodate for RDF and linked data, is the possibility to use namespaces without a corresponding local name, like &ns:; or &{http://example.org}; in an entity reference (but this is usable everywhere). Additionally, the empty prefix is valid and initially assigned to the document's default namespace (but changeable via xmlns:="URI").

This makes it possible distinguish a string attribute value containing a URI from a reference to a resource denoted by a URI. Some application-specific mechanism may be used for retrieving all sorts of data this way, making it embeddable into documents.

Empty end tag

As a rule, a syntax of </> should probably be a valid way to end an element. I'd argue it was not that much necessary before, but with expanded QNames, things would start to get too much wordy without it.

One could argue that parsers could be built in a way that requires the redundancy (in a switch-on/switch-off faction), but proper XML parsers should already have access to the name of the start tag to have something to compare the end tag to.

Short element value syntax

A common sight in data serialized as XML are lots of properties like <Name>value</Name> in other elements. This is something XML was not designed for, but lots of programming languages do not provide a distinction that would tell an XML serializer to use an attribute instead of an element for the property.

This is something that cannot be completely eliminated, thus it is better to provide a short syntax in a case like this. In SGML, this had already been possible (<Name/value/) but that looks unnatural. The element's value could be provided in an attribute-like syntax, like <Name="value"/> which makes it possible to easily indicate the datatype as a part of the value.

Comments

Due to its other differences from SGML, a -- occurence in a comment no longer has to be invalid for the sake of compatibility. A nice consequence is that something like <!-- --<element/><!-- --> could be turned from a comment to an element with an addition of just one > character, a "feature" already present in other programming languages.

Whitespace rules

Whitespace in XML has been somewhat fine in the latest versions, but it is clear that xml:space should also be incorporated in to control which whitespace is significant and which isn't. The values of the attribute should however be extended with ways to specify that whitespace around text nodes should be trimmed (made insignificant), or even collapsed inside the text node.

Parsers usually already have a way of indicating whether a piece of whitespace is significant or not, and this would only mean splitting a text node into several text-like nodes.

Inline special attributes

There are a few "special" attributes in XML, namely xml:space, xml:lang, and all xmlns attributes, which affect the processing of the element they are on. However, these attributes should not require an extra element just to apply them, instead they could be used as processing instructions to affect all nodes that follow them, in the current element:

Hello
<?xml:lang de?>
Hallo 

CDATA sections

CDATA sections in XML use syntax that one might legitimately describe as weird, particularly due to the reason that this syntax is not used much outside DTDs. Since DTD is not a part of XML Lite, it makes sense to tweak and shorten this syntax a bit: <![[inner text]]> is, I feel, sufficient without the CDATA part.

A possibility is also to be able to specify what sequence of characters is expected to end the section, similar to Lua's long brackets – a sequence like <![==[ could be only ended with ]==]>, for any number of equal signs.

Conversions between XML and XML Lite

In most cases, an XML document should not require significant changes when converting to XML Lite. The encoding has to be provided if non-Unicode, the DTD must be removed or incorporated in a compatible way, and any of its features the document relies on (entities or default attribute values) should be incorporated into the document itself, if compatibility is required.

Existing entity references and processing instructions have different semantics in XML Lite: their names are to be interpreted according to the namespace of the enclosing element (or the default namespace). While they could in theory contain the colon character in XML, most parsers reject those, so any other error is unlikely. When there is a risk that the namespace could redefine some of the standard entities, they can be prefixed like so: &:nbsp;.

Use of plain QNames inside textual values is not forbidden in XML Lite, but they should be converted to entity references to prevent a potential loss of information due to namespace renaming.

Converting from XML Lite to XML is a bit more complicated, due to the changes in its underlying data model. Again, the DTD could be readded properly, but other things like namespaced processing instructions and entity references might be an issue and could not be represented accurately.

Similarly, simple attributes can only be represented in their textual form, without the datatype information. Default element values could, however, be turned into xsi:type or xsi:nil="true" if required. Unregistering namespaces has to be performed via nil (but note that even when not translated correctly, a valid XML document could not be misinterpreted, as using an unregistered namespace is not allowed).

Example

<?xml version="lite" encoding="utf-8"?>
<?xml:dtd people SYSTEM "http://example.org/people"?>
<people xmlns="http://example.org/people">
  <person>
    <name="Peter"@en/>
    <age=P20Y/>
    <favourite-colour={http://example.org/colours}green/>
    <note><![[literal text <3]]></>
  </>
</>

This document could be easily interpreted as (standard) XML:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE people SYSTEM "http://example.org/people">
<people xmlns="http://example.org/people"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:colours="http://example.org/colours">
  <person>
    <name xml:lang="en">Peter</name>
    <age xsi:type="xs:duration">P20Y</age>
    <favourite-colour>colours:green</favourite-colour>
    <note><![CDATA[literal text <3]]></note>
  </person>
</people>

Extensions and options

The last point of interest is the XML declaration itself. I was intrigued by the standalone component of the standard XML declaration, which signifies whether there are externally declared entities or default attribute values used by the document. This literal meaning of the component is not so necessary in XML Lite, as the presence of an "unresolved" entity is no worse than a presence of an "unresolved" element, so to speak.

The XML declaration can however be used as a place for specifying other features or requirements of the document, or the capabilities of the parser. I choose the required/optional approach to specifying the features: when a "required" feature is requested by the document, the parser or processor must be able to understand and use it, otherwise the document may be reported as ill-formed or worse, valid but misinterpreted. On the other hand, an "optional" feature indicates that the document is best processed or viewed with that specific feature turned on, but its unavailability does not impact the meaning of the document in any critical way.

For example:

<?xml version="lite" required="dtd valid" optional="external"?>

This declaration could indicate that it is detrimental for the processor to understand and process a xml:dtd processing instruction, and also that it must be able to validate the document (according to the schema inferred from the DTD). However, any external declarations found in the DTD do not have to be resolved, which indicates that even though they may be present, they are not necessary for the correct interepretation of the DTD or the document itself (but the user may be queried whether to allow loading such references before the processing starts).

This mechanism may in theory be abused to extend the syntax itself with new constructs (or change the meaning of the old ones) introduced by various parsers, but even that is necessary to ensure that the format could organically change based on whatever requirements the future brings. In any case, it is however preferred to keep the set of required features at a minimum, and if possible, to publish an alternate version of the document without these requirements. For example, an XML Lite document that requires a schema to provide default values, or one that specifies an XSL transformation, could be preprocessed and published after those transformations take place, alongside the original.

No comments:

Post a Comment