February 14, 2022

The Models of CONCURrent SGML

Nowadays, the old glory of SGML is almost forgotten as it got replaced by XML in all but a handful locations, but the language itself still has some very impressive capabilities that, I dare say, might yet make it a viable option for some very specific cases in the future. One of those features is "CONCUR", a.k.a. the original namespaces. This comparison is inaccurate of course, so I will first take a few paragraphs to describe what this feature is, and how it relates to the whole SGML ecosystem.

Concurrent SGML

Most people who know XML assume that SGML is pretty similar to it, with the exception of some syntax shortcuts, omitted tags etc. The reason SGML was so powerful (and rarely fully understood or implemented) was also however the possibility to configure the parser and toggle specific capabilities that could make it a completely different language, besides what could be controlled by DOCTYPE.

One of those switches is CONCUR, turned off by default. This feature makes it possible to create two or more completely different hierarchies over the same data (text), for example:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<!DOCTYPE ORDER SYSTEM "order.dtd">
<(HTML)html><(ORDER)order>
<(HTML)head>
<(HTML)title>Order #<(ORDER)number>123</(ORDER)number></(HTML)title>
</(HTML)head>
<(HTML)body>
<(HTML)p><(ORDER)item>12 kg of cement</(ORDER)item> ordered by <(ORDER)buyer>Building Company</(ORDER)buyer>.</(HTML)p>
</(HTML)body>
</(ORDER)order></(HTML)html>

Here we see a document that is concurrently an HTML page but also an order with some proprietary markup. A browser only sees one "face" of the document, while a specialized application that understands orders only sees the other one.

What might not be clear here is why would this feature be useful in the first place, when you could just as easily produce two separate documents. It is important to understand that at the point of its conception, many technologies we know today hadn't been invented yet, or were not widely used. Things like content negotiation or XSL transformation would in time help, but at this point, stuffing separate hierarchies into a single document was something that could make things easier in some cases. As an example, you could have a person who would produce or edit orders who only sees the presentation (HTML or simpler) part, but any modified characters would also become part of the hidden order markup.

It is also useful to remind how identifiers work in SGML, which is something that also got carried to XML, but is non-productive nowadays. SGML uses two types of identifiers for document types, external entities, notations etc., basically any "thing" (what we would call resource in RDF) is identifiable using this pair of identifiers. The PUBLIC identifier is the thing's "name", something that differentiates it from distinct entities (we can think of it as a URN), while the SYSTEM identifier is a (surprise) system-specific location for the resource or its representation (we can think of it as a possibly relative URL). In both cases, there has to be a processor-specific way to resolve the identifiers in order to load the resource (in case of notations and external non-parsed entities, this is not strictly necessary to load the document).

CONCUR versus namespaces

XML has its own way of using elements from different sources – namespaces (now identified with just a single URI), but most similarities end here. Namespaces essentially only extend the local names used in the documents with prefixes that have a better chance at being unique, but XML is still a single hierarchy with a single root, specifically this is not what would the example above be equivalent to in terms of XML:

<html:html><order:order>
<html:head>
<html:title>Order #<order:number>123</order:number></html:title>
</html:head>
<html:body>
<html:p><order:item>12 kg of cement</order:item> ordered by <order:buyer>Building Company</order:buyer>.</html:p>
</html:body>
</order:order></html:html>

While you could in theory erase all the markup from the other namespaces and retrieve the intended structure that way, while keeping the source merged, it is still a single hierarchy and your intentions will be confusing to every other tool that wants to work with it. This has an additional consequence in that the order of the tags matters now.

To sum it up, XML namespaces are a means to uniquely assign meaning to names within documents. As a result, you can use elements and attributes defined in one namespace inside elements from another namespace without confusing the two. On the other hand, CONCUR SGML documents cannot really provide any interation between their DTDs, as they interleave at the syntactical level only. There is a different feature in SGML however which could make this possible: SUBDOC, allowing you to include a document with one DTD into another with a different DTD, as an entity. This is a better way to emulate namespaces.

A result of the way CONCUR works is the possibility to interleave the markups with almost no limits:

<(A)a>...<(B)b>...</(A)a>...</(B)b>

It should be also mentioned that document types in SGML and namespaces in XML are not really the same kind of thing. A DOCTYPE specifies both the "namespace" of the elements (i.e. their intended location in an interpretation system) and a schema to which such a document should validate. An XML namespace may be described by a schema (which could be found at its URL), but an XML schema could easily use elements from different namespaces and define validity for such "mixed" documents. A CONCUR document cannot be validated in any way that combines all its document types.

The CONCUR SGML data model

Say you really wanted to infer as much information from a CONCUR SGML document as possible. This means you simply cannot use the easiest approach and just separate the individual documents, as that would destroy any potential links between the two. What else can you do?

If you don't want to sacrifice the traditional single-root structure, you could accept a subset of valid SGML documents, where there is no (global) tag interleaving, but still satisfy the (correct) expectation that the following lines have the same meaning:

<(A)a><(B)b>...</(A)a></(B)b>
<(A)a><(B)b>...</(B)b></(A)a>
<(B)b><(A)a>...</(A)a></(B)b>
<(B)b><(A)a>...</(B)b></(A)a> 

In this model, the traditional element node is replaced with a "multielement" node that combines all the start tags at a particular place in the character data. We can therefore observe that the content is really enclosed in a single node with two "interpretations" A:a and B:b. This retains at least some concurrency in the markup, but some confusions may arise. For example, a simple change would increase the number of nodes by 1:

<(A)a><(B)b>...</(A)a>...</(B)b>

While the markup may not indicate that A:a is in fact contained within B:b (from the viewpoint of SGML, they simply share the start position), the model has to create an additional node if the elements do not completely overlap, and it must correctly reorder the start tags. A parser must be aware that this may happen after all the start tags and an unlimited portion of the content were processed.

Interleaved tags can be represented in a way as well, at the expense of using additional node types. In a case like this:

<(A)a>...<(B)b>...</(A)a>...</(B)b>

we could separate B:b into two nodes, one inside A:a and the other one right after it, but we also want to mark the second node in a way that makes it clear it is a continuation of a previous node.

Last thing are entities. As far as I know, entity references can also be restricted only to specific document types, and must as such also appear in the character data. If there is an inclusion mechanism in SGML that might restrict certain text nodes only to specific document types (I am genuinely unsure about this), it must also be handled in some way.

Then there is also the option of using a model that matches the underlying structure in a better way, however this model has multiple "roots" and not just one. First, we have to give "identity" to the textual data of the document, or just their ranges that are delimited by markup. This forms the first "root" of the model – just a list of all text nodes in the document, in sequence as they appear. This is the "intersection" of all the concrete hierarchies that are part of the document, something they all must agree on.

Next, for every distinct document type used in the document, an individual hierarchy is formed with its own root, precisely in a way standard SGML parsers should understand CONCUR. All hierarchies eventually meet at their shared text nodes.

While being able to join the hierarchies via text nodes is useful, a technique used for example in MuLaX could be adopted at the cost of adding new types of nodes – foreign start and end tags. These are references to elements in a different hierarchy than the current one, and simplify queries on the model. Their presence however mixes the semantic and syntactic layers of the document, and could lead to misinterpretations, similarly to how these two pieces of XML could be misinterpreted as being different: <a/> vs. <a></a>, which is something that could be reported by a parser to preserve the form of the document, but should not make any difference in its intepretation. With foreign tags, one has to remember that a tag placed right at the beginning of an element should be interpreted the same way as when the tag is preceding the element.

As it can be seen, it is not easy to produce a model that is both useful and truthful to the meaning of such a document. However, even the multiroot system (without foreign tags) can be used to make assumptions and even queries about the document, for example via XPath. It is reasonable to devise XPath axes like enclosing-elements(), enclosed-elements(), intersecting-elements() etc. based on the relation between the elements' text contents.

No comments:

Post a Comment