November 29, 2021

Finding the UUID for almost anything

Universally Unique Identifiers are a nifty way of obtaining identifiers for resources, objects, or concepts, without the need for a central assigning authority. Arguably the largest public use of UUIDs is from Microsoft's products, where they (known as GUIDs) identify classes or interfaces within COM, for example 450d8fba-ad25-11d0-98a8-0800361b1103 identifies the My Documents folder, accessible via shell.

What's less known is the fact that UUIDs have a specific structure and are not necessarily composed of random numbers. The earliest generated UUIDs used time as one guarantee of uniqueness (the UUID in the example above was created in 1997); these are version 1 UUIDs. Nowadays, you can still use them if you want to preserve the creation time in the identifier, but the most common are version 4 UUIDs that consist almost entirely of (pseudo-)random bytes.

Usually, one associates the generation of UUIDs with some random process that produces different identifiers each time it is invoked. A less known version of UUIDs, however, makes it possible to produce identifiers for certain resources deterministically, that is based solely on some input data and producing the same result each time.

This process is described in detail in RFC 4122, so there is no need to repeat that here. The input to this algorithm consists of some "name space" (UUID itself) and a "name" therewithin, but this nomenclature does not restrict it to textual names, or anything that would commonly be thought of as a namespace. Both components are stored as bytes, concatenated, and hashed (only MD5 and SHA-1 are supported), and the result is turned into a UUID itself.

A powerful consequence of this approach is that you can repeat this process to any length, taking the output of it as the "name space" of the next one, i.e. forming a hierarchy. By the virtue of the hash function, you should be always able (within reasonable attempts) to reach an identifier different from one obtained by following a different path.

There are 4 "standard" name space identifiers: URL, OID, DNS, and X.500 DN. This is where the fun begins: you can take any identifier in these identifier spaces and effectively generate a UUID that is, in a sense, equivalent to the original identifier. The inclusion of URLs (or URIs) makes this particularly interesting, as it brings the opportunity to create UUIDs from websites, e-mail addresses, files, or other UUIDs (if you would want that for some reason).

For linked data considerations, it is reasonable to conclude that such a UUID should, essentially, denote the same entity as the original identifier, thus something like http://example.org/ would be identical to urn:uuid:715d503b-5f95-5532-9100-00663eb45ade. There are many uses to this transformation:

  • If a URI is determined to be "secret" in a way, it is possible to produce a new, universally-recognized one that can be used to verify, but not reveal, the original information (as is the case with hashes).
  • URIs that are too long can be shortened, in a universal way. URIs that already have a fragment portion can be equipped with a new one, in a sense.
  • Websites can "protect" their URIs with a lookup table, only revealing the UUID. A way to "query" for such a URI is also kind of standardized, like http://example.org/.well-known/genid/715d503b5f955532910000663eb45ade.
  • Systems that are designed to work with UUIDs can be used with other forms of identifiers (albeit losing the original information). It can be turned into an OID, for example.

Of course it is possible to devise other schemes or methods that facilitate these points, but having a standardized one readily available is always nice. The only limitation, at the moment, is the restriction on hash functions, as SHA-1 is likely to be replaced at some point in the future, but hopefully a new approach (e.g. multihash) will become available by then.

What about other name spaces? I am not aware of many other UUIDs that have been standardized this way, but the existing 4 are enough to be used for anything that by its own makes sense as treating as a namespace.

One such example could be XML namespaces: such a namespace logically consists of element names, and each element also has its allowed attributes, forming another namespace of sorts. As an example, one may turn http://www.w3.org/2000/svg into its UUID ae9da800-6281-59d5-aedf-2e0dae53d41c, then treat that as a name space and find the UUID for the svg element (717b271d-9ee9-55a6-af20-21edc5f28e40), and then its viewBox attribute (856fd2ed-0560-5982-a43e-1003e37906f1).

XML namespaces are hardly enough to constitute "anything", but XML datatypes are a better candidate. Using the lexical space of the xsd:integer type as a name space (94277b3a-2e5b-5b33-942f-6b57e6bd7ea7), encoded in UTF-8, makes a5a934e0-e338-59fe-9446-a11b1c29cbdb the "official" UUID of the integer 1 (but not the only one of course, as xsd:decimal can also denote the same value).

Anything that is representable as a XML datatype has an identifier generated this way, including normal strings, numbers, dates, durations, and so on. The use of rdf:XMLLiteral, rdf:JSON or rdf:HTML is also possible to identify individual pieces of XML, JSON or HTML. Language-tagged strings in RDF can be identified in theory, if we treat the language tag as a namespace of its own, but using rdf:PlainLiteral (or other, more recent, additions) is probably better.

But why would anyone want to generate identifiers for these things in the first place? Because you can! UUIDs are by definition meant to be universal, and of course there are lots of temporary objects where finding a permanent identifier makes no sense, but there are also some that have been used for decades. Hashes are already used in large, distributed databases as keys, but when you want to have a database that can truly describe anything, this is the thing to choose.

Side note: The only occurence of these UUIDs "in the wild" (known to me) is Microsoft's algorithm to generate a GUID for a .NET type (albeit they are version 3, thus MD5-based UUIDs). These UUIDs are in the name space 69f9cbc9-da05-11d1-9408-0000f8083460 (the COM+ runtime uniqualifier name space, created in 1998) and are formed from the fully qualified name of the class (plus the signatures of its members in case of an interface).

No comments:

Post a Comment