IS4 Code Blog: Having no standard is better than having some standard

"You can't mean that!"

I don't.

But like with other overemphasized statements, it is possible to find some truth in it. Recently, I have been thinking about various standards of the Internet, and especially their adherence to registries. IANA is a prime example, hosting hundreds of tables for all sorts of Internet standards that absolutely definitely cover every possible situation and case... except when they don't.

The prime example is the Named Information Hash Algorithm Registry, an amazingly-named registry of hash algorithms for a quite nice URI scheme, ni:, for identifying pieces of content by their hash. Since there are so many diverse hash algorithms, it stands to reason to have a grand unified table that serves to cover the majority of them, for people to find and decide the one to choose. Yet (as of the time of writing), there are only 17 entries in there, the majority of which being SHA-256 and SHA-3 and their truncated versions, which is not many for something supposedly usable in all of the ever-modern content addressing systems.

Why aren't there more hash algorithms? Surely there were many more existing algorithms in 2012 when the corresponding RFC was published... Well, the RFC explicitly advises the Designated Expert not to accept additions that are cryptographically weak or otherwise obsolete. While I can see the good intentions behind this rule, I strongly disagree with the notion that making something impossible to express in a certain system makes such system secure, for the following reasons:

You cannot be sure that hash algorithms not present in the table are insecure ‒ there are many more secure hash algorithms in existence that nobody thought of including in the table, so it is incomplete.
Conversely, you cannot be sure that hash algorithms present in the table are secure for your particular implementation ‒ new attacks can be found and individual entries can be marked as "deprecated", but there may be different situations where algorithms are fine or not that are not distinguished, or it make take longer until the information is properly reflected in the table, losing its confidence.
You can bypass the restrictions by using the proposed mh hash name, pointing instead into the multiformats registry. Congratulations ‒ now you can form ni: URIs from CRC32 checksums and nobody can tell unless they analyze the individual bytes. Like with any system, when obstacles to prevent simple forms of abuse are added, the abuse simply becomes more complex and harder to discern, to the detriment of common users who simply become less aware of any abuse due to it.
You can ignore the table entirely. Of course you shouldn't, but in essence nothing stops you from producing ni: URIs with any arbitrary hash algorithm names, and there are far too few systems actually using it that care.
You can disregard ni: completely and use another scheme, such as magnet:, or worse.

The key point to take from this is that a table is not a security measure, and since ni: is not a very widespread scheme anyway, people will simply go elsewhere. They could have actually offered an improved security by modifying the ni: scheme, for example by requiring people to start the URI with a security score of the used hash algorithm (with MD5 at 0 and SHA-256 at 9, for example), which would have to match the table (not be greater) in order to be accepted by conforming software. Then people could verify the security from the URI itself without having to remember which algorithm is secure, and any attempts at falsification would be blocked.

While this issue was minor and specific to one registry, I believe there are far more widespread problems affecting many more standards, caused by the conflict between the mindset of those who design them and those who use them. Let's illustrate them, again on the same registry:

There is no reason why a hash algorithm registry has to be kept for the purpose of a single URI scheme. There are at least two other (or replaced) registries, again for use in other concrete areas (HTTP or certificates), and with many more hash algorithms. The only additional information is the Suite ID and truncation length, which could very easily simply reference the common hash algorithm table and add more details to it. In addition, the ID is optional and restricted to 64 values anyway (and the binary format was outcompeted by multihash). Standards should not mandate a particular registry when another one does not have any downsides for the common cases. Nothing stops you in theory from ignoring the NI registry and picking another one, but then there are disagreements, for example on the name of SHA-1 ("sha" or "sha-1"), which is where a standard should have helped but failed to do so.
The table started mostly empty (only with SHA-256 variants). While it may have served the creators of the proposal, Internet standard like this should start with a far wider range of options, and the maintainers themselves should seek to update them over time. Sure, you can always request more to be added, but that needs some effort and waiting, and not everyone is aware of this option (and the emptiness of the table might convince people of the opposite). Standards exist to remove ambiguity in already existing systems, to unify and guide individuals to make their products compatible, yet this one fails when someone wants to refer to something as (historically) common as SHA-1.
It is not obvious the hash algorithm name has to come from the table at all. To be clear, the specification does indicate it should (but does not explicitly prohibit other names or make them invalid), but the usage does not ‒ if someone unfamiliar sees a ni: URI, there is no way for them to arrive at the registry just from the URI. Additionally, someone could simply start with examples of such URIs and work their way towards the format without reading the whole RFC, and could then produce something like "ni:///sha-1;" because it was "obvious". There is not a perfect solution to this problem ‒ you could require the ID instead of hash name (which is what multihash does) which makes it impossible for people to come up with their own names, but that also conflicts with the other requirements.
There is no clear space for extensions or experimentation. This touches the previous issue, in that when a particular source of hash algorithms is not stated, it also cannot be changed or added. There are not that many serious algorithms, but mind you ‒ you do not always make the choice of hash algorithm and ni: URIs at the same time ‒ you may need to describe an existing system without having the opportunity to re-hash everything.

While these issues may seem superficial considering the tiny usage of the ni: scheme, I can give examples of other, far wider problems affecting things that are commonly in use every day, stemming from the misuse of registries:

Moving from IANA to ICANN, there was an issue with one of the newly registered TLDs for Google, .dev. This domain had long been privately in use before 2015 when Google acquired it, by various developers as testing environment through local DNS records. Yet when Google finally started allowing registrations under this TLD in 2019, they also included it in their big HSTS preload list, which made a lot of developers suddenly use HTTPS by default for their HTTP-only .dev websites when using a browser, without easy means to switch back. Now you could easily blame all these developers for using something they had no rights for (especially considering the .test TLD is the one supposed to be used for this purpose and was protected from being owned by any particular entity), but ICANN could also have democratically recognized the informal usage of the domain and protected it, similarly to .onion. Of course, the actual systemic cause of this issue was that the standard permitted for the DNS mechanism to cover both public and private domains, even temporally overlapping, leading to a situation where a private-use domain could be replaced by a public (purchased) one.
Media types (colloquially known as MIME types) are even a bigger mess. Something as rapidly evolving as file formats is difficult to synchronize with a centrally maintained table, and so there are many formats in use today with an unregistered media type or multiple competing types. The x- prefix was also historically used for private-use types, so obviously people started using it for anything they did not or could not register themselves, to the point of many common media types beginning with x-. This was somewhat remedied by loosening the rules for new media type registration, and changing the private use prefix to x., but registrations could still take months before a new type is added, and there are still hundreds of unregistered media types shaped by Internet's consensus. In this situation, I quite like Apple's UTI system which adopts the reverse domain name notation for structure and maintaining ownership. It also incorporates narrower/wider relations between types, which is very nice for taxonomy nerds.
The exact same situation has happened for URNs as well, although in not such big numbers. Thanks to their brevity and widespread use in some areas, URNs were regarded as a quick way to get your own URI without having to host anything anywhere. The issue is that URN namespaces are once again managed in a registry, so even without any server, you still have to register your namespace, prove that it is for a good purpose, and then start producing names in it. Of course you may use urn:uuid: or urn:oid: for pre-existing identifiers that are easier to obtain, but that did not prevent people from using things like urn:md5:, urn:sha1:, urn:btih:, urn:tree:tiger:, urn:bitprint:, urn:ed2k:, urn:aich:, and urn:kzhash: for hashes, or even better, urn:uvci: for vaccination certificates, all unregistered (there was a proposal for hashes though; expired for some reason). In some way, URNs are the worst of URIs, since when combined with the general negligence to register identifiers, they become just disembodied tokens, shouts of people who at one point in the past needed an URI but shunned any responsibility of actually maintaining one.
Lastly, the aforementioned issues also affect URIs themselves as a whole, specifically URI schemes. While some developers diligently register all their proprietary URI schemes, there is still a bunch of applications where the developers considered the URI scheme space a free real estate and just picked something they liked. Just take a stroll through HKCR and look for "URL Protocol" attributes. 😉

It is my opinion that standards are perfect for people to follow when a thing being standardized is already complete (or mostly complete), but when they define something that needs to grow or expand naturally, they inhibit the growth and cause conflict, like in the cases above, where the intended and standardized usage of various registries often did not match the public usage by developers worldwide.

Let's take a look at systems that, in my opinion, handle these issues better, or are noteworthy in that regard:

The urn:publicid: namespace. Where other standards battle misuse by imposing strict conformity to registries, this one embraces it. This namespace was created to import PUBLIC identifiers from SGML and XML into the URI ecosystem, but as a byproduct it also created a completely free space of possibilities, since PUBLIC identifiers are themselves almost unrestricted ‒ they might conform to a Formal Public Identifier (FPI) syntax, in which case they should not start with a registered prefix if not assigned by the authorized entity, but otherwise pretty much anything goes (within the set of allowed characters). Since educated users understand there is no uniqueness guarantee, they have to try their best to devise identifiers that are self-descriptive enough not to be potentially confused with others, and uneducated users simply see a long identifier and try to mimic it, to a comparable success. This is what URNs have been misused for, with the crucial difference that the notion that anything goes is allowed here. Use it whenever applicable instead of unregistered plain urn:.
The tag: URI scheme is somewhat confusingly named, but it is essentially the same thing, only with ownership information, provided as a date and usually a domain. It is sad that this URI scheme is not used more, because it checks all the boxes: you can't misuse it easily because the information that needs to be valid is clearly given, all you need is a domain (or even an e-mail address) to be entitled to use it, you are given an absolutely free space within that of what you own, without any resolution responsibility, and the date component ensures that there is never any temporal overlap of same identifiers for different resources. Use this instead of urn:publicid: whenever you can, if there are no better options.
If there is no readability requirement, there is also the OID, essentially a sequence of unbounded natural numbers navigating a potentially infinite tree of concepts. Within this tree, every branch has an ownership and may delegate its branches to other entities, who then control their branches. Thank to this, it is fairly straightforward (and free) to get an OID prefix yourself, and then you are entitled to assign meaning to everything under it. This is a perfect example of a system that cannot be misused, because the numbers themselves are essentially meaningless and you need an external mechanism to find out the meaning of the whole identifier.
Finally, the best possible set of identifiers for any situation is the set of URIs themselves, for example as used by Linked Data and RDF or other web standards. This gives you the most flexibility in choosing whichever scheme is applicable to your situation, and the need to provide a description of the identified resource is present only for http(s): URIs. Using a URI as an identifier provides not only the possibility of using standardized "vocabulary" for the common cases (with URNs or RDF vocabularies), but also limitless extensibility through the other URI schemes. The only issue is that it is not that nice to wrap URIs in other URIs, but it is nonetheless possible ‒ schemes such as magnet:, jar:, or lid: do that regularly (and through varying mechanisms ‒ magnet: wraps it fully in a query parameter, lid: stores it properly encoded as well both in query and path, while jar: does not require any encoding, using a sentinel character instead, which causes issues when such URIs are nested). It is also possible to refer to common vocabularies through a prefix taken from registries like the RDFa Core Initial Context (which is what lid: does) to prevent the final URI from growing too much while still being able to expand to its full meaning. While identifiers produced in this system are usually long (they are URIs of course), they are usually not general-sounding enough to confuse people about their meaning, ownership and allowed usage.

The honorable mention also goes to the reverse domain name notation, which can be viewed as a particularly restricted case of URIs, using the DNS hierarchy to form identifiers of individual resources. While these domain names are usually not resolvable through DNS, they can be, and it is possible to devise mechanisms that allow them to point to usable resources.

In the end, the best advice is to be mindful of the technologies that are available and to use them accordingly, since if you don't, it will always lead to issues one day.

IS4 Code Blog

December 22, 2023

Having no standard is better than having some standard

No comments:

Post a Comment