December 31, 2023

RDF and HTTPS

HTTPS is becoming the new norm on the Internet, but in most places, HTTP is still the "default". One of them is RDF ‒ being stuck with a particular URI to identify vocabulary terms means that even a minor change of one "s" in the URI scheme makes it identify a completely distinct resource. For this reason, RDF URIs (and similar things like XML namespaces) will never change to https. However, this is not such a big deal, for at least two reasons:

  • URIs in RDF are not "meant" to load data from. They should be dereferenceable, to make it possible to find out more information about the resource from the URI itself, but there is no obligation. Especially when talking about vocabulary terms, there are not that many automatic and security critical processes that would request descriptions of every encountered vocabulary term, and arguably there shouldn't be ‒ if it is so important to load the vocabulary from a trustworthy source, you should not download it from the URI in the first place, since it should have already been on your disk and verified manually. HTTPS is not going to save you from network outages, random errors, or hacked webservers, and the worst thing a potential attacker could cause through HTTP anyway is to break your inference.
  • Browsers have already started treating http in URIs as not as binding, instead they may switch to HTTPS under various conditions, and they send the Upgrade-Insecure-Requests header for the server to direct you to HTTPS as quickly as possible. With HSTS (HTTP Strict Transport Security), you request the browser to never use HTTP only for the given domain, and even if entered manually, the browser will switch to HTTPS (with no bypassing). It is also possible to preload a given domain to be stored in a list that is used by all modern browsers, so that first HTTP-only request never ever has to take place.

In case you actually are making an automated RDF vocabulary loader, it is a somewhat important question how to interpret http in URIs. There are a few possibilities:

  • Do nothing special. If there are no particular requirements, I don't think there is anything wrong with upholding the protocol indicated by the URI.
  • Many web servers always auto-redirect to HTTPS, disregarding Upgrade-Insecure-Requests. Arguably, if security is so important for these servers to ignore user's preference, they should also send Content-Security-Policy to make this redirect automatic. If you cache this information, you can feel safe if you treat http as https next time you make a request (but only for the connection, don't rewrite it in RDF!).
  • Due to the forced redirect most websites do, it may be a worthy optimization to use HTTPS automatically and revert to HTTP only in case of issues (can't connect or no actual RDF description found).
  • The HSTS preload list is available to download, and you can use this to skip the first round of HTTPS redirecting altogether. There are a few nuisances though ‒ the list is very huge but stored as JSON (and the download is base64-encoded), which is not such a great streaming format, and there are also comments. Its location has also been needlessly changed at least once, so better be on guard for surprises.

I decided to explore the last point and assess how exactly would this help the vocabularies found on the web. The results were not great...