IS4 Code Blog: RDF and HTTPS

HTTPS is becoming the new norm on the Internet, but in most places, HTTP is still the "default". One of them is RDF ‒ being stuck with a particular URI to identify vocabulary terms means that even a minor change of one "s" in the URI scheme makes it identify a completely distinct resource. For this reason, RDF URIs (and similar things like XML namespaces) will never change to https. However, this is not such a big deal, for at least two reasons:

URIs in RDF are not "meant" to load data from. They should be dereferenceable, to make it possible to find out more information about the resource from the URI itself, but there is no obligation. Especially when talking about vocabulary terms, there are not that many automatic and security critical processes that would request descriptions of every encountered vocabulary term, and arguably there shouldn't be ‒ if it is so important to load the vocabulary from a trustworthy source, you should not download it from the URI in the first place, since it should have already been on your disk and verified manually. HTTPS is not going to save you from network outages, random errors, or hacked webservers, and the worst thing a potential attacker could cause through HTTP anyway is to break your inference.
Browsers have already started treating http in URIs as not as binding, instead they may switch to HTTPS under various conditions, and they send the Upgrade-Insecure-Requests header for the server to direct you to HTTPS as quickly as possible. With HSTS (HTTP Strict Transport Security), you request the browser to never use HTTP only for the given domain, and even if entered manually, the browser will switch to HTTPS (with no bypassing). It is also possible to preload a given domain to be stored in a list that is used by all modern browsers, so that first HTTP-only request never ever has to take place.

In case you actually are making an automated RDF vocabulary loader, it is a somewhat important question how to interpret http in URIs. There are a few possibilities:

Do nothing special. If there are no particular requirements, I don't think there is anything wrong with upholding the protocol indicated by the URI.
Many web servers always auto-redirect to HTTPS, disregarding Upgrade-Insecure-Requests. Arguably, if security is so important for these servers to ignore user's preference, they should also send Content-Security-Policy to make this redirect automatic. If you cache this information, you can feel safe if you treat http as https next time you make a request (but only for the connection, don't rewrite it in RDF!).
Due to the forced redirect most websites do, it may be a worthy optimization to use HTTPS automatically and revert to HTTP only in case of issues (can't connect or no actual RDF description found).
The HSTS preload list is available to download, and you can use this to skip the first round of HTTPS redirecting altogether. There are a few nuisances though ‒ the list is very huge but stored as JSON (and the download is base64-encoded), which is not such a great streaming format, and there are also comments. Its location has also been needlessly changed at least once, so better be on guard for surprises.

I decided to explore the last point and assess how exactly would this help the vocabularies found on the web. The results were not great...

I used prefix.cc as a source of all vocabularies someone found important enough to store there. After some deduplication, I was left with 2818 reasonably unique URIs, leading to 1030 unique domains (not surprising considering many are found on w3.org, purl.org, w3id.org and so on). 895 are referenced through http.

Of all these URIs, only 63 could take use of the HSTS preload list, but the actual number of unique domains is 29. Yet when I checked them manually, there were a lot of dead or simply irrelevant sites, so in the end, all I was left with were 9 domains. Let's take a look at them!

A few vocabularies are from Wikidata, a valuable repository of collective information on almost everything. Since the same domain is used for logging in, HSTS makes sense, even if it was not to assist in RDF retrieval. A great start though!
There are a few SourceForge-based vocabularies: many under https://eulersharp.sourceforge.net/2006/02swap/ and https://eulersharp.sourceforge.net/2003/03swap/ (for a reasoner called EYE), http://eulergui.sourceforge.net/contacts.owl.n3, and http://mged.sourceforge.net/ontologies/MGEDOntology.owl. Again this does not count, since the HSTS is due to SourceForge.
One personal ontology is also there, at http://paul.staroch.name/thesis/SmartHomeWeather.owl. I have to commend the creator for having his ontology available even 10 years after the thesis. Such is not something one would see often.
http://schema.googleapis.com/ is dead, but at least it redirects to schema.org (in addition to HTTPS).
http://www.ft.com/ontology/content/ is an interesting one, since it tells you that you are in an ontology, but not what is it that you are accessing. In their own words: "The FT maintains an ontology which provides a structure for all the concepts we need to write or reason about." Doubt.
http://linked.opendata.cz/ontology/ldvm/ does not lead to a useful description, but at least there is a Virtuoso store behind it.
Lastly, there is http://olis.dev/, a very recent addition to the list, but it is HSTS again as a byproduct of begin under .dev. Sadly, their ontology actually uses https in the prefix (and they use https://schema.org as well, which is not correct).

Well, while this endeavour did not show what I hoped for (instead that RDF vocabularies are in a horrible state), at least I found some nice ontologies!

IS4 Code Blog

December 31, 2023

RDF and HTTPS

No comments:

Post a Comment