IS4 Code Blog: 2021

December 25, 2021

Identifiers for some file formats

I managed to find a gold mine when it comes to URIs and identifiers: the DocBook notations module. DocBook is an older SGML/XML-based standard developed by OASIS used for writing documentations, and with its DTD comes a module that “imports” several then-common notations for use with unparsed entities.

What Flavour is Your Function?

One programming article worth reading is definitely What Color is Your Function? by Bob Nystrom. A good old rant about the design of several programming languages, one that separates them into two groups of good languages and bad languages, depending on a presence of a certain feature.

The gist of the article is describing a feature called “colored functions” and then analyzing languages depending on whether there is some mechanism akin to this. The presence of a callback mechanism for asynchronous programming is basically the only application of these colored functions, but I believe pointing out other similar mechanisms is worth doing.

XML Lite (definitely not XML 2.0)

Years after the conception of XML, it had been customary for programmers to point out its perceived flaws and suggest fixes to the language. This trend has somewhat declined in recent years, as developers have either learnt to live with XML, or discovered that they can use something else.

I do not share these feelings towards XML. In many cases, I find it suitable for various purposes, and I think many cases of criticism stemmed from misunderstanding. Yet even I would like to add some features to the language, based on my experience. Let's take a look at them.

Finding the UUID for almost anything

Universally Unique Identifiers are a nifty way of obtaining identifiers for resources, objects, or concepts, without the need for a central assigning authority. Arguably the largest public use of UUIDs is from Microsoft's products, where they (known as GUIDs) identify classes or interfaces within COM, for example 450d8fba-ad25-11d0-98a8-0800361b1103 identifies the My Documents folder, accessible via shell.

What's less known is the fact that UUIDs have a specific structure and are not necessarily composed of random numbers. The earliest generated UUIDs used time as one guarantee of uniqueness (the UUID in the example above was created in 1997); these are version 1 UUIDs. Nowadays, you can still use them if you want to preserve the creation time in the identifier, but the most common are version 4 UUIDs that consist almost entirely of (pseudo-)random bytes.

Usually, one associates the generation of UUIDs with some random process that produces different identifiers each time it is invoked. A less known version of UUIDs, however, makes it possible to produce identifiers for certain resources deterministically, that is based solely on some input data and producing the same result each time.

HTTPS redirection in Apache HTTP Server, properly

HTTP is the standard text-based protocol for retrieving documents from web servers, but, like most net protocols, has the unfortunate issue of using a transport method that is insecure. The packets that correspond to an HTTP message may be observed by routers before they reach their destination, and the whole message may be reassembled and inspected by malicious devices. Thus plain HTTP is commonly wrapped in TLS, resulting in HTTPS wherein messages are encrypted and can be decrypted only by the actual endpoints.

HTTPS is not absolutely necessary, however. As a user, you don't need it when browsing static content, unless you care about anyone in the way not knowing what browser you use and what content you view, or you want to verify the identity of the server. HTTPS is good to have as there will always be users that need it, but in these cases, it shouldn't however be forced upon other users, as it does bring some inconveniences. You have to present a valid certificate, it has to be signed, it cannot be expired, and it is always possible the version of TLS you use becomes obsolete. In any of these cases, all web clients start warning the user or stop the website from being accessed at all.

This is where Upgrade-Insecure-Requests steps in. It is a header sent by all modern browsers, informing the server that the client wishes to use HTTPS whenever possible. It can be turned off however, and most other tools (command line-based or in programming languages) do not send it by default, so you can specify the protocol by yourself freely.

Why you should never unescape a URI

The street address of the Internet, the URI, is an interesting piece of syntax. Its specification is longer than those of some data languages, and together with all the existing schemes, it would take a couple of books to describe it all. The simplified form of an (absolute) URI is this:

scheme://host/path?query#fragment

Virtually all portions except the scheme can be omitted including their surrounding syntax, making the format very flexible and usable for a multitude of cases. Yet there are some cases where some of the design decisions cause troubles.

URI scheme for identifying linked data entities by their identifier

It is a good practice in RDF to identify entities with a URI scheme that may allow one to eventually arrive at a description of the identified entity. However, the individual URI patterns aren't semantical, and there isn't a uniform standard that would allow identifying an entity within a particular dataset. This is an attempt to devise one. The resulting URI should behave like a "link" that connects to the dataset and finds the actual URIs used for the entity. In theory, it could even be used directly to identify the entity itself.

Identifying resources works by traversing inverse functional properties. This kind of a property (an instance of owl:InverseFunctionalProperty) behaves like a function from its range to its domain. Thus a particular value assigned to the property serves as its primary key which, when used with the property, uniquely identifies the resource.

More about datatypes in RDF

Datatypes in RDF are arguably the most confusing and also underutilized tool in RDF. To understand datatypes, we first need to understand literals.

Literals

While RDF and XML are very different in many aspects, they share some core concepts. XML at its core doesn't really have standard datatypes like you'd find in many programming languages and other data languages like JSON. Instead, you have text (or character data if you will) which might get a specific meaning via other facilities, but to an external observer, everything is only text (or whitespace if you want to go into details).

RDF is very similar to XML in that a literal is simply a piece of (character) data. Unlike XML however, it is also possible to assign a datatype to the literal. The notion of a plain (untyped) literal was changed somewhat in RDF 1.1, making xsd:string the implicit datatype (more on that later). Specific serialization formats may define syntax for other common typed literals, such as numbers or booleans, but all of them are still backed by text. Thus a literal is simply a piece of text, optionally with something that identifies its datatype (we forget language-tagged literals for now).

Recommendations for producing linked data

While working with RDF and linked data, I have tried to come up with some tips that, in my opinion, make publishing RDF data better in a wide range of qualities. I shall call them IS4 Recommendations and list them in this article (which will be updated in case of new ones):

Dynamic Self-contained Infinite XML

Imagine a window showing some sort of a feed. There are multiple panels displaying various messages as they arrive, plus some boxes that show statuses of other things, or the time, for example. This is a complex system usually thought of as calls to different APIs, various types of messages, presentation layer etc. What if I told you however that such a feat could be done even with only standardized formats, no client-side scripting and a single XML document and HTTP request?

Null in RDF

There is nothing like null in RDF, but sometimes it is necessary to express its meaning in RDF documents as well. The issue with null however is that its semantics can vary from use to use, and thus one has to think about the intended meaning before going for one of the alternatives, as it may negatively affect the consistency of the data when an incorrect representation is selected.

Let's look at some examples.

A completely meaningless comparison of XML and JSON

I have never had anything against XML. Despite its many quirks and burdens, I think it is even nowadays a perfectly reasonable format for providing or accepting structured data, and I think it is meaningless (albeit trendy) to compare it with JSON trying to determine whichever is better. It is obvious those two languages were designed for different purposes, and the real virtue in data engineering (or software engineering in general) is to know and use the proper tools for completing a task.

In that regard, no language should come out as the winner in this comparison, but it should give you an idea when to use one or the other.

Linking classes and properties with RDF

At its core, any RDF document is just a collection of semantic triples, recognized by any basic RDF tool. However, the call for standardization and mutual intelligibility has led to several ways to describe the schema that the data should follow, with two use cases in mind: reasoning and validation.

IS4 Code Blog

December 25, 2021

Identifiers for some file formats

December 21, 2021

What Flavour is Your Function?

December 12, 2021

XML Lite (definitely not XML 2.0)

November 29, 2021

Finding the UUID for almost anything

July 10, 2021

HTTPS redirection in Apache HTTP Server, properly

April 29, 2021

Why you should never unescape a URI

March 14, 2021

URI scheme for identifying linked data entities by their identifier

March 13, 2021

More about datatypes in RDF

Literals

February 27, 2021

Recommendations for producing linked data

February 22, 2021

Dynamic Self-contained Infinite XML

February 20, 2021

Null in RDF

February 15, 2021

A completely meaningless comparison of XML and JSON

January 29, 2021

Linking classes and properties with RDF