IS4 Code Blog: The Glory of Semi-Structured Data

A not-so-recent trend in information storage has been an increase of use of an interesting blend of structured and unstructured data. Such a style of expressing information is convenient for users, and with the advent of neural networks and prompt-based programming, also manageable for automatic processing.

What exactly does such data look like? For starters, structured data is something that adheres to a strict structure, such as XML or JSON. Depending on the editing software, the difficulty of manually producing this kind of data ranges, but in every case, the user has to be informed of the limitations of the format or the editor in some way.

By unstructured data, we usually mean something that also contains the desired information, but does not use a medium specifically made for expressing that kind of information. As an example, music expressed in MIDI is structured, while an MP3 recording of a performance of the same music arguably contains as much information as the MIDI, but is unstructured (with respect to the music itself).

For text, unstructured data is in a natural language, in the form humans talk or write. Now what does semi-structured mean? Simply said, it is something that is understood both by humans and computers, which has the advantage of being easy to write and understand, while powerful to express information at the same time.

In some aspects, it almost seems like we've gone a full circle when it comes to semi-structured data. Take HTML, for example. It was based off SGML, which, while infamous due to its complexity, contained a remarkable degree of configurability, and in many cases an SGML document might be perfectly human readable. Then people realized that HTML might be suitable for websites, but not for forums, and created BBCode, the first widely used markup language on the forums.

Now we have Markdown used in a lot of places, because it is a much more similar to what humans would produce when trying to format plain text. There are no tags necessary to add emphasis, only punctuation, and if there is no Markdown processor available, a document that uses it is still readable without issues.

For general data storage, YAML had the same goal as Markdown, but then people realized that letting the ambiguities that inevitably come with natural language into storage of (potentially critical) data and configuration might not be the best thing to do, especially with parsers that have troubles understanding it all. Sometimes, having structure is still better.

Another interesting example, which actually made me write this article, was realizing what YouTube has been doing with the description field. Usually, one does not think about the description as something that might have a structure, but it is now a place where both chapters and tags (curiously enough since there was already a field for those) are recognized, controlling how the video is to be displayed.

Sometimes it is also not clear where the evolution actually happens. The hashtags from YouTube's description field were originally largely promoted by Twitter, from where they entered the "natural" language of web users, to the point where they are used even if there is no text processor that would understand them. Emoticons in some applications share a similar fate, making people write *smile* (or :smile:) in places where it doesn't really make sense. And ironically, the "punctuation" </sarcasm> was never actually used in a real markup. Sometimes, being natural is better as well.

IS4 Code Blog

April 1, 2022

The Glory of Semi-Structured Data

No comments:

Post a Comment