IS4 Code Blog: Why you should never unescape a URI

The street address of the Internet, the URI, is an interesting piece of syntax. Its specification is longer than those of some data languages, and together with all the existing schemes, it would take a couple of books to describe it all. The simplified form of an (absolute) URI is this:

scheme://host/path?query#fragment

Virtually all portions except the scheme can be omitted including their surrounding syntax, making the format very flexible and usable for a multitude of cases. Yet there are some cases where some of the design decisions cause troubles.

The components

The path portion of an URI is further subdivided into components, separated by a slash. The intention of this is clear – to standardize any hiearchical structure, so that when a new scheme is created, it gets the hierarchy for free and thus works well with relative references (looking at you, URN). However, some URIs don't have a use for any of this, for example data:,6/2=3? and data:,6%2F2=3%3F are equivalent, yet a generic algorithm might find two path components in the former URI, 6 and 2=3, and an empty query.

The characters

Common programming languages are composed of punctuation (special characters) and keywords that produce the syntax of the program, and identifiers and literals (usually in quotes) that make up the semantics. In contrast, URI syntax is more similar to XML in that it does not enclose actual text in any delimiters. It syntactically combines identifiers and literals (in the query portion for example), and relies on a set of special characters to make up the actual structure. When a special character is to be taken literally, it must be escaped with percent-encoding.

URI is in essence a binary format, not a textual format. All the characters are bytes (octets) and their textual representation is derived via US-ASCII. Percent-encoding is therefore simple – the actual byte value is encoded in 2 hexadecimal digits, and prepended with %. While it may seem simple, the semantics of such a thing are quite complicated.

The standard defines a set of unreserved characters. These are letters, digits and the characters -._~. These characters should not have any special meaning in an URI, thus in theory it should be safe to encode or decode any of them without changing the meaning of the URI. In practice, some services are not prepared to handle these encoded characters (if they treat them as keywords), thus some browsers sometimes decode such characters transparently, which is arguably the safest option.

Then there are the characters which are plainly disallowed. The set of disallowed characters contains characters like <>" . The angle brackets, quotes and space characters are disallowed to simplify embedding literal URIs in other languages or formats. The characters #% are disallowed in a specific sense, since # can be used at most once to separate the fragment from the rest of the URI, and % must always be followed by two hexadecimal digits, meaning a URI can contain these two characters, but in a very restricted way (thus a URI validator must check their correct usage if nothing else). Other disallowed characters are {}|\^[]`. These were referred to as "unwise" in the past, meaning some ancient services treated them in a special way. Similarly to the previous characters, [] can be present in a URI, but only as part of the IPv6 address syntax in the host portion. Control characters (%00 through %1F and %7F) are not allowed at all.

What follows now is a sort of a gray zone. There are the general delimiters :/?@ that separate different parts of the URI, but their special usage is limited to different portions of a URI. Then there are other common delimiters !$&'()*+,;= which are not important to the general URI syntax, but may be important to individual schemes or content types.

As you can see, this already complicates potential encoding or decoding algorithms, resulting in a large variety of configurations. Let's see some of the options, from the most benevolent:

Percent sanitizer: This algorithm doesn't care about the URI syntax, but ensures that percent-decoding does not fail – it only escapes % if not part of a valid escape sequence.
URI sanitizer: This assumes that the input is intended to be a URI, and tries its best to turn it into a valid one. The disallowed delimiters and control characters are always escaped, but there are some exceptions. The first # is unchanged but the subsequent occurences are encoded. [] are escaped only if not in the host. This is what browsers should do with the address bar in case of manual input, preferably also decoding unreserved characters.
Portion-specific sanitizer: This is a family of different configurations intended for individual portions of a URI, ensuring that when a full URI is built from its components, it can be also parsed back to the same form. It escapes # in all portions. It escapes @ in the user info in the authority portion. It escapes ? if used in the path and also ensures that it doesn't start on // if there is no authority. If the scheme is not present (thus the URI is relative), it also ensures that it does not begin with a path component that contains :.
Application-specific sanitizer: This is a family of sanitizers that know application-specific information beyond what the general URI syntax specifies. For example it is common for the query portion of a URI to contain key=value pairs, separated by &, or there may be some specific additional structure to the path based on the scheme, or the fragment could be in a specific format based on the type of the actual resource.
Data encoder: In any case, when the input is meant to be taken literally, escaping all reserved and invalid characters should be fine. This is what most supplied percent-decoding functions do.
Keyword encoder: This extends the previous encoder in a way that could circumvent any application that looks for specific strings in an unescaped URI. All characters are encoded, at the cost of making the URI confusing and perhaps incompatible.

The plus character

In queries, the + character has been sometimes used to encode the space character. This practice is falling out of use, but there are still services and libraries that are configured to do so by default. Encountering this character in an URI in any other place than the scheme could be ambiguous: in path it should be treated as plus and in query as space, but you could never know if the producer of the URI used the proper encoding algorithm. Thus it leaves the character in a similar state to the old ßβ conflation in CP437, using the same glyph for two different but visually similar characters.

In my opinion, + should be treated as unwise when creating a URI, like the space character, always percent-encoded. When encountered, it should be assumed equivalent to both the plus and the space.

IRIs

So far, we have been dealing solely with US-ASCII as the encoding, but there is another can of worms: Unicode characters in URIs. If the input is a sequence of bytes interpretable in UTF-8, it is an IRI, an Internationalized Resource Identifier. Since the URI syntax supports encoding of arbitrary octets, the conversion between the two forms is simply performed by encoding the characters outside ASCII in UTF-8. The general syntax is also extended, with valid Unicode characters added to the set of unreserved characters.

It should be noted that it does not work this way for other encodings, and that it does not make Unicode characters miraculously supported in any application that uses URIs. For example, data:,¢ is simply equivalent to data:,%C2%A2, defaulting to the ASCII encoding (usually extended in browsers).

When to unescape

Unlike in Java, where the hex-unescaping happens before everything else, in URIs it should be the last thing to do. Thus a process that uses them should:

Transform the IRI to URI (if necessary).
Validate or sanitize it with the algorithm above.
Parse the general structure of the URI.
Parse the path and query portion in an application-specific way, down to the deepest level.
Treat the individual smallest components of the URI as data.
Transform them from URI characters to IRI characters (if necessary).
Percent-decode the rest.
Validate or sanitize the data. At this point it should be either Unicode string or a byte sequence.
Process the data.

Performing all these steps globally in order is however sometimes impossible, as different components of the URI may be processed out of order, and before others are even decoded. This procedure is still however something that should eventually happen at every level, and exactly once.

IS4 Code Blog

April 29, 2021

Why you should never unescape a URI

The components

The characters

The plus character

IRIs

When to unescape

No comments:

Post a Comment