What I Expect a Senior Engineer to Know About Text Processing

FLÜGGÅӘNҚб€ČHIŒβØLſÊN

To put it simply, Unicode is a character table, where every item has a corresponding code point and often some extra flags and rules associated with it. Frequently there is more than one way to represent a visually identical text in Unicode code points using precomposed characters.

Two important outcomes here:

To compare two strings reliably, they need to be normalized first.
There are plenty of nonprintable characters that can massively change the binary representation of the string. User input needs to be sanitized against them.

Encodings define the way how to pack and unpack character sequences into computer memory using code points. UTF-8, for example, is a variable-width encoding that needs 1 to 4 bytes to encode a single Unicode code point.

Outcomes:

It’s not trivial to calculate the number of symbols in a UTF-8 string. Partly because of its variable-width nature, partly because it’s tricky to define what is a symbol.
It’s not trivial to chunk or stream UTF-8 either. For example, if you wanted to shorten some long string and insert an ellipsis sign at the end.

On top of that, some string operations are language or culture-specific and can be unintuitive.

In particular, capitalizing is not just subtracting 32 from its charcode. Sometimes is not even a bidirectional operation:

The capitalized word FIX can become fix in one locale, and fıx in another. Similarly, Σ has two noninterchangeable lowercase equivalents: σ and ς.

Ligatures and title cases make it even more difficult. Just to name a few:

Ligature ĳ sometimes can be treated as a single letter and cased accordingly, ǳ has three forms too but is treated as two letters.
ß in German language doesn’t have a capital form, thus the uppercased word groß would become GROSS. Mind that capital eszett exists in Unicode. The same happens to ﬗ ligature.
To make it worse, only a handful of scripts has upper and lower-case letters.

Accents can be tricky too.

Some are omissible in some languages, some are not. For example, schön and schon or tache and tâche are pairs of distinct words. While café can be written as cafe and мёд as мед.
There are also rules on how some symbols can be represented without accents, for instance, Türen is the full equivalent of Tueren in German and đak can be spelled as djak.

The bottom line. Of course, I don’t expect a random Senior Engineer to know all that but positively want them to know and consistently prefer Unicode-aware functions and libraries when dealing with texts nowadays. Be it a locale-aware function in their programming language of choice, a Unicode modifier in a regular expression, or a mandatory text normalization and sanitization for user inputs.