To put it simply, Unicode is a character table, where every item has a corresponding code point and often some extra flags and rules associated with it. Frequently there is more than one way to represent a visually identical text in Unicode code points using precomposed characters.
Two important outcomes here:
- To compare two strings reliably, they need to be normalized first.
- There are plenty of nonprintable characters that can massively change the binary representation of the string. User input needs to be sanitized against them.
Encodings define the way how to pack and unpack character sequences into computer memory using code points. UTF-8, for example, is a variable-width encoding that needs 1 to 4 bytes to encode a single Unicode code point.
Outcomes:
- It’s not trivial to calculate the number of symbols in a UTF-8 string. Partly because of its variable-width nature, partly because it’s tricky to define what is a symbol.
- It’s not trivial to chunk or stream UTF-8 either. For example, if you wanted to shorten some long string and insert an ellipsis sign at the end.
On top of that, some string operations are language or culture-specific and can be unintuitive.
In particular, capitalizing is not just subtracting 32 from its charcode. Sometimes is not even a bidirectional operation:
- The capitalized word
FIX
can becomefix
in one locale, andfıx
in another. Similarly,Σ
has two noninterchangeable lowercase equivalents:σ
andς
.
Ligatures and title cases make it even more difficult. Just to name a few:
- Ligature
ij
sometimes can be treated as a single letter and cased accordingly,dz
has three forms too but is treated as two letters. ß
in German language doesn’t have a capital form, thus the uppercased wordgroß
would becomeGROSS
. Mind that capital eszett exists in Unicode. The same happens toﬗ
ligature.- To make it worse, only a handful of scripts has upper and lower-case letters.
Accents can be tricky too.
- Some are omissible in some languages, some are not. For example,
schön
andschon
ortache
andtâche
are pairs of distinct words. Whilecafé
can be written ascafe
andмёд
asмед
. - There are also rules on how some symbols can be represented without accents, for instance,
Türen
is the full equivalent ofTueren
in German andđak
can be spelled asdjak
.
The bottom line. Of course, I don’t expect a random Senior Engineer to know all that but positively want them to know and consistently prefer Unicode-aware functions and libraries when dealing with texts nowadays. Be it a locale-aware function in their programming language of choice, a Unicode modifier in a regular expression, or a mandatory text normalization and sanitization for user inputs.