
To put it simply, Unicode is a character table, where every item has a corresponding code point and often some extra flags and rules associated with it. Frequently there is more than one way to represent a visually identical text in Unicode code points using precomposed characters.
Two important outcomes here:
- To compare two strings reliably, they need to be normalized first.
- There are plenty of nonprintable characters that can massively change the binary representation of the string. User input needs to be sanitized against them.
Encodings define the way how to pack and unpack character sequences into computer memory using code points. UTF-8, for example, is a variable-width encoding that needs 1 to 4 bytes to encode a single Unicode code point.
Outcomes:
- It’s not trivial to calculate the number of symbols in a UTF-8 string. Partly because of its variable-width nature, partly because it’s tricky to define what is a symbol.
- It’s not trivial to chunk or stream UTF-8 either. For example, if you wanted to shorten some long string and insert an ellipsis sign at the end.
On top of that, some string operations are language or culture-specific and can be unintuitive.
In particular, capitalizing is not just subtracting 32 from its charcode. Sometimes is not even a bidirectional operation:
- The capitalized word
FIXcan becomefixin one locale, andfıxin another. Similarly,Σhas two noninterchangeable lowercase equivalents:σandς.
Ligatures and title cases make it even more difficult. Just to name a few:
- Ligature
ijsometimes can be treated as a single letter and cased accordingly,dzhas three forms too but is treated as two letters. ßin German language doesn’t have a capital form, thus the uppercased wordgroßwould becomeGROSS. Mind that capital eszett exists in Unicode. The same happens toﬗligature.- To make it worse, only a handful of scripts has upper and lower-case letters.
Accents can be tricky too.
- Some are omissible in some languages, some are not. For example,
schönandschonortacheandtâcheare pairs of distinct words. Whilecafécan be written ascafeandмёдasмед. - There are also rules on how some symbols can be represented without accents, for instance,
Türenis the full equivalent ofTuerenin German andđakcan be spelled asdjak.
The bottom line. Of course, I don’t expect a random Senior Engineer to know all that but positively want them to know and consistently prefer Unicode-aware functions and libraries when dealing with texts nowadays. Be it a locale-aware function in their programming language of choice, a Unicode modifier in a regular expression, or a mandatory text normalization and sanitization for user inputs.