What I Expect a Senior Engineer to Know About Text Processing
To put it simply, Unicode is a character table, where every item has a corresponding code point and often some extra flags and rules associated with it. Frequently there is more than one way to represent a visually identical text in Unicode code points using precomposed characters. Two important outcomes here: To compare two strings reliably, they need to be normalized first. There are plenty of nonprintable characters that can massively change the binary representation of the string....