Imagine surfing the web without text. No articles, no blog posts, no comments, no social media posts. Just images, videos, and audio. It would be a pretty boring internet, wouldn't it?
So how do browsers store all of that text? It's pretty simple. Browsers use a system called UTF-16, which represents each character as a sequence of code units. Each code unit is 16 bits long, which means that browsers can store 65536 characters.
But what about emojis and other symbols? They take up more than 16 bits. That's where surrogate pairs come in. Surrogate pairs are two UTF-16 code units that represent a single character.
There are a few additional rules for surrogate pairs. First, both parts of the pair must be between 0xD800 and 0xDFFF. Second, each character is represented by one or two code units called code points. Code points are represented in the form \u{xxxxxx}
or \u{xxxxxx}\u{xxxxxx}
, where xxxxxx
represents one to six hexadecimal digits.
You might be wondering how the browser decides whether to decode one code point at a time or two code points at a time. So as I told you earlier surrogate pairs are made up of two code units, each of which is between 0xD800 and 0xDFFF. These hex sequences cannot be used alone; they must be used in pairs. If the browser sees a single code unit that is between 0xD800 and 0xDFFF, it knows that it is part of a surrogate pair and will decode it as a single character.
If the browser encounters a sequence of two code units where the first code unit is between 0xD800 and 0xDFFF and the second code unit is between 0xDC00 and 0xDFFF, it knows that it is a surrogate pair and will decode it as a single character.
If the browser encounters a single code unit that is between 0xD800 and 0xDFFF or a sequence of two code units where the first code unit is not between 0xD800 and 0xDFFF or the second code unit is not between 0xDC00 and 0xDFFF, it will not be able to decode it and will display the replacement character �
.
It can get even more complicated, like the fact that some emojis are made up of two other emojis as shown below. But we'll discuss that another day.
Thanks for reading \uD83D\uDC4B\uD83C\uDFFE
.