Characters, Symbols and the Unicode Miracle – Computerphile

Characters, Symbols and the Unicode Miracle – Computerphile


UTF-8 is perhaps the best hack, the best single thing that’s used that can be written down on the back of a napkin, and that’s how was it was put together. The first draft of UTF-8 was written on the back of a napkin in a diner and it’s just such an elegant hack that solved so many problems and I
absolutely love it. Back in the 1960s, we had teleprinters, we had simple
devices where you type a key and it sends some numbers and the same letter comes out on the other side, but there needs to be a standard so in
the mid-1960s America, at least, settled on ASCII, which is the American Standard Code for Information Interchange, and it’s a 7-bit binary system, so each letter you type in gets converted into 7 binary numbers and sent over the wire. Now that means you can have numbers from 0 to 127. They sort of moved the first 32 for control codes and less important stuff for writing, things like like “go down a line” or backspace. And then they made the rest characters. They added some numbers, some punctuation marks. They did a really clever thing, which is that they made ‘A’ 65 which, in binary— find 1, 2, 4, 8, 16, 32, 64— in binary, 65 is 1000001, which means that ‘B’ is 66, which means you’ve got 2 in binary just here. C, 67, 3 in binary. So you can look at a 7-bit binary character and just knock off the first two digits and know what its position in the alphabet is. Even cleverer than that, they started lowercase 32 later, which means that lowercase ‘a’ is 97—1100001. Anything that doesn’t fit into that is probably a space, which conveniently will be all zeroes, or some kind of punctuation mark. Brilliant, clever, wonderful, great way of doing things, and that became the standard, at least in the English-speaking world. As for the rest of the world, a few of them did versions of that, but you start getting into other alphabets, into languages that don’t really use alphabets at all. They all came up with their own encoding, which is fine. And then along come computers, and, over time, things change. We move to 8-bit computers, so we now have a whole extra number at the start just to confuse matters, which means we can go to 256! We can have twice as many characters! And, of course, everyone settled on the same standard for this, because that would make perfect s— No. None of them did. All the Nordic countries start putting Norwegian characters and Finnish characters in there. Japan just doesn’t use ASCII at all. Japan goes and creates its own multibyte encoding with more letters and more characters and more binary numbers going to each individual character. All of these things are massively incompatible. Japan actually has three or four different encodings, all of which are completely incompatible with each other. So you send a document from one old-school Japanese computer to another, it will come out so garbled that there is even a word in Japanese for “garbled characters,” which is—I’m probably mispronouncing this—but it’s “mojibake.” It’s a bit of a nightmare, but it’s not bad, because how often does someone in London have to send a document to a completely incompatible and unknown computer at another company in Japan? In those days, it’s rare. You printed it off and you faxed it. And then the World Wide Web hit, and we have a problem, because suddenly documents are being sent from all around the world all the time. So a thing is set up called the Unicode Consortium. In what I can only describe as a miracle, over the last couple of decades, they have hammered out a standard. Unicode now have a list of more than a hundred thousand characters that covers everything you could possibly want to write in any language— English alphabet, Cyrillic alphabet, Arabic alphabet, Japanese, Chinese, and Korean characters. What you have at the end is the Unicode Consortium assigning 100,000+ characters to 100,000 numbers. They have not chosen binary digits. They have not chosen what they should be represented as. All they have said is that THAT Arabic character there, that is number 5,700-something, and this linguistic symbol here, that’s 10,000-something. I have to simplify massively here because there are about, of course, five or six incompatible ways to do this, but what the web has more or less settled on is something called “UTF-8.” There are a couple of problems with doing the obvious thing, which is saying, “OK. We’re going to 100,000. That’s gonna need, what… to be safe, that’s gonna need 32 binary digits to encode it.” They encoded the English alphabet in exactly the same way as ASCII did. ‘A’ is still 65. So if you have just a string of English text, and you’re encoding it at 32 bits per character, you’re gonna have about 20-something… 26? Yeah. 26, 27 zeroes and then a few ones for every single character. That is incredibly wasteful. Suddenly every English language text file takes four times the space on disk. So problem 1: you have to get rid of all the zeroes in the English text. Problem 2: there are lots of old computer systems that interpret 8 zeroes in a row, a NULL, as “this is the end of the string of characters.” so if you ever send 8 zeroes in a row, they just stop listening. They assume the string has ended there, and it gets cut off, so you can’t have 8 zeroes in a row anywhere. ‘K. Problem number 3: it has to be backwards-compatible. You have to be able to take this Unicode text and chuck it into something that only understands basic ASCII, and have it more or less work for English text. UTF-8 solves all of these problems and it’s just a wonderful hack. It starts by just taking ASCII. If you have something under 128, that can just be expressed as 7 digits, you put down a zero, and then you put the same numbers that you would otherwise, so let’s have that ‘A’ again—there we go! That’s still ‘A.’ That’s still 65. That’s still UTF-8-valid, and that’s still ASCII-valid. Brilliant. OK. Now let’s say we’re going above that. Now you need something that’s gonna work more or less for ASCII, or at least not break things, but still be understood. So what you do is you start by writing down “110.” This means this is the start of a new character, and this character is going to be 2 bytes long. Two ones, two bytes, a byte being 8 characters. And you say on this one, we’re gonna start it with “10,” which means this is a continuation, and at all these blank spaces, of which you have 5 here and 6 here, you fill in the other numbers, and then when you calculate it, you just take off those headers, and it understands just as being whatever number that turns out to be. That’s probably somewhere in the hundreds. That’ll do you for the first 4,096. What about above that? Well, above that you go “1110,” meaning there are three bytes in this—three ones, three bytes— with two continuation bytes. So now you have 1, 2, 3, 4, 10, 16 spaces. You want to go above that? You can. This specification goes all the way to “1111110x” with this many continuation bytes after it. It’s a neat hack that you can explain on the back of a napkin or a bit of paper. It’s backwards-compatible. It avoids waste. At no point will it ever, ever, ever send 8 zeroes in a row, and, really, really crucially, the one that made it win over every other system is that you can move backwards and forwards really easily. You do not have to have an index of where the character starts. If you are halfway through a string and you wanna go back one character, you just look for the previous header. And that’s it, and that works, and, as of a few years ago, UTF-8 beat out ASCII and everything else as, for the first time, the dominant character encoding on the web. We don’t have that mojibake that Japanese has. We have something that nearly works, and that is why it’s the most beautiful hack that I can think of that is used around the world every second of every day. (BRADY HARAN)
-We’d like to think Audible.com for their support of this Computerphile video, and, if you register with Audible and go to audible.com/computerphile, you can download a free audiobook. They’ve got a huge range of books at Audible. I’d like to recommend “The Last Man On the Moon,” which is by Eugene Cernan who is the eleventh of twelve men to step onto the Moon. but he was the last man to step off the Moon, so I’m not sure whether or not he is “the last man on the Moon” or not. Sort of depends how you define it. But his book is really good, and what I really like about it is it’s read by Cernan himself, which I think is pretty cool Again, thanks to Audible. Go to audible.com/computerphile and get a free audiobook. (TOM SCOTT)
-“… an old system that hasn’t been programmed well will take those nice curly quotes that Microsoft Word has put into Unicode, and it will look at that and say, ‘That is three separate characters…’ ”

You May Also Like

About the Author: Oren Garnes

55 Comments

  1. This is literally the first video I have seen with Tom Scott in and I absolutely love his passion. I think there should be a standard for a lot more things too. What side of the road we all drive on for a start. Power sockets and the actual powers supply itself. Phone chargers etc.

  2. I'm still a little confused, now. Does the existence of the header bits mean that no character data can contain those exact pattern of bits? That seems really obtuse. And if not, then the computer has to be very careful to count 8 bits at a time forward or backward. Well then what's the point in having such long headers? It could be as simple as 0= start of new character, and 1 = continuation.

  3. A note from someone studying Japanese: as far as I can tell, from my limited knowledge of Japanese shortenings, "mojibake" means "character monster." Rather prefer the one on Sesame Street, myself.

  4. thanx Computerphile for explaining utf8 , user tried to understand from wiki but could not do it, u make everything simple

  5. So all the weird symbols like NUL and REF in a picture when you open it in a text editor are the 64 characters before A? And it’s weird because it’s being read incorrectly. Cool!

  6. At time 6:46, the number is 49 not 65. Super interesting video, very informative. I was directed here from electroboom, and am excited to find another great educational YouTuber!

  7. Well this sounds very nice and spacesaving, but still CZECH written texts encoded in utf-8 turn into complete gibrish upon transferring from one computer to another, or worse, from one app to another

  8. If you ever had to follow along after a hacker and analyze why their code is broken, you would be much less enthralled with hacked software. I have, it ain't pretty.

  9. Does the Computerphile channel have some giant stash of green bar paper they carry around and hand to who ever is speaking to illustrate? Seriously, where do they get an endless supply of greenbar these days?

  10. I watched this video like 5 times over a long period now. Keep coming back to it, I so love the explanation and the storytelling!

  11. I thought there was a Klein bottle on the left side behind Tom. I got excited, and then I got sad when I realized it wasn't…. :'(

  12. So why isn't the header for everything that doesn't fit into two bytes (e.g. 110xxxxx 10xxxxxx) not just 110 aswell? Or why does the header need to specify how many more bytes there are? It could also just say: "there are more bytes to come" and the program reading it would just look for the next header (or the end of the data) and "use" all the bytes in between… Or am I missing somethin?

  13. A Miracle would have given one unified Encoding. Not the mess we have now! Video is misleading UTF-8 is NOT de facto standard, not even for the Internet.

  14. Additionally, UTF-8 does not have a "byte order". The "always store 32 bits for each character" encoding (a.k.a. UTF-32) has the problem that when a little-endian computer and a big-endian computer exchange data in this format, they have to add a prefix which tells the other computer "I'm sending the bytes of each character in ascending order" or "… in descending order". Then software needs logic to understand this prefix, to eliminate this prefix, to guess what to do when this prefix is missing, and so on. The UTF-16 encoding, which is used by Microsoft Windows internally, has the problem. Whereas UTF-8 just gets away without it. Simple and beautiful!

  15. All zeroes is NOT a space. It is the null character, while 32 in decimal 20 in hex and 100000 in binary is a space.

  16. I am confused by a phrase at 2:13, "languages that don't use alphabets at all." If there is no written system, what are they typing?

  17. Very interesting as always Tom, but I couldn't watch this video – I listened to it, but the camera work made it unwatchable. The constant side to side swaying me seriously nauseous and the random rapid zooms just jarred. Please, tell your cameraman to get a tripod and to use it and to stop playing with the zoom lever.

  18. You pronounced mojibake pretty well! For anyone who wants google translate or a Japanese English dictionary: もじばけ

  19. For the people wanting to know where this vid was taken it in a cafe called the booking office in St Pancras station I know because I have been there once it's pretty popular

  20. Such an incredible enthusiasm just for UTF-8! I’d like to hear you speaking about quantum entanglement 🥴

  21. So does that mean non-English text takes more space to store? If I translate a document from English to .. say … Arabic, wouldn't that double or triple its file size? That sounds like a pretty big problem to me.

Leave a Reply

Your email address will not be published. Required fields are marked *