PDA

View Full Version : UTF-8 with HTML5



techno_race
06-21-2012, 06:28 PM
I didn't find anything about this elsewhere, so I thought I'd post it here.

When encoding an HTML5 (possibly earlier versions, as well) document as UTF-8, ensure that it does not include a byte order mark. (In Notepad++, this is the difference on the "Encoding" menu between "Encode in UTF-8" and "Encode in UTF-8 without BOM.")

For some reason, BOMs upset the W3C validator and web browsers (this may be server-specific); for example, the validator would not run on a document containing the byte \xA9 (the copyright sign) and a BOM, as it had trouble mapping that to Unicode properly.

Also remember to include the <meta charset="UTF-8"> tag, which makes the validator happy.

jscheuer1
06-22-2012, 12:46 AM
That's about right.

As far as I can tell, at least in HTML files or files that will be interpreted by the browser as HTML, CSS, XML, or javascript, possibly others, the BOM is an outdated, unnecessary prefix that tells the file interpreter that what follows is in UTF-8 or whatever (there are other BOM's for other encodings).

Using the BOM in those situations can cause problems, and is never as far as I know required or desired, at least not with UTF-8. If using UTF-16, it (well a different BOM for that encoding) might be required. I'm not sure.

There may be other situations - say, when the file is going to be read by something other than a browser, when a BOM is required even in UTF-8. But I'm not aware of any. Since I deal mostly with browsers and what they need/do, that's not saying a lot about other applications.

I'm pretty sure NotePad++ gives you that option in case you need the BOM for some reason.

And you're right, you should never use it for HTML files.

Something else to be aware of in regards to this is how other editors deal with it. Some just slap on the BOM, or do so by default unless set otherwise in their config. So, when trouble shooting other's work always have it in the back of your mind somewhere that an undesirable BOM may be present.

If you view the file in an editor in ISO-8859-1 (windows-1252) encoding, you can see and delete the BOM, it will look like so (enlarged and indented here for easier recognition):




It will almost always be the very first thing in the file, and often shows up in some browsers if they're served such a page in ISO-8859-1 (windows-1252) or other lower bit encodings.

techno_race
06-22-2012, 04:57 AM
The W3C does currently recommend the use of a BOM with UTF-16, but only for HTML5 (not for prior versions). I don't pretend to understand why.

Off the top of my head, I think that the BOM might have something to do with East Asian texts...?

I thought it would be good to note that the lack of a BOM is important for web publishing, as I, when looking for UTF-8 in the encodings list, just went to the one that said "UTF-8," without taking note of the other ones, which seemed irrelevant, given that I had already found UTF-8.

I'm thinking that, given my last point, it might be more useful for applications such as Notepad++ to give options as "UTF-8" and "UTF-8 with BOM" or "UTF-8 without BOM" and "UTF-8 with BOM," instead of "UTF-8 without BOM" and "UTF-8." Of course, knowing nothing about Unicode outside of web publishing, this could have disastrous effects on files being used in other fields.

Update: I created this for myself, and thought I should post it here, in case anyone else keeps making the same mistake in clicking "UTF-8" when they want UTF-8 encoding :p. It is a localization file for Notepad++ (tested in Notepad++ 6.1.3) which uses the first "more useful" example in the edit above. It should be placed in the localization directory of your Notepad++ installation folder, and can be used by choosing "English (customizable)" in the "Localization" section of the Properties dialog.

Download (http://tortoisewrath.com/f4)

jscheuer1
06-22-2012, 08:08 AM
Neat. UTF-16 is for Asian and can also be used for Arabic, Cyrillic, Hebrew, probably others. In many cases though, depending upon the dialect/character set, UTF-8 is sufficient and should therefore be employed.

When UTF-16 is required, it comes in two flavors - small and big endian. I know there's a difference, but I don't know what. I think there might be a difference in the BOM used for small and big endian.

The requirement for the BOM in HTML 5 for UTF-16 might just be wishful thinking on the part of the standards people. In my experience the standards generally fall into one of three categories:


What both works and is commonly accepted.


What 'should' be the standard but most browsers are just fine without and it doesn't hurt anything.


What most browsers do, so is being incorporated into the standard, even if there's an odd browser that doesn't follow it.


I'm not sure which category requiring the UTF-16 BOM falls into.

djr33
06-22-2012, 08:25 AM
UTF-16 is for Asian and can also be used for Arabic, Cyrillic, Hebrew, probably others.UTF-8 can do Japanese, Chinese, Korean, plus others like Thai and Vietnamese.
There *might* be some limits to the extent of some of the characters, such as the many thousands of Chinese characters, but in general, UTF-8 is sufficient.

UTF-16 expands UTF-8 by adding more space for many more characters.

I really have no idea what UTF-16 does except to allow for further expansion.

In looking up more info (which didn't prove very helpful), there's also UTF-32. As needed, I guess...

jscheuer1
06-22-2012, 08:53 AM
UTF-16 is for Asian and can also be used for Arabic, Cyrillic, Hebrew, probably others.

UTF-8 can do Japanese, Chinese, Korean, plus others like Thai and Vietnamese.
There *might* be some limits to the extent of some of the characters, such as the many thousands of Chinese characters, but in general, UTF-8 is sufficient.

That's what I said more or less:


UTF-16 is for Asian and can also be used for Arabic, Cyrillic, Hebrew, probably others. In many cases though, depending upon the dialect/character set, UTF-8 is sufficient and should therefore be employed.

And there's no *might* about it. I have worked in this forum with certain Chinese and Hebrew dialects or versions that do require more character space. There probably are other languages that fall into this category - basically those that cannot be truly rendered in an 8 byte per character system.

I haven't encountered any human language that required UTF-32, but there certainly could be some.

It's my understanding that this isn't necessarily the number of characters, though that could perhaps be a factor. It is in my limited experience required when an individual character requires more bytes in order to be represented. That's what I understand to be character space - space for a given character.

So if 8 bytes aren't enough for a specific character, you need UTF-16, unless you need more than 16 bytes for a character, then you would need a higher encoding.

techno_race
06-22-2012, 06:00 PM
UTF-8 can do Japanese, Chinese, Korean, plus others like Thai and Vietnamese.
There *might* be some limits to the extent of some of the characters, such as the many thousands of Chinese characters, but in general, UTF-8 is sufficient.

From my experience (I'ven't taken time to study character encodings or anything), modern editors set to UTF-8 will express characters above U+FF (256 characters in, hence the maximum expressible with eight bits) as two (up to U+FFFF) or three (up to U+FFFFFF) eight-bit bytes. I'm not sure if it works the same way with four bytes.

As to how the editor makes it clear that a character is two or three bytes, as opposed to one byte, in the file, I'm completely clueless.

Note: U+FF is ÿ, which is at the end of the Latin-1 Supplement block, so any characters past there (ie. most of them) require more than eight bits to express; thus, given that encoding a character like ł (U+142) in UTF-8 hasn't caused problems for me in the past, I can conclude that UTF-8 editors do somehow manage to express characters beyond eight bits.

From this, I can conclude that UTF-16 and above is mostly pointless (as it would be less efficient than UTF-8 in most situations), unless there is some upper limit to which characters can be expressed in UTF-8.

An upper limit seems not to apply, however; Notepad++ was able to convert ���� (U+2FA1D), a Chinese symbol which is pronounced pián and means tooth painting, as well as �� (U+E0039) to ANSI (ó*€¹ð¯¨) and back without problem. The file with the two of these characters occupied eight eight-bit bytes, or 64 bits. These characters' decimal values (195101 and 917561) require 18 and 20 bits to express, respectively, which would entail using UTF-32 to express them each as one character or 32-bit byte. Hence, the UTF-8 proved more efficient, even towards the upper limit of non-PUA characters.

The highest character encoded in Unicode 6.2 is the private-use character U+10FFFD, which was expressed as four eight-bit bytes, 32 bits, and converted to ANSI (􏿽) and back without issue.

Edit: An interesting development: apparently, the forum software couldn't express those characters as UTF-8.

djr33
06-22-2012, 08:38 PM
And there's no *might* about it. I have worked in this forum with certain Chinese and Hebrew dialects or versions that do require more character space. There probably are other languages that fall into this category - basically those that cannot be truly rendered in an 8 byte per character system.I've never had any trouble with it. And I've used most of the languages you mention. Specific diacritics might be missing, but in general everything should be available. I can certainly imagine that many Chinese characters are missing (more obscure ones), but Japanese has no trouble and all of the other languages you mentioned (eg, Arabic, Russian, etc.) have no trouble at all in UTF8.
I'm not doubting UTF16 has some applications, but I haven't run into them. I'll look into this. It's relevant for me as a linguist.

jscheuer1
06-22-2012, 10:59 PM
It's the difference between a generic or limited charset and a fuller, more true to the actual written language charset.

You said it yourself:


Specific diacritics might be missing, but in general everything should be available. I can certainly imagine that many Chinese characters are missing (more obscure ones)

I think in some cases, or to any purist in a specific language, it would seem or actually be much more serious than that. And/Or that just that much would seem very serious to them.

Sort of an analogy in English might be if you suddenly couldn't print soft c's and silent p's. Those who know the language well would know what you meant and ignore it. Especially if they're aware of the printing limitations you're laboring under. But if not, some of them might think you're stupid or uncultured. Others, less familiar with the language or the printing limitations might be left scratching their heads as to what you really meant.

After looking back at what techno_race added, which I missed when first responding - Some characters might be able to be expressed in UTF-8, but as a matter of course in practical usage are more easily rendered in UTF-16. That would depend upon the charset and/or font. Or maybe they really need UTF-16.

I stand by what I said though, I've worked with folks on two occasions in these forums where UTF-16 was required to render their pages correctly. One was in Chinese, the other in Hebrew. And yes djr33, at least in Hebrew these were characters with special marks, otherwise 'ordinary' Hebrew characters that have a special meaning with an added mark. In Chinese though, I think the characters were simply more complex than UTF-8 could support. My impression being that there are less limits on characters in Chinese, in that Chinese characters are more analogous to other languages' words than their letters.

djr33
06-23-2012, 03:25 AM
Hm, perhaps it is Unicode/UTF-8 utilizing multibyte characters as techno_race noted above.


I think in some cases, or to any purist in a specific language, it would seem or actually be much more serious than that. And/Or that just that much would seem very serious to them.So let's take the example of a Chinese newspaper. Would they really require UTF-16 for daily usage? I can't answer that. There are a much smaller set of daily characters than all 50,000+ technically in the language, in the most comprehensive dictionaries. Most readers/writers don't know all of them. (Just like most speakers of English don't know the exact spelling for every single word in the language.)

Hebrew certainly works in UTF-8. But if you need to add special diacritics, that's fine-- that's what I'd imagine UTF-16 is for. Exactly that-- adding extra information.

Basically the metaphor I'd choose for this is having accented characters unavailable for English. So you might need an extended character set for é in fiancée if you choose to write it that way. (Of course this is actually a great parallel example from ASCII to UTF-8 and from UTF-8 to UTF-16.)

Not spelling certain common letters (eg, silent c's) is, I think, a strong exaggeration.

However, most of my impression on this may be complicated by using multibyte UTF-8.
If that is effectively equivalent to UTF-16, then I may be overrepresening what UTF-8 can do by itself.

jscheuer1
06-23-2012, 04:12 AM
So it's just shades gray or perhaps a bit more as to the argument. We basically agree.

I was under the impression though that the meaning of a word, unlike fiancée/fiancee, could change. My spellchecker BTW flags the latter as misspelled.

As to the technical side, I don't think I know any more than you do. I just know that the people I was dealing with couldn't seem to get their characters on the screen in the browser without UTF-16.

djr33
06-23-2012, 04:17 AM
I was under the impression though that the meaning of a word, unlike fiancée/fiancee, could change. My spellchecker BTW flags the latter as misspelled.Mine too. But in the case of diacritics (not Chinese characters), that's what it would be like-- misspelling because a mark is missing. And in the case of fiancée, it's a bit pretentious to spell it that way, directly borrowing from French. It's debatable whether that's "English" or not. I'd usually just write Fiancee, probably out of laziness more than anything. Regardless, even in French, they could read the word without that diacritic, but it would look a little off.
I don't know much about Hebrew specifically, but Arabic has many many special characters that are almost never used but represent very specific things such as in the Qur'an. I'd suspect that you might have been working on a site related to the Hebrew Bible for that reason. Modern Hebrew doesn't, I don't think, require too much that is special. Like Arabic, it shouldn't be too hard to encode. (It's phonetic, nothing like Chinese characters).


Yeah, I think we've reached a point of clarity.

This is interesting for me though. I'm very interested in encoding these languages. And I know a lot about the languages themselves, but I don't know too much about the details of encoding them except that Unicode (unlike basically everything else) just seems to work.
I'm building a language learning website intended to have text in any language potentially, so it's worth knowing if I need to move to UTF-16. I might.

techno_race
06-23-2012, 05:51 AM
it's worth knowing if I need to move to UTF-16.

This (http://www.babelstone.co.uk/unicode/babelmap.html), this (https://meta.wikimedia.org/wiki/List_of_Wikipedias), this (http://en.wikipedia.org/wiki/List_of_pangrams), and this (http://www.columbia.edu/~fdc/utf8/index.html) are all encoded in UTF-8, so I would assume that using UTF-8 wouldn't cause problems for Unicode support.

Relevant:
http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
http://en.wikipedia.org/wiki/UTF-8#Advantages_and_disadvantages

jscheuer1
06-23-2012, 06:10 AM
I'm seeing what look like garbage characters in the second 'this':

https://meta.wikimedia.org/wiki/List_of_Wikipedias


Словѣ́ньскъ / ⰔⰎⰑⰂⰡⰐⰠⰔⰍⰟ (cu) · Deutsch

BTW, what you were saying about this forum, it's encoded as ISO-8859-1 (windows-1252). So some chars are not supported.

djr33
06-23-2012, 06:32 AM
John, the characters you can't see are probably because you don't have any fonts that support them. Very few people have support for all unicode characters (there's no need).
Over the past couple years I've started accumulating fonts that support many of them, but I'm far from all of them. Oriya (an Indic language) was one of the hardest to find I remember, and not included in any other more general fonts, but it's also not very useful for most of the world, but of course important for anyone who wants to use Oriya.

jscheuer1
06-23-2012, 03:49 PM
I think that might be part of the point. Perhaps in UTF-16 it could show the literal character. I've no idea of that. But if it's true, one could see the appeal from the page's author's point of view.

techno_race
06-23-2012, 06:03 PM
I'm seeing what look like garbage characters in the second 'this'

This doesn't seem to be an encoding-related problem, but a browser issue.

Though it, at first glance, seems to me to be a font problem, I had three fonts installed that support the Glagolitic Unicode block (MPH 2B Damase, TITUS Cyberbit Basic, and Dilyana (http://www.obshtezhitie.net/texts/bgf/introd.html)), and it appeared the same way for me, as well.

Browsers (I tested in Firefox alpha and Chrome) don't seem to be automatically selecting a font for that block, as they do for other blocks.

From a web development point of view, a solution for that would be embedding a font supporting that Unicode block via the @font-face CSS property, and calling it on span tags around characters in that block.

The Old Church Slavonic Wikipedia is also encoded in UTF-8, and I see this:
http://tortoisewrath.com/ob

The span containing that has a style="font-family: Vikidemia, TITUS Cyberbit Basic, Bukyvede, Ja, Unicode5;" on it*, explicitly telling the browser what fonts to look for that may support the Glagolitic block.

Not that supporting Old Church Slavonic is all that useful, anyway... :p

* Yes, I know that that there should be single quotes around the TITUS Cyberbit Basic, but that's how it was in the source.


BTW, what you were saying about this forum, it's encoded as ISO-8859-1 (windows-1252). So some chars are not supported.

I've come to the point where I automatically assume that everything supports Unicode.

Windows-1252 and ISO 8859-1 are different BTW; confusion there arose by the ignorance of someone at the W3C.