Character Sets
The Question Mark
We all have seen it at least once:
Bj?rk Gu?mundsd?ttir, born in Reykjav?k, Iceland...
Or even:
So, what causes these problems? Well, let's take a closer look.
Take, for example, 𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘.
If this is written in the source as:
HTML Code:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Something</title>
</head>
<body>
<p>𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘</p>
</body>
</html>
(please don't mind the DOCTYPE, I know it's Transitional, but this is about character sets.
)
it would display correctly. Notice:
HTML Code:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
That tag says that the character set is UTF-8, or Unicode.
As 𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘 is Unicode, this will display correctly.
However, the following is incorrect:
HTML Code:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>Something</title>
</head>
<body>
<p>𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘</p>
</body>
</html>
because in this example, the character set is set as Western European (ISO-8859-1).
𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘 is not in the ISO-8859-1 range, so this will not display correctly.
In Firefox, this will display as 𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘.
This will display correctly in Internet Explorer, as IE doesn't pay any attention whatsoever to character sets.
HTML Code:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>Something</title>
</head>
<body>
<p>𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘</p>
</body>
</html>
and
HTML Code:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>Something</title>
</head>
<body>
<p>&#141543;&#141561;&#141541;&#141511;&#141545;&#141524;&#141512;&#141542;&#141573;&#141557;&#141541;&#141610;&#141580;&#141522;&#141507;&#141528;&#141561;&#141573;&#141539;&#141545;&#141574;&#141570;&#141540;&#141548;&#141593;&#141592;</p>
</body>
</html>
are correct, however, as character entities can point to any Unicode symbol in any character set.
Thanks for reading!
Bookmarks