techno_race
07-21-2007, 06:03 PM
Character Sets
The Question Mark
We all have seen it at least once:
Bj?rk Gu?mundsd?ttir, born in Reykjav?k, Iceland...
Or even:
Bj?rk?s new idea for...
So, what causes these problems? Well, let's take a closer look.
Take, for example, 𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘.
If this is written in the source as:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Something</title>
</head>
<body>
<p>𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘</p>
</body>
</html>
(please don't mind the DOCTYPE, I know it's Transitional, but this is about character sets. :))
it would display correctly. Notice:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
That tag says that the character set is UTF-8, or Unicode.
As 𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘 is Unicode, this will display correctly.
However, the following is incorrect:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>Something</title>
</head>
<body>
<p>𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘</p>
</body>
</html>
because in this example, the character set is set as Western European (ISO-8859-1).
𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘 is not in the ISO-8859-1 range, so this will not display correctly.
In Firefox, this will display as 𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘.
This will display correctly in Internet Explorer, as IE doesn't pay any attention whatsoever to character sets.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>Something</title>
</head>
<body>
<p>𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘</p>
</body>
</html>
and
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>Something</title>
</head>
<body>
<p>&#141543;&#141561;&#141541;&#141511;&#141545;&#141524;&#141512;&#141542;&#141573;&#141557;&#141541;&#141610;&#141580;&#141522;&#141507;&#141528;&#141561;&#14157 3;&#141539;&#141545;&#141574;&#141570;&#141540;&#141548;&#141593;&#141592;</p>
</body>
</html>
are correct, however, as character entities can point to any Unicode symbol in any character set.
Thanks for reading! :D
The Question Mark
We all have seen it at least once:
Bj?rk Gu?mundsd?ttir, born in Reykjav?k, Iceland...
Or even:
Bj?rk?s new idea for...
So, what causes these problems? Well, let's take a closer look.
Take, for example, 𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘.
If this is written in the source as:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Something</title>
</head>
<body>
<p>𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘</p>
</body>
</html>
(please don't mind the DOCTYPE, I know it's Transitional, but this is about character sets. :))
it would display correctly. Notice:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
That tag says that the character set is UTF-8, or Unicode.
As 𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘 is Unicode, this will display correctly.
However, the following is incorrect:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>Something</title>
</head>
<body>
<p>𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘</p>
</body>
</html>
because in this example, the character set is set as Western European (ISO-8859-1).
𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘 is not in the ISO-8859-1 range, so this will not display correctly.
In Firefox, this will display as 𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘.
This will display correctly in Internet Explorer, as IE doesn't pay any attention whatsoever to character sets.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>Something</title>
</head>
<body>
<p>𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘</p>
</body>
</html>
and
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>Something</title>
</head>
<body>
<p>&#141543;&#141561;&#141541;&#141511;&#141545;&#141524;&#141512;&#141542;&#141573;&#141557;&#141541;&#141610;&#141580;&#141522;&#141507;&#141528;&#141561;&#14157 3;&#141539;&#141545;&#141574;&#141570;&#141540;&#141548;&#141593;&#141592;</p>
</body>
</html>
are correct, however, as character entities can point to any Unicode symbol in any character set.
Thanks for reading! :D