Results 1 to 3 of 3

Thread: Character Sets: The Question Mark

  1. #1
    Join Date
    Feb 2007
    Location
    🌎
    Posts
    528
    Thanks
    10
    Thanked 10 Times in 10 Posts
    Blog Entries
    2

    Question Character Sets: The Question Mark

    Character Sets
    The Question Mark

    We all have seen it at least once:
    Bj?rk Gu?mundsd?ttir, born in Reykjav?k, Iceland...
    Or even:
    Bj?rk?s new idea for...
    So, what causes these problems? Well, let's take a closer look.

    Take, for example, 𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘.

    If this is written in the source as:

    HTML Code:
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
    "http://www.w3.org/TR/html4/loose.dtd">
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    <title>Something</title>
    </head>
    
    <body>
    <p>𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘</p>
    </body>
    </html>
    (please don't mind the DOCTYPE, I know it's Transitional, but this is about character sets. )
    it would display correctly. Notice:
    HTML Code:
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    That tag says that the character set is UTF-8, or Unicode.
    As 𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘 is Unicode, this will display correctly.

    However, the following is incorrect:
    HTML Code:
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
    "http://www.w3.org/TR/html4/loose.dtd">
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
    <title>Something</title>
    </head>
    
    <body>
    <p>𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘</p>
    </body>
    </html>
    because in this example, the character set is set as Western European (ISO-8859-1).
    𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘 is not in the ISO-8859-1 range, so this will not display correctly.
    In Firefox, this will display as 𢣧𢣹𢣥𢣇𢣩𢣔𢣈𢣦𢤅𢣵𢣥𢤪𢤌𢣒𢣃𢣘𢣹𢤅𢣣𢣩𢤆𢤂𢣤𢣬𢤙𢤘.
    This will display correctly in Internet Explorer, as IE doesn't pay any attention whatsoever to character sets.

    HTML Code:
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
    "http://www.w3.org/TR/html4/loose.dtd">
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
    <title>Something</title>
    </head>
    
    <body>
    <p>&#x228E7;&#x228F9;&#x228E5;&#x228C7;&#x228E9;&#x228D4;&#x228C8;&#x228E6;&#x22905;&#x228F5;&#x228E5;&#x2292A;&#x2290C;&#x228D2;&#x228C3;&#x228D8;&#x228F9;&#x22905;&#x228E3;&#x228E9;&#x22906;&#x22902;&#x228E4;&#x228EC;&#x22919;&#x22918;</p>
    </body>
    </html>
    and
    HTML Code:
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
    "http://www.w3.org/TR/html4/loose.dtd">
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
    <title>Something</title>
    </head>
    
    <body>
    <p>&#38;#141543;&#38;#141561;&#38;#141541;&#38;#141511;&#38;#141545;&#38;#141524;&#38;#141512;&#38;#141542;&#38;#141573;&#38;#141557;&#38;#141541;&#38;#141610;&#38;#141580;&#38;#141522;&#38;#141507;&#38;#141528;&#38;#141561;&#38;#141573;&#38;#141539;&#38;#141545;&#38;#141574;&#38;#141570;&#38;#141540;&#38;#141548;&#38;#141593;&#38;#141592;</p>
    </body>
    </html>
    are correct, however, as character entities can point to any Unicode symbol in any character set.

    Thanks for reading!
    ....(o_ Penguins
    .---/(o_- techno_racing
    +(---//\-' in
    .+(_)--(_)' The McMurdo 500

  2. #2
    Join Date
    Jun 2005
    Location
    英国
    Posts
    11,876
    Thanks
    1
    Thanked 180 Times in 172 Posts
    Blog Entries
    2

    Default

    That tag says that the character set is UTF-8, or Unicode.
    The character set is Unicode; the encoding (the way the characters from that set are represented) is UTF-8. They are not two ways of saying the same thing; in fact, most people, when talking about "Unicode" as an encoding, mean UTF-16. I lost a mark in a practice paper recently due to this inaccuracy by the examiners. However, this only applies if a character set cannot be determined from any other source. In the majority of cases, the server will send a default encoding in the HTTP headers (usually, alas, ISO-8859-1), and this will override any <meta> tag present. Also, <meta> tags can only be used to specify encodings that are compatible with ASCII: UTF-8 and ISO-8859-1 are possible, but UTF-16, UTF-32, and many others aren't.
    Setting the character set using <meta> tags is thus a very inflexible and unreliable practice, and configuring your server to send the correct encoding in the Content-Type header is vastly preferable.
    This will display correctly in Internet Explorer, as IE doesn't pay any attention whatsoever to character sets.
    Not quite. It will often attempt to guess the entire content type of the page, including encoding. This isn't always the case, though, and it can't always be relied upon to do it correctly: it's prone to making mistakes.
    If this is written in the source as:
    Code:
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
    "http://www.w3.org/TR/html4/loose.dtd">
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    <title>Something</title>
    </head>
    
    <body>
    <p>⣧⣹⣥⣇⣩⣔⣈⣦⤅⣵⣥⤪⤌⣒⣃⣘⣹⤅⣣⣩⤆⤂⣤⣬⤙⤘</p>
    </body>
    </html>
    [...] it would display correctly.
    Not necessarily. It would only display correctly if the text was indeed UTF-8. There are other encodings that can represent that Braille sequence; if the text was encoded in one of those and you told the browser to use UTF-8, it would not display correctly.
    (please don't mind the DOCTYPE, I know it's Transitional, but this is about character sets. )
    But that document is perfectly valid Strict... if the DOCTYPE doesn't matter, why did you use Transitional?
    Twey | I understand English | 日本語が分かります | mi jimpe fi le jbobau | mi esperanton komprenas | je comprends franais | entiendo espaol | ti t hiểu tiếng Việt | ich verstehe ein bisschen Deutsch | beware XHTML | common coding mistakes | tutorials | various stuff | argh PHP!

  3. #3
    Join Date
    Dec 2004
    Location
    UK
    Posts
    2,358
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default

    Your presentation, techno_race, is confused. An incorrectly specified character encoding scheme will cause the wrong characters to be displayed. However, when question marks are rendered, this is usually a sign that the font in use does not contain the necessary glyphs for a given codepoint.

    Quote Originally Posted by Twey View Post
    Quote Originally Posted by techno_race
    If [uncommon characters are] written in the source [with the correct encoding, those characters] would display correctly.
    Not necessarily. It would only display correctly if the text was indeed [encoded correctly].
    And not necessarily even then; it would be processed correctly, but that's not the same as being displaying correctly.
    Mike

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •