Results 1 to 8 of 8

Thread: storing and retrieving unicode data.

  1. #1
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    1,694
    Thanks
    82
    Thanked 90 Times in 88 Posts

    Default storing and retrieving unicode data.

    If I store a unicode character in my database in a field set as utf8_unicode_ci and then retrieve the character with AddDefaultCharset utf8 set in the php.ini and display it the character will display as unicode, however if I display it in an input field I get the unicode numerical value.

    Is there a way to retrieve the unicode character and display it as such?

    In a separate, but related problem I notice that when using unicode in my passwords if I type out the unicode instead of the unicode character the passwords will match. Is there a way to change this so that only unicode characters when used with letters and numbers are valid as opposed to writing out the numerical value of the unicode as away to get around writing in the unicode characters?
    To choose the lesser of two evils is still to choose evil. My personal site

  2. #2
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,162
    Thanks
    263
    Thanked 690 Times in 678 Posts

    Default

    I'm not sure exactly what you're describing with the numbers. Could you post an example? Do you mean that it shows the HTML entity (Ӓ) or that it actually just shows the numbers (1234)?

    The encoding of everything* needs to match. If everything is properly set to UTF8 then it should be fine. From what you've said, my first guess is that your HTML page doens't have a meta tag with the encoding set to UTF8. Could you have overlooked that?

    *Absolutely everything: HTML page, file-format encoding of the HTML page, database settings, PHP processing settings, headers sent by the server, etc. Usually it's just an issue of the HTML page and the database encoding.

    By the way, I've had a working system before where the database is encoded in a random format but the HTML page is UTF8. It worked in a sort of encoding/decoding loop so that everything was fine. But I couldn't edit the database directly because the text as stored in the DB was wrong. Be careful you don't run into anything like that. It's much easier to "fix" it now rather than actually having messed up data.
    Daniel - Freelance Web Design | <?php?> | <html>| español | Deutsch | italiano | português | català | un peu de français | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  3. #3
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    1,694
    Thanks
    82
    Thanked 90 Times in 88 Posts

    Default

    It could be due to inexperience with matching the encoding. I thought they matched. Right now I am using AddDefaultCharset UTF8 in conjunction with the MySQL collation setting of utf8_binary_ci. If this is incorrect what should I be using?

    Not sure if this example will work, but let's say I have a password: mypass⊵GG
    When I store it in the database and retrieve it the password will display as mypass&#38;#8885;88
    Now when I use mypass&#38;#8885;88 or mypass⊵GG both versions will work. The problem is that the first one is much easier to brute force attack.
    To choose the lesser of two evils is still to choose evil. My personal site

  4. #4
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    1,694
    Thanks
    82
    Thanked 90 Times in 88 Posts

    Default

    I had to update my post to get it to display in the forums correctly. I forgot about that aspect of this forum.
    To choose the lesser of two evils is still to choose evil. My personal site

  5. #5
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,162
    Thanks
    263
    Thanked 690 Times in 678 Posts

    Default

    Using [color=black] tags around the text can get around the automated interpretation of the entities (and the same works for links, bbcode, etc).

    Hm. It sounds like maybe the issue is that the browser is interpreting it as if you intended to enter the entity in the form. In other words, I don't believe that the input to PHP actually is the numerical entity. Check that. I think the browser is changing your input then submitting it in that way to PHP. So a brute force attack would not be changing it in this same way. And even if it were to do that, then that would still make the password very difficult to crack-- it's still 7 extra characters to add, AND the fact that it knows to interpret stuff as UTF8 when applicable. That's not what most brute force algorithms would do.

    Instead, what I think you've found may be a browser bug/limitation, that you can't have literal &#1234; in your password since the browser will replace it with the character rather than the literal input.*

    Does that sound right? I'm not sure either, but it makes sense to me.


    (*As a tangent, that means that perhaps brute force algorithm makers should in fact always do the same thing and that would make cracking passwords marginally easier in that they wouldn't have to go in a longer loop for each character, just use more characters to make those; but the math is just as difficult-- it's still many many more iterations, so it's not any less secure. It's just a little different to program.
    Or perhaps this is the secret to a strong password-- use the literal HTML entity text rather than the UTF8 character; somehow trick the browser into submitting it literally, and you have something no one else would think of or even be able to type. It wouldn't be immune to brute force, but it would be creative...)
    Daniel - Freelance Web Design | <?php?> | <html>| español | Deutsch | italiano | português | català | un peu de français | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  6. #6
    Join Date
    Apr 2008
    Location
    So.Cal
    Posts
    3,643
    Thanks
    63
    Thanked 517 Times in 503 Posts
    Blog Entries
    5

    Default

    I would use utf_general_ci instead of utf_binary_ci. I'm not saying that's your problem, but it might be part of it. If you're not storing binary data, why specify a binary encoding?

    Also, in addition to the <meta> tag, you should make sure your server is setting encoding correctly when it sends the HTTP headers.

  7. #7
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    1,694
    Thanks
    82
    Thanked 90 Times in 88 Posts

    Default

    I think you may be right about the browser limitation bug. I tried using different browsers or maybe this is an error with $_POST. Dunno how I would test for it. I tried entering my password in Opera, Firefox, Chrome, and IE8 using the unicode character password and noticed that IE8 and Firefox got the unicode character wrong. Opera and Chrome generated the correct unicode character when I typed in (ALT+the unicode #).

    For example, ALT+1212 is different depending on the browser used. Even so I will ignore this minor limitation of my password where typing in the html entity works just as well as the unicode character since both would be rather difficult to brute force crack.

    I still get the occasional hack attempt to my site, but I have not seen any actual brute force attack yet. Mostly people just try using my IP address or various sql injections. It's sort of interesting to watch what people try. My guess is that the spiders are the ones that don't try any password.
    To choose the lesser of two evils is still to choose evil. My personal site

  8. #8
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    1,694
    Thanks
    82
    Thanked 90 Times in 88 Posts

    Default

    Sorry Traq, I misspoke earlier. I am using utf8_unicode_ci. I was using utf8_general_ci before and thought that was my problem, but I get the same results.
    To choose the lesser of two evils is still to choose evil. My personal site

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •