Page 1 of 2 12 LastLast
Results 1 to 10 of 13

Thread: Convert non-ASCII to HTML UNICODE entities

  1. #1
    Join Date
    Mar 2005
    Location
    SE PA USA
    Posts
    30,495
    Thanks
    82
    Thanked 3,449 Times in 3,410 Posts
    Blog Entries
    12

    Default Convert non-ASCII to HTML UNICODE entities

    I was recently working on a PHP related issue in the javascript forum:

    http://www.dynamicdrive.com/forums/s...ad.php?t=53535

    I got it working on WAMP, but the OP apparently either 'doesn't get it', or there are some other issues involved.

    Anyways, if you look at the thread you will see that the OP was setting a PHP variable to:

    Užimtas
    From my point of view it was primarily the ž in Užimtas that was messing things up.

    Regardless if my approach/solution there is the answer, it occurs to me that there are situations where one would like to convert characters like ž to their HTML UNICODE entities (in this case ž) before sending the result to an AJAX request or wherever. I tried htmlentities, but that seems to do sort of the opposite of what I want - preserve an entity as is without allowing it to be converted, while what I want is to detect and convert a character that isn't ASCII to its HTML UNICODE entity and send that in such a way that it can be converted to the character it represents on the receiving end.

    I looked at:

    utf8_decode

    and:

    utf8_encode

    But am unsure if either would be the best way, or even be applicable at all. It's fairly clear that if either of those are used for this, it would probably have to be as part of moderately complex custom PHP function.

    If you understand what I'm after and have any suggestions, it would be appreciated. If you need more explanation, feel free to ask.
    - John
    ________________________

    Show Additional Thanks: International Rescue Committee - Donate or: The Ocean Conservancy - Donate or: PayPal - Donate

  2. #2
    Join Date
    May 2007
    Location
    Boston,ma
    Posts
    2,127
    Thanks
    173
    Thanked 207 Times in 205 Posts

    Default

    Not sure if quite understand.

    Owner is trying to use the special character within the username: Užimtas as a variable like $Užimtas

    Owner is trying to parse that name to a variable $user = "Užimtas";

    If the later is the case it should be converted, I think for other usage, html, it should work.

    As a php conversion script though
    Code:
        $input = $_POST['content?fields']
        $patterns[] = '/À/';
        $replacements[] = '&Agrave';
        $patterns[] = '/à/';
        $replacements[] = '&agrave';
        $patterns[] = '/Á/';
        $replacements[] = '&Aacute';
        $patterns[] = '/á/';
        $replacements[] = '&aacute';
        $patterns[] = '/Â/';
        $replacements[] = '&acirc';
        $patterns[] = '/â/';
        $replacements[] = '&acirc';
        $patterns[] = '/Ã/';
        $replacements[] = '&Atilde';
        $patterns[] = '/ã/';
        $replacements[] = '&atilde';
        $patterns[] = '/Ä/';
        $replacements[] = '&Auml';
        $patterns[] = '/ä/';
        $replacements[] = '&auml';
        $patterns[] = '/Å/';
        $replacements[] = '&Aring';
        $patterns[] = '/å/';
        $replacements[] = '&aring';
        $patterns[] = '/È/';
        $replacements[] = '&Egrave';
        $patterns[] = '/è/';
        $replacements[] = '&egrave';
        $patterns[] = '/É/';
        $replacements[] = '&Eacute';
        $patterns[] = '/é/';
        $replacements[] = '&eacute';
        $patterns[] = '/Ê/';
        $replacements[] = '&Ecirc';
        $patterns[] = '/ê/';
        $replacements[] = '&ecirc';
        $patterns[] = '/Ë/';
        $replacements[] = '&Euml';
        $patterns[] = '/ë/';
        $replacements[] = '&euml';
        $patterns[] = '/Ì/';
        $replacements[] = '&Igrave';
        $patterns[] = '/ì/';
        $replacements[] = '&igrave';
        $patterns[] = '/Í/';
        $replacements[] = '&Iacute';
        $patterns[] = '/í/';
        $replacements[] = '&iacute';
        $patterns[] = '/Î/';
        $replacements[] = '&Icirc';
        $patterns[] = '/î/';
        $replacements[] = '&icirc';
        $patterns[] = '/Ï/';
        $replacements[] = '&Iuml';
        $patterns[] = '/ï/';
        $replacements[] = '&iuml';
        $patterns[] = '/Ñ/';
        $replacements[] = '&Ntilde';
        $patterns[] = '/ñ/';
        $replacements[] = '&ntilde';
        $patterns[] = '/Ò/';
        $replacements[] = '&Ograve';
        $patterns[] = '/ò/';
        $replacements[] = '&ograve';
        $patterns[] = '/Ó/';
        $replacements[] = '&Oacute';
        $patterns[] = '/ó/';
        $replacements[] = '&oacute';
        $patterns[] = '/Ô/';
        $replacements[] = '&Ocirc';
        $patterns[] = '/ô/';
        $replacements[] = '&ocirc';
        $patterns[] = '/Õ/';
        $replacements[] = '&Otilde';
        $patterns[] = '/õ/';
        $replacements[] = '&otilde';
        $patterns[] = '/Ö/';
        $replacements[] = '&Ouml';
        $patterns[] = '/ö/';
        $replacements[] = '&ouml';
        $patterns[] = '/Ø/';
        $replacements[] = '&Oslash';
        $patterns[] = '/ø/';
        $replacements[] = '&oslash';
        $patterns[] = '/Ù/';
        $replacements[] = '&Ugrave';
        $patterns[] = '/ù/';
        $replacements[] = '&ugrave';
        $patterns[] = '/Û/';
        $replacements[] = '&Ucirc';
        $patterns[] = '/û/';
        $replacements[] = '&ucirc';
        $patterns[] = '/Ü/';
        $replacements[] = '&Uuml';
        $patterns[] = '/ü/';
        $replacements[] = '&uuml';
        $patterns[] = '/Ý/';
        $replacements[] = '&Yuml';
        $patterns[] = '/ÿ/';
        $replacements[] = 'ÿ';
        $contents = preg_replace($patterns, $replacements, $input);
        For using later (varies on usage) : echo htmlspecialchars($contents);
        For using on current location usage: echo $contents;
    Corrections to my coding/thoughts welcome.

  3. #3
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    I'm also not sure if you mean that the variable name or the variable value should have a strange character in it. For PHP variables I'd suggest omitting unusual characters because the system won't like it and in general code must remain in English (such as function names).

    As for values, there should be no problems whatsoever with storing absolutely anything (unless it needs to be escaped, such as quotes) in any PHP variable.

    I'm working with lots of strange characters at the moment while building a language-learning website and I haven't had any issues within PHP.

    The issues always occur with MySQL or HTML. The problem is always the encoding and getting everything to cooperate.

    One other issue that can be very confusing that might be the problem here is the text encoding of the text document (php file) itself. This is *not* the HTML code at the top of the page, but the format of the .txt (or .php) file within the operating system. In a program like notepad you can select what format you want to use.

    Always use utf8 if there's any question about unusual characters.

    However, some programs will randomly change the encoding or not give a choice about format.

    One problem I had was using dreamweaver: it randomly switched the encoding to something else (not utf8) if there were no other characters present, just what was supported in the standard character set. So for this reason I put a comment at the end of a lot of files that read something like: //preserve encoding in DW あ.
    That Japanese character would then force DW to save the file encoding still as unicode and not revert back to whatever it felt like.

    Another problem I know about is a similar issue with FTP programs: if you use some FTP programs they will modify the contents of the files you upload, such as altering whitespace formatting (\r=>\n, etc) or changing the character encoding.


    The only real answer to all of this is to either find a workaround (which is usually a pain) or just use a better program that isn't annoying.


    Note: everything I have said is entirely about the variables typed directly into the .php files. Other issues like MySQL and HTML will still exist but won't be related to this specifically. In general, if you enter something and it is accepted by PHP another way (such as through a form), it will reliably remain in that same form when echoed, etc.

    For example, if you type in a Japanese character into a form, then print $_POST on the next page it will probably be in the same form that it was when you typed it into the text field. If you get it from *any* other source, the odds of this happening are not true unless you have carefully set all encodings to match. But since the source and destination are the same, it should not be a problem. If possible, setup a system where source and destination are the same and things will be easier.


    In general, my advice is to double check that all encodings are the same:
    1. text document encoding (see above)
    2. HTML encoding (in a tag)
    3. Database-- both the table and the default setting (or things can get confusing)
    4. Serve the page as that encoding (probably default, but might be weird on some servers).
    5. Your FTP program must upload without changing the encoding.
    6. Any location (ie, word document) where you might cut/paste from.
    7.The working encoding in all relevant programs such as editors and browsers, just to be sure that it's all compatible.


    Use UTF8 for everything and life should be easier (assuming you can actually get everything to use UTF8).




    As for encoding with a function in PHP, trial and error is always the simplest way to figure it out for each case, but I believe that htmlentities() should be what you need:
    http://www.php.net/manual/en/function.htmlentities.php
    Based on your description and what it says on the page that should work.

    Note that HTML entities also have encodings so that gets even more complex: several entities (numerical) exist for some characters, so make sure that's all operating in the right way or you STILL might end up with messy conversions.
    Daniel - Freelance Web Design | <?php?> | <html>| español | Deutsch | italiano | português | català | un peu de français | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  4. #4
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    2,385
    Thanks
    100
    Thanked 113 Times in 111 Posts

    Default

    Maybe you are thinking of the ord function?

    PHP Code:
    <?php
    $str 
    "i";
    $strord($str);
        echo 
    "&amp;$str".";";
    ?>
    You could then use a bit of preg_match to convert the character only if it is not a letter or number. There is probably a better way I am sure, but this is where my mind goes to first.

    I found the info first from http://tokira.net/unicode/index.php and then looked at his source which he made available and saw that he used the ord function.
    To choose the lesser of two evils is still to choose evil. My personal site

  5. #5
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    Yes, that's a very good way to go about it, but also complex and not very easy to work with. It's probably best to actually figure out what's going wrong, but if you can't, ord() will surely be a way to not deal with character encodings (at least at the level of PHP-- you still have to work out HTML, databases, etc).
    Daniel - Freelance Web Design | <?php?> | <html>| español | Deutsch | italiano | português | català | un peu de français | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  6. #6
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    2,385
    Thanks
    100
    Thanked 113 Times in 111 Posts

    Default

    true, except that jscheuer1 is saying that there may be situations where obtaining the unicode value is what he needs, but we don't know what that situation is or would be and I have to agree. He is saying that (correct me if I am wrong here jscheuer1) he does not know what those situations may be where he would need such a program, but he wants to be prepared.

    Your answer was quite useful and informative and deals with a host of possible causes, but I went for the more direct approach in my answer. apples and oranges so to speak

    If I remember correctly, the first time I looked into this issue was when I tried to post more than one space in a post on this forum and noticed that every method I tried resulted in a truncating of my consecutive spaces. Unicode was the only thing that worked.
    To choose the lesser of two evils is still to choose evil. My personal site

  7. #7
    Join Date
    Mar 2005
    Location
    SE PA USA
    Posts
    30,495
    Thanks
    82
    Thanked 3,449 Times in 3,410 Posts
    Blog Entries
    12

    Default

    I think I tried that, setting the encoding to UTF-8 in the header for the PHP page, UTF-8 on the receiving HTML page. The problem can be simplified as:

    PHP Code:
    <?php 
    $myvar1 
    "Plain"
    $myvar2 "Užimtas"
    echo 
    $myvar2 '<br>'//gives: Užimtas<br>
    $answer = array ($myvar1$myvar2);
    echo 
    $answer[1] . '<br>'//gives: Užimtas<br>
    echo json_encode($answer); //gives: ["Plain",null] 
    ?>
    Looks like so on the page:

    Užimtas
    Užimtas
    ["Plain",null]
    Whereas:

    Code:
    <?php 
    $myvar1 = "Plain"; 
    $myvar2 = "U&#0382;imtas"; 
    echo $myvar2 . '<br>'; //gives: U&#0382;imtas<br>
    $answer = array ($myvar1, $myvar2);
    echo $answer[1] . '<br>'; //gives: U&#0382;imtas<br>
    echo json_encode($answer); //gives: ["Plain",U&#0382;imtas] 
    ?>
    Looks like so on the page:

    Užimtas
    Užimtas
    ["Plain","Užimtas"]
    So it's pretty clear that if we could have converted Užimtas (in the first example) to U&#0382;imtas before (or during, but I don't think that's possible) json_encode, things would work out well. Or if we could get json_encode to not choke on Užimtas . . . That would be another approach, but less applicable in general.

    By extension, if we could scan all variables/array values prior to json_encode and convert any non-ASCII characters in them to valid UNICODE entities, that would make the process universally applicable.

    Now, I tried james438's link, it gives the hex entity, no good for a valid HTML page. But Googling "Convert Text to Unicode" (which is the main heading of the page james438 linked to) got me:

    http://www.pinyin.info/tools/convert...ninumbers.html

    Which employs a simple javascript that does almost exactly what I would want to do in PHP. All it needs is a little tweak to get it to output valid UNICODE entities (add preceding 0(s) for values of a length less than 4). Could this be easily translated to PHP? Here's my modified version of the javascript:

    Code:
    /* convertToEntities()
     * Convert non-ASCII characters to valid HTML UNICODE entities */
    
    function convertToEntities(astr){
    	var bstr = '', cstr, i = 0;
    	for(i; i < astr.length; ++i){
    		if(astr.charCodeAt(i) > 127){
    			cstr = astr.charCodeAt(i).toString(10);
    			while(cstr.length < 4){
    				cstr = '0' + cstr;
    			}
    			bstr += '&#' + cstr + ';';
    		} else {
    			bstr += astr.charAt(i);
    		}
    	}
    	return bstr;
    }
    - John
    ________________________

    Show Additional Thanks: International Rescue Committee - Donate or: The Ocean Conservancy - Donate or: PayPal - Donate

  8. #8
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    I'm not sure what charCodeAt() does. If this is the same as ord(), or another function, then sure that can be converted.

    Here's a rough example:
    PHP Code:
    function convertToEntities($astr) {
        
    $bstr '';
        for(
    $i=0;$i<strlen($astr);$i++) {
            if (
    ord($astr[$i])>127) {
                
    $cstr ord($astr[$i]);
                while (
    strlen($cstr)<4) {
                    
    $cstr .= '0';
                }
                
    $bstr .= '&#'.$cstr.';';
            }
            else {
                
    $bstr .= $astr[$i];
            }
        }
        return 
    $bstr;

    Daniel - Freelance Web Design | <?php?> | <html>| español | Deutsch | italiano | português | català | un peu de français | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  9. #9
    Join Date
    Mar 2005
    Location
    SE PA USA
    Posts
    30,495
    Thanks
    82
    Thanked 3,449 Times in 3,410 Posts
    Blog Entries
    12

    Default

    Not exactly. The ord() of PHP gets a hex value of the character. This can be converted easily in PHP to bin, oct, or dec. However, in order for a numeric entity to get the full seal of approval from the validator, it must use the HTML UNICODE value.

    For example, the ord() of the character in question here (ž) is 9e. And &#x9e; works. In decimal that's 158. And &#158; also works. However, the HTML UNICODE value as given by javascript's charCodeAt() is 382. And &#0382; also works. That's decimal, but converted to hex is 17e (which is also the value for the character if looked up at unicode.org). And &#x17e; also works. But only 0382 and 17e are considered fully acceptable by the validator. The value given by ord() receives a warning:

    Line 9, Column 7: reference to non-SGML character

    U&#x9e;imtas<br>U&#x9e;imtas<br>["Plain","U&#x9e;imtas"]



    You've included a character reference to a character that is not defined in the document type you've chosen. This is most commonly caused by numerical references to characters from vendor proprietary character repertoires. Often the culprit will be fancy or typographical quote marks from either the Windows or Macintosh character repertoires.

    The solution is to reference UNICODE characters instead. A list of common characters from the Windows character repertoire and their UNICODE equivalents can be found in the document "On the use of some MS Windows characters in HTML" maintained by Jukka Korpela <jkorpela@cs.tut.fi>.
    Now oddly, after playing around a bit more, I find that if I save my source code as UTF-8 and use the literal character, json_encode converts it to:

    \u17e

    which is a value I can work with (using preg_replace or similar on the server, or replace on the client) to get the HTML UNICODE entity. And interestingly enough json_encode does this regardless of whether or not the page is served as UTF-8 or as ISO-8859-1. Unfortunately json_encode isn't available until PHP 5.2 and requires a setting for PHP/the server and extra json code installed to be enabled.

    Disconcerting is that ord() now reads it as 197 (Å), understandable but wrong. This is also regardless of how the page is served.

    So now I'm wondering if this behavior can be relied upon for json_encode whenever it's present, or if it's dependent upon server settings, and/or PHP version, and/or json version/json settings. And if there might be a more common PHP function that would give this HTML UNICODE value.

    Edit: Appears PHP doesn't yet support UNICODE:

    http://us3.php.net/manual/en/intro.unicode.php

    There is some movement to do so though:

    http://www.linux.com/archive/feature/60386

    though it appears to be a bit behind. And apparently this json_encode, which is not AFAIK a direct part of PHP does so in at least a limited way.
    Last edited by jscheuer1; 03-29-2010 at 11:10 AM. Reason: add info/link
    - John
    ________________________

    Show Additional Thanks: International Rescue Committee - Donate or: The Ocean Conservancy - Donate or: PayPal - Donate

  10. #10
    Join Date
    Mar 2010
    Location
    Central PA, US of A
    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Default php unicode html entities

    I believe my problems are similar to the ones you have described.

    I have an XML file that contains unicode entities in addition to normal strings. When I load the XML via PHP's simplexml_load_file function and echo it out, certain unicode entities are not printed properly.

    I have squelched some of the 'special characters' via utf8_decode().
    The only entity giving me problems is 1/3 or ⅓ or &amp;#8531;

    I've also attached my XML as a text file since the forum is going to encode my entities!

    Code:
    <?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
    <cookbook>
    
    	<recipe>
    		<num>3</num>
    		<name>Green Bean Casserole</name>
    		<ingredients>
    			<item>1 can (10¾ oz) Campbell's Cream of Mushroom Soup</item>
    			<item>½ cup milk</item>
    			<item>1 tsp soy sauce</item>
    			<item>Dash ground black pepper</item>
    			<item>4 cups frozen cut green beans</item>
    			<item>1⅓ cups French fried onions</item>
    		</ingredients>
    		<procedure>
    			<step>Stir soup, milk, soy sauce, black pepper, beans and ⅔ cups onions in 1½ qt casserole.</step>
    			<step>Bake at 350°F for 25 minutes or until hot. Stir.</step>
    			<step>Top with remaining onions. Bake for 5 minutes more.</step>
    		</procedure>
    	</recipe>
    	
    </cookbook>
    PHP Code:
    <?php
        
    // load XML file
        
    $xml simplexml_load_file("ingred.xml");
        
        
    $recipe $xml->recipe;
        
        echo 
    $recipe->num."<br />";
        echo 
    $recipe->name."<br />";
        
        echo 
    "<ul>";
        
        
    // create ingredient list items
            
    foreach($recipe->ingredients->item as $item) {
                echo 
    '<li>'.utf8_decode($item).'</li>';
            }
        
        echo 
    "</ul><br /><ol>";
        
        
    // create step list items
            
    foreach($recipe->procedure->step as $step) {
                echo 
    '<li>'.utf8_decode($step).'</li>';
            }
        
        echo 
    "</ol>";    
    ?>
    Last edited by silvertip257; 03-29-2010 at 08:42 PM.

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •