View Full Version : Convert non-ASCII to HTML UNICODE entities
jscheuer1
03-28-2010, 04:56 AM
I was recently working on a PHP related issue in the javascript forum:
http://www.dynamicdrive.com/forums/showthread.php?t=53535
I got it working on WAMP, but the OP apparently either 'doesn't get it', or there are some other issues involved.
Anyways, if you look at the thread you will see that the OP was setting a PHP variable to:
Užimtas
From my point of view it was primarily the ž in Užimtas that was messing things up.
Regardless if my approach/solution there is the answer, it occurs to me that there are situations where one would like to convert characters like ž to their HTML UNICODE entities (in this case ž) before sending the result to an AJAX request or wherever. I tried htmlentities, but that seems to do sort of the opposite of what I want - preserve an entity as is without allowing it to be converted, while what I want is to detect and convert a character that isn't ASCII to its HTML UNICODE entity and send that in such a way that it can be converted to the character it represents on the receiving end.
I looked at:
utf8_decode
and:
utf8_encode
But am unsure if either would be the best way, or even be applicable at all. It's fairly clear that if either of those are used for this, it would probably have to be as part of moderately complex custom PHP function.
If you understand what I'm after and have any suggestions, it would be appreciated. If you need more explanation, feel free to ask.
bluewalrus
03-28-2010, 06:31 AM
Not sure if quite understand.
Owner is trying to use the special character within the username: Užimtas as a variable like $Užimtas
Owner is trying to parse that name to a variable $user = "Užimtas";
If the later is the case it should be converted, I think for other usage, html, it should work.
As a php conversion script though
$input = $_POST['content?fields']
$patterns[] = '/À/';
$replacements[] = 'À';
$patterns[] = '/à/';
$replacements[] = 'à';
$patterns[] = '/Á/';
$replacements[] = 'Á';
$patterns[] = '/á/';
$replacements[] = 'á';
$patterns[] = '/Â/';
$replacements[] = 'â';
$patterns[] = '/â/';
$replacements[] = 'â';
$patterns[] = '/Ã/';
$replacements[] = 'Ã';
$patterns[] = '/ã/';
$replacements[] = 'ã';
$patterns[] = '/Ä/';
$replacements[] = 'Ä';
$patterns[] = '/ä/';
$replacements[] = 'ä';
$patterns[] = '/Å/';
$replacements[] = 'Å';
$patterns[] = '/å/';
$replacements[] = 'å';
$patterns[] = '/È/';
$replacements[] = 'È';
$patterns[] = '/è/';
$replacements[] = 'è';
$patterns[] = '/É/';
$replacements[] = 'É';
$patterns[] = '/é/';
$replacements[] = 'é';
$patterns[] = '/Ê/';
$replacements[] = 'Ê';
$patterns[] = '/ê/';
$replacements[] = 'ê';
$patterns[] = '/Ë/';
$replacements[] = 'Ë';
$patterns[] = '/ë/';
$replacements[] = 'ë';
$patterns[] = '/Ì/';
$replacements[] = 'Ì';
$patterns[] = '/ì/';
$replacements[] = 'ì';
$patterns[] = '/Í/';
$replacements[] = 'Í';
$patterns[] = '/í/';
$replacements[] = 'í';
$patterns[] = '/Î/';
$replacements[] = 'Î';
$patterns[] = '/î/';
$replacements[] = 'î';
$patterns[] = '/Ï/';
$replacements[] = 'Ï';
$patterns[] = '/ï/';
$replacements[] = 'ï';
$patterns[] = '/Ñ/';
$replacements[] = 'Ñ';
$patterns[] = '/ñ/';
$replacements[] = 'ñ';
$patterns[] = '/Ò/';
$replacements[] = 'Ò';
$patterns[] = '/ò/';
$replacements[] = 'ò';
$patterns[] = '/Ó/';
$replacements[] = 'Ó';
$patterns[] = '/ó/';
$replacements[] = 'ó';
$patterns[] = '/Ô/';
$replacements[] = 'Ô';
$patterns[] = '/ô/';
$replacements[] = 'ô';
$patterns[] = '/Õ/';
$replacements[] = 'Õ';
$patterns[] = '/õ/';
$replacements[] = 'õ';
$patterns[] = '/Ö/';
$replacements[] = 'Ö';
$patterns[] = '/ö/';
$replacements[] = 'ö';
$patterns[] = '/Ø/';
$replacements[] = 'Ø';
$patterns[] = '/ø/';
$replacements[] = 'ø';
$patterns[] = '/Ù/';
$replacements[] = 'Ù';
$patterns[] = '/ù/';
$replacements[] = 'ù';
$patterns[] = '/Û/';
$replacements[] = 'Û';
$patterns[] = '/û/';
$replacements[] = 'û';
$patterns[] = '/Ü/';
$replacements[] = 'Ü';
$patterns[] = '/ü/';
$replacements[] = 'ü';
$patterns[] = '/Ý/';
$replacements[] = '&Yuml';
$patterns[] = '/ÿ/';
$replacements[] = 'ÿ';
$contents = preg_replace($patterns, $replacements, $input);
For using later (varies on usage) : echo htmlspecialchars($contents);
For using on current location usage: echo $contents;
djr33
03-28-2010, 07:26 AM
I'm also not sure if you mean that the variable name or the variable value should have a strange character in it. For PHP variables I'd suggest omitting unusual characters because the system won't like it and in general code must remain in English (such as function names).
As for values, there should be no problems whatsoever with storing absolutely anything (unless it needs to be escaped, such as quotes) in any PHP variable.
I'm working with lots of strange characters at the moment while building a language-learning website and I haven't had any issues within PHP.
The issues always occur with MySQL or HTML. The problem is always the encoding and getting everything to cooperate.
One other issue that can be very confusing that might be the problem here is the text encoding of the text document (php file) itself. This is *not* the HTML code at the top of the page, but the format of the .txt (or .php) file within the operating system. In a program like notepad you can select what format you want to use.
Always use utf8 if there's any question about unusual characters.
However, some programs will randomly change the encoding or not give a choice about format.
One problem I had was using dreamweaver: it randomly switched the encoding to something else (not utf8) if there were no other characters present, just what was supported in the standard character set. So for this reason I put a comment at the end of a lot of files that read something like: //preserve encoding in DW あ.
That Japanese character would then force DW to save the file encoding still as unicode and not revert back to whatever it felt like.
Another problem I know about is a similar issue with FTP programs: if you use some FTP programs they will modify the contents of the files you upload, such as altering whitespace formatting (\r=>\n, etc) or changing the character encoding.
The only real answer to all of this is to either find a workaround (which is usually a pain) or just use a better program that isn't annoying.
Note: everything I have said is entirely about the variables typed directly into the .php files. Other issues like MySQL and HTML will still exist but won't be related to this specifically. In general, if you enter something and it is accepted by PHP another way (such as through a form), it will reliably remain in that same form when echoed, etc.
For example, if you type in a Japanese character into a form, then print $_POST on the next page it will probably be in the same form that it was when you typed it into the text field. If you get it from *any* other source, the odds of this happening are not true unless you have carefully set all encodings to match. But since the source and destination are the same, it should not be a problem. If possible, setup a system where source and destination are the same and things will be easier.
In general, my advice is to double check that all encodings are the same:
1. text document encoding (see above)
2. HTML encoding (in a tag)
3. Database-- both the table and the default setting (or things can get confusing)
4. Serve the page as that encoding (probably default, but might be weird on some servers).
5. Your FTP program must upload without changing the encoding.
6. Any location (ie, word document) where you might cut/paste from.
7.The working encoding in all relevant programs such as editors and browsers, just to be sure that it's all compatible.
Use UTF8 for everything and life should be easier (assuming you can actually get everything to use UTF8).
As for encoding with a function in PHP, trial and error is always the simplest way to figure it out for each case, but I believe that htmlentities() should be what you need:
http://www.php.net/manual/en/function.htmlentities.php
Based on your description and what it says on the page that should work.
Note that HTML entities also have encodings so that gets even more complex: several entities (numerical) exist for some characters, so make sure that's all operating in the right way or you STILL might end up with messy conversions.
james438
03-28-2010, 07:27 AM
Maybe you are thinking of the ord (http://php.net/manual/en/function.ord.php) function?
<?php
$str = "i";
$str= ord($str);
echo "&$str".";";
?>
You could then use a bit of preg_match to convert the character only if it is not a letter or number. There is probably a better way I am sure, but this is where my mind goes to first.
I found the info first from http://tokira.net/unicode/index.php and then looked at his source which he made available and saw that he used the ord (http://php.net/manual/en/function.ord.php) function.
djr33
03-28-2010, 07:42 AM
Yes, that's a very good way to go about it, but also complex and not very easy to work with. It's probably best to actually figure out what's going wrong, but if you can't, ord() will surely be a way to not deal with character encodings (at least at the level of PHP-- you still have to work out HTML, databases, etc).
james438
03-28-2010, 07:52 AM
true, except that jscheuer1 is saying that there may be situations where obtaining the unicode value is what he needs, but we don't know what that situation is or would be and I have to agree. He is saying that (correct me if I am wrong here jscheuer1) he does not know what those situations may be where he would need such a program, but he wants to be prepared.
Your answer was quite useful and informative and deals with a host of possible causes, but I went for the more direct approach in my answer. apples and oranges so to speak ;)
If I remember correctly, the first time I looked into this issue was when I tried to post more than one space in a post on this forum and noticed that every method I tried resulted in a truncating of my consecutive spaces. Unicode was the only thing that worked.
jscheuer1
03-28-2010, 01:09 PM
I think I tried that, setting the encoding to UTF-8 in the header for the PHP page, UTF-8 on the receiving HTML page. The problem can be simplified as:
<?php
$myvar1 = "Plain";
$myvar2 = "Užimtas";
echo $myvar2 . '<br>'; //gives: Užimtas<br>
$answer = array ($myvar1, $myvar2);
echo $answer[1] . '<br>'; //gives: Užimtas<br>
echo json_encode($answer); //gives: ["Plain",null]
?>
Looks like so on the page:
Užimtas
Užimtas
["Plain",null]
Whereas:
<?php
$myvar1 = "Plain";
$myvar2 = "Užimtas";
echo $myvar2 . '<br>'; //gives: Užimtas<br>
$answer = array ($myvar1, $myvar2);
echo $answer[1] . '<br>'; //gives: Užimtas<br>
echo json_encode($answer); //gives: ["Plain",Užimtas]
?>
Looks like so on the page:
Užimtas
Užimtas
["Plain","Užimtas"]
So it's pretty clear that if we could have converted Užimtas (in the first example) to Užimtas before (or during, but I don't think that's possible) json_encode, things would work out well. Or if we could get json_encode to not choke on Užimtas . . . That would be another approach, but less applicable in general.
By extension, if we could scan all variables/array values prior to json_encode and convert any non-ASCII characters in them to valid UNICODE entities, that would make the process universally applicable.
Now, I tried james438's link, it gives the hex entity, no good for a valid HTML page. But Googling "Convert Text to Unicode" (which is the main heading of the page james438 linked to) got me:
http://www.pinyin.info/tools/converter/chars2uninumbers.html
Which employs a simple javascript (http://www.pinyin.info/tools/converter/convertToEntities.js) that does almost exactly what I would want to do in PHP. All it needs is a little tweak to get it to output valid UNICODE entities (add preceding 0(s) for values of a length less than 4). Could this be easily translated to PHP? Here's my modified version of the javascript:
/* convertToEntities()
* Convert non-ASCII characters to valid HTML UNICODE entities */
function convertToEntities(astr){
var bstr = '', cstr, i = 0;
for(i; i < astr.length; ++i){
if(astr.charCodeAt(i) > 127){
cstr = astr.charCodeAt(i).toString(10);
while(cstr.length < 4){
cstr = '0' + cstr;
}
bstr += '&#' + cstr + ';';
} else {
bstr += astr.charAt(i);
}
}
return bstr;
}
djr33
03-28-2010, 10:24 PM
I'm not sure what charCodeAt() does. If this is the same as ord(), or another function, then sure that can be converted.
Here's a rough example:
function convertToEntities($astr) {
$bstr = '';
for($i=0;$i<strlen($astr);$i++) {
if (ord($astr[$i])>127) {
$cstr = ord($astr[$i]);
while (strlen($cstr)<4) {
$cstr .= '0';
}
$bstr .= '&#'.$cstr.';';
}
else {
$bstr .= $astr[$i];
}
}
return $bstr;
}
jscheuer1
03-29-2010, 04:06 AM
Not exactly. The ord() of PHP gets a hex value of the character. This can be converted easily in PHP to bin, oct, or dec. However, in order for a numeric entity to get the full seal of approval from the validator, it must use the HTML UNICODE value.
For example, the ord() of the character in question here (ž) is 9e. And ž works. In decimal that's 158. And also works. However, the HTML UNICODE value as given by javascript's charCodeAt() is 382. And ž also works. That's decimal, but converted to hex is 17e (which is also the value for the character if looked up at unicode.org). And ž also works. But only 0382 and 17e are considered fully acceptable by the validator. The value given by ord() receives a warning:
Line 9, Column 7: reference to non-SGML character
Užimtas<br>Užimtas<br>["Plain","Užimtas"]
✉
You've included a character reference to a character that is not defined in the document type you've chosen. This is most commonly caused by numerical references to characters from vendor proprietary character repertoires. Often the culprit will be fancy or typographical quote marks from either the Windows or Macintosh character repertoires.
The solution is to reference UNICODE characters instead. A list of common characters from the Windows character repertoire and their UNICODE equivalents can be found in the document "On the use of some MS Windows characters in HTML" maintained by Jukka Korpela <jkorpela@cs.tut.fi>.
Now oddly, after playing around a bit more, I find that if I save my source code as UTF-8 and use the literal character, json_encode converts it to:
\u17e
which is a value I can work with (using preg_replace or similar on the server, or replace on the client) to get the HTML UNICODE entity. And interestingly enough json_encode does this regardless of whether or not the page is served as UTF-8 or as ISO-8859-1. Unfortunately json_encode isn't available until PHP 5.2 and requires a setting for PHP/the server and extra json code installed to be enabled.
Disconcerting is that ord() now reads it as 197 (Å), understandable but wrong. This is also regardless of how the page is served.
So now I'm wondering if this behavior can be relied upon for json_encode whenever it's present, or if it's dependent upon server settings, and/or PHP version, and/or json version/json settings. And if there might be a more common PHP function that would give this HTML UNICODE value.
Appears PHP doesn't yet support UNICODE:
http://us3.php.net/manual/en/intro.unicode.php
There is some movement to do so though:
http://www.linux.com/archive/feature/60386
though it appears to be a bit behind. And apparently this json_encode, which is not AFAIK a direct part of PHP does so in at least a limited way.
silvertip257
03-29-2010, 08:34 PM
I believe my problems are similar to the ones you have described.
I have an XML file that contains unicode entities in addition to normal strings. When I load the XML via PHP's simplexml_load_file function and echo it out, certain unicode entities are not printed properly.
I have squelched some of the 'special characters' via utf8_decode().
The only entity giving me problems is 1/3 or ⅓ or &#8531;
I've also attached my XML as a text file since the forum is going to encode my entities! :o
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<cookbook>
<recipe>
<num>3</num>
<name>Green Bean Casserole</name>
<ingredients>
<item>1 can (10¾ oz) Campbell's Cream of Mushroom Soup</item>
<item>½ cup milk</item>
<item>1 tsp soy sauce</item>
<item>Dash ground black pepper</item>
<item>4 cups frozen cut green beans</item>
<item>1⅓ cups French fried onions</item>
</ingredients>
<procedure>
<step>Stir soup, milk, soy sauce, black pepper, beans and ⅔ cups onions in 1½ qt casserole.</step>
<step>Bake at 350°F for 25 minutes or until hot. Stir.</step>
<step>Top with remaining onions. Bake for 5 minutes more.</step>
</procedure>
</recipe>
</cookbook>
<?php
// load XML file
$xml = simplexml_load_file("ingred.xml");
$recipe = $xml->recipe;
echo $recipe->num."<br />";
echo $recipe->name."<br />";
echo "<ul>";
// create ingredient list items
foreach($recipe->ingredients->item as $item) {
echo '<li>'.utf8_decode($item).'</li>';
}
echo "</ul><br /><ol>";
// create step list items
foreach($recipe->procedure->step as $step) {
echo '<li>'.utf8_decode($step).'</li>';
}
echo "</ol>";
?>
jscheuer1
03-29-2010, 09:41 PM
I'd try using the literal characters. As long as all the files are saved in your editor encoded as UTF-8 and served as UTF-8, it should work out. You may have to use a PHP header to set the encoding on the PHP page. Your host must allow UTF-8 encoding (most do). Alternatively, and I don't know much about this, you may be able to use straight xml, dispensing with PHP entirely.
silvertip257
03-29-2010, 10:28 PM
Unfortunately for the entity that does not display correctly (one third) does not have a literal value for it.
I've tried HTML headers, but setting it to utf-8 doesn't make it better - it makes it worse (box characters). Ironically setting the encoding to ascii removes the boxes, but the 1/3 still exhibits a question mark.
I've read and tried techniques in these articles. Nothing so far makes the 1/3 symbol display correctly though.
http://www.varslashlog.com/2009/02/09/how-to-use-unicodeutf-8-in-php-properly-part-1/
http://www.phpwact.org/php/i18n/utf-8
silvertip257
03-29-2010, 11:17 PM
I've found a fix for my problems.
Adding this line to the top of my PHP code
header ('Content-type: text/html; charset=utf-8');
and removing my utf8_decode() function call.
I didn't have to change a thing in my XML.
The whole problem was not using the right encoding type - ASCII does not have support for the Vulgar One Thirds entity ... but UTF-8 does.
Hopefully this information is helpful to others.
I'll attach my sources in a zip file just for good measure.
<?php
// load XML file
$xml = simplexml_load_file("ingred.xml");
$recipe = $xml->recipe;
header ('Content-type: text/html; charset=utf-8');
echo $recipe->num."<br />";
echo $recipe->name."<br />";
echo "<ul>";
// create ingredient list items
foreach($recipe->ingredients->item as $item) {
//$item = mb_convert_encoding($item, 'auto', 'UTF-8');
echo '<li>'.$item.'</li>';
//echo '<li>'.$item.'</li>';
}
echo "</ul><br /><ol>";
// create step list items
foreach($recipe->procedure->step as $step) {
echo '<li>'.$step.'</li>';
}
echo "</ol>";
?>
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<cookbook>
<recipe>
<num>3</num>
<name>Green Bean Casserole</name>
<ingredients>
<item>1 can (10¾ oz) Campbell's Cream of Mushroom Soup</item>
<item>½ cup milk</item>
<item>1 tsp soy sauce</item>
<item>Dash ground black pepper</item>
<item>4 cups frozen cut green beans</item>
<item>1⅓ cups French fried onions</item><!--⅓-->
</ingredients>
<procedure>
<step>Stir soup, milk, soy sauce, black pepper, beans and ⅔ cups onions in 1½ qt casserole.</step>
<step>Bake at 350°F for 25 minutes or until hot. Stir.</step>
<step>Top with remaining onions. Bake for 5 minutes more.</step>
</procedure>
</recipe>
</cookbook>
Powered by vBulletin® Version 4.2.2 Copyright © 2021 vBulletin Solutions, Inc. All rights reserved.