Log in

View Full Version : Resolved htmlentities behaving differently after php 5.4 upgrade



james438
11-21-2014, 12:52 AM
After upgrading to php 5.4.19 from 5.3.24 I see that I am having trouble using htmlentities (http://php.net/htmlentities) and the like.

Below is the example code I am playing with, but with no luck so far. Most characters can be expressed just fine, but when there are odd characters used like non standard single quotes or accented characters or arrows the output is empty. The output is empty by design in php for security reasons if I am understanding it correctly.

What I want to do is express html entities as the actual character or as the html equivalent and express all other characters as is. Even partial solutions are fine as I can just play around with the code, but I am having a bit of trouble better understanding the flags and encoding used.


<?php
$title="á";
##$title=htmlspecialchars($title);
print $title;
?>
<textarea name="summary" cols=75 rows=25><?php print htmlentities($title) ; ?></textarea>

jscheuer1
11-21-2014, 03:52 AM
Have you checked the man (php.net) page for that function to see if there have been any changes as to its usage?

Are you familiar with basic usage of flags?

I don't have the later version of PHP, what are you getting? I'm getting the entity - but of course it looks exactly like the character unless I 'view source'.

james438
11-21-2014, 04:56 AM
Thanks for looking into it. I did look at the main php site with particular interest at the differences added in php 5.4. I am learning about this kinda slowly, but it seems to have to do with the character set.

I have discovered that the following works:


<?php
$title="á";
$title=htmlspecialchars($title, ENT_IGNORE, '');
print $title;
?>
<textarea name="summary" cols=75 rows=25><?php print htmlentities($title,ENT_IGNORE,'') ; ?></textarea>

where the character set is unspecified.

php.net has this to say about using an empty string, but I don't fully understand it.


An empty string activates detection from script encoding (Zend multibyte), default_charset and current locale (see nl_langinfo() and setlocale()), in this order. Not recommended.

I'm not fully sure what character set I am using, but I know it is either UTF-8 or ISO-8859-1. I think it matters, but I'm not sure how.

Sadly, getting information on this has been slow, but I am making some progress.

EDIT: I am definitely using the default which is UTF-8.

james438
11-21-2014, 05:34 AM
$title=htmlspecialchars($title, ENT_IGNORE, 'ISO-8859-15');

I seem to be somewhat wrong about the character set used. It registers as UTF-8 when I try to detect the character set, but in phpinfo() under exif.encode_unicode the one listed is ISO-8859-15.

Next up is to find out what that means, why it was used, and if I should change it to the php 5.4 onwards standard of ISO-8859-1.

jscheuer1
11-21-2014, 07:10 AM
My reading of that cryptic quote is that it will use the default charset for the server when you use an empty string. If this is the encoding used by the server when serving the page or if the characters in question being converted overlap in the two charsets if two are involved, it will work out. Otherwise you must specify the encoding the page is served in to get the correct result.

I know that's not much clearer, but I hope it is clear enough to be of some use.

james438
11-21-2014, 02:34 PM
I'm glad it is not just me that found their quote somewhat cryptic. php.net has a lot of great documentation, but some of their pages are just not well written. Still, as far as documentation goes php.net is probably my favorite.


My reading of that cryptic quote is that it will use the default charset for the server when you use an empty string. If this is the encoding used by the server when serving the page or if the characters in question being converted overlap in the two charsets if two are involved, it will work out. Otherwise you must specify the encoding the page is served in to get the correct result.

That does help out a bit.

It looks like my hosting service decided to use ISO-8859-15 over ISO-8859-1 or UTF-8 because according to the description:

ISO-8859-1 Western European, Latin-1.
ISO-8859-15 Western European, Latin-9. Adds the Euro sign, French and Finnish letters missing in Latin-1 (ISO-8859-1).
UTF-8 ASCII compatible multi-byte 8-bit Unicode.

So I think I will keep using ISO-8859-15 and start specifying the encoding used. I'll try to remember that my hosting service may change this in the future, but I don't think that will happen again anytime soon if it ever does.

molendijk
11-21-2014, 03:05 PM
Have you tried this:


<?php
$title="é";
##$title=htmlspecialchars($title);
print $title;
?>
<textarea name="summary" cols=75 rows=25><?php print html_entity_decode($title);?></textarea>

james438
11-21-2014, 07:13 PM
That will work, but only under limited circumstances. Encoding any character that is non standard such as é or the &mdash; (—) will return an empty string.

I wish php.net had more to say on exactly what is non standard. This is about all it has to say on it though.


If the input string contains an invalid code unit sequence within the given encoding an empty string will be returned, unless either the ENT_IGNORE or ENT_SUBSTITUTE flags are set.

jscheuer1
11-22-2014, 05:17 AM
For maximum utility in all but the most demanding spoken (well typed really, but what I mean is human languages as opposed to coding languages) languages, UTF-8 is the way to go.

That is to say, for English, most Arabic, all Romance languages, most Oriental ones, many others, UTF-8 will work as long as the page is encoded in and served as UTF-8. Once you have that, if you then also use UTF-8 as the encoding for the htmlentities command, everything should work out. The only exception I can think of is if you're pulling a string from a database that's encoded in something other than UTF-8. There could be other exceptions.

The bottom line is that if at all possible you should ensure that everything is encoded to and being told to use the same charset. Where that's not possible, one can almost always convert, but it gets tricky because you might not always know which encoding to which encoding is optimal for each specific point in your operations. Again - that's why it's optimal to use a single encoding for everything. If UTF-8 is not adequate, then use another, but use it for everything.

I noticed earlier that you mentioned something about the euro and/or pound sign I think. As far as I know, both of those are supported in UTF-8. However, and as an example of how confusing things can become if more than one encoding is employed, the bits used to represent these two common monetary prefixes vary depending upon the encoding used to render them in.