JAN
HTML entity - character conversion in PHP
Posted by Joe under Programming
A quick quiz for PHP developers:
- What function in PHP’s standard library is used for converting characters in a string to HTML entities?
- What function does the opposite?
- By default, what character encoding is used by each of these functions?
If you can answer these questions without looking at the documentation, you’re doing better than I am. It’s symptomatic of the haphazard design of the standard library that answer to the first question is htmlentities(), whereas the second is html_entity_decode(), which “converts HTML entities in a string to their applicable characters.”
As if this naming inconsistency weren’t inconvenient enough, the answer to the third question raises real problems because each function has different default character encodings which change depending on which version of PHP you’re running. htmlentities() defaults to using ISO-8859-1 encoding, but according to the documentation, “this default is very likely to change in future versions of PHP; the programmer is highly encouraged to specify a value.”
The situation with html_entity_decode() is even worse. Prior to PHP 5.4, this function defaults to giving its output in ISO-8859-1 encoding but as of that version has changed the default to UTF-8. Since I prefer to encode all text in UTF-8, “ ” and other entities will be converted to invalid characters that will show as a question mark or box if the encoding is not specified. The following code will show how this will cause problems when changing PHP versions:
echo htmlentities(html_entity_decode(' '));
Under 5.3, this will output “ ”, while under 5.4 it will give “ ”, showing how htmlentities() is still trying to interpret its input in ISO-8859-1, while html_entity_decode() is now defaulting to UTF-8. You can simulate this result on an earlier version by setting the encoding for html_entity_decode() to UTF-8 and leaving the encoding for htmlentities() unspecified (ie. ISO-8859-1).
It seems that the only practical solution here is to define the character encoding explicitly, otherwise the behaviour of the HTML entity encoding/decoding functions could vary from installation to installation. If you want to use UTF-8, even if you’re running a version <= 5.3 and don’t care about future compatibility, you’ll still have to specify the encoding every time.