PHP 8.1: HTML entity en/decode functions process single quotes and substitute by default

Version8.1
TypeChange

An "HTML Entity" is a text representation of a character that would be otherwise interpreted as HTML code.

For example, the < and > characters are used to define an HTML tag: <h1>. The HTML entity representation for < and > are &lt; and &gt;. These HTML entities can be used safely in an HTML document, and browsers will not interpret them as HTML code, but literal text in their original form — &lt;H1&gt; will not be interpreted by the browser as HTML code, but are used to display <h1> as literal text.

PHP has built-in functions to convert certain HTML characters to HTML entities:


Prior to PHP 8.1, the default behavior of htmlspecialchars and htmlentities functions is to convert ", <, >, and & characters to their respective HTML entities, but it did not convert single quotes (') to HTML entities. Further, it returns an empty string if there is an invalid character in the given text.


From PHP 8.1 and later, the default parameters of these functions are changed, so that it also converts single quote characters. In contrast to the prior behavior of returning an empty string for a string that contains an invalid character, the default behavior of these functions in PHP 8.1 and later is to substitute invalid characters with (U+FFFD) characters.

The "�" character is known as the Unicode Replacement Character. This character is used to represent a character that cannot be represented, or is an invalid value.
"�" character can also be used in a PHP code with `"\u{FFFD}", using PHP Unicode Character Escape Sequences.


This essentially means the signatures of the functions mentioned above have changed in PHP 8.1:

htmlspecialchars

- htmlspecialchars($string, ENT_COMPAT);
+ htmlspecialchars($string, ENT_QUOTES | ENT_SUBSTITUTE);

htmlspecialchars_decode

- htmlspecialchars_decode($string, ENT_COMPAT);
+ htmlspecialchars_decode($string, ENT_QUOTES | ENT_SUBSTITUTE);

htmlentities

- htmlentities($string, ENT_COMPAT);
+ htmlentities($string, ENT_QUOTES | ENT_SUBSTITUTE);

html_entity_decode

- html_entity_decode($string, ENT_COMPAT);
+ html_entity_decode($string, ENT_QUOTES | ENT_SUBSTITUTE);

get_html_translation_table

- get_html_translation_table(HTML_SPECIALCHARS, ENT_COMPAT);
+ get_html_translation_table(HTML_SPECIALCHARS, ENT_QUOTES | ENT_SUBSTITUTE);

Impact of this change

With the default value of these functions changed from ENT_COMPAT to ENT_QUOTES | ENT_SUBSTITUTE, there are two major changes:

Single quote (') characters are converted to its HTML entity (&apos;).

With the default flag changed from ENT_COMPAT to ENT_QUOTES, this is a security improvement, because prior to PHP 8.1, using htmlspecialchars($user_input) was inherently insecure because single quotes were not converted to an HTML entities. Many frameworks/libraries mitigate this by explicitly using htmlspecialchars($user_input, ENT_QUOTES) pattern, which converts both double and single quotes.

Invalid character substitution

With the use of ENT_SUBSTITUTE as a default flag, if the text contains an invalid character, those characters are replaced with characters, instead of returning a completely empty
string.

$string = "Hello \x80";

The \x80 above is an invalid UTF-8 hex character escape sequence.

$string = "Hello \x80, Good morning";
htmlspecialchars($string); // ""
htmlspecialchars($string, ENT_SUBSTITUTE); // "Hello �, Good morning"

In PHP 8.1, invalid characters are replaced with characters due to ENT_SUBSTITUTE flag being present as a default value.

Note that the new default ENT_SUBSTITUTE option is different from the ENT_IGNORE option. ENT_IGNORE is potentially insecure, because it replaces invalid characters with an empty string, but returns the remaining of the converted string. This allows an attacker to craft malicious strings that are may not be detected as malicious at first, but become malicious text once the invalid characters are removed.


Backwards Compatibility Impact

The default values for the five functions are changed. This means that any function calls that did not explicitly set the default values for the function flags will return different results.

Many frameworks and libraries uses ENT_QUOTES | ENT_SUBSTITUTE as the default flag value, and will not see any difference in its functionality.

Most of the frameworks and libraries use ENT_QUOTES because the default ENT_COMPAT is insecure because it does not convert single quotes. Those frameworks will not see any changes to the single quote conversion behavior, but the new ENT_SUBSTITUTE flag will cause them to handle invalid UTF-8 characters with the substitution behavior instead of returning an empty string.

It is possible to revert the PHP 8.1's change by explicitly setting the default value to the existing values. However, the new default value is highly recommended because it is a safe and secure default value.


Implementation