PHP 8.4: MBString: Unicode Character Database updated to version 16

Version8.4
TypeChange

The MBString extension contains an extracted subset of data from the Unicode specification that it uses for its operations such as converting a given string to upper or lowercase, determining the width of a given string (useful in certain Eastern Asian scripts), etc.

In PHP 8.3, the MBString extension included the data from the Unicode 14.0 standard, released in 2022 September. In PHP 8.4, the Unicode Character Database (UCD) data source was updated from 14.0 to 16.0, released in 2024 September. Unicode 16.0 is the latest UCD released available at the time.

Unicode 15.0, 15.1, and 16.0 adds 4,489, 627, and 5,185 new characters. Further, the three Unicode versions combined have support for 11 additional scripts. For MBString extension, however, the updates that matter are in the character case folding rules that affect functions such as mb_strtolower, mb_strtoupper, and East Asian Width value assignments that determine whether a given character is considered normal width or wide (mb_strwidth).


With the Unicode 16 support, the MBString extension can handle all of the latest Emoji characters, and has the most up-to-date case folding and character width information.

There are no direct changes to any of the MBString functions. The Unicode Character Data are part of the MBString extension itself, and will be used in all PHP 8.4 functions.

Backward Compatibility Impact

UCD 16 is currently the latest version of the Unicode standard, and because there are no changes in any of the function parameters and return values, this is unlikely to cause any backward compatibility issues.

However, note that there is a slight chance of data that were previously case-converted or width-measured on an older PHP version returning different data in PHP 8.4 and later.


Some of the examples include:

echo "Emoji: \u{1F6DC}, width: " . mb_strwidth("\u{1F6DC}");
// Emoji: 🛜, width: 2

The WiFi Emoji (🛜) is added in UCD 15, and older PHP versions do not identify the 1F6DC codepoint, and return width: 1 instead.


echo mb_strtoupper("\u{019b}") === "\u{a7dc}";

In UCD 16, there is a new case mapping for the 019b character (ƛ - Latin Small Letter Lambda with Stroke). PHP 8.3 and older versions do not support this case mapping, so mb_strtoupper("\u{019b}") returns "\u{019b}" itself. On PHP 8.4 and later with UCD 16, mb_strtoupper("\u{019b}") returns "\u{a7dc}".


The snippets above use Hex Unicode character escaping in the double-quoted strings.


Parity with other functionality

The Intl extension and the PCRE extensions also use the Unicode Character Database in their operations.

In PHP 8.4, the PCRE2 library is updated to 10.44. PCRE2 is built with UCD 15.

The Intl extension relies on the ICU library for Unicode data tables. On a default build, the UCD version of the underlying ICU library is likely to be UCD 15.

Although unlikely, there can be edge cases in the MBString, Intl, and PCRE extensions handling characters differently. For example, the new uppercase mapping in Unicode 16 for the "\u{019b}" character is not recognized as an uppercase character by the PCRE extension:

preg_match('/\p{Ll}/u', "\u{019b}"); // Unicode aware, match any lowercase character
// Matches in PHP 8.3 and 8.4.
preg_match('/\p{Lu}/u', mb_strtoupper("\u{019b}")); // Unicode aware, match any uppercase character
// Does not match in any PHP 8.3 or 8.4, although it is considered an uppercase character.

Note that the UCD version number mismatch is not new in PHP 8.4, and the likelihood of a UCD mismatch causing issues can be considered rare.


Implementation