PHP 8.4: New grapheme_str_split function

Version8.4
TypeNew Feature

The Intl extension in PHP 8.4 adds a new function named grapheme_str_split that splits a given string into an array of graphemes.

A grapheme is the smallest meaningful and functional unit of a language system. In comparison, the mb_str_split function from the Mbstring extension has similar semantics, but with a major difference in that the mb_str_split function splits a string into Unicode multi-byte characters, while the grapheme_str_split function splits into functional units of the writing system.

The difference between Unicode characters and graphemes is important when presenting the characters in certain complex languages and Emojis with modifiers. mb_str_split splits the string into Unicode code-points while grapheme_str_split splits the string into functional units. Individual Unicode code-points are valid characters, but in complex scripts and Emojis, splitting a string with mb_str_split` can break certain characters to lose modifiers such as vowel characters.

For example, the Sinhalese language word අයේෂ් (pronounced "Ayesh" in English) comprises three units in the Sinhalese script: + යේ + ෂ්. is a stand-alone character, but යේ and ෂ් characters use additional Unicode code-points as vowel modifiers. grapheme_str_split splits the word correctly into individual characters that adhere to the Sinhalese writing system, while mb_str_split splits it into individual Unicode code-points: + + + + .


Here are a few more examples in various languages and Emoji:

String
Unicode representation
grapheme_str_split output
Unicode representation
mb_str_split output
Unicode representation
PHP
0050 0048 0050
P + H + P
0050 + 0048 + 0050
P + H + P
0050 + 0048 + 0050
你好
4F60 597D
+
4F60 + 597D
+
4F60 + 597D
අයේෂ්
0D85 0DBA 0DDA 0DC2 0DCA
+ යේ + ෂ්
0D85U + 0DBA 0DDA + 0DC2 0DCA
+ + + +
0D85 + 0DBAU + 0DDAU + 0DC2U + 0DCA
สวัสดี
0E2A 0E27 0E31 0E2A 0E14 0E35
+ วั + + ดี
0E2A + 0E27 0E31 + 0E2A 0DCA + 0E2A + 0E14 0E35
+ + + + +
0E2A + 0E27 + 0E31 + 0E2A + 0E14 + 0E35
👭🏻👰🏿‍♂️
1F46D 1F3FB 1F470 1F3FF 200D 2642 FE0F
👭🏻 + 👰🏿
1F46D 1F3FB + 1F470 1F3FF 200D 2642 FE0F
👭 + 🏻 + 👰 + 🏿 + + +
1F46D + 1F3FB + 1F470 + 1F3FF + 200D + 2642 + FE0F

grapheme_str_split Synopsis

grapheme_str_split function is similar to the mb_str_split function, and supports specifying an int $length parameter to specify the length of each chunk. If the length is larger than the entire or a chunk of the graphemes, the string/chunk will be returned.

Passing an empty string returns an empty array.

/**
 * Splits a string into an array of individual or chunks of graphemes.
 *
 * @param string $string The string to split into individual graphemes
 *  or chunks of graphemes.
 * @param int $length If specified, each element of the returned array
 *  will be composed of multiple graphemes instead of a single
 *  graphemes.
 *
 * @return array|false
 */
function grapheme_str_split(string $string, int $length = 1): array|false {}

grapheme_str_split Usage Examples

grapheme_str_split("PHP");
// ["P", "H", "P"]
grapheme_str_split("你好");
// ["你", "好"]
grapheme_str_split("你好", length: 4);
// ["你好"]
grapheme_str_split("สวัสดี");
// ["ส", "วั", "ส", "ดี"]
grapheme_str_split("අයේෂ්");
// ["අ", "යේ", "ෂ්"]
grapheme_str_split("👭🏻👰🏿‍♂️");
// ["👭🏻", "👰🏿‍♂️"]

Backward Compatibility Impact

The new grapheme_str_split function is new in the Intl extension, and is declared in the global namespace. Unless there is an existing function with the exact name, this change should not introduce any backward-compatibility issues.

grapheme_str_split polyfill

It is possible to polyfill the grapheme_str_split function using Unicode regular expressions. The /\X/ selector matches a complete grapheme, and can be used as the base for the polyfill.

Note that the following polyfill uses \X regular expression which matches a complete Grapheme. However, it does not correctly split complex Emojis such as Emojis with skin modifiers on PCRE2 library versions <= 10.43.

/**
 * Splits a string into an array of individual or chunks of graphemes.
 *
 * @param string $string The string to split into individual graphemes
 *  or chunks of graphemes.
 * @param int $length If specified, each element of the returned array
 *  will be composed of multiple graphemes instead of a single
 *  graphemes.
 *
 * @return array|false
 */
function grapheme_str_split(string $string, int $length = 1): array|false {
    if ($length < 0 || $length > 1073741823) {
        throw new \ValueError('grapheme_str_split(): Argument #2 ($length) must be greater than 0 and less than or equal to 1073741823.');
    }
    if ($string === '') {
        return [];
    }

    preg_match_all('/\X/u', $string, $matches);

    if (empty($matches[0])) {
        return false;
    }

    if ($length === 1) {
        return $matches[0];
    }

    $chunks = array_chunk($matches[0], $length);

    array_walk($chunks, static function(&$value) {
        $value = implode('', $value);
    });

    return $chunks;
}

RFC Discussion Implementation