PHP 8.4: New grapheme_str_split
function
The Intl extension in PHP 8.4 adds a new function named grapheme_str_split
that splits a given string into an array of graphemes.
A grapheme is the smallest meaningful and functional unit of a language system. In comparison, the mb_str_split
function from the Mbstring extension has similar semantics, but with a major difference in that the mb_str_split
function splits a string into Unicode multi-byte characters, while the grapheme_str_split
function splits into functional units of the writing system.
The difference between Unicode characters and graphemes is important when presenting the characters in certain complex languages and Emojis with modifiers. mb_str_split
splits the string into Unicode code-points while grapheme_str_split
splits the string into functional units. Individual Unicode code-points are valid characters, but in complex scripts and Emojis, splitting a string with mb_str_split` can break certain characters to lose modifiers such as vowel characters.
For example, the Sinhalese language word අයේෂ්
(pronounced "Ayesh" in English) comprises three units in the Sinhalese script: අ
+ යේ
+ ෂ්
. අ
is a stand-alone character, but යේ
and ෂ්
characters use additional Unicode code-points as vowel modifiers. grapheme_str_split
splits the word correctly into individual characters that adhere to the Sinhalese writing system, while mb_str_split
splits it into individual Unicode code-points: අ
+ ය
+ ේ
+ ෂ
+ ්
.
Here are a few more examples in various languages and Emoji:
String Unicode representation |
grapheme_str_split outputUnicode representation |
mb_str_split outputUnicode representation |
---|---|---|
PHP 0050 0048 0050 |
P + H + P 0050 + 0048 + 0050 |
P + H + P 0050 + 0048 + 0050 |
你好 4F60 597D |
你 + 好 4F60 + 597D |
你 + 好 4F60 + 597D |
අයේෂ් 0D85 0DBA 0DDA 0DC2 0DCA |
අ + යේ + ෂ් 0D85U + 0DBA 0DDA + 0DC2 0DCA |
අ + ය + ේ + ෂ + ් 0D85 + 0DBAU + 0DDAU + 0DC2U + 0DCA |
สวัสดี 0E2A 0E27 0E31 0E2A 0E14 0E35 |
ส + วั + ส + ดี 0E2A + 0E27 0E31 + 0E2A 0DCA + 0E2A + 0E14 0E35 |
ส + ว + ั + ส + ด + ี 0E2A + 0E27 + 0E31 + 0E2A + 0E14 + 0E35 |
👭🏻👰🏿♂️ 1F46D 1F3FB 1F470 1F3FF 200D 2642 FE0F |
👭🏻 + 👰🏿 1F46D 1F3FB + 1F470 1F3FF 200D 2642 FE0F |
👭 + 🏻 + 👰 + 🏿 + + ♂ + ️ 1F46D + 1F3FB + 1F470 + 1F3FF + 200D + 2642 + FE0F |
grapheme_str_split
Synopsis
grapheme_str_split
function is similar to the mb_str_split
function, and supports specifying an int $length
parameter to specify the length of each chunk. If the length is larger than the entire or a chunk of the graphemes, the string/chunk will be returned.
Passing an empty string returns an empty array.
/**
* Splits a string into an array of individual or chunks of graphemes.
*
* @param string $string The string to split into individual graphemes
* or chunks of graphemes.
* @param int $length If specified, each element of the returned array
* will be composed of multiple graphemes instead of a single
* graphemes.
*
* @return array|false
*/
function grapheme_str_split(string $string, int $length = 1): array|false {}
grapheme_str_split
Usage Examples
grapheme_str_split("PHP");
// ["P", "H", "P"]
grapheme_str_split("你好");
// ["你", "好"]
grapheme_str_split("你好", length: 4);
// ["你好"]
grapheme_str_split("สวัสดี");
// ["ส", "วั", "ส", "ดี"]
grapheme_str_split("අයේෂ්");
// ["අ", "යේ", "ෂ්"]
grapheme_str_split("👭🏻👰🏿♂️");
// ["👭🏻", "👰🏿♂️"]
Backward Compatibility Impact
The new grapheme_str_split
function is new in the Intl extension, and is declared in the global namespace. Unless there is an existing function with the exact name, this change should not introduce any backward-compatibility issues.
grapheme_str_split
polyfill
It is possible to polyfill the grapheme_str_split
function using Unicode regular expressions. The /\X/
selector matches a complete grapheme, and can be used as the base for the polyfill.
Note that the following polyfill uses
\X
regular expression which matches a complete Grapheme. However, it does not correctly split complex Emojis such as Emojis with skin modifiers on PCRE2 library versions <= 10.43.
/**
* Splits a string into an array of individual or chunks of graphemes.
*
* @param string $string The string to split into individual graphemes
* or chunks of graphemes.
* @param int $length If specified, each element of the returned array
* will be composed of multiple graphemes instead of a single
* graphemes.
*
* @return array|false
*/
function grapheme_str_split(string $string, int $length = 1): array|false {
if ($length < 0 || $length > 1073741823) {
throw new \ValueError('grapheme_str_split(): Argument #2 ($length) must be greater than 0 and less than or equal to 1073741823.');
}
if ($string === '') {
return [];
}
preg_match_all('/\X/u', $string, $matches);
if (empty($matches[0])) {
return false;
}
if ($length === 1) {
return $matches[0];
}
$chunks = array_chunk($matches[0], $length);
array_walk($chunks, static function(&$value) {
$value = implode('', $value);
});
return $chunks;
}