PHP 8.4: PCRE2 Upgrade and Regular Expression Changes

Version8.4
TypeNew Feature

PHP's regular expression capabilities, available as preg_* functions, rely on the PCRE (Perl-Compatible Regular Expressions) library. In PHP 7.3, PHP started to use PCRE2.

PHP has been slowly keeping up minor PCRE updates such as PCRE2 10.39 in 2021 and PCRE210.40 in 2022. However, PCRE2 10.43 and 10.44 bring some significant changes including changes that affect the Regular Expression syntax it supports.

The PCRE2 library is included in the PHP source tree, so there are no changes involved in the compile-time dependencies. It has some changes that may not be compatible with existing regular expressions, or are not compatible with other flavors of Regular Expression engines.

Regular Expression Syntax Changes

Following are the changes to the regular expression syntax as part of the PCRE2 10.44 update. These changes will take effect in PHP 8.4 because that is the version that effectively brings the PCRE2 10.44 update.

Quantifiers without minimum quantity

Prior to PHP 8.4, expressions without the minimum quantity were not considered valid. in PHP 8.4, quantifiers without the minimum quantity specified (e.g. /a{,3}/) are considered as zero minimum quantity (i.e. /a{0,3}/).

The following snippet shows preg_match calls with Regexps that match zero to three matches of the character a.

preg_match('/a{,3}/', 'aaa'); // Only valid in PHP 8.4
preg_match('/a{0,3}/', 'aaa'); // Valid in PHP 8.4 and older

This syntax change is in line with Perl 5.34.0. Python also supports the {,3} syntax, but other languages such as JavaScript, Go, Java, etc do not.

Spaces allowed in curly braces

PHP 8.4 allows space and horizontal tab characters after and before quantifier curly brace pairs and around the comma separating the quantities. This is not Perl-compatible, but ECMAScript supports this syntax.

preg_match('/a{ 5,10 }/',    'aaaaaaa'); // Only valid in PHP 8.4
preg_match('/a{5 ,10}/',     'aaaaaaa'); // Only valid in PHP 8.4
preg_match('/a{ 5, 10 }/',   'aaaaaaa'); // Only valid in PHP 8.4
preg_match('/a{ 5, 10   }/', 'aaaaaaa'); // Only valid in PHP 8.4

Prior to PHP 8.4/PCRE2 10.43, the above regexps are not considered valid quantifiers, and are only matched as a string literal.

Unicode 15 Update

PHP 8.4's bundled PCRE2 now supports Unicode 15. Apart from the new Emoji and glyph updates in Unicode 15, this includes support for new Unicode character classes.

For example, with the new Unicode 15 updates, new scripts added to Unicode 15 can now be used as Named Character Classes. Unicode 15 adds Kawi (U11F00-11F5F) and Nag Mundari (U1E4D0-1E4FF) scripts, which means they can be used in regexps as well:

preg_match('/\p{Kawi}/u', 'abc');
preg_match('/\p{Nag_Mundari}/u', 'abc');

In PHP versions prior to PHP 8.4, this results in a warning because these character classes are unknown to PCRE2.

preg_match(): Compilation failed: unknown property after \P or \p at offset ...

Older PHP versions can continue to match these characters and emojis, but instead of using named matching groups, the regexp has to define the range:

preg_match('/\p{Kawi}/u', 'abc');
// is equivalent to:
preg_match('/[\x{11F00}-\x{11F5F}]/u', 'abc');
preg_match('/\p{Nag_Mundari}/u', 'abc');
// is equivalent to:
preg_match('/[\x{1E4D0}-\x{1E4FF}]/u', 'abc');

Further, the Unicode 15 update brings more changes to existing character classes, new Emojis, and a new ZWJ pattern for Emoji combinations.

Regex \w in Unicode mode

Prior to PHP 8.4, using the /\w/u character class is equivalent to /[\p{L}\p{N}_]/u. What this means is that \w is a shorthand for character class \p{L} (Unicode "letter" character point), \p{N} (numeric character in any script), and underscore (_).

In PHP 8.4 and later, \w additionally includes \p{Mn} (Non-spacing Mark) and \p{Pc} (Connector Punctuation). This makes \w equivalent to /[\p{L}\p{N}_\p{Mn}\p{Pc}]/u. The new behavior matches Perl.

As of Unicode 15, the Mn character category contains 1,839 entries and the Pc category contains 10 entries. This can potentially have a bigger impact on existing Regexps as well, because /w/u now matches 1,849 additional characters.

preg_match('/\w/u', "\u{0300}"); // PHP  < 8.4: Does not match
preg_match('/\w/u', "\u{0300}"); // PHP >= 8.4: Does match

Caseless Restrict Modifier Support

As part of the PCRE2 10.43 update, PHP 8.4 can utilize "caseless restrict" (PCRE2_EXTRA_CASELESS_RESTRICT) modifier inside the regular expressions and as an additional flag that applies to the whole expression.

When applied, it prevents the matching across ASCII and non-ASCII characters. For example, the Kelvin sign (, "\u{212A}") and the English letter K can be matched with k (English simple letter k) interchangeably in a Unicode Regex:

preg_match('/k/iu', "K"); // Matches
preg_match('/k/iu', "k"); // Matches
preg_match('/k/iu', "\u{212A}"); // Matches

PHP 8.4 introduces a "caseless restrict" mode which prevents caseless (/i) matches across ASCII and non-ASCII characters. This mode is enabled by placing a (?r) at the position the caseless matching should start. Similarly (?-r) disables caseless matching.

preg_match('/(?r)k/iu', "K"); // Matches
preg_match('/(?r)k/iu', "k"); // Matches
preg_match('/(?r)k/iu', "\u{212A}"); // Does NOT Match

Unicode code-points larger than 0080 with case folding rules . They are defined in the Unicode Case Folding standard. As of Unicode 15, the only code-points that match this criteria are:

Character ASCII base equivalent
U+212A - KELVIN SIGN (K) 006B - Latin Capital Letter K (K)
U+017F - LATIN SMALL LETTER LONG S (ſ) 0073 - Latin Small Letter S (s)

To perform the whole Regexp in caseless-restrict mode, PHP 8.4 supports a new "r" flag:

preg_match('/\x{212A}/iu', "K"); // Matches
preg_match('/\x{212A}/iur', "K"); // Does NOT match

Using the caseless-restrict modifiers in PHP versions prior to PHP 8.4 emits a PHP warning:

Compilation failed: unrecognized character after (? or (?-

Similarly, because the "r" flag is only available on PHP 8.4 and later, using it results in a PHP warning as well:

Warning: preg_match(): Unknown modifier 'p'

Length-limited Variable-Length Lookbehind Support

PCRE2 10.43 supports variable-length lookbehind assertions as long as there is a maximum length. This means PHP 8.4 supports Regexps like the following:

preg_match('/(?<=Hello{1,5}) world/', 'Hello world'); // Matches
preg_match('/(?<=Hello{1,5}) world/', 'Hellooooo world'); // Matches
preg_match('/(?<=Hello{1,5}) world/', 'Helloooooo world'); // No Matches

Although PHP 8.3 and older versions supported lookbehind assertions (both positive and negative lookbehinds), the pattern was not allowed to have a quantifier (such as {x,y}, ?, or *) in it. Doing so results in a compilation failure in PHP 8.3 and older:

Warning: preg_match(): Compilation failed: lookbehind assertion is not fixed length

In PHP 8.4 and later, variable-length lookbehind assertions are allowed, with the restriction that the quantifier must have an upper limit. This means lookbehind assertions such as (?<=a?) and (?<=a{1,20}) are allowed.

However, the * and + are not allowed because they do not define an upper limit. Doing so results in a compilation error:

Warning: preg_match(): Compilation failed: length of lookbehind assertion is not limited

There is also a limitation that the upper limit of lookbehind quantifiers must 255 or lower. A lookbehind assertions that exceeds this limit (e.g. (?<=a{1,256}) results in a compilation error:

Warning: preg_match(): Compilation failed: branch too long in variable-length lookbehind assertion

Named Capture Group Label length increased to 128

PCRE2 10.44 increases the maximum length of the labels of named capture groups. Prior to PHP 8.4, the maximum length of a named capture group label length is 32.

preg_match('/(?<mylabel1234567890123456789012345>a+)/');

In PHP 8.4 and later, the label (e.g. from the example above, starting from mylabel123... can be up-to 128 characters long.

When passing a Regexp with a named capture group labels longer than 32 characters in PHP < 8.4, or longer than 128 characters in PHP >= 8.4, preg_ functions fail with a compilation error:

Compilation failed: subpattern name is too long (maximum 32 code units) at offset 36 ...

Backward Compatibility Impact

PCRE2 library is part of the PHP source tree, and it is not possible to bring these changes to older PHP versions.

Some of the functionality (such as Kawi and Nag Mundari script-named character classes) can be used in older PHP versions by specifying their Unicode range, but the majority of the new functionality cannot be brought back to older PHP versions.

Use the PCRE_VERSION constant provides the bundled PCRE2 version, which might be useful when running conditional preg_* calls depending on the availability.


Implementation (PCRE2 Upgrade) Implementation (r flag)