Writing better Regular Expressions in PHP

Published On2021-05-26

PHP Regular Expressions - Improve readability and maintainability

Regular Expressions are powerful, PHP but they are not known to be readable, and more often than not, maintaining a regular expression is not a straight-forward task.

PHP uses PCRE (PCRE2 since PHP 7.3) regular expression flavor, and it comes with several advanced features that can help write readable, self-explanatory, and easy to maintain regular expressions. PHP's filters and ctype functions provide validations such as URL, email, and alphanumeric values, that helps to not use a regular expressions in the first place.

IDEs can provide nicer syntax highlighting to help make a given regular expression more readable and easier to grasp and even offer quick-fixes to improve them. However, writing a self-explanatory and more readable regular expressions in the first place can help in the long run.

Here are some tips to improve and write better regular expressions in PHP. Note that they might not work in older PHP versions (older than PHP 7.3). Further, using these improvements also means the regular expressions might be less portable to other languages. For example, named captures are supported even in older PHP versions, but in JavaScript, named captures feature was only added in ECMAScript 2018.



Choice of Delimiter

Each and every regular expression has two parts: the expression, and flags. The regular expression is contained within two characters, followed by optional flags.

Consider the regular expression below:

/(foo|bar)/i

In any regular expression, a delimiter character contains the expression, followed by optional flags. In the example above, (foo|bar) is the expression itself, and i is a flag/modifier. The / character is the delimiter.

Forward slashes (/) are frequently used as the delimiter, but it can be any character such as ~, !, @, #, $, etc. Alphanumeric characters (A-Z, a-z, and 0-9), multi-byte characters (such as Emojis) and backslashes (\) are not allowed to be a delimiter.

Alternately, braces can be used as delimiters as well. Regular expressions with {}, (), [], and <> are also accepted, and might be more readable depending on the context.

The choice of the delimiter is important because all occurrences of the delimiter character within the expression must be escaped. The fewer escaped characters inside a regular expression, the more readable it will be. Not choosing meta characters (such as ^, $, braces, and other characters that carry special meaning in regular expressions) can reduce the number of characters escaped.

Although forward slashes are common as a regular expression delimiter, it is often not a good fit for regular expressions containing URIs.

preg_match('/^https:\/\/example\.com\/path/i', $uri);

Forward slashes (/) are a poor choice of delimiter in the example above because the expression itself also contains forward slashes, which must now be escaped, resulting in a rather unreadable snippet.

Simply switching the delimiter from / to # made the expression more readable because it no longer contains any escape characters:

- /^https:\/\/example\.com\/path/i
+ #^https://example\.com/path#i
- preg_match('/^https:\/\/example\.com\/path/i', $uri);
+ preg_match('#^https://example\.com/path#i', $uri);

Reducing escape characters

Taking a step further from the choosing a better delimiter, there are other approaches to reduce the number of escaped characters used in a regular expression.

In regular expressions, certain meta characters are not considered meta characters when they are used inside square braces (character class). For example, ., *, +, and $ characters (among others) carry a special functionality in regular expressions, but not inside square braces.

/Username: @[a-z\.0-9]/

In the expression above, the dot character (.) is escaped with a backslash (\.), but it is unnecessary because the . character is not a meta character when it is used inside square braces.

Further, some characters do not need escaping if they are not part of a range.

For example, the dash character (-) denominates a character range if it used between two characters, but it carries no special functionality if it used elsewhere. In the regular expression /[A-Z]/, the dash character - is used to create a range of matches from A to Z. If the dash character is escaped (/[A\-Z]/), the regular expression only matches characters A, Z, and -. Instead of escaping the dash character (\-), simply moving the dash character to the end of the square braces reduces the number of characters that needs escaping; Regular expression /[A\-Z]/ is equivalent to [AZ-], but the latter is more readable.

Excessive use of the escape characters does not make the regular expression fail, but they can greatly reduce the readability.

- /Price: [0-9\-\$\.\+]+/
+ /Price: [0-9$.+-]+/ 

There is a flag X, that errors the regular expression if a character with no special meaning is escaped, but it is not context-sensitive (e.g. throwing an error depending on braces, etc.).

preg_match('/x\yz/X', ''); // "y" is not a special character, but escaped.
Warning: preg_match(): Compilation failed: unrecognized character follows \ at offset 2 in ... on line ...

Non-capture groups

In regular expressions, () braces start a capturing group. The matched results will be passed to the matches list:

Consider an example regular expression that extracts the price from a given text, from the text Price: €24.

$pattern = '/Price: (£|€)(\d+)/';
$text    = 'Price: €24';
preg_match($pattern, $text, $matches);

In the snippet above, there are two capturing groups: the first one is for the currency ((£|€)), followed by the numeric value.

The $matches variable will store the matched results from both capture groups:

var_dump($matches);
array(3) {
  [0]=> string(12) "Price: €24"
  [1]=> string(3) "€"
  [2]=> string(2) "24"
}

On regular expressions that do not need to capture at all, or to limit the number of matches passed to the $matches array, a non-capturing group can help.

Syntax of a non-capturing group is a brace that starts with (?:, and ends with ). Regex engine asserts the expression inside the braces, but it is not returned as a match; i.e. not captured.

If the expression above is only interested in the numeric value, the (£|€) capturing group can be turned into a non-capturing group: (?:£|€).

$pattern = '/Price: (?:£|€)(\d+)/';
$text    = 'Price: €24';
preg_match($pattern, $text, $matches);
var_dump($matches);
array(2) {
  [0]=> string(12) "Price: €24"
  [1]=> string(2) "24"
}

On regular expressions with several groups, turning the unused ones to non-capturing groups can reduce the amount of data assigned to the $matches variable.

Named Captures

Similar to non-capturing groups, named captures makes it possible to capture a specific group and give it a name. They can not only help in naming the returned values, but also name the parts of the regular expression itself.

Using the same price matching example above, a named capture group allows to give name to each capture group:

/Price: (?<currency>£|€)(?<price>\d+)/

A named capture group has the syntax of (?<, followed by the name of the group, and ended with ).

In the example above, (?<currency>£|€) is a named capture group with name currency, and (?<price>\d+) is named price. The names provide a little bit of context when reading the regular expression, but also provide a way to name the values in the matched values array.

$pattern = '/Price: (?<currency>£|€)(?<price>\d+)/';
$text    = 'Price: €24';
preg_match($pattern, $text, $matches);
var_dump($matches);
array(5) {
  [0]=> string(12) "Price: €24"
+ ["currency"]=> string(3) "€"
  [1]=> string(3) "€"
+ ["price"]=> string(2) "24"
  [2]=> string(2) "24"
}

The $matches array now contains the names and the positional values of the matched values.

Using named capture-groups makes it easy to consume the $matches values and easily change the regular expression later by preserving the name of the capture group.

By default, capture-groups with duplicated names are not allowed, and results in an error PHP Warning: preg_match(): Compilation failed: two named subpatterns have the same name (PCRE2_DUPNAMES not set) at offset ... in ... on line .... It is possible to explicitly allow this duplicate named capture-groups with the J modifier:

/Price: (?<currency>£|€)?(?<price>\d+)(?<currency>£|€)?/J'

With this regular expression, there are two capturing groups with the name currency, and it is explicitly allowed with the J flag. When it is matched against a string, it will only return the last match for the named capture value, but the positional values (0, 1, 2, ...) contain all matches.

$pattern = '/Price: (?<currency>£|€)?(?<price>\d+)(?<currency>£|€)?/J';
$text    = 'Price: €24£';
preg_match($pattern, $text, $matches);
var_dump($matches);
array(6) {
  [0]=> string(14) "Price: €24£"
  ["currency"]=> string(2) "£"
  [1]=> string(3) "€"
  ["price"]=> string(2) "24"
  [2]=> string(2) "24"
  [3]=> string(2) "£"
}

Using Comments

Some of the regular expressions are quite long, and extend to multiple lines.

Concatenating the regular expression while commenting individual sub-patterns or assertions can improve readability and provide smaller diff outputs when reviewing commits:

- $pattern  = '/Price: (?<currency>£|€)(?<price>\d+)/i';
+ $pattern  = '/Price: ';
+ $pattern .= '(?<currency>£|€)'; // Capture currency symbols £ or €
+ $pattern .= '(?<price>\d+)'; // Capture price without decimals.
+ $pattern .= '/i'; // Flags: Case-insensitive

Alternately, comments can be added inside the regular expression itself.

There is regular expression flag, x, that makes the engine ignore all white spaces characters, allowing the expression to be spread out, aligned, or even split into multiple lines:

- /Price: (?<currency>£|€)(?<price>\d+)/i
+ /Price:  \s  (?<currency>£|€)  (?<price>\d+)  /ix

In /Price: (?<currency>£|€)(?<price>\d+)/i, engine matches against the white space character right after the Price: string, but with the x flag, all white spaces are ignored. To match a white space, use \s special character.

Further, with the x flag, the # character starts an inline comment, similar to the // and # comment syntax in PHP.

With more spacing around logical groups of sub-patterns, the pattern can be made more readable. However, a better approach would be splitting the expression to multiple lines and adding comments:

- /Price: (?<currency>£|€)(?<price>\d+)/i
+ /Price:           # Check for the label "Price:"
+ \s                # Ensure a white-space after.
+ (?<currency>£|€)  # Capture currency symbols £ or €
+ (?<price>\d+)     # Capture price without decimals.
+ /ix

When storing in a PHP variable, using Heredoc/Nowdoc can preserve formatting. Since PHP 7.3, the heredoc/nowdoc syntax is more relaxed too.

$pattern = <<<PATTERN
  /Price:           # Check for the label "Price:"
  \s                # Ensure a white-space after.
  (?<currency>£|€)  # Capture currency symbols £ or €
  (?<price>\d+)     # Capture price without decimals.
  /ix               # Flags: Case-insensitive
PATTERN;
preg_match($pattern, 'Price: £42', $matches);

Named Character Classes

Regular expressions support character classes, and they can help take the scrutiny off a regular expression while making them more readable at the same time.

\d is probably the most frequently used character class. \d represents a single digit, and is equivalent to [0-9] (in non-Unicode mode). Further, \D is the inverse of \d, and is equivalent to [^0-9].

A regular expression that meticulously looks for digits, followed by a non-digit can simplified without changing the functionality:

- /Number: [0-9][^0-9]/
+ /Number: \d\D/

Regular expressions support several more character classes, that can make the difference stand-out more.

  • \w is equivalent to [A-Za-z0-9_]:
    - /[A-Za-z0-9_]/
    + /\w/
  • [:xdigit:] is a named character class that matches all hexadecimal characters, and is equivalent to [a-fA-F0-9]:

    - /[a-fA-F0-9]/
    + /[[:xdigit:]]/
  • \s is a matches all white-space characters, and is equivalent to [ \t\r\n\v\f]:

    - / \t\r\n\v\f/
    + /\s/

When using regular expressions with Unicode support (/u flag), it enables several more character classes. Unicode named character classes have the pattern \p{EXAMPLE}, where EXAMPLE is the name of the character class. Using the uppercase P (e.g. \P{FOO}) is the inverse of that character class.

For example, \p{Sc} is a named character class for all current and future Currency Symbols. There is a longer form of them (e.g. \p{Currency_Symbol}) but PHP does not support them at the moment.

$pattern = '/Price: \p{Sc}\d+/u';
$text = 'Price: ¥42';

Character classes allow capturing/matching classes even without prior knowledge about the characters. New currency symbols introduced in the future will start to match, as soon as that information is included in the next Unicode database update.

Unicode character classes also include a very helpful list of script classes for all Unicode scripts. For instance, \p{Sinhala} represents all characters from the Sinhalese language, and is equivalent to \x{0D80}-\x{0DFF}.

- $pattern = '/[\x{0D80}-\x{0DFF}]/u';
+ $pattern = '/\p{Sinhala}/u';
$text = 'පීඑච්පී.වොච්`;
$contains_sinhala = preg_match($pattern, $text);

A previous version of this article erroneously had a mix-up under the named character classes section, and had an example of the long form Unicode character classes, which PHP does not support. This is now fixed, thanks to Bruno Verley (@brnvrl). Thanks to Sergey Lebedev, and Taoshu, this article is also available in Russian and Chinese.

Recent Articles on PHP.Watch

All ArticlesFeed
Compressed HTTP Requests with Curl and PHP

Compressed HTTP Requests with Curl and PHP

How to make fast and efficient HTTP(S) requests via Curl using transfer encodings such as gzip, zstd, Brotli, and deflate.
How to compile PHP from source on Fedora/RHEL/etc

How to compile PHP from source on Fedora/RHEL/etc

A comprehensive guide on how to compile PHP from source on Fedora, RHEL, CentOS, etc. operating systems.
Phive: Secure, Easy, and Contained Phar Manager

Phive: Secure, Easy, and Contained Phar Manager

Phive is a PHP tool to easily install, validate, and update Phar archives in a PHP application.
Subscribe to PHP.Watch newsletter for monthly updates

You will receive an email on last Wednesday of every month and on major PHP releases with new articles related to PHP, upcoming changes, new features and what's changing in the language. No marketing emails, no selling of your contacts, no click-tracking, and one-click instant unsubscribe from any email you receive.