This is a Clilstore unit. You can link all words to dictionaries.

Introduction to regular expressions

A regular expression or regex is a method of matching patterns within a string also known as pattern matching. Regular expressions are a powerful tool used in many programming languages such as Perl, JavaScript, Java, PHP, Python, C++, …

Here we have an example of a text (string), a regex and the patterns being matched:

text → "The boy noted on the notebook what he noticed"

regex → "/not\w{1,5}/g"

patterns matched → "noted", "notebook" and "noticed"

Here we can see why people are scared of regular expressions. It is kind of hieroglyphic, isn't it?. But don't worry, there is a Rosetta Stone for regular expressions and in a bit of time we will be able to make regular expressions that we can use in our programs.

Regular expressions in PHP

In PHP the most often used are PCRE or "Perl Compatible Regular Expressions". PCRE is a regular expression C library based on the regular expression capabilities in the Perl programming language. PCRE syntax is more expressive than the original POSIX regular expressions syntax therefore this is the reason we are going to work with it.

Let's start with some PHP code we can use to test regular expressions against texts:

<?php

$string = "the boy noted on the notebook what he noticed";

$regex = "/not\w{1,5}/g";

if (preg_match($regex, $string)) {

echo "Match found";

} else {

echo "No match found";

}

The $string variable contains the text in which we seek for matches of the regular expression. The $regex variable contains a string that represents a pattern we are trying to match, that is, in other words, the regex. The function preg_match gets two parameters and returns 1 if at least a match is found and 0 if no matches are found. If we wanted to capture the first match we could use an alternative function call with three parameters "preg_match($regex, $string, $matches)" and in case the function returns 1, the first match is contained in $matches[0].

The basics of PCRE

Usually a regex starts with the forward slash char / and ends with the forward slash char /. So forward slashes are a delimiter that hold our regex pattern. The quotations are used to wrap it all up as a string. Other delimiters we can use are: @ , #, `, ~, %, &, ' and " but as we mentioned before the most used is / . Probably you have noticed that the $regex used in the example above ends with a "g". This is because "g" is a modifier and it is no part of the pattern but affects how the pattern is applied. In PCRE all modifiers come after the ending forward slash char. The modifier "g" means "global" that is "find all matches". If we don't use this modifier only the first match will be found. Later we will talk more about modifiers. What we know until now is that in $regex, "/not\w{1,5}/" is the pattern and "g" is the modifier.

To match the beginning of a string we use the caret ^ and it comes after the first forward slash. So the pattern "/^the/" matches the "the" at the beginning of $string. This works case sensitive, if we want to work case insensitive the "i" modifier should be used. In this case "/^the/i" would match "the", "The", "tHe", "thE", "THe", "tHE", "ThE" and "THE" at the beginning of the text.

What if we want to match something at the end of a string? We should use the $ before the last forward slash. For example if we want to match "ed" at the end of $string then the pattern to be used is "/ed$/".

If we don't use neither ^ nor $ then the pattern we seek is free to appear anywhere.

Exercise

$string = "abcdefghijklmnopqrstuvwxyz0123456789";

$regex = "?";

Find out the pattern for the following matches:

"abc" only at the beginning.
"fgh" anywhere and case insensitive.
"789" only at the end.

Meta characters

In PCRE regex there are many characters that have a special meaning like the caret ^, the $ and the forward slash /; these are known as meta characters . Let's take a look at the whole bunch of Meta characters:

. (Full stop)
^ (Caret)
* (Asterisk)
+ (Plus)
? (Question Mark)
{ (Opening curly bracket)
[ (Opening bracket)
] (Closing bracket)
\ (Backslash)
| (Pipe)
( (Opening parenthesis)
) (Closing parenthesis)
} (Closing curly bracket)

Don't worry if you don't know how to use them in PCRE regex, we will get to them very soon. For now we will introduce the concept of escaping. We need to escape a character when we don't want to use it as a "meta character" but we want to use it as a "regular character". To escape a character we have to prefix it with a backslash \.

For example:

$string = "$100 is a good price";

$regex = "/$100/";

If we want to match "$100", $regex wouldn't work because $ is used as meta character. To make it work we should use the following pattern:

$regex = "/\$100/";

As we can see now $ is escaped and works as regular character. What if we want to escape the escape character? Common sense recommend us to use "\\" but in PHP we should use "\\\\". This is because we use the backslash within a string delimited by " or ' and, in this case, backslash also has a special meaning in PHP.

Exercise

$string = "I don't know how to proceed with 2+w=y";

$regex = "?";

Try to match "2+w=y" at the end of the string.

Character classes

A character class is a set of characters we wish to match. We have to enclose this characters between square brackets. For example:

$string = "bag beg big bug";

$regex = "?";

If we wish to match "bag", "beg" and "big" we can use one of these regexs: "/b[aei]g/g" or "/b[a-i]g/g"

In the first character class we simply enumerate the characters we want to match "[aei]". In the second character class we use a range "[a-i]" that means "from a to i".

Meta characters do not work inside classes so there is no need to escape them. It is the same "[$€]100" than "[\$€]100". But every rule has its exception, in this case the caret ^. When it appears at the first position of a class it has special meaning, for example "[^A]" means any character except "A". when the caret ^ doesn't appear at first position in a class it has no special meaning. Example:

<?php

$string = "abcefghijklmnopqrstuvwxyz0123456789";

$regex = "?";

// The same as preg_match using modifier \g but we get the result into array $matches.

// Returns the number of matches which may be zero or FALSE if an error occurred.

preg_match_all($regex, $string, $matches);

foreach ($matches as $match) {

echo $match . </br>;

}

$regex = "/a[bc]/" → $matches = array("ab");
$regex = "a[^b]/" → $matches = array();
$regex = "/[0-9]/" → $matches = array("0", "1", "2", "3", "4", "5", "6", "7", "8", "9");

Exercise

$regex = "/[^0-9]/" → $matches = ?
$regex = "/[^B-E]/i" → $matches = ?
$regex = "/[^B-E]/" → $matches = ?

To end with character classes there are some shortcuts you can use within the square brackets:

[:alnum:] → for letters and digits. Note that the shortcut is [:alnum:], to use it as a class: [[:alnum:]].
[:alpha:] → for letters
[:blank:] → space or tab only
[:digit:] → decimal digit
[:lower:] → lowercase letters
[:print:] → visible characters
[:punct:] → visible punctuation characters
[:space:] → visible whitespace
[:upper:] → uppercase characters
…

The dot meta character "."

The dot meta character matches any single character except newline characters (non visible characters \r and \n). To make the dot meta character match all characters you can use the "s" modifier. Here we have an example:

$string = "Brother bought a coconut, he bought it for a dime
His sister had another one, she paid it for a lime.
She put the lime in the coconut, she drank them both up"

$regex = "/[dl]ime./gs" matches "dime\n" at the end of the first line, "lime." at the end of the second line and "lime " at the third line.

$regex = "/[dl]ime./g" matches "lime." at the end of the second line and "lime " at the third line.

$regex = "/h.[ds]/gi" matches "His" and "had" at the second line.

The asterisk, the plus and the question mark meta characters ("*", "+", "?")

The asterisk meta character matches zero or more occurrences of the previous character. Whereas the plus meta character matches one or more occurrences of the previous character. And, finally, the question mark meta character matches zero or one occurrence of the previous character. Let's take a look at this example:

$string = "He was born a fool for love
What he wouldn't do for love
He's a fool, a fool for love"

$regex = "/foo*[lr]/gi" matches all occurrences of "fool" and "for" in the text.

$regex = "/fo+./gi" matches all occurrences of "fool" and "for" in the text.

$regex = "/.*/gi" has three matches: the first, second and third line.

$regex = "/.*/gis" has one match the complete text.

$regex = "/foo?[lr]/gi" matches all occurrences of "fool" and "for" in the text.

Exercise

$string = "aaa bbb aaabbb aabb aacbb cbb ab";

$regex = "?";

Find out the pattern for the following matches:

"aaabbb", "aabb", "ab"
"aacbb", "cbb"
"bbb", "aaabbb", "aabb", "aacbb", "cbb", "ab"

The curly brackets meta characters("{}")

It is an interesting meta character, it matches a specific number of instances of the preceding character or range of characters. For example, "/a{3}/" means that exactly 3 instances of "a" have to be matched. But the syntax can be flexible to admit more cases: "/a{3,}/" means that at least 3 instances of "a" have to be matched. We can also set lower and upper limits in this way: "/a{3,6}/" in this case the number of instances allowed to be matched are in between three and six.

Example:

$string = "we could be immortals, immortals,
Just not for long, for long,
We could be immooooooo- immortals,
Immooooooo- immortals,
Immooooooo- immortals,
Immooooooo- immortals,"

$regex = "/im{2}o{1,}(rtals)?/gi" → It has 10 matches: 2 x "immortals" at the first line, "immooooooo" and "immortals" at the second line and "immooooooo" and "immortals" at the last three lines.

$string = "My phone number is 96-435544233"

$regex = "/[0-9]{2}-?[0-9]{9}/g" → matches the phone number with or without "-".

Special sequences

The backslash can be used to make special sequences that are equivalent to a character class. Here you can see the most used:

"\s" matches any whitespace character. Its equivalent class is "[\t\n\r\f\v]".
"\S" matches any non-whitespace character. Its equivalent class is "[^\t\n\r\f\v]".
"\d" matches any digit character. Its equivalent class is "[0-9]".
"\D" matches any non-digit character. Its equivalent class is "[^0-9]".
"\w" matches any word character. Its equivalent class is "[a-zA-Z0-9_]".
"\W" matches any non-word character. Its equivalent class is "[^a-zA-Z0-9_]".
…

Having this in mind the regex used in the last example "/[0-9]{2}-?[0-9]{9}/g" can be rewritten in a shorter way: "/\d{2}-?\d{9}/g".

There are more special sequences but we leave them for the advanced regular expressions lesson. Let us take a look at this example:

$string = "On the other hand, we denounce with righteous indignation and dislike men who are so beguiled and demoralized by the charms of pleasure of the moment, so blinded by desire, that they cannot foresee the pain and trouble that are bound to ensue; and equal blame belongs to those who fail in their duty through weakness of will, which is the same as saying through shrinking from toil and pain. These cases are perfectly simple and easy to distinguish. In a free hour, when our power of choice is untrammeled and when nothing prevents our being able to do what we like best, every pleasure is to be welcomed and every pain avoided. But in certain circumstances and owing to the claims of duty or the obligations of business it will frequently occur that pleasures have to be repudiated and annoyances accepted. The wise man therefore always holds in these matters to this principle of selection: he rejects pleasures to secure other greater pleasures, or else he endures pains to avoid worse pains. (H. Rackham 1914)"

$regex = "/\w+/g" → matches all words "On", "the", "other", … "1914" except white spaces, newlines, ",", ".", ":" "(" and ")".

$regex = "/\d+/g" → matches "1914".

$regex = "/\w+[.,;]/g" → matches "hand,", "moment,", "desire,", "will,", "pain.", "distinguish.", "hour,", "best,", "avoided.", "accepted.", "selection:", "pleasures,", "pains." and "H.".

Now it is time to take a look at regex we used at the beginning of the lesson: "/not\w{1,5}/g". It is easy to figure out that it matches all patterns that begin with "not" and are followed by 1 to 5 word characters.

Exercise

$string = "First degree equations:
x + 2 = 5
x+2=5
x -3= 10
2x/3 = 4";

$regex = "?";

Find a regex to match: "x + 2 = 5", "x+2=5", "x -3= 10" and "2x/3 = 4"

NOTE: there is an excellent regex tester at https://regex101.com/. In this website you can check all exercises and learn more about regular expressions. It is very intuitive and you can see the matches "real time". The only thing you have to do is write a text and type the regular expression. Good luck!