Explained: regular expression (regex)

Regular expression, or “regex” for short, is a mathematical term for the theory used to describe regular languages. But in computing, regexes are used to search for patterns in files and databases, and their functionality is incorporated into many modern programming languages. Regex search patterns make wildcards look like clumsy clowns because they offer a whole slew of additional options.

Regex overview

The simplest and most common method of searching is to look for a specific string or character in a text file, for example, by using F3 on a website. This is basically what you use when you apply the “Search” or “Search and Replace” functions in Notepad.

Like we said, regex can do a lot more. But to achieve this, a few special characters have to be defined. It is good to know these so-called meta characters because syntax errors are the most common cause for failed searches.

The most used special characters are:

Square brackets []

Square brackets are used to specify a character set—at least one of which must be a match, but no more than one unless otherwise specified.

Example: Malwareb[yi]es will be a match for Malwarebytes and Malwarebites, not for Malwarebyites.

The minus sign –

The minus sign or hyphen is used to specify a range of characters.

Example: [0-9] will be a match for any single digit between 0 and 9.

Curly brackets {}

Curly brackets are used to quantify the number of characters.

Example: [0-9]{3} matches for any number sequence between 000 and 999

Parentheses ()

Parentheses are used to group characters. Matches contain the characters in their exact order.

Example: (are) gives a match for malware, but not for aerial because the following order of the characters is different from the specification.

Slash |

The slash, as in many languages, stands for the logical “or” operator.

Example: Most|more will be a match for both of the specified words.

Period .

The dot or period acts as a wildcard. It matches any single character, except line break characters.

Example: Malwareb.tes will be a match for Malwarebytes, Malwarebites, Malwarebotes, and many others, but still not for Malwarebyites.

Backslash

The backslash is used to escape special characters and to give special meaning to some characters that follow it.

Examples: d matches for one whole number (0 – 9).

w matches for one alphanumeric character.

Asterisk *

The asterisk is a repeater. It matches when the character preceding it matches 0 or more times.

Example: cho*se will match for chose and choose, but also for chse (zero match).

Asterisk and period .*

The asterisk is used in combination with the period to match for any character 0 or more times.

Example: Malware.* will match for Malware, Malwarebytes, and any misspelled version that starts with Malware.

Plus sign +

The plus sign matches when the character preceding + matches 1 or more times.

Example: cho+se will match for chose and choose, but not for chse.

There are quite a few more meta characters, but it is outside the scope of this post to explain them all in detail. For those interested, there are many basic and advanced regex tutorials available. One of them will certainly fit your specific wishes.

Responsible use

Sophisticated regexes look intimidating and confusing at first sight, but once you have constructed a few yourself, you will start recognizing what others have tried to accomplish—especially if you take them apart one piece at a time. But we do advise caution when using your own regexes on public-facing servers or apps. An inexperienced publisher could be digging his own grave by doing so.

For most common tasks, there are many examples to be found on code repositories like GitHub. But you will have to choose carefully and ask yourself:

Security-wise, is it safe to use in production?
Is it well maintained? Does it get updated regularly, or will that become your future task?

The more contributors, the better is the rule of thumb here. More contributors mean not only more eyes that check for vulnerabilities, but also more people writing new code and improving existing code.

Abuse

As in many other programming languages, regex can be used in JavaScript as well. This capability is nice, but also poses a problem that has been known for several years. The first paper mentioning the possibilities of a regular expression denial of service (ReDoS) stems from 2012.

Basically, an attacker can prepare a specially-crafted and/or lengthy piece of text that he feeds into an input field of a JavaScript-based web server or app. Since JavaScript does not run multi-threaded, the targeted server or app is busy running its regex functions on the text. While it is doing that, it is unable to perform any other tasks, so the server or app will appear to be frozen. Other languages will take a long time to deal with such texts as well, but if they are multi-threaded, other requests can be dealt with at the same time and won’t have to wait until the regex functions are done processing the text.

Since it is not hard to figure out, or in some cases, it’s well-known what regexes will be performed, it is relatively easy to craft a text that will keep an unprotected server occupied for up to a few minutes.

For example, many servers use Node.js, a JavaScript runtime that has quite a few documented ReDoS vulnerabilities.

In other cases, attackers can search for so-called “evil regexes.” What makes a regex stand out as evil?

The regular expression applies repetition (“+”, “*”) to a complex subexpression.
For the repeated subexpression, there exists a match that is also a suffix of another valid match.

Prevention of ReDoS attacks

To prevent becoming a victim of a ReDoS attack, it is not enough to rely on the built-in security of the regex. Here are some tips:

Use atomic grouping in your regex. An atomic group is a group that, when the regex engine exits from it, automatically throws away all backtracking positions remembered by any tokens inside the group.
Keep tabs on your regexes. When a regex takes much longer then it should, kill it at once. You can inform the user that it was stopped for this reason and as a security measure.
Validate your input, and don’t allow users to use their own regexes. If there is no other way, then pre-format the regexes and only allow certain minimal deviations.
Only write your own regexes for production servers and apps if there are no other known reliable sources available.
Use one of the verification packages that are available for regexes to have your regex checked for vulnerabilities.

Popular does not equal safe

Even though Node.js is an immensely popular JavaScript runtime, it is not enough to rely on the security it provides. And even though regexes can be useful tools, using them should come with some precautions. Reportedly, there has been an uptick in web apps and servers that have been under ReDos attacks lately.