A guide to regular expressions for regular SEOs

This is a survival manual for those who are scared of regular expressions.
PUBLISHED ON
UPDATED ON

A guide to regular expressions for regular SEOs

A regular expression is a sequence of characters that forms a search pattern. It can be used to find patterns in text, such as particular words or phrases, or to extract information from text. It is very useful if you work in marketing but can sometimes feel like you are trying to write hieroglyphics. This is why I did an entire conference on this topic at BrightonSEO. It was well received so I hope this article will be as well.

Tweet Post Greta about regex

What is a regex pattern?

A regular expression (often abbreviated as regex) is a sequence of characters that define a search pattern, typically used for string matching or parsing. They are used by many programming languages and text-editing programs to perform search-and-replace operations, validate input, and perform other tasks.

The SEO definition of a regex

  • At the most basic level, a regex pattern is a sequence of characters that is used to match a string of text.
  • The regex pattern consists of one or more characters, which can be any type of character, including letters, numbers, punctuation, and even whitespace.
  • Some regex patterns are also composed of special characters that have specific meanings. 

What do 'lazy' and 'greedy' regex patterns mean?

Lazy and greedy regex patterns are two different approaches to using regular expressions to match or extract text from a given string. It's important to understand the difference between these two approaches in order to understand why they return different results.

"Greedy' means match longest possible string.

The idea behind greedy regular expressions is that it will try to match as much of the string as possible in one go. This means it will look for the longest possible pattern match in the string. For example, let’s say you have a string ‘abcdefg’ and you are searching for the pattern ‘abc’. A greedy regular expression pattern would first look for the longest possible match of ‘abc’, which in this case would be ‘abcdefg’. It would then move onto the next match, which would be ‘abcde’. This process would continue until it had found all of the possible matches for ‘abc’ in the string.

'Lazy' means match shortest possible string.

Lazy regular expressions are named this way because they take the least amount of effort to match, meaning they don’t try to match as much as possible in the string. They are used to match the smallest possible pattern that can match the string. For example, if you wanted to match the word “red” in a string, you could use a lazy pattern like “r.*?d”. The “.*?” part of the pattern is known as a lazy quantifier, which matches the smallest amount of text possible. In this case, it will match the first occurrence of “r” followed by any number of characters and then “d”.

Here's the best way to illustrate the difference between the two: The greedy h.+l matches 'hell' in 'hello' or in 'hellscape' but the lazy h.+?l matches 'hel'. 

What are some common pitfalls to avoid when using regular expressions for SEO?

Here's the main one: you get more than what you expected because your filter doesn't work as intended. If you are using a regex like .* that contains a greedy quantifier, you may end up matching more than you want. The solution to this is to use a regex like .*? (aka the lazy one). If you want to know more about fixing your advanced filters in Screaming Frog, you can read their article : SEO Spider Regex FAQ

Regular Expressions in SEO

Regular expressions are one of the most powerful tools in the SEO toolbox. They are very useful for any SEO specialist trying to identify patterns (like URL patterns) in large amounts of data. No need to be a pro to use them. Here's a Google Sheet Script To Convert Plain English Descriptions Into Regex Statements. Here's how to use it: 

  1. Enter a description (in English) of the Regex filter you need and you’ll get a properly formatted Regex.
  2. The script uses OpenAI’s GPT3 machine learning model to convert standard English statements into valid Regex.
  3. Danny, the creator has the best sales pitch for this thing: Download a copy of my Google Sheet and script to put an end to your tears.”

 

Google Analytics allows regex filter options

In Google Analytics, it’s magical to find specific patterns. You can use it to find all pages within a subdirectory, all pages with a query string, deal with IP addresses, etc. Here are things most people don't know about using regular expression in GA:

  • By default, UA treats a regex as a "partial match". The expression will be true if the pattern is contained anywhere in the data.
  • In GA4, the default regex is a "full match." The data must exactly match the pattern you provide.
  • If you use regex in a report in Google Analytics and then navigate away from that report, you will lose that filter.

Google Search Console + GSC

One of the most useful features offered by Google Search Console is the ability to filter search results using regular expressions. Regular expressions allow you to specify complex patterns for matching strings of text. This can be used to quickly narrow things down in the queries report. 

Remember, you need to select the REGEX option in the dropdown menu!

What are some common regular expressions used in SEO?

Find long-tail keywords with Regular Expressions

here's a great regular expression to help you find long-tail queries in Google Search Console. You can simply copy it as it to get started. RegEx to match any query longer than 75 characters: ^[\w\W\s\S]{75,}$

Replace 75 in the regular expression with any other character length that suits your needs. 

Check for Potential Content Injections 

Content injections typically involve an attacker injecting malicious code into an application, website, or other system. Content injections can also be used to create fake content, such as malicious ads, links, and images. Use the following regex to find spammy content type issues in GSC:

.*viagra.*|.*cialis.*| .*levitra.*|.*drugs.*|.*porn.*|.*www.*www.*

Feel free to replace any keyword phrase by something else or add more variables to it. We just showed you the most common words usually tied to content injection issues. We did notice Shopify had some issues with "FIFA" content injection so we would recommend adding that to the list if you are checking this type of site.

Filter out users finding your website through “commercial” intent terms:

.*(best|top|alternate|dupe|alternative|vs|versus|review*).*

Once again, you can tweak this regular expression by replacing any work you see in between two pipes. 

Find all the client queries on Google AFTER a purchase in Google Search Console

^(clean|broken|wash off|shattered|polish|problem|treat|doesn't work|replace|doesn't start|scratch|repair|manual|fix|protect|renew|coverage|warranty)[” “]

This regular expression was originally published in this Tweet by Christopher Hofman.

Compare Brand VS Non-Brand Traffic.

Filter out brand terms to see the generic keywords you rank for. Here's an example for the brand H&M:

hm|h&m|hennes|mauritz

Find index inconsistencies in 30 seconds.

This tip comes courtesy of Daniel Foley Carter on LinkedIn

  • Go to Google Search Console
  • Select Last 28 Days
  • Select PAGE > Filter > Select CUSTOM (REGEX)
  • Select Matches Regex and then put .*\/$
  • Hit Apply
  • Export the URLS into a Google Sheet
  • Then, go back to the PAGE FILTER and change MATCHES REGEX to Doesn;t Match Regex
  • Hit Apply
  • Export the URLS into a G Sheet

Once that is done, compare both sheets - if your site has URL variables - then on both sheets - SELECT ALL in Google Sheets, click DATA > Create a Filter, then, on COLUMN A for URL apply a filter - Text Not Contains and specify ? If both sheets have URLS then you have inconsistent paths in the index.

Common Regex Operators you can use to create a custom extraction

Some characters have a special meaning. The most common special characters are the asterisk (*), which is used to match any character; the plus sign (+), which is used to match one or more characters; the question mark (?), which is used to match zero or one character; and the parentheses (), which are used to group patterns together.

You can use this guide to figure out things: 

. A wildcard match for any single character.

.* A match for zero or more characters.

.+ A match for one or more characters.

d A match for any single numerical digit 0-9.

? The question mark is inserted after a character to make it an optional part of the expression.

| A vertical line or ‘pipe’ character indicates an ‘or’ function.

^ Used to denote the start of a string.

$ Used to denote the end of a string.

( ) Used to nest a sub-expression.

\ Inserted before an operator or special character to ‘escape’ it. 

Other things to keep in mind

Match metacharacters

Use the backslash (\) to escape regex metacharacters when you need those characters to be interpreted literally. For example, the dot (punctuation mark used in regular expressions) in an IP address must be escaped with a backslash (\.) so that it isn’t interpreted as a wildcard.

Log file analysis

You can also use regex patterns when analyzing server log files to segment crawled URLs. Regular expressions allow you to filter based on complex custom segments.

Regex tester online tool

Don’t forget to test your regex, just to make sure it works perfectly. Use https://regex101.com/  to see your creations come to life.

Here's a bibliography you will find interesting:

Regex For SEO: A Guide To Regular Expressions (With Use Cases)

A Marketer's Guide to Using Regex in Search Console [Video] - Annielytics.com

Don't Be Tongue-Tied: Learn RegEx Patterns for SEO

Beginner Guide To Regex For SEO - JC Chouinard

Regular Expressions (RegEx) in Google Search Console - JC Chouinard

Regex formulas for messy UTMs

This article is written because I gave a talk at BrightonSEO on the topic. Myriam Jessier & Chloe Smith – IRREGULAR REGEX FOR REGULAR PEOPLE

 

go to the top arrow