Skip to content

Regular Expressions

Regular Expressions: The Lasso For Data Wrangling

A regular expression is a pattern describing a subset of text.  Often this concept is shortened to ‘regex’.  It has become a foundational skill in bioinformatics following its use in the 90s.  It is the Data Wrangler’s Lasso. It allows one to parse text, and scrape out just the key parts. In explanation, you can identify key data based on your knowledge of the pattern it follows. Since much of biology requires special terminology, it’s a natural match. For example, Gene IDs tend to follow a regular pattern. If given a data set with a lot of information about a gene, you can identify just the IDs and separate them out. One great thing is that it is now a part of most languages… JavaScript, Java, VB, C #, C / C++, Python, Perl, Ruby, Delphi, R, Tcl… just to name a few. The best way to learn regular expressions is by testing it out. Check out the regex101 website by clicking on this button.

Begin by typing some simple text into the “Test String’ prompt. In this example, I’ve a list of names. Using regular expressions, we are going to check whether our string contains the name “Bryan”. We can confirm that, in our list, there is in fact, a Bryan! Now if we were to search for a name that was not in the list, our regular expression would not return any results.

Basic Patterns

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

Regular Expression Character Function
^The Matches any string starting with The.The ‘^’ symbol denotes the start of a sentence
end$ The ‘$’ symbol denotes the end of the sentence or text block
word Matches if, in the literal sense, word is in the sentence.
s* This is a greedy match that will match where ever 1 or more s is found
this* This will match for thi plus 0 or more ‘s’, either thissss or even thi
this+ We will match for thi plus 1 or more s ; this we will not match thi but we will match thiss
[AT] This will match any A or Twhich can also be done by (A|T)
\s (lowercase s) matches a single whitespace character — space, newline, return, tab, form
\S S (upper case S) matches any non-whitespace character.
. Matches any character
.\w (lowercase w) matches a “word” character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although “word” is the mnemonic for this, it only matches a single word char, not a whole word. W (upper case W) matches any non-word character
.\b boundary between word and non-word
.\t,\n,\r tab, newline, return
\d decimal digit [0-9] (some older regex utilities do not support but d, but they all support w and s)
\ inhibit the “specialness” of a character. So, for example, use . to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as ‘@’, you can put a slash in front of it, \@, to make sure it is treated just as a character
\D Matches anything that is not [0-9]

Regex Flags

Regex flags are conditions that you can modify to specify where you’d like to search. On the regex site, you can modify these specifications using the icon found on the end of the Regular Expression Command box. Let’s look at some useful flag options:

Flag Function
g

Match as many times as possible. If a certain word is found 4 times in a script, it will be detected every time.

m Match across multilines. If, for example, we are looking for the word “the” at the beginning of a line. Using the multiline function will allow you to check this condition for every line.
i Match ignoring case. HELLO and hello can both be returned using the same search criteria.

Example: Using modifier flags

In this example, note the ‘gmi’ flag at the right end. This means that the search was

  • global- it did not stop after the first line
  • multiline – once completing the search of the first line, the search continued to line 2
  • ignored case – it returned results regardless of the case of the letters

To understand what your regular expression should look like, paste a sample of your data in the test string box. Try out different patterns to understand the most specific and accurate choice. Some tools on the website include the quick reference and explanation boxes. I often search for keywords such as ‘character’ or ‘digit’ in the quick reference to understand how to use it in my pattern.

 

Regular Expressions in Python

Now we’ll look at some ways to implement regular expression in our python scripts. The Python re library provides regular expression support. In the beginning of your code, import the library using import re. In Python a regular expression search is typically written as:

match = re.search(pattern, string)

The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise. Therefore, the search is usually immediately followed by an if-statement to test if the search succeeded.

Grouping In Regex:

Grouping is used to separate parts of our search and return the parts individually. Here’s an example:

import re 
line = "Cats are smarter than dogs" 
matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I) 
if matchObj: 
    print ("matchObj.group() : ", matchObj.group()) 
    print ("matchObj.group(1) : ", matchObj.group(1)) 
    print ("matchObj.group(2) : ", matchObj.group(2)) 
else: 
    print ("No match!!")


Looking closer: 

  • r< — Interpreting the string in its raw form
  • '(.*) are (.*?) .*'Using the (.*) search searches for all the characters found before the word are. The parentheses create a group or way of addressing this word separately. You can use grouping to extract individual sections from a search.
  • Using the ? in the query allows us to return only a single word. This word is stored in the second group
  • line — Were searching the string stored in line
  • re.M — The search should span all the lines of text
  • re.I — Ignore the case (both upper and lowercase letters should be returned.
  • matchObj.group() if we don’t specify which group number, the entire match is returned
  • matchObj.group(1)Group 1 returns the match found within the first set of parentheses. 2 would return the second and so on.

A common workflow with regular expressions is that you write a pattern for the thing you are looking for, adding parentheses groups to extract the parts you want.

Substitution Using Regex:

With substitution, you are essentially carrying out this action:

re.sub(r"find this", "and replace with this", string)

Using the re.sub function, you can specify what pattern you’d like to find and the string or character you’d like to replace this with. Let’s look at an example to understand its functionality.

import re 
phone = "2004-959-559 # This is Phone Number" # Delete Python-style comments 
num = re.sub(r'#.*)


Looking closer: 

  • re.sub(r'#.*Square brackets offer some more useful features. You can also use a dash to indicate a range, so [a-z] matches all lowercase letters. To use a dash without indicating a range, put the dash last, e.g. [abc-]. An up-hat ^ at the start of a square-bracket set inverts it, so [^ab] means any char except a or b. Using the grouping concept described in a previous section, you could isolate the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: r'([\w.-]+)@([\w.-]+)'. In this case, the parenthesis do not change what the pattern will match. Instead, they establish logical “groups” inside of the match text. On a successful search, match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. The plain match.group() is still the whole match text as usual.

Regex: returning multiple results using ‘findall()’

findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern. findall() finds all the matches and returns them as a list of strings.

Debugging

Regular expression patterns pack a lot of meaning into just a few characters. You can spend a lot of time debugging your patterns. Set up your runtime so you can run a pattern and print what it matches easily, for example by running it on a small test text and printing the result of findall(). If the pattern matches nothing, try weakening the pattern, removing parts of it so you get too many matches. When it’s matching nothing, you can’t make any progress since there’s nothing concrete to look at. Once it’s matching too much, then you can work on tightening it up incrementally to hit just what you want. Additionally, https://regex101.com/ can be a useful resource to visualize your matches and identify mistakes. Using the reference feature will allow you to explore some more methods of specifying you pattern.