Regular Expressions: The Lasso For Data Wrangling
Begin by typing some simple text into the “Test String’ prompt. In this example, I’ve a list of names. Using regular expressions, we are going to check whether our string contains the name “Bryan”. We can confirm that, in our list, there is in fact, a Bryan! Now if we were to search for a name that was not in the list, our regular expression would not return any results.
The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:
|Regular Expression Character||Function|
||Matches any string starting with
||The ‘$’ symbol denotes the end of the sentence or text block|
||Matches if, in the literal sense,
||This is a greedy match that will match where ever 1 or more
||This will match for
||We will match for
||This will match any
||(lowercase s) matches a single whitespace character — space, newline, return, tab, form|
||S (upper case S) matches any non-whitespace character.|
||Matches any character|
||(lowercase w) matches a “word” character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although “word” is the mnemonic for this, it only matches a single word char, not a whole word. W (upper case W) matches any non-word character|
||boundary between word and non-word|
||tab, newline, return|
||decimal digit [0-9] (some older regex utilities do not support but d, but they all support w and s)|
||inhibit the “specialness” of a character. So, for example, use
||Matches anything that is not
Regex flags are conditions that you can modify to specify where you’d like to search. On the regex site, you can modify these specifications using the icon found on the end of the Regular Expression Command box. Let’s look at some useful flag options:
Match as many times as possible. If a certain word is found 4 times in a script, it will be detected every time.
||Match across multilines. If, for example, we are looking for the word “the” at the beginning of a line. Using the multiline function will allow you to check this condition for every line.|
||Match ignoring case. HELLO and hello can both be returned using the same search criteria.|
Example: Using modifier flags
In this example, note the ‘gmi’ flag at the right end. This means that the search was
- global- it did not stop after the first line
- multiline – once completing the search of the first line, the search continued to line 2
- ignored case – it returned results regardless of the case of the letters
To understand what your regular expression should look like, paste a sample of your data in the test string box. Try out different patterns to understand the most specific and accurate choice. Some tools on the website include the quick reference and explanation boxes. I often search for keywords such as ‘character’ or ‘digit’ in the quick reference to understand how to use it in my pattern.
Regular Expressions in Python
Now we’ll look at some ways to implement regular expression in our python scripts. The Python
re library provides regular expression support. In the beginning of your code, import the library using
import re. In Python a regular expression search is typically written as:
match = re.search(pattern, string)
re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise. Therefore, the search is usually immediately followed by an if-statement to test if the search succeeded.
Grouping In Regex:
Grouping is used to separate parts of our search and return the parts individually. Here’s an example:
import re line = "Cats are smarter than dogs" matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I) if matchObj: print ("matchObj.group() : ", matchObj.group()) print ("matchObj.group(1) : ", matchObj.group(1)) print ("matchObj.group(2) : ", matchObj.group(2)) else: print ("No match!!")
r< — Interpreting the string in its raw form
'(.*) are (.*?) .*'Using the (.*) search searches for all the characters found before the word are. The parentheses create a group or way of addressing this word separately. You can use grouping to extract individual sections from a search.
- Using the
?in the query allows us to return only a single word. This word is stored in the second group
line— Were searching the string stored in line
re.M— The search should span all the lines of text
re.I— Ignore the case (both upper and lowercase letters should be returned.
matchObj.group()if we don’t specify which group number, the entire match is returned
matchObj.group(1)Group 1 returns the match found within the first set of parentheses. 2 would return the second and so on.
A common workflow with regular expressions is that you write a pattern for the thing you are looking for, adding parentheses groups to extract the parts you want.
Substitution Using Regex:
With substitution, you are essentially carrying out this action:
re.sub(r"find this", "and replace with this", string)
Using the re.sub function, you can specify what pattern you’d like to find and the string or character you’d like to replace this with. Let’s look at an example to understand its functionality.
import re phone = "2004-959-559 # This is Phone Number" # Delete Python-style comments num = re.sub(r'#.*)
Square brackets offer some more useful features. You can also use a dash to indicate a range, so [a-z] matches all lowercase letters. To use a dash without indicating a range, put the dash last, e.g.
[abc-]. An up-hat
^at the start of a square-bracket set inverts it, so
[^ab]means any char except
b. Using the grouping concept described in a previous section, you could isolate the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this:
r'([\w.-]+)@([\w.-]+)'. In this case, the parenthesis do not change what the pattern will match. Instead, they establish logical “groups” inside of the match text. On a successful search,
match.group(1)is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. The plain
match.group()is still the whole match text as usual.
Regex: returning multiple results using ‘findall()’
findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern.
findall() finds all the matches and returns them as a list of strings.
Regular expression patterns pack a lot of meaning into just a few characters. You can spend a lot of time debugging your patterns. Set up your runtime so you can run a pattern and print what it matches easily, for example by running it on a small test text and printing the result of
findall(). If the pattern matches nothing, try weakening the pattern, removing parts of it so you get too many matches. When it’s matching nothing, you can’t make any progress since there’s nothing concrete to look at. Once it’s matching too much, then you can work on tightening it up incrementally to hit just what you want. Additionally, https://regex101.com/ can be a useful resource to visualize your matches and identify mistakes. Using the reference feature will allow you to explore some more methods of specifying you pattern.