RegEx, an Intimidating Best Friend

Erin Sellers
5 min readMay 13, 2021

Sometimes, in life, you experience something for the first time and your preference is so strong or immediate that it can be difficult to explain. Oysters (love), mustard (hate), Vampire Weekend (love), video games (hate — apologies in advance to 90% of my cohort), Settles of Catan (love), women’s pants with a rise over nine inches (hate). My most recent gut-reaction, meet cute happened when I encountered the .split method in programming — HATE.

Maybe it was because this method seemed unnecessary. Thus far in my time on earth I’ve rarely had a string of text or numbers that I desperately needed to hack into smaller pieces. But I suspected that my hatred for .split actually came from the frequently befuddling code in the .split argument. There would be some coding labs when I would spend a harrowing amount of time running my head against a wall only to find out that solution looked something like : self.split(/\.|\?|\!/).delete_if {|w| w.size < 2}.size. I’d throw my hands up and declare of course I would have never gotten that, no one would ever think of that, and walk away, convinced that .split was a method I was never going to master but was also never going to need and washed my hands of it.

That turned out to be wrong. Especially as I moved into Ruby and working with back-end data, it seemed like methods like .split, .match, .scan were here to stay. When I finally had to face the music and started to do some research into how to properly use the .split argument I couldn’t find any satisfying explanations but I kept seeing the term RegEx thrown around.

RegEx stands for “regular expression” and is a way to search for certain patterns or strings in text. RegEx is the scary mix of “/|\*W/” I had been seeing at the end of .split. And although there was a lack of satisfying explanations for split, there was a myriad of resources on RegEx. Below are some quick takeaways so that, if you’re like me, the next time you see RegEx it’s not quite as daunting.

To start, you will often see forward slashes at the start and end of a RegEx (/123/). These are sometimes called flags and are literal syntax used to denote the beginning and end of a regular expression (think of this like the “ or ‘ used to indicate a string).

Brackets are typically used as character classes. For example, if I had a block of text and wanted to search for every instance of a vowel and I search using only flags (/aeiou/) it would search for that exact expression in my block of text. But if I put those letters in character brackets ([aeiou]), now my code understands that I want any instance of any one of those letters.

Another handy bracket feature is a range. Instead of typing out each letter or number I might be searching for, if it happens to fall in a range I can use this notation ([a-z]) to search for any letter in this range. Note, these are case sensitive, so ([A-Z]) is distinct from the range above. And if I use this symbol ^ before an expression in brackets (^[abc]), it negates the expression and means that I am looking for any letter except a, b, or c. Outside of brackets, ^ is used to indicate the start of the line, or “begins with” (while $ is used to indicate the end of the line or “ends with”).

Metacharacters are a fun short-hand that you’ll frequently see: \d matches a single character that is a digit, \w matches any word character (letter, number, underscore), and \s is any whitespace character including tabs and line breaks. Similar to the ^ example above, making any of these metacharacters uppercase will negate the expressions and prompt them to return the opposite (i.e., \D will return any non-digit). We can also match tabs \t, new lines \n, and carriage returns \r.

As a side note, I find it wild that computers have any reference to a carriage return and would bet the term has little meaning to most programmers as not everyone briefly got into typewriters during their old-timey phase as a teen. Some operating systems seem to agree with me and only read \n for new lines, so I’d suggest sticking with this.

Similar to other programming expressions a pipe indicates either/or with (e|s) searching for either “e” or “s”. RegEx also simplifies looking for repeating instances of characters or symbols — (a*) looks for zero or more instances of “a”, (a+) looks for one or more, (a{2}) looks for exactly two, (a{2, }) means two or more , and (a{2,5}) looks for between 2 and 5 a’s.

As we’ve seen, there are certain characters in RegEx that have special meaning ^.[$()|*+?{\. What happens if I need to search for one of those?

False! In order to search for one of these special characters, you need to use a flag to be taken literally or to “escape” a character. To search for every “$” in my document I could say (/\$/).

There are many more nuances to RegEx that make it a useful and powerful tool to understand, but hopefully after this quick review of the basics you’ll see RegEx as a potential best friend. The kind of intimidating best friend who feels comfortable making an aesthetically pleasing meal out of “whatever’s in the fridge” on a Tuesday.

For more resources on RegEx:

https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285

--

--