Regular Expressions (Part 1)

A powerful, pervasive, and yet poorly-understood aspect of computer programming is the concept of “regular expressions.” Through what at first may seem like arcane symbols and syntax, system administrators and coders can search through large swaths of text quickly and efficiently, highlight, replacing, or even deleting any content that relates to their query.

My first exposure to regular expressions came from learning the Perl scripting language, which is largely designed to aid system administrators in scanning through log files and other records. There are slight variances when it comes to different language’s use of regular expressions, but largely the Perl standard is common across many architectures, and so will be what I present here.

By the use of particular switches, selectors, and symbols, programmers can tell their scripts whether they want to search for letters, numbers, or other characters. We will start with two examples that will match any single number:\d and [0-9]

“\d” stands for “digit” and represents any number 0 through 9. The second example shows a selector, or range, of possible entries, in this case also 0 through 9. With just the above as guides, how could you write “match any three-digit number?” Highlight the following text for possible answers: \d\d\d or [0-9][0-9][0-9] We will learn easier/more efficient methods later.

Just like “\d” stands for “digit,” the switch “\w” stands for “word character.” Slightly more complicated than numbers, \w can match both lower- and upper-case letters. This means a selector designed to match the same items as \w will need to account for both lower- and upper-case letters as well: [a-zA-Z]

Do you see the difference from the numerical example above, and can you guess its meaning? A range of [a-z] would only match lower-case letters, while [A-Z] would only match upper-case letters. In some occasions this specificity may be desired, but it’s important to understand how character ranges work and when they are equivalent to (and differ from) switches.

Can you use the above examples to create a regular expression that means “match any lower-case letter, any digit, and any letter?” Highlight the following text for possible answers: [a-z][0-9][A-Z] or [a-z]\d\w Hopefully these examples make sense.

Sometimes it is efficient to match multiple characters, without having to specify \d or \w for each match. This is where the “+” operator comes in to play. In essence, any switch or range followed by a + means “one or more of these.” Using \w+ would, therefore, match a word of any length, whether one character or fifty. \d+ would match a number of any length, so long as it only contained digits (no commas, periods, or other separators).

In the next post exploring regular expressions, I’ll introduce the operators ? and *, as well as give new examples for matching specific words or phrases inside of a longer string. I find regular expressions to be fascinating, and an absolutely invaluable tool when it comes to system administration, scripting, and simplifying many repetitive computing tasks.

Have questions about regular expressions? Feel free to email me! [email protected]