Regular expressions (shortened as regex or regexp) are a powerful mechanism for text processing. You can use regular expressions to search for a pattern within a block of text, to replace bits of that text with other bits of text, and to manipulate strings in various other subtle and interesting ways.
The shell itself does not support regular expressions natively. To use regular expressions, you must invoke an external tool.
Some tools that support regular expressions include:
awk—A scripting language in and of itself. Described further in How AWK-ward.
grep—Returns the list of lines that match an expression (or the lines that do not match with the -v flag). Exits with a status of true (0) if a match occurred or false (1) if no match occurred.
perl—A scripting language with more advanced regular expression functionality.
sed—A tool that performs text substitutions based on regular expressions.
change every occurrence of the letter “a” in a string to a capital “A”, you might echo the string and pipe the result to sed like this:
echo "This is a test, this is only a test" | sed 's/a/A/g'
You can also use regular expressions to search for strings in a file or a block of text by using the grep command. For example, to look for the word “bar” in the file foo.txt, you might do this:
grep "bar" foo.txt
cat foo.txt | grep "bar"
Regular Expression Syntax
The fundamental format for regular expressions is one of the following, depending on what you are trying to do:
The first syntax is a basic search syntax. In the absence of a command prefix, such a regular expression returns the lines matching the search pattern. In some cases, the slash marks may be (or must be) omitted—in the pattern argument to the grep command, for example.
The second syntax is used for most commands. In this form, some operation occurs on lines matching the pattern. This may be a form of matching, or it may involve removing the portions of the line that match the pattern.
The third syntax is used for substitution commands. These can be thought of as a more complex form of search and replace.
For example, the following command searches for the word ‘test’ within the specified file:
# Expression: /test/
grep 'test' poem.txt
Positional Anchors and Flags
A common way to significantly alter regular expression matching is through the use of positional anchors and flags.
Positional anchors allow you to specify the position within a line of text where an expression is allowed to match. There are two positional anchors that are regularly used: caret (^) and dollar ($). When placed at the beginning or end of an expression, these match the beginning and end of a line of text, respectively.
This matches the word “Mary”, but only when it appears at the beginning of a line.
# Expression: /^Mary/
grep "^Mary" < poem.txt
Similarly, the following matches the word "fox," but only at the end of a line:
# Expression: /fox$/
grep "fox$" < poem.txt
The other common technique for altering the matching behavior of a regular expression is through the use of flags. These flags, when placed at the end of a regular expression, can change whether a regular expression is allowed to match across multiple lines, whether the matching is case sensitive or insensitive, and various other aspects of matching.
Note: Different tools support different flags, and not all flags are supported with all tools. The grep command-line tool uses command-line flags instead of flags in the expression itself.
The most commonly used flag is the global flag. By default, only the first occurrence of a search term is matched. This is mainly of concern when performing substitutions. The global flag changes this so that a substitution alters every match in the line instead of just the first one.
# Expression: s/Mary/Joe/
sed "s/Mary/Joe/" < poem.txt This replaces only the first occurrence of "Mary" with "Joe." By adding the global flag to the expression, it instead replaces every occurrence, as shown in the following example:
# Expression s/Mary/Joe/g
sed "s/Mary/Joe/g" < poem.txt
Wildcards and Repetition Operators
One of the most common ways to enhance searching through regular expressions is with the use of wildcard matching.
A wildcard is a symbol that takes the place of any other symbol. In regular expressions, a period (.) is considered a wildcard, as it matches any single character. For example:
# Expression: /wa./
grep 'wa.' poem.txt
This matches lines containing both "was" and "want" because the dot can match any character.
Wildcards are typically combined with repetition operators to match lines in which only a portion of the content is known. For example, you might want to search for every line containing "Mary" with the word "lamb" appearing later. You might specify the expression like this:
# Expression: /Mary.*lamb/
grep "Mary.*lamb" poem.txt
This searches for Mary followed by zero or more characters, followed by lamb.
Of course, you probably want at least one character between those to avoid matches for strings containing "Marylamb". The most common way to solve this is with the plus (+) operator. However, you can construct this expression in several ways:
# Expression (Basic): /Mary.\+lamb/
# Expression (Extended): /Mary.+lamb/
# Expression: /Mary..*lamb/
grep "Mary.\+lamb" poem.txt
grep -E "Mary.+lamb" poem.txt # extended regexp
grep "Mary..*lamb" poem.txt
Note: The appearance of the plus operator differs depending on whether you are using basic or extended regular expressions; in basic regular expressions, it must be preceded by a backslash.
if you want to match both Mary and Marry, you might use an expression like this:
# Expression (Basic): /Marr\?y/
# Expression (Extended): /Marr?y/
grep "Marr\?y" poem.txt
grep -E "Marr?y" poem.txt
The question mark causes the preceding r to be optional, and thus, this expression matches lines containing either “Mary” or “Marry.”
In summary, the basic wildcard and repetition operators are:
period (.)—wildcard; matches a single character.
question mark (\? or ?)—matches 0 or 1 of the previous character, grouping, or wildcard. (This operator differs depending on whether you are using basic or extended regular expressions.)
asterisk(*)—matches zero or more of the previous character, grouping, or wildcard.
plus(\+ or +)—matches one or more of the previous character, grouping, or wildcard. (This operator differs depending on whether you are using basic or extended regular expressions.)
Character Classes and Groups
Searching for certain keywords can be useful, but it is often not enough. It is often useful to search for the presence or absence of key characters at a given position in a search string.
For example, assume that you require the words Mary and lamb to be within the same sentence. To do this, you need to only allow certain characters to appear between the two words. This can be achieved through the use of character classes.
There are two basic types of character classes: predefined character classes and custom, or user-defined character classes. These are described in the following sections.
Predefined Character Classes
Most regular expression languages support some form of predefined character classes. When used between brackets, these define commonly used sets of characters. The most broadly supported set of predefined character classes are the POSIX character classes:
[:alnum:]—all alphanumeric characters (a-z, A-Z, and 0-9).
[:alpha:]—all alphabetic characters (a-z, A-Z).
[:blank:]—all whitespace within a line (spaces or tabs).
[:cntrl:]—all control characters (ASCII 0-31).
[:graph:]—all alphanumeric or punctuation characters.
[:lower:]—all lowercase letters (a-z).
[:print:]—all printable characters (opposite of [:cntrl:], same as the union of [:graph:] and [:space:]).
[:punct:]—all punctuation characters
[:space:]—all whitespace characters (space, tab, newline, carriage return, form feed, and vertical tab). (See note below about compatibility.)
[:upper:]—all uppercase letters.
[:xdigit:]—all hexadecimal digits (0-9, a-f, A-F).
For example, the following is another way to match any sentence containing Mary and lamb (but not if there are punctuation marks between them):
# Expression: /Mary[[:alpha:][:digit:][:blank:]][[:alpha:][:digit:][:blank:]]*lamb/
grep 'Mary[[:alpha:][:digit:][:blank:]][[:alpha:][:digit:][:blank:]]*lamb' poem.txt
Custom Character Classes
In addition to the predefined character classes, regular expression languages also allow custom, user-defined character classes. These custom character classes just look like a list of characters surrounded by square brackets.
For example, if you only want to allow spaces and letters, you might create a character class like this one:
# Expression: /Mary[a-z A-Z]*lamb/
grep "Mary[a-z A-Z]*lamb" poem.txt