Regular Expressions are well known for their power and brevity in validating textual patterns. Less known is their ability to extract substrings surrounded by known patterns of text through a construct known as round bracket groupings. The text matching the sub-expression within a pair of round brackets is captured and is available as a backreference within the regular expression itself or an indexed variable outside. For example, the PHP statement
preg_match('/Name: (.+), Age: (\d+)/', $text, $matches);
would return 1 on finding a substring that matches the specified pattern and stores the matched name, ie; the first captured group, in $matches[1]
and matched age, ie; the second captured group, in $matches[2]
. $match[0]
stores the full matched text. Other languages that support regular expressions, and the list of such languages is pretty long, have similar conventions.
Counting the capturing groups to get the index of the captured text works okay with short regualr expressions that don't change often. However, counting the position becomes tedious and error prone when the number is large and new groups may get introduced or existing ones removed as the code evolves.
If you just rely on the documentation accompanying your programming language, such as this regex syntax for PHP, or this Javadoc page for Java, then you are not likely to find a better solution to this problem. At least this is what happened to me, for I wrote code that had the magic indexes all over till I started readingJeffrey E.F. Friedl's excellent Mastering Regular Expression and came across PHP's support for named captures, a mechanism to associate symbolic names to captured groups.
What it essentially means is that I could rewrite the previous statement as
preg_match('/Name: (?P<Name>.+), Age: (?P<Age>\d+)/', $text, $matches);
and access the matched name and age as $matches['Name']
and $matches['Age']
and need not worry about introducing (or dropping) groups. It not only improves the readability but also makes the code more robust.
At this point one could argue that in this particular case the book was just incidental, for the information on named captures was already available on the Web, as my link shows, and I should just have googled it. Unfortunately, you need to know a little bit about something to search for more. Google and the Web are no good if you don't know what you don't know. This is exactly where I think the book Mastering Regular Expressions really shines. You need to go through this to realize what you didn't know and what you should look for. And be assured that there are enough aspects of regualr expressions and their implementations in various languages that you may not know to justify the cost of the book. By the way, named captures are not the only thing that I learned from this book. Other things I learnt inlcude 'x' modifiers, conditionals within regular expressions, lookaheads and lookbehinds, and many others. No wonder this book is selling almost as well as Programming Perl, 3rd Edition, the all time programming best seller from O'Reilly.
At this point I should add that named captures may not yet be widely available in all languages. In fact, as per the book, Perl doesn't have it, though my research for this post led me to this page and eventually to this page stating that Perl 5.10 has named captures. In fact, the support in Perl 5.10 are much more powerful and makes available not only the last match, as we saw in PHP, but all the matches in an array. Java and JavaScript programmers may have to wait longer for named captures, though!