Aug 16, 2008

regular expression notes


With the question mark, I have introduced the first metacharacter that is greedy. The question mark gives the regex engine two choices: try to match the part the question mark applies to, or do not try to match it. The engine will always try to match that part. Only if this causes the entire regular expression to fail, will the engine try ignoring the part the question mark applies to.

The effect is that if you apply the regex Feb 23(rd)? to the string Today is Feb 23rd, 2003, the match will always be Feb 23rd and not Feb 23. You can make the question mark lazy (i.e. turn off the greediness) by putting a second question mark after the first. * greedy, *? lazy, ? greedy, ?? lazy, + greedy, +? lazy.

[\D\S] is not the same as [^\d\s]

[\D\S] is any character that is either (not a digit), or (is not whitespace)

[^\d\s] any character that is not (a digit or whitespace) ==> (not digit) and (not whitespace)

single-line mode and "."

The dot matches a sginle character, without caring what the character is. The only exception are newline characters. In all regex flavors discussed, "." is [^\n] (UNIX regex flavors) or [^\r\n] (Windows regex flavors).

This exception exists mostly because of historic reasons. The first tools that used regular expressions were line-based. They would read a file line by line, and apply the regular expression separately to each line. The effect is that with these tools, the string could never contain newlines, so the dot could never match them.Modern tools and languages can apply regular expressions to very large strings or even entire files.

In Perl, the mode where dot also match newines is called "single-line mode". This mode has nothing to do with multi-linemode. Mulitline mode only affects anchors, and single-line mode only affects the dot. You can activate single-line mode by adding an s after regex code, like m/^regex$/s

When using .net class, you activate this mode by specifying Regex.Match("string", "regex", RegexOptions.Singleline). In .net, if is not in single-line mode, \r\n is matched as [cr] and [lf]. If it is in single-line mode, it is matched as [CR]

JavaScript and VBScript do not have an option to make the dot match line break characters. In these languages, you can use a character class such as [\s\S] to match any character. This character matches a character that is either a whitespace character (including line break characters), or a character that is not a whitespace character. Since all characters are either whitespace or non-whitespace, this character class matches any character.

Anchor and multi-line mode

Anchor is not charactor, so they do not match any characer or whitespace including \n. They match a position before after or between charactors.hey can be used to "anchor" the regex match at a certain position. The caret ^ matches the position before the first character in the string.$ matches right after the last character in the string. However, they are not used by the regex engine by default, this is because multi-line mode is off by default. To use it, you need to turn on it. In .net you need to use Regex.Match("string", "regex", RegexOptions.Multiline).

\A only ever matches at the start of the string. Likewise, \Z only ever matches at the end of the string. These two tokens never match at line breaks. This is true in all regex flavors discussed in this tutorial, even when you turn on "multiline mode".

Strings Ending with a Line Break

Even though \Z and $ only match at the end of the string (when the option for the caret and dollar to match at embedded line breaks is off), there is one exception. If the string ends with a line break, then \Z and $ will match at the position before that line break, rather than at the very end of the string. This "enhancement" was introduced by Perl, and is copied by many regex flavors, including Java, .NET and PCRE. In Perl, when reading a line from a file, the resulting string will end with a line break. Reading a line from a file with the text "joe" results in the string joe\n. When applied to this string, both ^[a-z]+$ and \A[a-z]+\Z will match joe.

If you only want a match at the absolute very end of the string, use \z (lower case z instead of upper case Z). \A[a-z]+\z does not match joe\n. \z matches after the line break, which is not matched by the character class.

Capturing Grouping, Non-Capturing grouping, Backreference

(ab) is a group, \1(non .net style) or $1(.net style) is a backreference. Grouping is to use () to group part of regular expression together. The original reason of grouping is to apply ?,+,*,{n,m} to the group.

The side effect of grouping is that group create backreference, which is a matched part of regular expression part. This backreference can be used find-replace purpose. For example, Checking for Doubled Words. When editing text, doubled words such as "the the" easily creep in. Using the regex \b(\w+)\s+\1\b , in your text editor, you can easily find them. To delete the second word, simply type in $1 as the replacement text and click the Replace button.

Another usage is to use it inside of regular expression itself. For example, <([A-Z][A-Z0-9]*)\b[^>]*>(.*?), the \1 is a backreference symbol, which reference the matched part.

Grouping by default is capturing, so it slow down performance. If you don't need back reference, but you need grouping, for example you want match Set or SetValue, the initial regex will be Set(Value)? . This create capturing, but you are not going to use the backreference to the capture anyway, then you can use Non-Capturing grouping , Set(?:Value) to speed up the performance.

Name group

Normally, you can use \number(eg, \1, \2) to backreference group capture. But some regex engine also support named backreference. In .net you can use (?regex) or (?'group_name'regex) to define the named group, and use \k or \k'name' to backreference the named group capture.

Multiple Groups with The Same Name

The .NET framework allows multiple groups in the regular expression to have the same name. If you do so, both groups will store their matches in the same Group object. You won't be able to distinguish which group captured the text. This can be useful in regular expressions with multiple alternatives to match the same thing. E.g. if you want to match "a" followed by a digit 0..5, or "b" followed by a digit 4..7, and you only care about the digit, you could use the regex a(?'digit'[0-5])|b(?'digit'[4-7]). The group named "digit" will then give you the digit 0..7 that was matched, regardless of the letter. Python and PCRE do not allow multiple groups to use the same name. Doing so will give a regex compilation error.

Names and Numbers for Capturing Groups

Here is where things get a bit ugly. Python and PCRE treat named capturing groups just like unnamed capturing groups, and number both kinds from left to right, starting with one. The regex (a)(?Pb)(c)(?Pd) matches abcd as expected. If you do a search-and-replace with this regex and the replacement \1\2\3\4, you will get abcd. All four groups were numbered from left to right, from one till four. Easy and logical.

Things are quite a bit more complicated with the .NET framework. The regex (a)(?b)(c)(?d) again matches abcd. However, if you do a search-and-replace with $1$2$3$4 as the replacement, you will get acbd. Probably not what you expected.

The .NET framework does number named capturing groups from left to right, but numbers them after all the unnamed groups have been numbered. So the unnamed groups (a) and (c) get numbered first, from left to right, starting at one. Then the named groups (?b) and (?d) get their numbers, continuing from the unnamed groups, in this case: three.

To make things simple, when using .NET's regex support, just assume that named groups do not get numbered at all, and reference them by name exclusively. To keep things compatible across regex flavors, I strongly recommend that you do not mix named and unnamed capturing groups at all. Either give a group a name, or make it non-capturing as in (?:nocapture). Non-capturing groups are more efficient, since the regex engine does not need to keep track of their matches.

Repetition and Backreferences

The regex engine does not permanently substitute backreferences in the regular expression. It will use the last match saved into the backreference each time it needs to be used. If a new match is found by capturing parentheses, the previously saved match is overwritten. There is a clear difference between ([abc]+) and ([abc])+. Though both successfully match cab, the first regex will put cab into the first backreference, while the second regex will only store b. That is because in the second regex, the plus caused the pair of parentheses to repeat three times. The first time, c was stored. The second time a and the third time b. Each time, the previous value was overwritten, so b remains.

This also means that ([abc]+)=\1 will match cab=cab, and that ([abc])+=\1 will not. The reason is that when the engine arrives at \1, it holds b which fails to match c. Obvious when you look at a simple example like this one, but a common cause of difficulty with regular expressions nonetheless. When using backreferences, always double check that you are really capturing what you want.

Parentheses and Backreferences Cannot Be Used Inside Character Classes

Round brackets cannot be used inside character classes, at least not as metacharacters. When you put a round bracket in a character class, it is treated as a literal character. So the regex [(a)b] matches a, b, ( and ).

Backreferences also cannot be used inside a character class. The \1 in regex like (a)[\1b] will be interpreted as an octal escape in most regex flavors. So this regex will match an a followed by either \x01 or a b.

Lookaround is lookahead and lookbehind

Lookahead "?"(Positive "=" and Negative assertion "!")

Semantic of lookaround is "zero-width asserttion. They are like ^, $ anchors. The difference is that lookarounds will actually match charachers, but then give up the match and only return the result: match or not match. The assertion is group with (). But expression inside () is an assertion. This is different from grouping, which expression inside () return match result. For example, q(?!u) is get q not followed by u(negative look ahead). q(?=u) is get q followd by u.

You can use any regular expression inside the lookahead. (Note that this is not the case with lookbehind. I will explain why below.) Any valid regular expression can be used inside the lookahead. If it contains capturing parentheses, the backreferences(capturing) will be saved. Note that the lookahead itself does not create a backreference(capturing). So it is not included in the count towards numbering the backreferences. If you want to store the match of the regex inside a backreference, you have to put capturing parentheses around the regex inside the lookahead, like this: (?=(regex)). The other way around will not work that is ((?=regex)), because the lookahead will already have discarded the regex match by the time the backreference is to be saved.

Lookbehind "?<" (Positive "=" and Negative "!" assertion)

Lookbehind has the same effect, but works backwards. It tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there. (?

The construct for positive lookbehind is (?<=text): a pair of round brackets, with the opening bracket followed by a question mark, "less than" symbol and an equals sign. Negative lookbehind is written as (?

The good news is that you can use lookbehind anywhere in the regex, not only at the start. If you want to find a word not ending with an "s", you could use \b\w+(?<!s)\b.

The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot apply a regular expression backwards. Therefore, the regular expression engine needs to be able to figure out how many steps to step back before checking the lookbehind.

Therefore, many regex flavors, including those used by Perl and Python, only allow fixed-length strings. You can use any regex of which the length of the match can be predetermined. This means you can use literal text and character classes. You cannot use repetition or optional items. You can use alternation, but only if all options in the alternation have the same length.

The only regex engines that allow you to use a full regular expression inside lookbehind are the JGsoft engine and the .NET framework RegEx classes.

Lookaround is a bit confusing. The confusing part is that the lookaround is zero-width. So if you have a regex in which a lookahead is followed by an other piece of regex, or a lookbehind is preceded by another piece of regex, then the regex will traverse part of the string twice.

For example, we want to find a word that is six letters long and contains three subsequent letters cat. For condition 1, regex can be \bw{6}\b, for condition 2, it is \b\w{0,3}cat\w*\b, so we can combine them together like (?=\b\w{6}\b)\b\w*cat\w*\b. The reason is that lookahead is applied first, but after it is successful matched, it get zero-width advance, meaning that the current character position in the string is at the start of a 6-letter word in the string. Then the second part of the regex is applied to the same text again. To optimize it, we can use \b(?=\w{6}\b)\w{0,3}cat\w*.

Conditional: If-then-else conditions in regular expressions

(?(?=regex)then|else) means if match regex, then use "then" regex to match, else use "else" regex to match.

\G(End of The Previous Match vs. Start of The Match Attempt)

I don't understand.

Mode modifier

RegexOptions option (? mode ) Description
.Singleline s Causes dot to match any character
.Multiline m Expands where  ^  and  $  can match
.IgnorePatternWhitespace x Sets free-spacing and comment mode
.IgnoreCase i Turns on case-insensitive matching
.ExplicitCapture n Turns capturing off for  (···)  , so only  (?< name >···)  capture
.ECMAScript   Restricts  \w  ,  \s  , and  \d  to match ASCII characters only, and more
.RightToLeft   The transmission applies the regex normally, but in the opposite direction (starting at the end of the string and moving toward the start). Unfortunately, buggy.
.Compiled   Spends extra time up front optimizing the regex so it matches more quickly when applied

Instead of setting RegexOptions in .net, you can specify this option in the regular expression, for example, (?i)v(?-i)b match vb, Vb, but not vB, VB.

Mode-modified span: (?modifier:···), such as (?i:···)

The example from the previous section can be made even simpler for systems that support a mode-modified span. Using a syntax like (?i:··· ) , a mode-modified span turns on the mode only for what's matched within the parentheses. Using this, the (?:(?i)very) example(?: is not capturing switch) is simplified to (?i:very)

When supported, this form generally works for all mode-modifier letters the system supports. Tcl and Python are two examples that support the (?i) form, but not the mode-modified span (?i:···) form.

No comments:

Post a Comment