Regular Expression Tools & Reference

Forum for info exchange on beta tests of new versions/features, and in-depth discussions of issues related to nextup products with Power Users. You must register with the forum system in order to have access to this section.

Moderators: kdwhite, Jim Bretti, D.Leikin

Regular Expression Tools & Reference

Postby BrienMalone » Thu Apr 20, 2006 2:16 am

Regular Expression Tools

Regular-Expressions.org
While it seems that there are endless flavors of regular expressions, the website http://regular-expressions.org/reference.html has been pointed to frequently by those in the know at TextAloud as a reference.

Regular Expression Calculator
If you want to try out your regular expressions to see how they will work, visit Mark Sweeting's Regular Expression Calculator page:
http://www.sweeting.org/mark/html/revalid.php

RegEx Buddy
If you prefer a stand-alone executable to the transitory nature of web apps, try regexbuddy. Regexbuddy is a robust regular expression parsing tool. (Free Evaluation - US$30 price tag for the full version)
http://www.regexbuddy.com/

RegEx-Coach
My personal favorite is the free (donation requested) tool called RegEx-Coach. (I'm not completely sure that this tool speaks the same 'flavor' of regular expression used with TextAloud.)
Home: http://www.weitz.de/regex-coach/
Download: http://weitz.de/files/regex-coach.exe

---------------------------------------------------------------------
TextAloud Regular Expression Notes

How Regular Expressions are Applied in TextAloud

Order of Execution
All non-regular expression / non mask expressions are applied first, in the order the words appear in the left panel (alphabetical order). Then, the re's and masks are applied, in the same order. If you set up two regular expressions that happen to match the same string, you can't control which will be applied. The {{re strings in the word field are sorted alphabetically, and the first one in the list is the one that will be used.

Ignoring Word Boundaries
The default behavior is to perform substitutions only on word boundaries. You should be able to change the pronunciation of a character like "-" (hyphen symbol), but if there is text immediately before / after the hyphen you need to use a special substring character. An "&" symbol on either side of the word you're defining means you don't care about word boundaries on that side of the word.

So if you want to match the hyphen character in "First-Class", use the & symbol on both sides of the hyphen ... instead of defining "-" in the pronunciation editor, define the word "&-&".

None of this applies when regular expressions or masks are used. You can force regular expression matches to look for word boundaries using the \b metacharacter in the expression.


---------------------------------------------------------------------
Regular Expression Reference

The following information was copied from http://regular-expressions.org/reference.html
Copyright © 2003-2006 Jan Goyvaerts. All rights reserved.

The regular expression reference below is split into two sections, Basic Syntax and Advanced Syntax, which are further divided into subsections.

Basic Syntax
    Characters
    Character Classes or Character Sets [abc]
    Dot
    Anchors
    Word Boundaries
    Alternation
    Quantifiers

Advanced Syntax
    Grouping and Backreferences
    Modifiers
    Atomic Grouping and Possessive Quantifiers
    Lookaround
    Continuing from the Previous Match
    Conditionals
    Comments

The reference below follows this format:
    Expression
    Expression Description
    Code: Select all
    <Example Expression Usage> Matches <Example Matching Characters>

PLEASE NOTE: Not all expressions have examples.


---------------------------------------------------------------------
Regular Expression Basic Syntax Reference

Characters

Any character except [\^$.|?*+()
All characters except the listed special characters match a single instance of themselves.
Code: Select all
a matches a


\ (backslash) followed by any of [\^$.|?*+()
A backslash escapes special characters to suppress their special meaning.
Code: Select all
\+ matches +


\xFF where FF are 2 hexadecimal digits
Matches the character with the specified ASCII/ANSI value, which depends on the code page used. Can be used in character classes.
Code: Select all
\xA9 matches © when using the Latin-1 code page.


\n, \r and \t
Match an LF character, CR character and a tab character respectively. Can be used in character classes.
Code: Select all
\r\n matches a DOS/Windows CRLF line break.


Character Classes or Character Sets [abc]

[ (opening square bracket)
Starts a character class. A character class matches a single character out of all the possibilities offered by the character class. Inside a character class, different rules apply. The rules in this section are only valid inside character classes. The rules outside this section are not valid in character classes, except \n, \r, \t and \xFF


Any character except ^-]\ add that character to the possible matches for the character class.
All characters except the listed special characters.
Code: Select all
[abc] matches a, b or c


\ (backslash) followed by any of ^-]\
A backslash escapes special characters to suppress their special meaning.
Code: Select all
[\^\]] matches ^ or ]


- (hyphen) except immediately after the opening [
Specifies a range of characters. (Specifies a hyphen if placed immediately after the opening [)
Code: Select all
[a-zA-Z0-9] matches any letter or digit


^ (caret) immediately after the opening [
Negates the character class, causing it to match a single character not listed in the character class. (Specifies a caret if placed anywhere except after the opening [)
Code: Select all
[^a-d] matches x (any character except a, b, c or d)


\d, \w and \s
Shorthand character classes matching digits 0-9, word characters (letters and digits) and whitespace respectively. Can be used inside and outside character classes
Code: Select all
[\d\s] matches a character that is a digit or whitespace


\D, \W and \S
Negated versions of the above. Should be used only outside character classes. (Can be used inside, but that is confusing).)
Code: Select all
\D matches a character that is not a digit


Dot

. (dot)
Matches any single character except line break characters \r and \n. Most regex flavors have an option to make the dot match line break characters too.
Code: Select all
. matches x or (almost) any other character


Anchors

^ (caret)
Matches at the start of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the caret match after line breaks (i.e. at the start of a line in a file) as well.
Code: Select all
^. matches a in abc\ndef. Also matches d in "multi-line" mode.


$ (dollar)
Matches at the end of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the dollar match before line breaks (i.e. at the end of a line in a file) as well. Also matches before the very last line break if the string ends with a line break.
Code: Select all
.$ matches f in abc\ndef. Also matches c in "multi-line" mode.


\A
Matches at the start of the string the regex pattern is applied to. Matches a position rather than a character. Never matches after line breaks.
Code: Select all
\A. matches a in abc


\Z
Matches at the end of the string the regex pattern is applied to. Matches a position rather than a character. Never matches before line breaks, except for the very last line break if the string ends with a line break.
Code: Select all
.\Z matches f in abc\ndef


\z
Matches at the end of the string the regex pattern is applied to. Matches a position rather than a character. Never matches before line breaks.
Code: Select all
.\z matches f in abc\ndef


Word Boundaries

\b
Matches at the position between a word character (anything matched by \w) and a non-word character (anything matched by [^\w] or \W) as well as at the start and/or end of the string if the first and/or last characters in the string are word characters.
Code: Select all
.\b matches c in abc


\B
Matches at the position between two word characters (i.e the position between \w\w) as well as at the position between two non-word characters (i.e. \W\W).
Code: Select all
\B.\B matches b in abc


Alternation

| (pipe)
Causes the regex engine to match either the part on the left side, or the part on the right side. Can be strung together into a series of options.
Code: Select all
abc|def|xyz matches abc, def or xyz


| (pipe)
The pipe has the lowest precedence of all operators. Use grouping to alternate only part of the regular expression.
Code: Select all
abc(def|xyz) matches abcdef or abcxyz


Quantifiers

? (question mark)
Makes the preceding item optional. Greedy, so the optional item is included in the match if possible.
Code: Select all
abc? matches ab or abc


??
Makes the preceding item optional. Lazy, so the optional item is excluded in the match if possible. This construct is often excluded from documentation because of its limited use.
Code: Select all
abc?? matches ab or abc


* (star)
Repeats the previous item zero or more times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is not matched at all.
Code: Select all
".*" matches "def" "ghi" in abc "def" "ghi" jkl


*? (lazy star)
Repeats the previous item zero or more times. Lazy, so the engine first attempts to skip the previous item, before trying permutations with ever increasing matches of the preceding item.
Code: Select all
".*?" matches "def" in abc "def" "ghi" jkl


+ (plus)
Repeats the previous item once or more. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only once.
Code: Select all
".+" matches "def" "ghi" in abc "def" "ghi" jkl


+? (lazy plus)
Repeats the previous item once or more. Lazy, so the engine first matches the previous item only once, before trying permutations with ever increasing matches of the preceding item.
Code: Select all
".+?" matches "def" in abc "def" "ghi" jkl


{n} where n is an integer >= 1
Repeats the previous item exactly n times.
Code: Select all
a{3} matches aaa


{n,m} where n >= 1 and m >= n
Repeats the previous item between n and m times. Greedy, so repeating m times is tried before reducing the repetition to n times.
Code: Select all
a{2,4} matches aa, aaa or aaaa


{n,m}? where n >= 1 and m >= n
Repeats the previous item between n and m times. Lazy, so repeating n times is tried before increasing the repetition to m times.
Code: Select all
a{2,4}? matches aaaa, aaa or aa


{n,} where n >= 1
Repeats the previous item at least n times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only n times.
Code: Select all
a{2,} matches aaaaa in aaaaa


{n,}? where n >= 1
Repeats the previous item between n and m times. Lazy, so the engine first matches the previous item n times, before trying permutations with ever increasing matches of the preceding item.
Code: Select all
a{2,}? matches aa in aaaaa


Regular Expression Advanced Syntax Reference

Grouping and Backreferences

(regex)
Round brackets group the regex between them. They capture the text matched by the regex inside them that can be reused in a backreference, and they allow you to apply regex operators to the entire grouped regex.
Code: Select all
(abc){3} matches abcabcabc. First group matches abc.


(?:regex)
Non-capturing parentheses group the regex so you can apply regex operators, but do not capture anything and do not create backreferences.
Code: Select all
(?:abc){3} matches abcabcabc. No groups.


\1 through \9
Substituted with the text matched between the 1st through 9th pair of capturing parentheses. Some regex flavors allow more than 9 backreferences.
Code: Select all
(abc|def)=\1 matches abc=abc or def=def, but not abc=def or def=abc.


Modifiers

(?i)
Turn on case insensitivity for the remainder of the regular expression. (Older regex flavors may turn it on for the entire regex.)
Code: Select all
te(?i)st matches teST but not TEST.


(?-i)
Turn off case insensitivity for the remainder of the regular expression.
Code: Select all
(?i)te(?-i)st matches TEst but not TEST.


(?s)
Turn on "dot matches newline" for the remainder of the regular expression. (Older regex flavors may turn it on for the entire regex.)

(?-s)
Turn off "dot matches newline" for the remainder of the regular expression.

(?m)
Caret and dollar match after and before newlines for the remainder of the regular expression. (Older regex flavors may apply this to the entire regex.)

(?-m)
Caret and dollar only match at the start and end of the string for the remainder of the regular expression.

(?i-sm)
Turns on the options "i" and "m", and turns off "s" for the remainder of the regular expression. (Older regex flavors may apply this to the entire regex.)

(?i-sm:regex)
Matches the regex inside the span with the options "i" and "m" turned on, and "s" turned off.
Code: Select all
(?i:te)st matches TEst but not TEST.


Atomic Grouping and Possessive Quantifiers

(?>regex)
Atomic groups prevent the regex engine from backtracking back into the group (forcing the group to discard part of its match) after a match has been found for the group. Backtracking can occur inside the group before it has matched completely, and the engine can backtrack past the entire group, discarding its match entirely. Eliminating needless backtracking provides a speed increase. Atomic grouping is often indispensable when nesting quantifiers to prevent a catastrophic amount of backtracking as the engine needlessly tries pointless permutations of the nested quantifiers.
Code: Select all
x(?>\w+)x is more efficient than x\w+x if the second x cannot be matched.


?+, *+, ++ and {m,n}+
Possessive quantifiers are a limited yet syntactically cleaner alternative to atomic grouping. Only available in a few regex flavors. They behave as normal greedy quantifiers, except that they will not give up part of their match for backtracking.
Code: Select all
x++ is identical to (?>x+)


Lookaround

(?=regex)
Zero-width positive lookahead. Matches at a position where the pattern inside the lookahead can be matched. Matches only the position. It does not consume any characters or expand the match. In a pattern like one(?=two)three, both two and three have to match at the position where the match of one ends.
Code: Select all
t(?=s) matches the second t in streets.


(?!regex)
Zero-width negative lookahead. Identical to positive lookahead, except that the overall match will only succeed if the regex inside the lookahead fails to match.
Code: Select all
t(?!s) matches the first t in streets.


(?<=text)
Zero-width positive lookbehind. Matches at a position to the left of which text appears. Since regular expressions cannot be applied backwards, the test inside the lookbehind can only be plain text. Some regex flavors allow alternation of plain text options in the lookbehind.
Code: Select all
(?<=s)t matches the first t in streets.


(?<!text)
Zero-width negative lookbehind. Matches at a position if the text does not appear to the left of that position.
Code: Select all
(?<!s)t matches the second t in streets.


Continuing from The Previous Match

\G
Matches at the position where the previous match ended, or the position where the current match attempt started (depending on the tool or regex flavor). Matches at the start of the string during the first match attempt.
Code: Select all
\G[a-z] first matches a, then matches b and then fails to match in ab_cd.


Conditionals

(?(?=regex)then|else)
If the lookahead succeeds, the "then" part must match for the overall regex to match. If the lookahead fails, the "else" part must match for the overall regex to match. Not just positive lookahead, but all four lookarounds can be used. Note that the lookahead is zero-width, so the "then" and "else" parts need to match and consume the part of the text matched by the lookahead as well.
Code: Select all
(?(?<=a)b|c) matches the second b and the first c in babxcac


Comments

(?#comment)
Everything between (?# and ) is ignored by the regex engine.
Code: Select all
a(?#foobar)b matches ab
Last edited by BrienMalone on Sun May 14, 2006 12:52 am, edited 2 times in total.
BrienMalone
 
Posts: 17
Joined: Sat Apr 08, 2006 1:03 am

Postby BrienMalone » Thu Apr 27, 2006 10:51 pm

I thought I would include Jim Bretti's tutorial post here from a few years ago. It's a good jumpstart if you're new to Masks.

Posted: Fri May 07, 2004 2:00 pm Post subject: Pronunciation Editor - Masks and Regular Expressions

--------------------------------------------------------------------------------

Beta version 2.047B includes support for Masks and Regular Expressions. Since it isn't documented anywhere yet, I'll try to explain here.

For those not familiar with the term, Regular Expressions are a pattern matching language used for searching and parsing text. I won't get into regular expressions much here, if you search the web you can find plenty of references.

Regular Expressions are extremely powerful, but can be intimidating the first time you see them. So in addition to Regular Expression support, the Basic Pronunciation Editor also supports something called "Masks". A Mask is really a regular expression under the surface, but Masks are a little easier to construct, and do not require that you know anything about regular expressions.

In the Basic Pronunciation Editor, you can now use the Word field to enter either a Mask, or a Regular Expression. To illustrate how to use both, the following real problem will be used: you're using a voice engine that pronounces year numbers (like 1987) as "one thousand nine hundred and eighty seven". We would like the year pronounced as "nineteen eighty seven".

To use a Regular Expression, you would enter the following in the Word field:

{{re=\b19(\d\d)(\b)}}

The characters {{re= indicate the beginning of the regular expression, the trailing }} characters are also required to end the expression.

Inside the expression, \b is a "metacharacter" that indicates word delimiter. The \d characters match any numeric characters. So the pattern we're searching for is a word delimiter, followed by the string "19", two numeric characters and another word delmiter.

Notice there are also two sets of parentheses in the expression, the first set contains the last two digits of the year number, the second set contains the trailing word delimiter.

In the Pronounciation field, enter the following:
<s>nineteen $1$2

The leading <s> indicates a space .. since we're matching on something starting with a word delimiter preceding a year number, we need a leading space in the substituted string. After the space, comes the string "nineteen". Finally, the $1 and $2 strings point at the first and second subexpressions mentioned above, this gets us the last two digits of the year, and any trailing punctuation following the year number.

The idea of the mask is that it basically does the same thing, but you can look at some simple help instead of figuring out how to write a regular expression. The following mask characters are predfined:
# - numeric character
$ - any alpha character
@ - any alphanumeric character
? - any character
_ - underscore is a word separator.
\ - escape character

To handle the year problem above, you would enter the following mask into the Word field:

{{mask=_19(##)(_)}}

Very similar to the regular expression above, including how the parentheses are used. The pronunciation field ends up the same:

<s>nineteen $1$2

For those of you that are interested that should be enough to get started. I'd appreciate any feedback.

Thanks
_________________
Jim Bretti
NextUp.com
The Power of Spoken Audio
http://www.NextUp.com
BrienMalone
 
Posts: 17
Joined: Sat Apr 08, 2006 1:03 am


Return to Power Users, Beta Tests, In-Depth Discussions

Who is online

Users browsing this forum: No registered users and 1 guest