CSE Support: 2014

What is a Regular Expression?

A regular expression (reg-ex or regexp) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings. This concept arose in the 1950s. Regular expressions are so useful in computing that the various systems to specify regular expressions have evolved to provide both a basic and extended standard for the grammar and syntax. Many programming languages provide regular expression capabilities. For example Perl, Ruby, AWK, Tcl and etc.. has inbuilt library support and, .NET languages, Java, Python, C++ and Most other languages offer regular expressions via a library. That means Reg-Ex patterns are independent of the programming language that you are using. Simply it is a kind of universal language.

Basic concepts

Try testing your own reg-ex while following this tutorial. you can easily test regex using this online tool. below is a screen shot that would help to use it.

Parentheses are used to define the scope and precedence of the operators.

| is the OR operator.

"color|colour" this reg-ex matches the words "color" or "colour"
"colo(u|)r" also does the same job. here in addition to "colo" and "r", in between them "u" OR empty character is also matched.

"?" question mark indicates there is zero or one of the preceding element.

"colou?r" also matches both "color" and "colour". here preceding element is "u"
"(pat)?tern" matches "pattern" and "tern". here preceding element is "pat"

"*" asterisk indicates there is zero or more of the preceding element

"so*n" matches "sn", "son", "soon", "sooon" and so on. here matches zero or more "o"s

"+" plus sign indicates there is one or more of the preceding element

"so*n" matches "son", "soon", "sooon" and so on. here matches one or more "o"s. so there is no "sn"

So basically "a+"="aa*"

or even we can define how many occurrences should be matched using brackets "{}".(java does not support)

"so{2}n" matches only "soon" which "o" occurs exactly twice.
"a{2,5}" matches "aa", "aaa", "aaaa", "aaaaa" .

By combining these basic operators we can write complex reg-exes to identify languages and validate arithmetic operations(later in this tutorial). Lets look at a simple reg-ex that contains above operators

"ab*(c|)" denotes the set of strings starting with "a", then zero or more "b"s and finally optionally a "c". this matches "a", "ac", "ab", "abc", "abb", "abbc", ... and so on. "ab*c?" does the same.

"\" escape character is used to define escape sequences. {}[]()^$.|*+? and \ are also known as meta characters.With meta characters, characters that may or may not have their literal meaning. as an example "\d" does not match the letter "d".

Shorthand Character Classes

"\d", "\w", "\s" are mostly used. Capital letters give the negation of each("\D", "\W", "\S").

"\d" this matches a single digit.(0 to 9). so "\d" acts like a single unit, not as two characters. "\d+" matches any number(one or more digits).
"\w" matches any alphanumeric(a-z , A-Z and 0-9) including underscore( _ ). you must make sure that "w" is lower case. otherwise it would give a different meaning. with capital "\W" it matches all characters except alphanumeric(!,@,#,$......). but it also matches the underscore.
"\s" matches a white space.
"." dot matches a single character except a new line(\n). so "a.b" matches "abc"

"[ ]" is used to shorten the OR sets.

"[abc]" as same as "a|b|c"
"0|1|2|3|4|5|6|7|8|9" can be shorted as "[0-9]". "\d" also equals "[0-9]"
so basically "\w" is as same as "[a-zA-Z0-9_]". remember not to add any spaces between characters inside [].
if you want to match letters except "q" you can use "[a-pr-z]".
since "\[.\]" matches any single character surrounded by "[" and "]" (brackets are escaped), it matches "[a]" and "[b]".

Negation

if you want to match letters except vowels you can use "[b-df-hj-np-tv-z]". this can be easily implemented using negation(^). "[^aeiou]" gives characters except vowels.
so since "\d" equals "[0-9]", "\D" equals "[^0-9]"

Start and End Anchors

"^" matches the starting of the string while "$" matches the end of the string.

"^so" matches part "so" in string "soon" but not "so" in "picaso".
"so$" matches part "so" in "picaso" but not "so" in string "soon".
"^so$" only matches the word "so".

Standards

*to use reg-ex in program codes you can get a start by following these examples.

If-Then-Else Conditionals

Format : "(?(?=regex)then|else)"

"(reg)?(?(1)ex|str)" lets look at the function of this regex.

if string "reg" is matched then match the string "ex". else match string "str". finally this matches strings "regex" and "str"

did you ever wonder what is "1" inside this reg-ex. it is a group identifier. lets look at groups in regular expressions.

GROUPS

if you divide the reg-ex using parentheses they will form groups.

"(abs)(olu)?(?(2)te|tract)",

group 1: abs
group 2: olu
group 3: ?(2)te|tract

we can identify them using group number. here "2" in if condition returns true if "olu" is matched otherwise false. if true "te" is matched otherwise "tract" is matched.

Relative group identifiers

if you add plus or minus sign before the group number you can use id number to the left (minus) or to the right (plus)relative to the position it calls.

above reg-ex can also be written as "(abs)(olu)?(?(-1)te|tract)" using -1 instead of 2.

the advantage of using relative id is you can add changes to the beginning of the reg-ex without conflicts. otherwise you have to change the id whenever something is added.

Naming groups

you also have the facility to give names to groups other than using identifiers.
"?<>" is used to give names. if we take earlier one

"(abs)(?<match>olu)?(?(match)te|tract)" second group is named as "match". so we can use that name to refer the group.

when we use group names we can alter the entire reg-ex without any conflict. (after changing, sometimes there will be conflicts even using relative id's)

Comments

(?#comment) is used to write comments.

"a|b(?#this is a comment)"

CSE Support

Feb 2, 2014

Useful Regex patterns

For e-mails

For Telephone Numbers

For Numbers including Floating points

For URLs

Jan 12, 2014

Regular Expression basics