What are the rules of writing regular expressions?
There are some rules
for writing a regular expression or regex in java. Lets discuss about those
rule. But first have a look on What are regular expressions or regex?
Common matching symbols
that used in regex
Regular Expression
|
Description
|
. |
Matches any character
|
^regex |
Finds regex that must
match at the beginning of the line.
|
regex$ |
Finds regex that must
match at the end of the line.
|
[abc] |
Set definition, can
match the letter a or b or c.
|
[abc][vz] |
Set definition, can
match a or b or c followed by either v or z.
|
[^abc] |
When a caret appears
as the first character inside square brackets, it negates the pattern. This
pattern matches any character except a or b or c.
|
[a-d1-7] |
Ranges: matches a
letter between a and d and figures from 1 to 7, but not d1.
|
X|Z |
Finds X or Z.
|
XZ |
Finds X directly
followed by Z.
|
$ |
Checks if a line end
follows.
|
Meta characters
There are some pre-defined
meta characters that are used to make certain common patterns easier to use. Let’s
have a look on these characters.
Regular Expression
|
Description
|
\d |
Any digit, short for
[0-9] |
\D |
A non-digit, short
for
[^0-9] |
\s |
A whitespace
character, short for
[ \t\n\x0b\r\f] |
\S |
A non-whitespace
character, short for
|
\w |
A word character,
short for
[a-zA-Z_0-9] |
\W |
A non-word character
[^\w] |
\S+ |
Several
non-whitespace characters
|
\b |
Matches a word
boundary where a word character is
[a-zA-Z0-9_] |
Quantifier
Quantifier defines how
often an element can occur. The symbols ?, *, + and {} are qualifiers.
Regular Expression
|
Description
|
Examples
|
* |
Occurs zero or more
times, is short for
{0,} |
X* finds no or several letter X, <sbr /> .* finds any character sequence |
+ |
Occurs one or more
times, is short for
{1,} |
X+ - Finds one or
several letter X |
? |
Occurs no or one
times,
? is short for {0,1} . |
X? finds no or
exactly one letter X |
{X} |
Occurs X number of
times,
{} describes the order of the preceding liberal |
\d{3} searches for three digits, .{10} for any character sequence of length 10. |
{X,Y} |
Occurs between X and
Y times,
|
\d{1,4} means \d must occur at least once and at a maximum of
four. |
*? |
? after a quantifier makes it a reluctant quantifier. It tries to find the
smallest match. This makes the regular expression stop at the first match. |
Grouping
and back reference
We can group parts of regular
expression. In pattern we group elements with round brackets, e.g.,
()
. This allows us to assign a repetition operator to a
complete group.
In addition, these
groups also create a back reference to the part of the regular expression. This
captures the group. A back reference stores the part of the
String
which matched the group. This allows you to use
this part in the replacement.
Via the
$
you can refer to a group. $1
is the first group, $2
the second, etc.
Let’s, for example,
assume we want to replace all whitespace between a letter followed by a point
or a comma. This would involve that the point or the comma is part of the
pattern. Still it should be included in the result.
// Removes whitespace between a word character and . or ,String pattern = "(\\w)(\\s+)([\\.,])";System.out.println(DATA.replaceAll(pattern, "$1$3"));
This example extracts
the text between a title tag.
// Extract the text between the two title elementspattern = "(?i)(<title.*?>)(.+?)()";String updated = EXAMPLE_TEST.replaceAll(pattern, "$2");
Negative look ahead
It provides the
possibility to exclude a pattern. With this we can say that a string should not
be followed by another string.
Negative look ahead are
defined via
(?!pattern)
. For example, the
following will match "a" if "a" is not followed by
"b".a(?!b)
Specifying modes inside the regular expression
We can add the mode
modifiers to the start of the regex. To specify multiple modes, simply put them
together as in (?ismx).
· (?i) makes the regex case insensitive.
· (?s) for "single line mode" makes the dot
match all characters, including line breaks.
· (?m) for "multi-line mode" makes the caret and
dollar match at the start and end of each line in the subject string.
Backslashes in Java
The backslash
\
is an escape character in Java Strings. That means
backslash has a predefined meaning in Java. You have to use double backslash \\
to define a single backslash. If you want to
define \w
, then you must be
using \\w
in your regex. If
you want to use backslash as a literal, you have to type \\\\
as \
is
also an escape character in regular expressions.