Using Java RegExp Engine

Using Java RegExp Engine

I’m playing Java RegExp Engine recently and I’d like to share some use experiences with you in this article.

Here is the first example:

Pattern p = Pattern.compile("(^[A-Za-z]+)( [0-9]+)( [A-Za-z]+)(.*)");
String text = "foo 42 bar xyz";
Matcher matcher = p.matcher(text);
matcher.find();

We have used parentheses to group our pattern into four parts:

(^[A-Za-z]+)( [0-9]+)( [A-Za-z]+)(.*)

The matcher.find() method will help to match the text using our Pattern:

for (int i = 0; i <= matcher.groupCount(); i++) {
    System.out.println(i + ": " + matcher.group(i));
}

And the matcher.group() can help us to print the matched group defined in pattern string:

0: foo 42 bar xyz
1: foo
2:  42
3:  bar
4:  xyz

Please note the group(0) is the whole text matched by the pattern. The groups are defined by parentheses in pattern. For example, if we change our pattern from:

Pattern p = Pattern.compile("(^[A-Za-z]+)( [0-9]+)( [A-Za-z]+)(.*)");

to:

Pattern p = Pattern.compile("(^[A-Za-z]+)( [0-9]+)( [A-Za-z]+).*");

The difference is that this time we don’t quote the last .* part into parentheses. And let’s rerun the matching process:

for (int i = 0; i <= matcher.groupCount(); i++) {
    System.out.println(i + ": " + matcher.group(i));
}

The result becomes:

0: foo 42 bar xyz
1: foo
2:  42
3:  bar

We can see this time xyz does not show,because the last .* does not belong to any group. Now let’s see how to use region method. Here is the code example:

Pattern p = Pattern.compile("(.* )(.* )");
String text = "foo 42 bar xyz";
Matcher matcher = p.matcher(text);
matcher.region(1, text.length());
matcher.find();

The matcher.region(1, text.length()) is to set the start index and the end index of the text that the matcher will search for. As you can see, we have set the matcher to search from text index 1 till the end of the text, so the first character(which is index 0) of text will be ignored. Here’s the search code:

for (int i = 0; i <= matcher.groupCount(); i++) {
    System.out.println(i + ": " + matcher.group(i));
}

And here is the result:

0: oo 42 bar
1: oo 42
2: bar

We can see the first character in text, which is f of the foo, is bypassed. We can also see from the result that Java RegExp engine is greedy by default, because the group 1 returns oo 42, which matches maximum text of pattern (.* ). Now let’s see the difference between matches and find methods of Matcher class.

The matches method will try to match the pattern with whole text. If the pattern doesn’t match the whole text, the method fails. Otherwise, if there are contents in the text that can match the pattern, the find method will match it. Here is the code example:

Pattern p = Pattern.compile("[a-z][a-z][a-z]");
String text = "foo 42 bar xyz";
Matcher matcher = p.matcher(text);

I have written a pattern that will match three consecutive alphabets. Now let’s try to use ‘matches()’ method firstly:

System.out.println(matcher.matches());

The result is false. Because the pattern can’t match the whole text. Now let’s try find() method:

System.out.println(matcher.find());

The result is true, because the pattern can match partial contents in text, which could be foo. And each time we call find(), it will try to search the next partial text that can match the pattern. Let’s see the code example:

Pattern p = Pattern.compile("[a-z][a-z][a-z]");
String text = "foo 42 bar xyz";
Matcher matcher = p.matcher(text);

while (matcher.find()) {
    System.out.println(matcher.group());
}

As the example shown above, we will call find() method until it can’t find anything more. Here’s the result:

foo
bar
xyz

As the result shown above, we can see each time the find() method call match the pattern with partial text. We used group() method to return the whole matched text. Because our pattern has parentheses, so we can also use group method to get the substring we want. Here’s the code example:

Pattern p = Pattern.compile("([a-z])([a-z])([a-z])");
String text = "foo 42 bar xyz";
Matcher matcher = p.matcher(text);

while (matcher.find()) {
    System.out.println(matcher.group(1));
}

Because our pattern is ([a-z])([a-z])([a-z]),so group(1) will be the first [a-z] in matched text. Here’s the result:

f
b
x

The result is the first group in the pattern for each found text. Please note both matches and find will alter the internal state of the matcher. As we can see above, each find will search next part of the text. Let’s see this code example:

Pattern p = Pattern.compile("([a-z])([a-z])([a-z])");
String text = "foo";
Matcher matcher = p.matcher(text);

System.out.println(matcher.matches());

while (matcher.find()) {
    System.out.println(matcher.group(1));
}

We used matches method to match text will the pattern, and it will succeed. So the find method won’t match anything at all, because the previous matches method already processed the text and matches the text successfully. The result of above code will be true and nothing else. If we remove the matcher.matches(), then the result will be f, which is the matcher.group(1) output.

Non-capturing group is a concept that supports you to group something in the pattern but don’t count it in the result. Let’s see this code example:

Pattern p = Pattern.compile("([a-z])([a-z])([a-z])");
String text = "foo bar";
Matcher matcher = p.matcher(text);

while (matcher.find()) {
    System.out.println(matcher.group(1));
}

The pattern is to match three consecutive alphabets, so foo and bar will match for each find(). And group 1 is the first [a-z], so it will match the first alphabet of foo and bar. The result of above code is:

f
b

Which is the same like we expected. Now let’s modify the pattern a little bit:

Pattern p = Pattern.compile("(?:[a-z])([a-z])([a-z])");

Please note we have added ?: into first group, which means this group is a non-capturing group. The matcher will still match it, but this group will not be counted in the result.

So the matcher.group(1) this time will become the second [a-z]. Let’s rerun the codes and here’s the result:

o
a

Please note the first group, f in foo and b in bar, are bypassed. Because the first group has been marked as non-capturing group with ?:.

My Github Page: https://github.com/liweinan

Powered by Jekyll and Theme by solid

If you have any question want to ask or find bugs regarding with my blog posts, please report it here:
https://github.com/liweinan/liweinan.github.io/issues