Why is my regular expression always greedy?

7 views (last 30 days)
I have the following string, read into MATLAB:
*aaa
$bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
1111111111111111111111
222222222222
3333333333333333333333333
4444444444555556666666
777777788899999
*ddd
$11111111111111111111111111111111
222222222222222abcdf
99999999999
*abcde99999
$eeeeeeeeeeeeeeeeeeeeee
I would like to perform a search that only extracts the text between *aaa and *ddd, using the following regexp pattern:
pattern = '(?<=\*aaa\s)(.*|\n)*?(?=\*)';
I expected the middle (.*|\n)*? to match the minimum number of "either any character other than linebreak, or a linebreak" that sits between *aaa and the closest * symbol, at *ddd. Instead, MATLAB returns the following:
$bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
1111111111111111111111
222222222222
3333333333333333333333333
4444444444555556666666
777777788899999
$11111111111111111111111
*ddd
$11111111111111111111111111111111
222222222222222abcdf
99999999999
Instead of stopping at just before *ddd, regexp continued until just before *abcde99999, despite the presence of the "?" at the end of the middle section of the pattern.
Just to make sure this isn't a lookaround issue, I also tried running
pattern = '\*(.*|\n)*?\*';
And sure enough, I get the following, with the *ddd in the middle being skipped entirely:
*aaa
$bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
1111111111111111111111
222222222222
3333333333333333333333333
4444444444555556666666
777777788899999
$11111111111111111111111
*ddd
$11111111111111111111111111111111
222222222222222abcdf
99999999999
*
Interestingly enough, when I tried this pattern on online regex testers here and here, I get the expected result. Is there any reason why the MATLAB implementation of regex remains greedy even with a "?" at the end? Any help would be appreciated!

Accepted Answer

Guillaume
Guillaume on 16 Oct 2019
Matlab regex engine has the odd peculiarity that . also matches \n by default, whereas other engines don't. So your greedy .* inside the capturing group also captures all newlines, and the 2nd half of the alternation never get a chance to match anything. That can be turned off, and if you do you get the result you expected:
regexp(yourstring, pattern, 'match', 'dotexceptnewline')
However I don't understand why the alternation in the first place, and a simpler pattern that would achieve the same would be:
regexp(yourstring, '(?<=\*aaa\s)[^*]*(?=\*)', 'match') %dotall or dotexceptnewline doesn't matter for that one, since [^*] also matches newline.
  1 Comment
zhert
zhert on 16 Oct 2019
Thanks for the explanation! If I were to use [^\*], I suppose I don't even really need the (?=\*), right?
Another related question, why is it that the pattern works with "\*aaa\s", but not with "\*aaa\n"?

Sign in to comment.

More Answers (0)

Products


Release

R2019b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!