What is the function of a question mark '?' in regular expression

If I have a C code text with two comments like below and I want to use matlab regular expression function to remove all the comments in the C file called "myTextFile.txt"
/* This is my first comment */
void myCfun(void){}
/* This is my second comment */
Below is my first matlab script, which does not work well:
mytext = fileread('myTextFile.txt');
searchPattern = '/\*.*\*/';
matchedString = regexp(mytext,searchPattern,'match');
The result is:
matchedString = {' This is my first comments */void myCfun(void){}/* This is my second comment */'};
In the C file there should be two comments, however, without a question mark, the matlab regexp function include all the text between the first '/*' and the last '*/'. This is not what I want.
Then I modifed the search pattern as the following:
mytext = fileread('myTextFile.txt');
searchPattern = '/\*.*?\*/';
matchedString = regexp(mytext,searchPattern,'match');
This time it gives the right answer:
matchedString = {' This is my first comments', ' This is my second comment'};
I don't understand what the question mark does in the above example?

1 Comment

madhan ravi:
Rui Zhang:
I went to the website where you provided the link to.
My understanding is the question mark '?' is equivalent to {0, 1}. But that doesn't help me on my question.
In my examples, it looks like the question mark changes the search direction. Without it, the function seaches the given string from the last character; with it, it searchs the string from the first character. This is my perception that comes from my example above.

Sign in to comment.

 Accepted Answer

Let's interpret the 1st searchPattern.
searchPattern = '/\*.*\*/';
% /\* start with /*
% .* allow for any character(s) (Greedy!)
% \*/ keep searching until you get to the LAST */
This matches all of your text because your text begins with /* and ends with */. It doesn't care that there's another */ before the last one. This is known as a greedy search.
Let's interpret the 2nd searchPattern
searchPattern = '/\*.*?\*/';
% .*? allow for any character(s) until you get to....
% \*/ ....the first time this occurs
This match ends at the next */ unlike the above match that ends at the last */.

6 Comments

Glad I could help. Regular expressions are so versatile that I'm always learning new ways to construct expressions despite using regular expressions for years.
I find it helpful to construct complex expressions here at https://regex101.com/ (select Python flavor for Matlab which is highly, but not completely, identical to Matlab's syntax).
.*? does not strictly mean until the first time what follows appears, except in the more general sense of "what follows".
The difference is in what happens if there is a match failure in what follows. For example, consider the string AAAABAAABCAABCAA and the match pattern A.*BC . Here, the .* is a greedy match and goes as far as possible in the match. The initial A would match the first character, and then at first the .* would match every remaining character, and then because the B fails to match after end of string, the match would back up one position A|AAABAAABCAABCA|A and then the B would be looked for. The B is not present so the .* would be backed up one more time, A|AAABAAABCAABC|AA and the B would be looked for. It still fails, so .* is backed up again, A|AAABAAABCAAB|CAA and then again A|AAABAAABCAA|BCAA . And now the B matches, and the C after it matches, and the match is considered complete.
Now use a match pattern A.*?BC . The .*? matches as little as possible at first -- namely nothing: A||AAABAAABCAABCAA . The A does not match after that, so the .*? is expanded, A|A|AABAAABCAABCAA and that fails, so expand again and again, A|AAA|BAAABCAABCAA . Now the B matches, so you proceed to A|AAAB|AAABCAABCAA looking for the C. But the C is not present at that point.
If the .*? operator strictly meant "the first time" then at this point the parser would abandon the match because it expanded to the "first" B and what followed failed.
But that isn't what .*? means, exactly: instead when the C fails to match, the parser goes back to the last quantifier and attempts to expand it, backtracking before the B, so you get A|AAAB|AAABCAABCAA which is not followed by B so expand .*? again, A|AAABA|AABCAABCAA and again and again, A|AAABAAA|BCAABCAA and that is followed by a B so you expand to A|AAABAAAB|CAABCAA looking for the C, and find it, so you declare the match a success, having matched |AAAABAAABC|AABCAA . At that point, since 'once' was not specified, it would start looking again for a A.?*BC pattern and would find it in the AABC stretch and that would be a second match.
There is an operator that corresponds to "look for the first match and if what is after that fails, do not back up and try again". I don't think I have ever seen that operator used.
+1, definitely it takes sometime to learn regex without a doubt as Adam said.

Sign in to comment.

More Answers (1)

Categories

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!