Try to understand a regexp example

1 view (last 30 days)
regexp is a function by far I have used the most to work on the string-related Cody problems, which I have spent a lot of time trying to understand how it works and yet a lot of questions still remain. There is an example in doc regexp under Dynamic Regular Expressions (modified below)
a = 'zzabcbagh';
regexp(a, '(.{2,}).?(??@fliplr($1))','match','once');
This is to "find palindromes that are at least four characters long". What I don't understand here is why the two '.' (dots) are necessary. How about the first '?'
Besides, when 'once' is used, it returns as char; however, when it is igored ('all' by default), it returns as cell. Why?
Thank you in advance.
  1 Comment
Walter Roberson
Walter Roberson on 5 Aug 2019
When once is not used, then the pattern is to be applied repeatedly, finding all of the matches in the string. Suppose that only one match were found, then you propose that a character vector be returned, and that a cell be used if there were multiple matches. Okay then, now you as a programmer who does not know ahead of time how many matches there are must interpret the result. If there just happened to be one match, return would be char, you propose. So then for example
S = char(randi(['a' 'z'], 1, 10);
R = regexp(S, '[bp]', 'match') ;
How do you process R under your proposal? There could be 0, 1, or even up to 10 matches. So ahead of time you do not know if there is going to be exactly one match, so you cannot presume that R is char, and to process you would need to do
if ischar(R)
%exactly one match
Use R as a character vector
elseif isempty(R)
%empty cell
...
else
%R is a cell of results
end
Compare that to the current implementation: R is always a cell unless you told it 'once' and you do not need to test specially.
There is one call that returns character vector on one match and returns a cell on multiple matches: namely uigetfile with multiselect turned on. It is a nuisance that they special cased the single file possibility and people often get the call wrong.

Sign in to comment.

Accepted Answer

Walter Roberson
Walter Roberson on 5 Aug 2019
The pattern translates as "at least two characters, possibly followed by another character, followed by the reverse of the original characters.
The optional middle character handles odd-length cases such as aboba while still permitting even length ones such as abooba.
If the original input were not restricted to 2 or more characters then the pattern would match double letters such as the rr in lorry as that would be character that does not happen to be followed by a second character, followed by the reverse of the first (which would just be itself.) I presume that the problem statement says that those are not to be found. The restriction to 2 or more also has the effect of eliminating cases such as bib so I have to presume that the problem statement says not to find those.
  2 Comments
Edward Huang
Edward Huang on 6 Aug 2019
These (including the comment above) are very informative answers! I really appreciate it.
Just now I changed my code a bit, as follows:
b = '23.4546258285267723';
regexp(b, '(.{1,}).?(??@fliplr($1))','match');
In this case, the first cell returns as
{'.4546'}
Why is it not {'454'} ?
Walter Roberson
Walter Roberson on 7 Aug 2019
.{1,} can match any character, including the decimal point between '23' and '4545'. When you fliplr() the matched characters, the decimal point is one of the characters. Then the decimal point gets interpreted as part of the pattern to match, and decimal point matches any one character, so the "reflected" decimal point gets acted upon as-if it were the pattern character "."

Sign in to comment.

More Answers (0)

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!