Wanted: Examples on how to use "Dynamic Regular Expressions" to debug regular expressions
Show older comments
I try to develop a function, is_string_constant, which takes a text string of Matlab code and returns a logical vector, which is true for the positions of string constants, e.g. '%s%f'. (But neither ';c=d' in a=b';c=d'; nor in comments.)
Status
- I have not found a similar function in the FEX or elsewhere
- I found % MATLAB Comment Stripping Toolbox by Peter J. Acklam
- At regex101 I have a working regular expression (in PCRE(php)). Matlabs regular expressions is close to PCRE.
- At regex101 my PCRE-expression works under Python too. regex101 automatically makes a Python-script based on my test case.
- So far I failed to port my PCRE-expression to Matlab
- I'm trying to use "Dynamic Regular Expressions" to understand where it goes wrong. Now, I'm moving around (?@disp($0)) in the expression and see code fragments printed in the Command Window. However, it remains to make something useful out of it.
(?@cmd)
Execute the MATLAB command represented by cmd, but discard any output
the command returns. (Helpful for diagnosing regular expressions.)
Example: '\w*?(\w)(?@disp($1))\1\w*' matches words that include double
letters (such as pp), and displays intermediate results.
Questions:
- Is there already a is_string_constant to find somewhere
- Where can I find tutorials and example on how to debug regular expressions in Matlab
- Would it be crazy to try Java or Python to do the job? (I haven't used either.)
- Other tips
Accepted Answer
More Answers (1)
Regular expressions may not be that appropriate in this context; I used them in the past for doing exactly this, but it was too complicated for being really satisfactory.
I took 10 minutes for building a basic loop (being a regexp evangelist, it was quite painful ;)), which seems to be working on a few test strings:
strs = {
'', ...
'abc', ...
'''abc''', ...
'% ''abc''', ...
's = ''hello'' ; b = c'' ; fprintf([''A''''s content '',''%d : %s''], i, str{:}.'') % ''abc''' ...
} ;
for sId = 1 : numel( strs )
is_string_constant( strs{sId}, true ) ;
fprintf( '\n' ) ;
end
Outputs (skipping the empty string):
>> test
abc
000
'abc'
11111
% 'abc'
0000000
s = 'hello' ; b = c' ; fprintf(['A''s content ','%d : %s'], i, str{:}.') % 'abc'
00001111111000000000000000000000111111111111111011111111100000000000000000000000
EDIT 1: I spent another 20 minutes building a simple debug function (attached). It doesn't do much but avoids the hassle of updating patterns. It seems to be managing well internal levels of parentheses and escaped ones.
PS: .. but of course, I don't see why it is useful unless we can't output the match and/or tokens for a reason, so I may just have wasted 20 minutes (lol) and we are back to "However, it remains to make something useful out of it.".
>> match = regexp_debug( 'hello world', '(ll.).*?(o.l)', 'match', 'once' )
match: 'lo worl'
token_1: 'llo'
token_2: 'orl'
match =
'llo worl'
>> tokens = regexp_debug( 'hello world', '(ll.).*?(o.l)', 'tokens', 'once' )
match: 'lo worl'
token_1: 'llo'
token_2: 'orl'
tokens =
1×2 cell array
{'llo'} {'orl'}
>> [tokens, start] = regexp_debug( 'hello world', '((?<=(l|\(\)))l.).*?(o.l)', 'tokens', 'start', 'once' )
match: 'lo worl'
token_1: 'lo'
token_2: 'orl'
tokens =
1×2 cell array
{'lo'} {'orl'}
start =
4
Further EDITs:
- 15/10 - Added the match in the output.
- 16/10 @ 02:34UTC - Corrected bug in tokens count.
5 Comments
Walter Roberson
on 14 Oct 2017
Regular expressions do poorly on "balancing" problems, such as matching brackets or matching quote marks. In fact, some of the foundational theory on Regular Expressions shows that they cannot handle balancing problems: in order to handle balancing you need an indefinitely-large push-down stack or equivalent. Perl "extended regular expressions" implement that explicitly.
Quote marks have the additional complication that if the quote is fallowed by another quote then that is a single literal quote that is not considered to balance anything.
I would have to think more about how it could be done with dynamic matches. For bracket matching it would involve recursion and backtracking. You want to find the smallest match, so it is not enough to find that open and close bracket counts match, you can only accept at the point where the counts match and do not match on any substring... But for quote marks.. mumble mumble mumble.
per isakson
on 18 Oct 2017
Edited: per isakson
on 19 Oct 2017
per isakson
on 18 Oct 2017
Edited: per isakson
on 19 Oct 2017
per isakson
on 19 Oct 2017
Edited: per isakson
on 19 Oct 2017
Categories
Find more on Text Data Preparation in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!