How to capture tokens using regular expressions?

Dear all, I would like to capture two parts of a sequence of strings. I would like to call the first part "main" and the second part "digits". The expressions in the strings have a distinct pattern in that they either have ONE underscore or parentheses. What I am looking to capture is the part before the underscore or the opening parenthesis (main) and the part after the underscore or inside the parenthesis (digits). As an example, the typical exercise will be of the form
expression={'abcd_1','ghsa(22)','gaver_45','fadae(8)'}
out=regexp(expression,pattern,'name')
The result should be a cell array where each cell contains a structure with fields "main" and "digits". In the first case, for instance, the result should be
main='abcd' and digits='1'.
What I am missing is the right "pattern". Any suggestions?

5 Comments

Cedric
Cedric on 16 Sep 2015
Edited: Cedric on 16 Sep 2015
Are you dealing with a cell array of strings initially or is it something else? Ideally, what kind of output do you need? Cell array, struct array, other? You showed an example with only one output, so it's difficult to say. Also about the logic, do you need to check that what is after the underscore or within parentheses is made of digits or do you just want to use the underscore or the opening parenthesis as a separator and separate whatever is behind from whatever is after/within?
Dear Cedric,
I think regular expressions offer the possibility to get an output in the form of a structure or a cell array of structures. The main point however, is about the extraction of the various parts.
The input could be a string or a cell array of strings. But in all cases, the elements to process will be of the form
whatever_45
or
whatever(45)
You make a good point about whether to ensure that we indeed have digits after the underscore or inside parentheses. Normally that should be checked and an error should be issued if this is not the case. But for now, I would be happy even with a solution that does not check for errors.
Thanks alot
Dear Patrick,
In summary, for extracting and validating digits and decimal point, I would would write a pattern like
'(.*?)[\(_]([\d\.]*)'
which explicitly requires the second part to be zero or more * elements of the set [] of digits \d or decimal point \.. Yet, if I wanted to leave validation to STR2DOUBLE, I would extract whatever is in parenthesis or after the underscore:
'(.*?)[\(_]([^\)]*)'
which I translated into zero or more * elements that are not in the set [^] of the literal closing parenthesis. Another way is given by Benjamin where he adds a conditional closing parenthesis.
I also asked about how these strings are defined initially, because the context is important. If you are dealing with a reasonable number of cells, performing pattern matching on a cell array will be efficient enough. If, on the contrary, you have e.g. a 1GB file of entries to process, you may be much more efficient working on it "manually". To illustrate, say the file contains
name1_45
name2(45)
name2b_32
name2c(84)
..
then you could load it as a char array, replace all '_', '(', ')', new lines, and carriage returns with white spaces, and extract names and contents in one shot with SSCANF or TEXSCAN:
% - Dummy file content.
content = sprintf( 'name1_45\nname2(45)\nname2b_32\nname2c(84)\n' ) ;
% - Flag elements to replace.
doReplace = content == '_' | content == '(' | content == ')' | content == 10 ;
% - Replace with with space.
content(doReplace) = ' ' ;
% - Parse.
parsed = textscan( content, '%s %f' ) ;
(10 = ASCII code of new line \n, should also manage 13 for carriage return; may be possible to make it even more efficient using BSXFUN). With that we get
>> parsed
parsed =
{4x1 cell} [4x1 double]
>> parsed{1}
ans =
'name1'
'name2'
'name2b'
'name2c'
>> parsed{2}
ans =
45
45
32
84
Thanks a lot Cedric!!!

Sign in to comment.

Answers (2)

expression={'abcd_1','ghsa(22)','gaver_45','fadae(8)'};
pattern = '(?<main>[a-zA-Z]+)(?:[_\(])(?<digits>[0-9]+))?';
out = regexp(expression,pattern,'once','names');
The pattern breaks down like this:
  • (?<main>[a-zA-Z]+) - A token named "main" with only letters.
  • (?:[_\(]) - An uncaptured token containing either an underscore or "(".
  • (?<digits>[0-9]+) - A token named "digits" with only numbers.
  • )? - An optional ")" character at the end.
The 'once' means to capture the pattern only once per input string. I think in this case you can leave it out.

1 Comment

Dear Benjamin,
Thanks for your input. Your solution would work but would probably need to be refined in the sense that the first part main, may also include some digits. For instance,
whatever345whatever_100
would also be something I would like to capture. It is the second part that would only include digits.
A potential algorithm would be to say everything before an opening parenthesis or an underscore is to be captured in "main", while everything after an underscore or inside parentheses is to be captured in "digits".

Sign in to comment.

This isn't the most efficient or elegant solution, but it solves the problem. Let me know if your data is large enough that this code is slow. I can optimize it.
ex={'abcd_1','ghsa(22)','gaver_45','fadae(8)'};
temp=cellfun(@(s)strsplit(s,{'_','(',')'}),ex,'UniformOutput',false);
ex_main=cellfun(@(s)s{1},temp,'UniformOutput',false);
ex_digit=cellfun(@(s)s{2},temp,'UniformOutput',false);
clear temp;

1 Comment

Dear Kirby,
There are many ways to solve this problem and what you are suggesting is definitely one way to do it. However, I would like to use the elegance of regular expressions and get to practice something I am not very good at yet.
In my current solution for instance, I first use regular expressions to transform all the inputs into the same format
whatever_45
then I look for the underscore, etc. But this entails several lines of codes.
Thanks for your input!

Sign in to comment.

Asked:

on 16 Sep 2015

Commented:

on 19 Sep 2015

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!