How to capture tokens using regular expressions?

Question

0 votes

Dear all, I would like to capture two parts of a sequence of strings. I would like to call the first part "main" and the second part "digits". The expressions in the strings have a distinct pattern in that they either have ONE underscore or parentheses. What I am looking to capture is the part before the underscore or the opening parenthesis (main) and the part after the underscore or inside the parenthesis (digits). As an example, the typical exercise will be of the form

 expression={'abcd_1','ghsa(22)','gaver_45','fadae(8)'}
 out=regexp(expression,pattern,'name')

The result should be a cell array where each cell contains a structure with fields "main" and "digits". In the first case, for instance, the result should be

main='abcd' and digits='1'.

What I am missing is the right "pattern". Any suggestions?

5 Comments
Show 3 older comments Hide 3 older comments

Cedric on 17 Sep 2015

Edited: Cedric on 17 Sep 2015

Open in MATLAB Online

Dear Patrick,

In summary, for extracting and validating digits and decimal point, I would would write a pattern like

'(.*?)[\(_]([\d\.]*)'

which explicitly requires the second part to be zero or more * elements of the set [] of digits \d or decimal point \.. Yet, if I wanted to leave validation to STR2DOUBLE, I would extract whatever is in parenthesis or after the underscore:

'(.*?)[\(_]([^\)]*)'

which I translated into zero or more * elements that are not in the set [^] of the literal closing parenthesis. Another way is given by Benjamin where he adds a conditional closing parenthesis.

I also asked about how these strings are defined initially, because the context is important. If you are dealing with a reasonable number of cells, performing pattern matching on a cell array will be efficient enough. If, on the contrary, you have e.g. a 1GB file of entries to process, you may be much more efficient working on it "manually". To illustrate, say the file contains

 name1_45 
 name2(45)
 name2b_32
 name2c(84)
 ..

then you could load it as a char array, replace all '_', '(', ')', new lines, and carriage returns with white spaces, and extract names and contents in one shot with SSCANF or TEXSCAN:

 % - Dummy file content.
 content = sprintf( 'name1_45\nname2(45)\nname2b_32\nname2c(84)\n' ) ;
 % - Flag elements to replace.
 doReplace = content == '_' | content == '(' | content == ')' | content == 10 ;
 % - Replace with with space.
 content(doReplace) = ' ' ;
 % - Parse.
 parsed = textscan( content, '%s %f' ) ;

(10 = ASCII code of new line \n, should also manage 13 for carriage return; may be possible to make it even more efficient using BSXFUN). With that we get

 >> parsed
 parsed = 
    {4x1 cell}    [4x1 double]
 >> parsed{1}
 ans = 
    'name1'
    'name2'
    'name2b'
    'name2c'
 >> parsed{2}
 ans =
    45
    45
    32
    84

Patrick Mboma on 19 Sep 2015

Thanks a lot Cedric!!!

Cedric on 19 Sep 2015

My pleasure!

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Benjamin Kraus on 16 Sep 2015

Open in MATLAB Online

3 votes

expression={'abcd_1','ghsa(22)','gaver_45','fadae(8)'};
pattern = '(?<main>[a-zA-Z]+)(?:[_\(])(?<digits>[0-9]+))?';
out = regexp(expression,pattern,'once','names');

The pattern breaks down like this:

(?<main>[a-zA-Z]+) - A token named "main" with only letters.
(?:[_\(]) - An uncaptured token containing either an underscore or "(".
(?<digits>[0-9]+) - A token named "digits" with only numbers.
)? - An optional ")" character at the end.

The 'once' means to capture the pattern only once per input string. I think in this case you can leave it out.

1 Comment
Show -1 older comments Hide -1 older comments

Patrick Mboma on 17 Sep 2015

Open in MATLAB Online

Dear Benjamin,

Thanks for your input. Your solution would work but would probably need to be refined in the sense that the first part main, may also include some digits. For instance,

whatever345whatever_100

would also be something I would like to capture. It is the second part that would only include digits.

A potential algorithm would be to say everything before an opening parenthesis or an underscore is to be captured in "main", while everything after an underscore or inside parentheses is to be captured in "digits".

Sign in to comment.

Answer 2

Kirby Fears on 16 Sep 2015

Open in MATLAB Online

0 votes

This isn't the most efficient or elegant solution, but it solves the problem. Let me know if your data is large enough that this code is slow. I can optimize it.

ex={'abcd_1','ghsa(22)','gaver_45','fadae(8)'};
temp=cellfun(@(s)strsplit(s,{'_','(',')'}),ex,'UniformOutput',false);
ex_main=cellfun(@(s)s{1},temp,'UniformOutput',false);
ex_digit=cellfun(@(s)s{2},temp,'UniformOutput',false);
clear temp;

1 Comment
Show -1 older comments Hide -1 older comments

Patrick Mboma on 17 Sep 2015

Open in MATLAB Online

Dear Kirby,

There are many ways to solve this problem and what you are suggesting is definitely one way to do it. However, I would like to use the elegance of regular expressions and get to practice something I am not very good at yet.

In my current solution for instance, I first use regular expressions to transform all the inputs into the same format

whatever_45

then I look for the underscore, etc. But this entails several lines of codes.

Thanks for your input!

Sign in to comment.

How to capture tokens using regular expressions?

5 Comments
Show 3 older comments Hide 3 older comments

Answers (2)

1 Comment
Show -1 older comments Hide -1 older comments

1 Comment
Show -1 older comments Hide -1 older comments

Categories

Tags

Community Treasure Hunt

How to capture tokens using regular expressions?

5 Comments Show 3 older comments Hide 3 older comments

Answers (2)

1 Comment Show -1 older comments Hide -1 older comments

1 Comment Show -1 older comments Hide -1 older comments

Categories

Tags

See Also

Community Treasure Hunt

5 Comments
Show 3 older comments Hide 3 older comments

1 Comment
Show -1 older comments Hide -1 older comments

1 Comment
Show -1 older comments Hide -1 older comments