help with regexpi expression match

Question

J M on 4 Dec 2017

0
Link

Direct link to this question

https://in.mathworks.com/matlabcentral/answers/370829-help-with-regexpi-expression-match

Edited: per isakson on 4 Dec 2017

I have a question regarding regexpi expression match which may be an easy one (not for me). I have a set of strings from a single cell. An example is as follows:

d2 = {'chromosome 1:NC_011985.1/CP000628.1; chromosome 2:NC_011983.1/CP000629.1; plasmid pAgK84:NC_011994.1/CP000632.1; plasmid pAtK84b:NC_011990.1/CP000630.1; plasmid pAtK84c:NC_011987.1/CP000631.1 chromosome Mycobacterium_bovis_AF2122/97:NC_002945.4/LT708304.1'};

I would like the two names that follow "chromosome" which are found after the ":" and "/" to be picked up with a regexpi expression. so for example, I want to match for NC_011985.1, CP000628.1, NC_011983.1, CP000629.1, NC_002945.4 and LT708304.1 but I want to ignore the other names that follow plasmid. I chose this large string as an example because I wanted the names to be proceeded by the word "chromosome" however as you can see, after the word "chromosome, there may have a number, a word or even nothing, followed by a semicolon ":" a name (that we want to keep) and then another name that follows "/" (that we also want to keep). Keeping all the names in one cell is fine I just want to pick up these names.

if I use the following code:

accession6 = regexpi(d2,'(?<=:)\w+','match');

using this as a base, I do not know how to proceed the match by the word "chromosome" followed by an optional number, or words or even nothing after the word "chromosome" without messing it up . It would have be before the necessary ":" and "/" parts of the expression that go before the name we want to keep.

Any help would be super appreciated.

1 Comment
Show -1 older commentsHide -1 older comments

Stephen23 on 4 Dec 2017

Edited: Stephen23 on 4 Dec 2017

"Any help would be super appreciated"

You might like to download my FEX submission iregexp, an interactive regular expression tool:

https://www.mathworks.com/matlabcentral/fileexchange/48930-interactive-regular-expression-tool

It lets you quickly experiment with different regular expressions and shows all of regexp's outputs in real-time as you type.

Sign in to comment.

Sign in to answer this question.

Answer 1

per isakson on 4 Dec 2017

0
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/370829-help-with-regexpi-expression-match#answer_294533

Edited: per isakson on 4 Dec 2017

Open in MATLAB Online

One expression

'chromosome' followed by anything up till ':' and one ':'
capturing group of one or more letter, digit, underscore, and '.' (greedy)
zero or more of anything up till '/' and one '/'
capturing group of one or more letter, digit, underscore, and '.' (greedy)

And repeat until no more matches are found

>> cac = regexpi( d2, 'chromosome[^:]+[:]([\w\.]+)[^/]*[/]([\w\.]+)', 'tokens' );
>> cac{:}{:}
ans = 
    'NC_011985.1'    'CP000628.1'
ans = 
    'NC_011983.1'    'CP000629.1'
ans = 
    'NC_002945.4'    'LT708304.1'
>>

If d2 contains one string

>> cac = regexpi( d2{:}, 'chromosome[^:]+[:]([\w\.]+)[^/]*[/]([\w\.]+)', 'tokens' );
>> cac
cac = 
    {1x2 cell}    {1x2 cell}    {1x2 cell}
>>

6 Comments
Show 4 older commentsHide 4 older comments

per isakson on 4 Dec 2017

Edited: per isakson on 4 Dec 2017

Open in MATLAB Online

And an alternative that uses @JM's approach. In a first step match "name slash name" between

look-behind: (?<=chromosome[^:]+[:])
look-ahead: (?=;|$)

and in a second step split the two names at slash

cac = regexpi( d2{:}, '(?<=chromosome[^:]+[:])[\w\.]+[/][\w\.]+(?=;|$)', 'match' );
cac = regexp( cac, '/', 'split' );
cac{:}
ans = 
    'NC_011985.1'    'CP000628.1'
ans = 
    'NC_011983.1'    'CP000629.1'
ans = 
    'NC_002945.4'    'LT708304.1'
>>

per isakson on 4 Dec 2017

chromosome[^:;] with a semi-colon (proposed by @Guillaume) is better than chromosome[^:] without, because the latter will return a plasmid-name-pair if a colon is missing in the string after 'chromosome'. With semi-colon the pair is missed altogether.

Sign in to comment.

help with regexpi expression match

1 Comment
Show -1 older commentsHide -1 older comments

Answers (1)

6 Comments
Show 4 older commentsHide 4 older comments

See Also

Categories

Tags

Community Treasure Hunt

help with regexpi expression match

1 Comment Show -1 older commentsHide -1 older comments

Answers (1)

6 Comments Show 4 older commentsHide 4 older comments

See Also

Categories

Tags

Community Treasure Hunt

1 Comment
Show -1 older commentsHide -1 older comments

6 Comments
Show 4 older commentsHide 4 older comments