help with regexpi expression match
8 views (last 30 days)
Show older comments
I have a question regarding regexpi expression match which may be an easy one (not for me). I have a set of strings from a single cell. An example is as follows:
d2 = {'chromosome 1:NC_011985.1/CP000628.1; chromosome 2:NC_011983.1/CP000629.1; plasmid pAgK84:NC_011994.1/CP000632.1; plasmid pAtK84b:NC_011990.1/CP000630.1; plasmid pAtK84c:NC_011987.1/CP000631.1 chromosome Mycobacterium_bovis_AF2122/97:NC_002945.4/LT708304.1'};
I would like the two names that follow "chromosome" which are found after the ":" and "/" to be picked up with a regexpi expression. so for example, I want to match for NC_011985.1, CP000628.1, NC_011983.1, CP000629.1, NC_002945.4 and LT708304.1 but I want to ignore the other names that follow plasmid. I chose this large string as an example because I wanted the names to be proceeded by the word "chromosome" however as you can see, after the word "chromosome, there may have a number, a word or even nothing, followed by a semicolon ":" a name (that we want to keep) and then another name that follows "/" (that we also want to keep). Keeping all the names in one cell is fine I just want to pick up these names.
if I use the following code:
accession6 = regexpi(d2,'(?<=:)\w+','match');
using this as a base, I do not know how to proceed the match by the word "chromosome" followed by an optional number, or words or even nothing after the word "chromosome" without messing it up . It would have be before the necessary ":" and "/" parts of the expression that go before the name we want to keep.
Any help would be super appreciated.
1 Comment
Stephen23
on 4 Dec 2017
Edited: Stephen23
on 4 Dec 2017
"Any help would be super appreciated"
You might like to download my FEX submission iregexp, an interactive regular expression tool:
It lets you quickly experiment with different regular expressions and shows all of regexp's outputs in real-time as you type.
Answers (1)
per isakson
on 4 Dec 2017
Edited: per isakson
on 4 Dec 2017
One expression
- 'chromosome' followed by anything up till ':' and one ':'
- capturing group of one or more letter, digit, underscore, and '.' (greedy)
- zero or more of anything up till '/' and one '/'
- capturing group of one or more letter, digit, underscore, and '.' (greedy)
And repeat until no more matches are found
>> cac = regexpi( d2, 'chromosome[^:]+[:]([\w\.]+)[^/]*[/]([\w\.]+)', 'tokens' );
>> cac{:}{:}
ans =
'NC_011985.1' 'CP000628.1'
ans =
'NC_011983.1' 'CP000629.1'
ans =
'NC_002945.4' 'LT708304.1'
>>
If d2 contains one string
>> cac = regexpi( d2{:}, 'chromosome[^:]+[:]([\w\.]+)[^/]*[/]([\w\.]+)', 'tokens' );
>> cac
cac =
{1x2 cell} {1x2 cell} {1x2 cell}
>>
6 Comments
per isakson
on 4 Dec 2017
Edited: per isakson
on 4 Dec 2017
And an alternative that uses @JM's approach. In a first step match "name slash name" between
- look-behind: (?<=chromosome[^:]+[:])
- look-ahead: (?=;|$)
and in a second step split the two names at slash
cac = regexpi( d2{:}, '(?<=chromosome[^:]+[:])[\w\.]+[/][\w\.]+(?=;|$)', 'match' );
cac = regexp( cac, '/', 'split' );
cac{:}
ans =
'NC_011985.1' 'CP000628.1'
ans =
'NC_011983.1' 'CP000629.1'
ans =
'NC_002945.4' 'LT708304.1'
>>
per isakson
on 4 Dec 2017
chromosome[^:;] with a semi-colon (proposed by @Guillaume) is better than chromosome[^:] without, because the latter will return a plasmid-name-pair if a colon is missing in the string after 'chromosome'. With semi-colon the pair is missed altogether.
See Also
Categories
Find more on Data Import in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!