Alternative for regex to find line that characters are repeated on consecutively.

Question

N/A on 15 Jul 2021

0
Link

Direct link to this question

https://in.mathworks.com/matlabcentral/answers/879103-alternative-for-regex-to-find-line-that-characters-are-repeated-on-consecutively

Commented: Rik on 19 May 2023

Currently I have very large block of data that looks like this (Very many):

>sp ASD123 OSD12_MOUSE Protein OSD12 OS=Mus musculus OX=10090 GN=OSD12 PE=1 SV=1
MSVRTLPLLFLNLGGEMLYVLDQRLRAQNIPGDKARKVLNDIISTMFNRKFMDELFKPQE
LYSKKALRTVYDRLAHASIMRLNQASMDKLYDLMTMAFKYQVLLCPRPKDVLLVTFNHLD
AIKGFVQDSPTVIHQVDETFRQLSEVEEEEDDEDEDEEEFF
>sp UISMAA PUD22_MOUSE random words PUD22 OS=Mus musculus OX=10090 GN=SUM23 PE=1 SV=1
MDPEVSLLLLCPLGGLSQEQVAVELSPAHDRRPLPGGDKAITAIWETRQQAQPWIFDAPK
FRLHSATLVSSSPEPQLLLHLGLTSYRDFLGTNWSSSASWLRQQGAADWGDKQAYLADPL
GVGAALVTADDFLVFLRRSQQVAEAPGLVDV

I am trying to make a script that finds strings of ten or more consecutive E/D characters, like in the the first block of data in the section above. Basically I am asking for a way that is an alternative for regex, as I have not found any way to make a pattern to do so on regex. I want to know which lines in the large text file the consecutive characters were found on. Really just looking for an alternative to regex, if anyone has any good suggestions. This is part of the code I was using before.

inp = {''};
form = '[de]{10,}';
calc = regexp(inp,form,'match');
idx = cellfun(@(c)any(cellfun(@numel,c)>10),calc);
find(idx)

5 Comments
Show 3 older commentsHide 3 older comments

Stephen23 on 15 Jul 2021

Edited: Stephen23 on 15 Jul 2021

"...I have not found any way to make a pattern to do so on regex"

What is the specific problem with that regular expression? It works for me, when adjusted for the characters you want:

inp = {'>sp ASD123 OSD12_MOUSE Protein OSD12 OS=Mus musculus OX=10090 GN=OSD12 PE=1 SV=1 MSVRTLPLLFLNLGGEMLYVLDQRLRAQNIPGDKARKVLNDIISTMFNRKFMDELFKPQELYSKKALRTVYDRLAHASIMRLNQASMDKLYDLMTMAFKYQVLLCPRPKDVLLVTFNHLDAIKGFVQDSPTVIHQVDETFRQLSEVEEEEDDEDEDEEEFF';...
'>sp UISMAA PUD22_MOUSE random words PUD22 OS=Mus musculus OX=10090 GN=SUM23 PE=1 SV=1 MDPEVSLLLLCPLGGLSQEQVAVELSPAHDRRPLPGGDKAITAIWETRQQAQPWIFDAPKFRLHSATLVSSSPEPQLLLHLGLTSYRDFLGTNWSSSASWLRQQGAADWGDKQAYLADPLGVGAALVTADDFLVFLRRSQQVAEAPGLVDV'};
rgx = '[DE]{10,}';
tmp = regexp(inp,rgx,'once');
idx = ~cellfun(@isempty,tmp)
idx = 2×1 logical array
   1
   0

Did you notice that your regular expression matches the lowercase characters 'd' and 'e', although the data you want to match consists of the uppercase characters 'D' and 'E' ? Did you attempt to match the the correct character case or use REGEXPI ?

Stephen23 on 15 Jul 2021

You can loop over blocks of the lines using TEXTSCAN: did you try that?

Rik on 19 May 2023

I recovered the removed content from the Google cache (something which anyone can do). Editing away your question is very rude. Someone spent time reading your question, understanding your issue, figuring out the solution, and writing an answer.

You chose to publish the contents of your question. You can't retract that now.

Sign in to comment.

Sign in to answer this question.

Answer 1

Walter Roberson on 15 Jul 2021

0
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/879103-alternative-for-regex-to-find-line-that-characters-are-repeated-on-consecutively#answer_747433

%alternative without splitting
S = fileread(filename);
matches = regexp(S, '^.*[DE]{10}.*$', 'match', 'dotexceptnewline', 'lineanchors');
matches

This is the entire code other than setting the file name. It does not split the file, so your objection to 20000 strings is avoided. It produces the lines directly without any post-processing cellfun. It is quite efficient. It has been tested.

It was also already posted in your earlier question, with the only difference being DE vs de

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Alternative for regex to find line that characters are repeated on consecutively.

5 Comments
Show 3 older commentsHide 3 older comments

Accepted Answer

0 Comments
Show -2 older commentsHide -2 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Alternative for regex to find line that characters are repeated on consecutively.

5 Comments Show 3 older commentsHide 3 older comments

Accepted Answer

0 Comments Show -2 older commentsHide -2 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

5 Comments
Show 3 older commentsHide 3 older comments

0 Comments
Show -2 older commentsHide -2 older comments