Alternative for regex to find line that characters are repeated on consecutively.

4 views (last 30 days)
Currently I have very large block of data that looks like this (Very many):
>sp ASD123 OSD12_MOUSE Protein OSD12 OS=Mus musculus OX=10090 GN=OSD12 PE=1 SV=1
MSVRTLPLLFLNLGGEMLYVLDQRLRAQNIPGDKARKVLNDIISTMFNRKFMDELFKPQE
LYSKKALRTVYDRLAHASIMRLNQASMDKLYDLMTMAFKYQVLLCPRPKDVLLVTFNHLD
AIKGFVQDSPTVIHQVDETFRQLSEVEEEEDDEDEDEEEFF
>sp UISMAA PUD22_MOUSE random words PUD22 OS=Mus musculus OX=10090 GN=SUM23 PE=1 SV=1
MDPEVSLLLLCPLGGLSQEQVAVELSPAHDRRPLPGGDKAITAIWETRQQAQPWIFDAPK
FRLHSATLVSSSPEPQLLLHLGLTSYRDFLGTNWSSSASWLRQQGAADWGDKQAYLADPL
GVGAALVTADDFLVFLRRSQQVAEAPGLVDV
I am trying to make a script that finds strings of ten or more consecutive E/D characters, like in the the first block of data in the section above. Basically I am asking for a way that is an alternative for regex, as I have not found any way to make a pattern to do so on regex. I want to know which lines in the large text file the consecutive characters were found on. Really just looking for an alternative to regex, if anyone has any good suggestions. This is part of the code I was using before.
inp = {''};
form = '[de]{10,}';
calc = regexp(inp,form,'match');
idx = cellfun(@(c)any(cellfun(@numel,c)>10),calc);
find(idx)
  5 Comments
Rik
Rik on 19 May 2023
I recovered the removed content from the Google cache (something which anyone can do). Editing away your question is very rude. Someone spent time reading your question, understanding your issue, figuring out the solution, and writing an answer.
You chose to publish the contents of your question. You can't retract that now.

Sign in to comment.

Accepted Answer

Walter Roberson
Walter Roberson on 15 Jul 2021
%alternative without splitting
S = fileread(filename);
matches = regexp(S, '^.*[DE]{10}.*$', 'match', 'dotexceptnewline', 'lineanchors');
matches
This is the entire code other than setting the file name. It does not split the file, so your objection to 20000 strings is avoided. It produces the lines directly without any post-processing cellfun. It is quite efficient. It has been tested.
It was also already posted in your earlier question, with the only difference being DE vs de

More Answers (0)

Categories

Find more on Programming in Help Center and File Exchange

Products


Release

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!