Alternative for regex to find line that characters are repeated on consecutively.

rogox on 15 Jul 2021
Answered: Walter Roberson on 15 Jul 2021
Currently I have very large block of data that looks like this (Very many):
>sp ASD123 OSD12_MOUSE Protein OSD12 OS=Mus musculus OX=10090 GN=OSD12 PE=1 SV=1
>sp UISMAA PUD22_MOUSE random words PUD22 OS=Mus musculus OX=10090 GN=SUM23 PE=1 SV=1
I am trying to make a script that finds strings of ten or more consecutive E/D characters, like in the the first block of data in the section above. Basically I am asking for a way that is an alternative for regex, as I have not found any way to make a pattern to do so on regex. I want to know which lines in the large text file the consecutive characters were found on. Really just looking for an alternative to regex, if anyone has any good suggestions. This is part of the code I was using before.
inp = {''};
form = '[de]{10,}';
calc = regexp(inp,form,'match');
idx = cellfun(@(c)any(cellfun(@numel,c)>10),calc);

Accepted Answer

Walter Roberson
Walter Roberson on 15 Jul 2021
%alternative without splitting
S = fileread(filename);
matches = regexp(S, '^.*[DE]{10}.*$', 'match', 'dotexceptnewline', 'lineanchors');
This is the entire code other than setting the file name. It does not split the file, so your objection to 20000 strings is avoided. It produces the lines directly without any post-processing cellfun. It is quite efficient. It has been tested.
It was also already posted in your earlier question, with the only difference being DE vs de

