How to inclusively extract rows of a large cell array between cells given start and end patterns?

5 views (last 30 days)
Hello Folks,
I am searching for the most efficent method to parse a large text file (typically 2-4 GB) for ocurrences of a message. I have to search ~100 large files for dozens of messages so efficiency will be quite significant. I have attached a sample_input.txt with two occurrences of a message specified in the considerations below.
Considerations:
1) start of the message is: 'Hello_Message.pdf'
2) end of the message is: '&&&'
3) store all lines of each occurence of the message to an array within a structure
5) all messages have a header pattern '.*\.[a-zA-Z]{3}\n\r' and end with pattern '&&&\n\r'
4) hopeful to avoid for loops by filtering using a function for example extractBetween, Contains, regexpPattern, or other function(s)
The code below does not work but hopefully it provides an idea of what I was thinking...
clear
close all
clc
Input_fid = fopen(sample_input.txt);
ftext = textscan(Input_fid,'%s','Delimeter','\n\r');
fclose(Input_fid)
% I want to inclusively capture the start of the message 'Hello_Message.pdf' and the end
% of the message '&&&' along with all rows between the start and end of each ocurrence
% of the message
for check = 1:height(ftext{1})
HelloMsgs.Occurrences(check) = extractBetween(ftext{1},regexpPattern('Hello_Message.pdf.*\n\r'),regexpPattern('&&&\n\r'));
end
Desired Output:
HelloMsgs.Occurrences(1) <--- cell array of all lines of first occurrence of the Hello_Message in its
own row cell
HelloMsgs.Occurrences(2) <--- cell array of all lines of second occurrence of the Hello_Message in its
own row cell
HelloMsgs.Occurrences(3) <--- cell array of all lines of third occurrence of the Hello_Message in its
own row cell
Thank you in advance for your time. I am new to posting a coding question in a forum so hopefully I explained
the problem well enough.
  4 Comments
Jude
Jude on 18 Oct 2023
Hi Star Strider,
Thank you very much for your time and patience with me. Looks like I could have done better with how I explained the problem. I am reviewing your solution.
Star Strider
Star Strider on 18 Oct 2023
Thank you.
I substituted extractBetween for extractBefore since that gives the appropriate result in my ‘Extract’ cell array.

Sign in to comment.

Accepted Answer

Star Strider
Star Strider on 18 Oct 2023
Edited: Star Strider on 18 Oct 2023
type('sample_input.txt')
Hello_Message.pdf 2341234342 3214234 ert 2341234342 3214234 abc 2341234342 3214234 Some_ting 23453425 Blah_bleh Sadf_5 Ouch 4 TEST Asdff: sdf_sdf Is_sdf: asdf IS_ssg: sadf NJ_T: adfgh Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> toaster</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds &&& Hello_Message.txt 2341234342 3214234 ert 2341234342 3214234 abc 2341234342 3214234 Some_ting 23453425 Blah_bleh Sadf_5 Ouch 4 TEST Asdff: sdf_sdf Is_sdf: asdf IS_ssg: sadf NJ_T: adfgh Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> thisdata</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds &&& Bye_Message.pdf 2341234342 3214234 ert 2341234342 3214234 abc 2341234342 3214234 Some_ting 23453425 Blah_bleh Sadf_5 Ouch 4 TEST Asdff: sdf_sdf Is_sdf: asdf IS_ssg: sadf NJ_T: adfgh Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> sadfsdfdsfasdf</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds &&& Hello_Message.pdf 2341234342 3214234 ert 2341234342 3214234 abc 2341234342 3214234 Some_ting 23453425 Blah_bleh Sadf_5 Ouch 4 TEST Asdff: sdf_sdf Is_sdf: asdf IS_ssg: sadf NJ_T: adfgh Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> iron </Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds\ &&&
fidi = fopen('sample_input.txt','rt');
fidi = 3
k = 1;
while ~feof(fidi)
Line{k,:} = fgetl(fidi);
k = k+1;
end
fclose(fidi);
k
k = 92
% Line
Line = 91×1 cell array
{0×0 char } {'Hello_Message.pdf' } {'2341234342 3214234 ert' } {'2341234342 3214234 abc' } {'2341234342 3214234' } {'Some_ting' } {'23453425' } {'Blah_bleh' } {'Sadf_5' } {'Ouch 4' } {'TEST' } {' ' } {' ' } {' ' } {'Asdff: sdf_sdf' } {'Is_sdf: asdf' } {'IS_ssg: sadf' } {'NJ_T: adfgh' } {0×0 char } {'Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> toaster</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds'} {'&&&' } {0×0 char } {0×0 char } {0×0 char } {'Hello_Message.txt' } {'2341234342 3214234 ert' } {'2341234342 3214234 abc' } {'2341234342 3214234' } {'Some_ting' } {'23453425' }
for k1 = 1:k-1
if ~isempty(Line{k1,:})
Lc = strfind(extractBetween(Line{k1,:},'_','.'),'Message');
if ~isempty(Lc)
Start(k1) = 1;
% sprintf('Start = %2d',k1)
end
if strfind(Line{k1}, '&&&')
End(k1) = 1;
% sprintf('End = %2d',k1)
end
end
end
StartIdx = find(Start)
StartIdx = 1×4
2 25 46 72
EndIdx = find(End)
EndIdx = 1×4
21 44 65 91
for k = 1:numel(StartIdx)
Extract{k,:} = Line(StartIdx(k):EndIdx(k));
end
Extract{1}
ans = 20×1 cell array
{'Hello_Message.pdf' } {'2341234342 3214234 ert' } {'2341234342 3214234 abc' } {'2341234342 3214234' } {'Some_ting' } {'23453425' } {'Blah_bleh' } {'Sadf_5' } {'Ouch 4' } {'TEST' } {' ' } {' ' } {' ' } {'Asdff: sdf_sdf' } {'Is_sdf: asdf' } {'IS_ssg: sadf' } {'NJ_T: adfgh' } {0×0 char } {'Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> toaster</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds'} {'&&&' }
Extract{end}
ans = 20×1 cell array
{'Hello_Message.pdf' } {'2341234342 3214234 ert' } {'2341234342 3214234 abc' } {'2341234342 3214234' } {'Some_ting' } {'23453425' } {'Blah_bleh' } {'Sadf_5' } {'Ouch 4' } {'TEST' } {' ' } {' ' } {' ' } {'Asdff: sdf_sdf' } {'Is_sdf: asdf' } {'IS_ssg: sadf' } {'NJ_T: adfgh' } {0×0 char } {'Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> iron </Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds\'} {'&&&' }
EDIT — (18 Oct 2023 at 03:42)
I am a bit lost with respect to ‘start key’ and ‘stop key’. My code defines ‘StartIdx’ and ‘StopIdx’ as the indices that define the ‘Message’ and ‘&&&’ entries. The ‘Extract’ cell arrays are those lines and all the lines between them.
My initial approach was to use the fileread function and then do ‘logical indexing’, however that failed so the loop was the only other available option.
My code here is the same code I posted as a Comment, changed to test for all the ‘Message’ lines and not only ‘Hello_Message.pdf’ that was initially specified.
The regexp approach is not specific enough for this requirement.
.
  6 Comments

Sign in to comment.

More Answers (1)

Jan
Jan on 18 Oct 2023
Edited: Jan on 18 Oct 2023
Why do you want to avoid loops? Reading the file completely to apply vectorized methods requires 8 GB of contiguous free RAM for a 4 GB file (16 bit per char). I'd choose such an approach only on computers with >= 32 GB RAM, while a loop method is less demanding concering the RAM. In addition a filtering during the reading avoid to keep the complete text in the RAM.
S = ParseFile("sample_input.txt");
S{1}
ans = 18×1 cell array
{'2341234342 3214234 ert' } {'2341234342 3214234 abc' } {'2341234342 3214234' } {'Some_ting' } {'23453425' } {'Blah_bleh' } {'Sadf_5' } {'Ouch 4' } {'TEST' } {' ' } {' ' } {' ' } {'Asdff: sdf_sdf' } {'Is_sdf: asdf' } {'IS_ssg: sadf' } {'NJ_T: adfgh' } {0×0 char } {'Some_data_: 4 sadf sadf asdf 45676578675 sdaf sadf asdf asdf sadf 4365436546 sdfdsf 0 sadfsdffds 0 <Item> toaster</Item> dsfasdf sadfdsakfdsfklj sdafsdafdsa fds'}
function S = ParseFile(File)
startKey = "Hello_Message.pdf";
stopKey = "&&&";
fid = fopen(File, 'r');
assert(fid > 0, "Cannot open file: %s", File);
bS = 1000; % Pre-allocate output in blocks
nS = bS;
iS = 0;
S = cell(1, nS);
buffer = cell(20, 1); % Grows iteratively at first
ibuffer = 0;
doGrab = false;
while ~feof(fid)
Line = fgetl(fid);
if startsWith(Line, startKey)
buffer(:) = {[]}; % Clear the buffer
ibuffer = 0;
doGrab = true; % Start grabbing in next line
elseif startsWith(Line, stopKey)
doGrab = false; % Stop grabbing
iS = iS + 1; % Expand output S in blocks on demand
if iS > nS
nS = nS + bS;
S{nS} = [];
end
S{iS} = buffer(1:ibuffer); % Store the buffer
elseif doGrab
ibuffer = ibuffer + 1;
buffer{ibuffer} = Line;
end
end
fclose(fid);
if doGrab % Store last buffer, if stopKey is missing?!?
iS = iS + 1;
S{iS} = Line;
end
S = S(1:iS); % Crop pre-allocated output cells
end
  1 Comment
Jude
Jude on 18 Oct 2023
Hi Jan,
With regards to my reason for wanting to avoid for loops, I "assumed" there could be a more resource/time efficient way to accomplish what I was trying to do. The input files are maintained on a network and not stored locally on the machine (64GB RAM) where matlab is being executed.
I do like your approach a lot and will be looking at it in detail so that I understand what is happening...
How would your solution/code be modified to so that the startkey and stopkey for the messages are included in the cell arrays captured by S?
Perhaps the startkey would need to be defined as regexpPattern('.*\.[a-z]{3}') then a filter for the message where line1 is equal to "Hello_Message.pdf" applied?
Thank you for your time.

Sign in to comment.

Products


Release

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!