Regular expressions: extracting data after certain keywords
28 views (last 30 days)
Show older comments
Jens Oppliger
on 10 Oct 2020
Commented: Jens Oppliger
on 12 Oct 2020
Hello everyone,
I'm currently working on a task of extracting some data from a large .txt file. The file consists of certain keywords that are followed by arrays of data enclosed by "[...]" (please refer to the attached example file). I read the file into MATLAB via "fileread" and now I would like to perform the following two operations on the file content:
1) Extracting the data that comes after the first keyword "Excitation energy:". The result should be a 1D array (double).
2) Extracting the data that comes after the "Pixel number ..." keyword, e.g. after "Pixel number 0: 100656.007213". And this should be done for every "Pixel number ..." keyword in the file. The result should then be in the form of a 2D array (double) (one column/row for each Pixel number basically).
Now I started looking into how to solve this problem using regexp. However, I'm struggling in obtaining the desired parts of the text file.
For example I tried using the following expression to obtain the text enclosed by "[...]" after "Excitation energy:"
content = fileread("example_file.txt");
expr = '((?<=Excitation energy:\s\s\[)).+(?=\])';
energy_text = regexp(content,expr,'match');
But the result is basically just the complete content of the text file in the form of a char array (but in this case it should stop before hitting the first closing braket "]"). So I must be doing something wrong (I'm not very familar in using regexp). Has anyone an idea of how to extract the above mentioned data arrays? As a side note I would like to mention that the number of values within the data arrays can vary in a different text file so the expressions for regexp should really just focus on finding the data that is enclosed by "[...]" after the corresponding keyword.
Maybe there is also another solution to this problem without using regexp ...
Thank you very much in advance.
2 Comments
Sindar
on 10 Oct 2020
What version are you using? Strings have come a long way in recent years
Something like this might work:
exc_str = extractBetween(content,"Excitation energy: [","]")
exc_data = str2double(exc_str);
pixel_sets = split(content,"Pixel number");
pixel_sets(1) = [];
pixel_sets = extractBefore(pixel_sets,"]");
pixel_sets = extractAfter(pixel_sets,"[");
pixel_data = str2double(pixel_sets);
Accepted Answer
Stephen23
on 12 Oct 2020
For such a large file I would get textscan to directly import the numeric data. With a few simple file commands you can also automatically adjust the format string to the number of columns, as shown below. Note that for R2020a you will probably need to change 'EndOfLine' to 'LineEnding'.
opt = {'CollectOutput',true, 'EndOfLine',']', 'HeaderLines',1,...
'MultipleDelimsAsOne',true, 'Whitespace',' \b\n\r\t'};
[fid,msg] = fopen('example_file.txt','rt');
assert(fid>=3,msg)
% Read "Excitation energy" block:
fscanf(fid,'Excitation energy:%*[^[][');
exe = fscanf(fid,'%f',[1,Inf]);
pos = ftell(fid);
% Read first "Pixel number" block:
fscanf(fid,'%*[^[]');
fscanf(fid,'[');
tmp = fscanf(fid,'%f',[1,Inf]);
% Create TEXTSCAN format string:
fmt = repmat('%f',1,numel(tmp));
fmt = ['Pixel number%f:%f%*[^0123456789]',fmt];
% Read all "Pixel number" blocks at once:
fseek(fid,pos,'bof');
out = textscan(fid,fmt,opt{:});
fclose(fid);
out = out{1};
Giving:
>> exe
exe =
0.6 0.599925979 0.599703936 .. 0.501484417 0.496248345 0.49088983
>> out
out =
0 100656.007213 2.08147902 .. -2.25828678 -2.35062627
1 100656.116929 2.08050533 .. -2.26393975 -2.34891882
More Answers (1)
Walter Roberson
on 10 Oct 2020
expr = '((?<=Excitation energy:\s\s\[)).+?(?=\])';
or
expr = '((?<=Excitation energy:\s\s\[))[^]]+)';
Remember that the + and * operators immediately extend as far as possible into the string, and then the focus point is moved backwards only as needed to match anything later in the same pattern. So in your case
expr = '((?<=Excitation energy:\s\s\[)).+(?=\])';
then the .+ would first match right to the very end of the string, and then the (?=\]) would force the focus point to move back to just before the ] that is closest before that point.
The *? and +? operators, on the other hand, are minimal operators, moving the focus point forward as little as possible to match what follows in the pattern. Or, as I showed, you can just tell it to move forward past all non-] characters, which is even less work for the parser. The main difference is that the [^]]+ by itself does not promise that the next character is ] the way the other possibilities do. For example if the file ended in
Excitation energy: [1 2 3 4
The (?=\[) pattern would not match unless what followed was ] whereas the [^]]+ would match to end of string since everything after the [ is something that is not ]
You could make the two equivalent by adding a (?=\]) after the [^]]+
See Also
Categories
Find more on String Parsing in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!