Regular expressions: extracting data after certain keywords

Question

0 votes

example_file.txt

Hello everyone,

I'm currently working on a task of extracting some data from a large .txt file. The file consists of certain keywords that are followed by arrays of data enclosed by "[...]" (please refer to the attached example file). I read the file into MATLAB via "fileread" and now I would like to perform the following two operations on the file content:

1) Extracting the data that comes after the first keyword "Excitation energy:". The result should be a 1D array (double).

2) Extracting the data that comes after the "Pixel number ..." keyword, e.g. after "Pixel number 0: 100656.007213". And this should be done for every "Pixel number ..." keyword in the file. The result should then be in the form of a 2D array (double) (one column/row for each Pixel number basically).

Now I started looking into how to solve this problem using regexp. However, I'm struggling in obtaining the desired parts of the text file.

For example I tried using the following expression to obtain the text enclosed by "[...]" after "Excitation energy:"

content = fileread("example_file.txt");
expr = '((?<=Excitation energy:\s\s\[)).+(?=\])';
energy_text = regexp(content,expr,'match');

But the result is basically just the complete content of the text file in the form of a char array (but in this case it should stop before hitting the first closing braket "]"). So I must be doing something wrong (I'm not very familar in using regexp). Has anyone an idea of how to extract the above mentioned data arrays? As a side note I would like to mention that the number of values within the data arrays can vary in a different text file so the expressions for regexp should really just focus on finding the data that is enclosed by "[...]" after the corresponding keyword.

Maybe there is also another solution to this problem without using regexp ...

Thank you very much in advance.

2 Comments
Show None Hide None

Sindar on 10 Oct 2020

Open in MATLAB Online

What version are you using? Strings have come a long way in recent years

Something like this might work:

exc_str = extractBetween(content,"Excitation energy: [","]")
exc_data = str2double(exc_str);
pixel_sets = split(content,"Pixel number");
pixel_sets(1) = [];
pixel_sets = extractBefore(pixel_sets,"]");
pixel_sets = extractAfter(pixel_sets,"[");
pixel_data = str2double(pixel_sets);

Jens Oppliger on 10 Oct 2020

Hello Sindar,

thank you very much for your quick answer. I'm using MATLAB R2020a. Just a few minutes ago I also came across those functions you mentioned. Especially "extractBetween" that I now used to perform the desired tasks.

Thanks again.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Stephen23 on 12 Oct 2020

Open in MATLAB Online

0 votes

example_file.txt

For such a large file I would get textscan to directly import the numeric data. With a few simple file commands you can also automatically adjust the format string to the number of columns, as shown below. Note that for R2020a you will probably need to change 'EndOfLine' to 'LineEnding'.

opt = {'CollectOutput',true, 'EndOfLine',']', 'HeaderLines',1,...
	'MultipleDelimsAsOne',true, 'Whitespace',' \b\n\r\t'};
[fid,msg] = fopen('example_file.txt','rt');
assert(fid>=3,msg)
% Read "Excitation energy" block:
fscanf(fid,'Excitation energy:%*[^[][');
exe = fscanf(fid,'%f',[1,Inf]);
pos = ftell(fid);
% Read first "Pixel number" block:
fscanf(fid,'%*[^[]');
fscanf(fid,'[');
tmp = fscanf(fid,'%f',[1,Inf]);
% Create TEXTSCAN format string:
fmt = repmat('%f',1,numel(tmp));
fmt = ['Pixel number%f:%f%*[^0123456789]',fmt];
% Read all "Pixel number" blocks at once:
fseek(fid,pos,'bof');
out = textscan(fid,fmt,opt{:});
fclose(fid);
out = out{1};

Giving:

>> exe
exe =
    0.6    0.599925979    0.599703936  ..  0.501484417    0.496248345    0.49088983
>> out
out =
    0     100656.007213     2.08147902  ..  -2.25828678    -2.35062627
    1     100656.116929     2.08050533  ..  -2.26393975    -2.34891882

1 Comment
Show -1 older comments Hide -1 older comments

Jens Oppliger on 12 Oct 2020

Hello Stephen,

your proposed code does exactly what I wanted. Thanks again for your help!

Sign in to comment.

Answer 2

Walter Roberson on 10 Oct 2020

Open in MATLAB Online

0 votes

expr = '((?<=Excitation energy:\s\s\[)).+?(?=\])';

or

expr = '((?<=Excitation energy:\s\s\[))[^]]+)';

Remember that the + and * operators immediately extend as far as possible into the string, and then the focus point is moved backwards only as needed to match anything later in the same pattern. So in your case

expr = '((?<=Excitation energy:\s\s\[)).+(?=\])';

then the .+ would first match right to the very end of the string, and then the (?=\]) would force the focus point to move back to just before the ] that is closest before that point.

The *? and +? operators, on the other hand, are minimal operators, moving the focus point forward as little as possible to match what follows in the pattern. Or, as I showed, you can just tell it to move forward past all non-] characters, which is even less work for the parser. The main difference is that the [^]]+ by itself does not promise that the next character is ] the way the other possibilities do. For example if the file ended in

Excitation energy: [1 2 3 4

The (?=\[) pattern would not match unless what followed was ] whereas the [^]]+ would match to end of string since everything after the [ is something that is not ]

You could make the two equivalent by adding a (?=\]) after the [^]]+

1 Comment
Show -1 older comments Hide -1 older comments

Jens Oppliger on 11 Oct 2020

Hello Walter,

thank you for the detailed explanation. I tried both of your suggestions, but only the first one worked. Regexp doesn't return any match for the second expression where you included the [^]].

Nevertheless your answer gave me quite some additional insight into how the different quantifiers work in a regular expression.

Sign in to comment.

Regular expressions: extracting data after certain keywords

2 Comments
Show None Hide None

Accepted Answer

1 Comment
Show -1 older comments Hide -1 older comments

More Answers (1)

1 Comment
Show -1 older comments Hide -1 older comments

Categories

Tags

Community Treasure Hunt

Regular expressions: extracting data after certain keywords

2 Comments Show None Hide None

Accepted Answer

1 Comment Show -1 older comments Hide -1 older comments

More Answers (1)

1 Comment Show -1 older comments Hide -1 older comments

Categories

Tags

See Also

Community Treasure Hunt

2 Comments
Show None Hide None

1 Comment
Show -1 older comments Hide -1 older comments

1 Comment
Show -1 older comments Hide -1 older comments