Parsing data from complicated text files

Question

Michael Browne on 22 Mar 2021

0
Link

Direct link to this question

https://in.mathworks.com/matlabcentral/answers/780102-parsing-data-from-complicated-text-files

Edited: Michael Browne on 24 Mar 2021

I have about 20 years of text files that contain the records of individual tests (about 8GB of plain text files, about 4,000 individual files). Each file has this format:

********************************************************************************
Test Data Report
Station ID:                     [Test Station ID Number]
Station Part Number:            [Test Station Part Number]
Station Serial Number:          [Test Station Serial Number]
Test Procedure Number:          [Test Procedure Number]   [Test Procedure Revision]
Operation:                      [colloquial test]
Serial Number of test subject:  [Serial Number + plus some other info about the test]
Date:                           [Day, Month Date, year]
Time:                           [11:00:03 AM]
Operator:                       [Operator Name]
Number of Results:              [NNNN]
Test Result:                    [Passed/Failed]
********************************************************************************
--------------------------------------------------------------------------------
MEASUREMENT               LL      READING           UL        UNITS      STATUS
--------------------------------------------------------------------------------
Enter Testing Time:                                                         Done
--------------------------------------------------------------------------------
08:00
--------------------------------------------------------------------------------
FOE, CAL:                                                                 Passed
--------------------------------------------------------------------------------
CALIBRATION IS VALID
--------------------------------------------------------------------------------
Test Start Time:                                                             Done
--------------------------------------------------------------------------------
11:00:33 AM
--------------------------------------------------------------------------------
Group Meas Init:                                                          Passed
--------------------------------------------------------------------------------
Datapoint_01            LL         Measured         UL         Units      Passed
Datapoint_02            LL         Measured         UL         Units      Passed
Datapoint_03            LL         Measured         UL         Units      Passed
Datapoint_04            LL         Measured         UL         Units      Passed
Datapoint_05            LL         Measured         UL         Units      Passed
Datapoint_06            LL         Measured         UL         Units      Passed
Datapoint_07            LL         Measured         UL         Units      Passed
Datapoint_08            LL         Measured         UL         Units      Passed
Datapoint_09            LL         Measured         UL         Units      Passed
Datapoint_10            LL         Measured         UL         Units      Passed
Datapoint_11            LL         Measured         UL         Units      Passed
Datapoint_12            LL         Measured         UL         Units      Passed
Datapoint_13            LL         Measured         UL         Units      Passed
Datapoint_14            LL         Measured         UL         Units      Passed
Datapoint_15            LL         Measured         UL         Units      Passed
Datapoint_16            LL         Measured         UL         Units      Passed
Datapoint_17            LL         Measured         UL         Units      Passed
Datapoint_18            LL         Measured         UL         Units      Passed
Datapoint_19            LL         Measured         UL         Units      Passed
Datapoint_20            LL         Measured         UL         Units      Passed
Datapoint_21            LL         Measured         UL         Units      Passed
Datapoint_22            LL         Measured         UL         Units      Passed
Datapoint_23            LL         Measured         UL         Units      Passed
Datapoint_24            LL         Measured         UL         Units      Passed
Datapoint_25            LL         Measured         UL         Units      Passed
Datapoint_26            LL         Measured         UL         Units      Passed
Datapoint_27            LL         Measured         UL         Units      Passed
Datapoint_28                       Measured         UL         Units      Passed
Datapoint_29                       Measured                    Units      Passed
--------------------------------------------------------------------------------
Group Meas Ramp:                                                          Passed
--------------------------------------------------------------------------------
Datapoint_01            LL         Measured         UL         Units      Passed
Datapoint_02            LL         Measured         UL         Units      Passed
Datapoint_03            LL         Measured         UL         Units      Passed
Datapoint_04            LL         Measured         UL         Units      Passed
Datapoint_05            LL         Measured         UL         Units      Passed
Datapoint_06            LL         Measured         UL         Units      Passed
Datapoint_07            LL         Measured         UL         Units      Passed
Datapoint_08            LL         Measured         UL         Units      Passed
Datapoint_09            LL         Measured         UL         Units      Passed
Datapoint_10            LL         Measured         UL         Units      Passed
Datapoint_11            LL         Measured         UL         Units      Passed
Datapoint_12            LL         Measured         UL         Units      Passed
Datapoint_13            LL         Measured         UL         Units      Passed
Datapoint_14            LL         Measured         UL         Units      Passed
Datapoint_15            LL         Measured         UL         Units      Passed
Datapoint_16            LL         Measured         UL         Units      Passed
Datapoint_17            LL         Measured         UL         Units      Passed
Datapoint_18            LL         Measured         UL         Units      Passed
Datapoint_19            LL         Measured         UL         Units      Passed
Datapoint_20            LL         Measured         UL         Units      Passed
Datapoint_21            LL         Measured         UL         Units      Passed
Datapoint_22            LL         Measured         UL         Units      Passed
Datapoint_23            LL         Measured         UL         Units      Passed
Datapoint_24            LL         Measured         UL         Units      Passed
Datapoint_25            LL         Measured         UL         Units      Passed
Datapoint_26            LL         Measured         UL         Units      Passed
Datapoint_27            LL         Measured         UL         Units      Passed
Datapoint_28                       Measured         UL         Units      Passed
Datapoint_29                       Measured                    Units      Passed
--------------------------------------------------------------------------------
Time (after meas):                                                          Done
--------------------------------------------------------------------------------
11:01:16 AM
--------------------------------------------------------------------------------
Group Meas Ramp:                                                          Passed
--------------------------------------------------------------------------------
Datapoint_01            LL         Measured         UL         Units      Passed
Datapoint_02            LL         Measured         UL         Units      Passed
Datapoint_03            LL         Measured         UL         Units      Passed
Datapoint_04            LL         Measured         UL         Units      Passed
Datapoint_05            LL         Measured         UL         Units      Passed
Datapoint_06            LL         Measured         UL         Units      Passed
Datapoint_07            LL         Measured         UL         Units      Passed
Datapoint_08            LL         Measured         UL         Units      Passed
Datapoint_09            LL         Measured         UL         Units      Passed
Datapoint_10            LL         Measured         UL         Units      Passed
Datapoint_11            LL         Measured         UL         Units      Passed
Datapoint_12            LL         Measured         UL         Units      Passed
Datapoint_13            LL         Measured         UL         Units      Passed
Datapoint_14            LL         Measured         UL         Units      Passed
Datapoint_15            LL         Measured         UL         Units      Passed
Datapoint_16            LL         Measured         UL         Units      Passed
Datapoint_17            LL         Measured         UL         Units      Passed
Datapoint_18            LL         Measured         UL         Units      Passed
Datapoint_19            LL         Measured         UL         Units      Passed
Datapoint_20            LL         Measured         UL         Units      Passed
Datapoint_21            LL         Measured         UL         Units      Passed
Datapoint_22            LL         Measured         UL         Units      Passed
Datapoint_23            LL         Measured         UL         Units      Passed
Datapoint_24            LL         Measured         UL         Units      Passed
Datapoint_25            LL         Measured         UL         Units      Passed
Datapoint_26            LL         Measured         UL         Units      Passed
Datapoint_27            LL         Measured         UL         Units      Passed
Datapoint_28                       Measured         UL         Units      Passed
Datapoint_29                       Measured                    Units      Passed
--------------------------------------------------------------------------------
Time (after meas):                                                          Done
--------------------------------------------------------------------------------
11:01:37 AM
--------------------------------------------------------------------------------

Now, at the moment, the only things I care about are

Whether a failure occured or not
When that failure occured

I will likely want to perform other analysises on the data the in the future, but for the moment, this will suffice. I want to go through each report, determine whether a failure occured, record when that failure occured, and then plot all the failures as a histogram in terms of time so that I can see if there are any typical lengths of time it takes for a test to fail.

I have a pretty good amount of experience with working with data once it is in Matlab, but I am much less experienced with importing data, especially this kind of batch importing. Is there a simple way to do this, or am I essentially just using something like textscan() or fscanf() in a loop?

3 Comments
Show 1 older commentHide 1 older comment

Michael Browne on 22 Mar 2021

Open in MATLAB Online

I cannot post the full text files because of company policies, and this text is just illustrative of the formating that each file has.

But, yes, the text you've highlighted is the header, however, the "Time:" is the when the test begins, not when it fails.

The "Test Result" section does record whether a test is an overal pass or fail, but it is a summation of all the data points and all the tests performed on those data points. It is the result of the software looking for a single failure, and then recording a "Failed" result in the header. I don't really care about the header result, since I really care about when a failure occurs.

So what I need to build is a function that scans through the file, looking for any "Fail" results in a section like this one:

Group Meas Ramp:                                                          Passed
--------------------------------------------------------------------------------

and then jump to the time section immediately below it, like this one here:

Time (after meas):                                                          Done
--------------------------------------------------------------------------------
11:01:16 AM
--------------------------------------------------------------------------------

It would then take the difference with the time listed in the header of the file

[11:01:16 AM] - [11:00:03 AM]

Then it would store this data as a single point, that will go towards the creation of a histogram.

I can already do this for one single file using Matlab "Import Data" tool, and a lot of manual selection that is specific to each file, but the issue is I need to do this for 4,000 files where the failure is located at the end of the file (the test hardware terminate the test in the event of a failure). So it is the automating of this data parsing that is giving me trouble.

dpb on 23 Mar 2021

Well, we still don't have a file to test with nor is there a case that fails in the text you posted...if you expect somebody to write code, you've got to do your part to give them the help needed from your end; otherwise you'll have the result of the other poster's wasted time/effort that doesn't work because what he was provided wasn't sufficient and his best guess of what it should be apparently wasn't correct.

In general, however, the idea would be to use readcell to import each file into a cell array, use contains or regexp to find rows with the key words/phrases wanted, and then parse those lines, taking into account where the group headers are to match which are which.

Sign in to comment.

Sign in to answer this question.

Answer 1

Michael Browne on 24 Mar 2021

0
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/780102-parsing-data-from-complicated-text-files#answer_657183

Edited: Michael Browne on 24 Mar 2021

Open in MATLAB Online

Alright, after digging through @Mathieu NOE's code, and seeing why it failed, it turns out that there are slight variations in all the text file formatting that were introduced by ~20 years of test software updates - things like the exact number and types of white space characters changing. However, I did discover another timing flag that I could use, which had stayed consistent. Buried much deeper in the file is an 'elapsed time' flag that is very poorly named, which is why I missed it the first time (however, I still apologize for not including it in the posted format in the OP). This elapsed time flag had both a consistent format across the years, but unique as well among all the times listed in this data, so I was able to make a pattern for that, and then detect and pull that out. Once I had all those elapsed time loggings, I just selected the one at the end of the array, since that one will always be the longest, and I can just select that as the time it took to fail each data report.

Also, thank you for your paitence @dpbdpb. I actually found myself reading a lot of your replies to other issues with reading strings from text files. This solution of your made me realize that I was over-thinking my problem.

Here is what I came up with:

filename_in = 'testData.txt';
[output] = extract_data(filename_in);
function [time_to_fail] = extract_data(file)
    fid = fileread(file);
    
    % Pattern Definitions
    elapsed_pattern = digitsPattern(2) + " : " + digitsPattern(2) + " : " + digitsPattern(2);
    %time_format = 'HH:MM:SS';
    % First screen, to check for any failures
    failure_detect = strfind(fid, "Failed");
    
    % If a failure is detected, pull the all the elapsed times
    if failure_detect > 0
        
        % 'extract' pulls out the 'hh : mm : ss' flags from the text file
        % 'strrep' removes all white space, leaving 'hh:mm:ss'
        elapsed_time = strrep(extract(fid,elapsed_pattern),' ','');
        
        %'elapsed_time(end)' grabs just the last elapsed time from the
        %data, which should always be the longest time.
        elapsed_time = elapsed_time(end);
        
        %Spit out the result from the function
        time_to_fail = elapsed_time;       
    end
    
end

Now I just need to wrap my head around handling time in Matlab, but that is off-topic for this issue, and I have not had a chance to do my homework for it yet.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 2

Mathieu NOE on 23 Mar 2021

1
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/780102-parsing-data-from-complicated-text-files#answer_655592

Open in MATLAB Online

hello

this is my 2 cents code to import the required data. The function will give you the time values (char array) and the number of failures. i tested it with two dummy files, one is your original data and the second one I changed the last section to create a Failed condition , plus I added another failed case with a different time value , just to check my code would correctly detect the 2 failures

Filename_in = 'data2.txt';
% Filename_out= 'dataABC_reduced.txt';
[Time_init,Time_end,fail_count] = extract_data(Filename_in);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [Time_init,Time_end,fail_count] = extract_data(Filename)
fid = fopen(Filename);
tline = fgetl(fid); 
% initialization
k = 0; % counter #1
fail_count = 0; % counter #2
Time_init = '';
Time_end{1} = '';
line_fail_ind = 0;
fail_flag = 0;
while ischar(tline)
    
    k = k+1;    % loop over line index
    
    % store initial Time value (start Time)
    if contains(tline,'Time:                           [')
        Time_init = deblank(extractBetween(tline,'[',']'))
    end
    
    % then search for 'Failed' case in line " Group Meas Ramp " 
     if (contains(tline,'Group Meas Ramp') && contains(tline,'Failed'))
        fail_flag = 1 ;
     end
     
     if fail_flag == 1 && contains(tline,'Time (after meas)')
         line_fail_ind = k;
     end
     
    %  time of failure  : capture when running index k = line_fail_ind + 2
    %  (and fail_flag == 1)
    if fail_flag == 1 && k == line_fail_ind + 2
        fail_count = fail_count+1;    
        Time_end{fail_count} = tline;
        fail_flag = 0; % reset fail_flag
    end
    
    tline = fgetl(fid);  % lower make matlab not case sensitive
end
fclose(fid);
end

3 Comments
Show 1 older commentHide 1 older comment

Mathieu NOE on 23 Mar 2021

hi

would you be able to copy paste the section of data that seems not to work 100% with my code ?

dpb on 24 Mar 2021

Is this one test/file?

Is the Group Meas Init: section of interest? There is no time after it; only after the "Ramp" section is a ending time given. I presume maybe if the INIT fails, the rest of the test is aborted and there consequently is no file?

Need all the ground rules...

Sign in to comment.

Parsing data from complicated text files

3 Comments
Show 1 older commentHide 1 older comment

Accepted Answer

0 Comments
Show -2 older commentsHide -2 older comments

More Answers (1)

3 Comments
Show 1 older commentHide 1 older comment

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Parsing data from complicated text files

3 Comments Show 1 older commentHide 1 older comment

Accepted Answer

0 Comments Show -2 older commentsHide -2 older comments

More Answers (1)

3 Comments Show 1 older commentHide 1 older comment

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

3 Comments
Show 1 older commentHide 1 older comment

0 Comments
Show -2 older commentsHide -2 older comments

3 Comments
Show 1 older commentHide 1 older comment