How, if possible, do I limit the number of times REGEXP searches for a specific pattern?
3 views (last 30 days)
Show older comments
I’m using a regular expression to search blocks of text that look like the following;
MSN_BER (0:31) Observation #1 Rx'd at: (58570.000) Msg. Time: (58568.000)
Forward to IMU: true Rcv Date: 2010121 Synch: f0f0 Rep Mode: Replay_Mode
State Time: 12:00:00.000 (58571.000)
State Position: -1500.0000, -5000.0000, 4100.0000
MSN_RAM (0:32) Observation #20 Rx'd at: (58569.000) Msg. Time: (58569.000)
Forward to IMU: true Rcv Date: 2010121 Synch: f0f0 Rep Mode: Replay_Mode
Fmt: 10 (AIRBORN__ARRAY_LOT) Length: 5678 Remote Num: 1 Number of Obsevations: 1
Type: 1 Track ID: 12345 Time Tag: 58573.00000000
Band ID: 1 AD ID: 21 Scan ID: 0 LRT/HRT: 1 Valid Flag: 0
MSN_RAM (0:32) Observation #30 Rx'd at: (58569.000) Msg. Time: (58569.000)
Forward to IMU: true Rcv Date: 2010121 Synch: f0f0 Rep Mode: Replay_Mode
Fmt: 10 (AIRBORN__ARRAY_LOT) Length: 5678 Remote Num: 1 Number of Obsevations: 2
Type: 1 Track ID: 12345 Time Tag: 58583.00000000
Band ID: 1 AD ID: 31 Scan ID: 0 LRT/HRT: 1 Valid Flag: 0
Type: 1 Track ID: 12345 Time Tag: 58585.00000000
Band ID: 1 AD ID: 32 Scan ID: 0 LRT/HRT: 1 Valid Flag: 0
Note: There is no 2nd MSN_BER data block.
I’m using the following search pattern and REGEXP function to extract the time tag and AD ID values:
exp = '([\d\.]+)\s+Band[^A]+?AD ID:\s+(\d+).';
tokens3 = regexp(bufferSplit{BlockId}, exp, 'tokens');
This results in: tokens3 = {1x2 cell} {1x2 cell} {1x2 cell},
where the time tag and AD ID are contained in the cells for each occurrence in the block of text.
>> tokens3{1,1}
ans = '58573.00000000' '21'
>> tokens3{1,2}
ans = '58583.00000000' '31'
>> tokens3{1,3}
ans = '58585.00000000' '32'
What I’m attempting to accomplish is limit the search pattern. Specifically, limit the number of times to search for the time tag and AD ID values based on the fact that there is no 2nd MSN_BER data block. I know the command option 'once' will return only the first match found. However, there could be multiple occurrences of the AD ID and its associated time tag.
The result of this would be: tokens3 = {1x2 cell}
>> tokens3{1,1}
ans = '58573.00000000' '21'
Can this be accomplished using the REGEXP function?
3 Comments
Cedric
on 16 Nov 2013
So you have a situation like the following?
MSN_BER
...
MSN_RAM
...
Type: - this block of data could occur between 1 and several hundred times
MSN_RAM ** No MSN_BER, so Type entries should be discarded.
...
Type: - this block of data could occur between 1 and several hundred times
MSN_BER
...
MSN_RAM
...
Type: - this block of data could occur between 1 and several hundred times
If, so, what do you want to achieve? Is it to get a stat on time of all types which belong to any MSN_BER, or is it a stat per MSN_BER, or anything else?
Accepted Answer
Cedric
on 16 Nov 2013
Edited: Cedric
on 17 Nov 2013
I'll answer assuming that my last comment under your question is correct. It is nice to implement complex regular expressions for learning, but in practice one often gets better results by splitting a one shot complex call/pattern into a series of simpler calls/patterns. Here is an example: I am using the following content:
MSN_BER (0:31) Observation #1 Rx'd at: (58570.000) Msg. Time: (58568.000)
Forward to IMU: true Rcv Date: 2010121 Synch: f0f0 Rep Mode: Replay_Mode
State Time: 12:00:00.000 (58571.000)
State Position: -1500.0000, -5000.0000, 4100.0000
MSN_RAM (0:32) Observation #20 Rx'd at: (58569.000) Msg. Time: (58569.000)
Forward to IMU: true Rcv Date: 2010121 Synch: f0f0 Rep Mode: Replay_Mode
Fmt: 10 (AIRBORN__ARRAY_LOT) Length: 5678 Remote Num: 1 Number of Obsevations: 1
Type: 1 Track ID: 12345 Time Tag: 58573.00000000
Band ID: 1 AD ID: 21 Scan ID: 0 LRT/HRT: 1 Valid Flag: 0
Type: 1 Track ID: 12345 Time Tag: 58574.00000000
Band ID: 1 AD ID: 21 Scan ID: 0 LRT/HRT: 1 Valid Flag: 0
MSN_RAM (0:32) Observation #30 Rx'd at: (58569.000) Msg. Time: (58569.000)
Forward to IMU: true Rcv Date: 2010121 Synch: f0f0 Rep Mode: Replay_Mode
Fmt: 10 (AIRBORN__ARRAY_LOT) Length: 5678 Remote Num: 1 Number of Obsevations: 2
Type: 1 Track ID: 12345 Time Tag: 58583.00000000
Band ID: 1 AD ID: 31 Scan ID: 0 LRT/HRT: 1 Valid Flag: 0
Type: 1 Track ID: 12345 Time Tag: 58585.00000000
Band ID: 1 AD ID: 32 Scan ID: 0 LRT/HRT: 1 Valid Flag: 0
MSN_BER (0:31) Observation #1 Rx'd at: (58570.000) Msg. Time: (58568.000)
Forward to IMU: true Rcv Date: 2010121 Synch: f0f0 Rep Mode: Replay_Mode
State Time: 12:00:00.000 (58571.000)
State Position: -1500.0000, -5000.0000, 4100.0000
MSN_RAM (0:32) Observation #20 Rx'd at: (58569.000) Msg. Time: (58569.000)
Forward to IMU: true Rcv Date: 2010121 Synch: f0f0 Rep Mode: Replay_Mode
Fmt: 10 (AIRBORN__ARRAY_LOT) Length: 5678 Remote Num: 1 Number of Obsevations: 1
Type: 1 Track ID: 12345 Time Tag: 58578.00000000
Band ID: 1 AD ID: 41 Scan ID: 0 LRT/HRT: 1 Valid Flag: 0
Type: 1 Track ID: 12345 Time Tag: 58579.00000000
Band ID: 1 AD ID: 41 Scan ID: 0 LRT/HRT: 1 Valid Flag: 0
which is made of two MSN_BER/MSN_RAM blocks framing an MSN_RAM only block. I assume that you want to get all AD IDs and time tags of MSN_BER/MSN_RAM blocks.
The first step is to read the file and get valid MSN_BER/MSN_RAM blocks:
content = fileread( 'bradFile.txt' ) ;
BER_blocks = regexp( content, 'MSN_BER.+?RAM(?:[^R]+|R(?!AM))*', 'match' ) ;
Running this produces..
>> BER_blocks
BER_blocks =
[1x766 char] [1x762 char]
If you display these two blocks, you'll see that the first doesn't include the MSN_RAM block. The first part of the pattern is trivial, and the second part matches all characters which are not 'R' or all 'R''s not followed by 'AM'. This is one (not too inefficient) way to exclude a given string from the match.
The second step is to extract AD IDs and time tags from each block.
data = cell( size( BER_blocks )) ;
for bId = 1 : numel( BER_blocks )
tokens = regexp( BER_blocks{bId}, 'Time Tag:\s*([\d\.]+).+?AD ID:\s*(\d+)', ...
'tokens' ) ;
data{bId} = reshape( str2double( [tokens{:}] ), 2, [] ).' ;
end
Which leads, based on the above content, to the following data cell array (each cell contains time tag and AD ID of one MSN_BER/MSN_RAM block) ..
>> celldisp( data )
data{1} =
58573 21
58574 21
data{2} =
58578 41
58579 41
You can then concatenate these cells' content if you want to have one big array instead of one array per block:
>> data = vertcat( data{:} )
data =
58573 21
58574 21
58578 41
58579 41
Let me know if it's not what you wanted.
2 Comments
Cedric
on 18 Nov 2013
You're welcome. And we actually all have a long way to go with these regular expressions, so I sympathize!
More Answers (1)
Walter Roberson
on 12 Nov 2013
After a pattern, perhaps enclosed in () or (?:), you can put {minimum,maximum} counts. For example
'(?:\d\w){3,7}'
would match 3, 4, 5, 6, or 7 occurrences of \d\w repeated.
7 Comments
Walter Roberson
on 14 Nov 2013
Sorry, the look-ahead should be ?= rather than ?:
'((?:\w+=).*?)(?=MSN_RAM)'
The \w+= was just a sample pattern I tossed in for illustration; it matches a "word" followed by an equals sign.
The structure would be
(pattern_to_repeat)?*(?=pattern_to_stop_before)
See Also
Categories
Find more on Data Type Identification in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!