Extracting Data field of a Series in HTML file

13 views (last 30 days)
b
b on 7 Apr 2020
Commented: b on 10 May 2020
In an HTML file, there is a section like this :
series: [{
name: 'Numbers',
color: '#33CCFF',
lineWidth: 5,
data: [45,78,84,91,111,125,178,231,274,283,303,333] }],
How to extract the 'data' field into an array in a matlab code ?
There are many such series' in that same HTML file with different 'name' fields. For example, name: 'Total Value', 'Log Scale', 'Base Value' etc.
  4 Comments
b
b on 7 Apr 2020
I am trying to do the following:
url="c:\finCase\case1.html";
code=webread(url);
tree=htmlTree(code);
selector="series";
subtrees=findElement(tree,selector);
The subtrees field is empty whereas it should have all the series' corresponding to various names ('Numbers', 'Total Value', 'Log Scale', 'Base Value' etc).

Sign in to comment.

Accepted Answer

per isakson
per isakson on 7 Apr 2020
Edited: per isakson on 9 May 2020
I misunderstood your question. This is a bit of overkill.
Assumptions
  • the string, series:, always indicates the start of a block of interest
I created a sample file, cssm.txt, which I uploaded. (Matlab Answers doesn't allow the extension .html ).
This script reads all blocks
%%
chr = fileread('cssm.txt');
cac = regexp( chr, '(?<=series\:)[^\}]+\}\],', 'match' );
%%
len = length( cac );
series(1,len) = struct( 'name','', 'color','', 'lineWidth',[], 'data',[] );
for jj = 1 : len
txt = regexp( cac{jj}, '(?<=name\:)[^,]+', 'match', 'once' );
txt(txt== '''') = [];
series(jj).name = matlab.lang.makeValidName( txt );
txt = regexp( cac{jj}, '(?<=color\:)[^,]+', 'match', 'once' );
txt(txt== '''') = [];
series(jj).color = txt;
txt = regexp( cac{jj}, '(?<=lineWidth\:)[^},]+', 'match', 'once' );
series(jj).lineWidth = str2double( txt );
txt = regexp( cac{jj}, '(?<=data\:)[^}]+', 'match', 'once' );
series(jj).data = str2num( txt ); %#ok<ST2NM>
end
and extract "series which matches name='Numbers'. Not the other series'."
>> series(strcmp({series.name},'Numbers')).data
ans =
45 78 84 91 111 125 178 231 274 283 303 333
In response to comment below
Assumptions
  • the string, series:, always indicates the start of a block of interest
  • the string, }], indicates the end of a block of interest
  • all html-files of interest are named index.html
  • all files named index.html are of interest
  • all html-files of interest are in subfolders under a root-folder, ...\finCase
  • every html-file, index.html, contains exactly one block that has a specific value of the field name:, e.g. Numbers
The overkill is still there. However, reading and parsing four html-files (copies of cssm.txt ) takes less than 10ms.
Try
>> client_data = read_client_data( 'd:\m\cssm\finCase', 'index.html', 'Numbers' )
client_data =
4×2 cell array
{'anderson' } {1×9 double}
{'kim-j-clijsters'} {1×10 double}
{'paul-judd' } {1×11 double}
{'simmi' } {1×12 double}
>>
where (in one m-file)
function client_data = read_client_data( root, file, name )
sad = dir( fullfile( root, '**', file ) );
len = length( sad );
client_data = cell( len, 2 );
for jj = 1 : len
cac = strsplit( sad(jj).folder, filesep );
client = cac{end};
series = read_one_file_( fullfile( sad(jj).folder, sad(jj).name ) );
client_data(jj,:) = { client, series(strcmp({series.name},name)).data };
end
end
function series = read_one_file_( file )
chr = fileread( fullfile( file ) );
cac = regexp( chr, '(?<=series\:)[^\}]+\}\],', 'match' );
len = length( cac );
series(1,len) = struct( 'name','', 'color','', 'lineWidth',[], 'data',[] );
for jj = 1 : len
txt = regexp( cac{jj}, '(?<=name\:)[^,]+', 'match', 'once' );
txt(txt== '''') = [];
series(jj).name = strtrim( txt );
txt = regexp( cac{jj}, '(?<=color\:)[^,]+', 'match', 'once' );
txt(txt== '''') = [];
series(jj).color = txt;
txt = regexp( cac{jj}, '(?<=lineWidth\:)[^},]+', 'match', 'once' );
series(jj).lineWidth = str2double( txt );
txt = regexp( cac{jj}, '(?<=data\:)[^}]+', 'match', 'once' );
series(jj).data = str2num( txt ); %#ok<ST2NM>
end
end
TODO: add error handling and comments
  10 Comments
b
b on 10 May 2020
LOL on the tragedy of being Null.
The code section works nicely with output as needed.
Indebted once again.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!