HTML Page source info

1 view (last 30 days)
b
b on 26 Nov 2020
Commented: Rik on 3 Dec 2020
Hello, many-a-times we come across a series of numbered webpages
basePage.html?page=2
basePage.html?page=3
and so forth, wherein there are several fields identified by their labels:
<h2 class="category-heading">Name1</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name2</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name3</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
and so on.
How can the "textOfInterest" of one particular parameter, say, Parameter2, of all the Name*, of all the pages,
basePage.html?page=1toInf
be taken (outputted/exported) into one text file, say, Parameter2.txt?
The "textOfInterest" is often alphanumeric with special characters !@#$% also.
Thanks.
  6 Comments
Rik
Rik on 1 Dec 2020
The goal of Bible downloader is religious (although you can use the text of a Bible translation for non-religous purposes as well of course), but the code isn't.
Did you try adapting any of the code? I'll post some code as an answer.

Sign in to comment.

Accepted Answer

Rik
Rik on 1 Dec 2020
One possibility with strfind:
close_div=strfinf(d,'</div>');
param=1;
pat=sprintf('<label>Parameter%d : </label> <div class="category-related">',param)
position=strfind(d,pat);
position=position+numel(pat);%this will be the start of your text of interest
texts=cell(size(position));
for n=1:numel(position)
end_of_text=close_div(close_div>position(n));
end_of_text=end_of_text(1)-1;
texts{n}=d(position(n):end_of_text);
end
Or with a regexp:
d=['<h2 class="category-heading">Name1</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'...
'<h2 class="category-heading">Name2</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'...
'<h2 class="category-heading">Name3</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'];
RE=['<label>Parameter\d',... % \d matches a single digit
' : </label> <div class="category-related">',...
'(',... % use parentheses to capture a token
'[^<]*',... % this matches any number of characters other than <
')',...
'</div>'];
t=regexp(d,RE,'tokens');
clc
celldisp(t)
You can also adapt the expression to look forward to match </div> so you can use .* instead of [^<]*
  8 Comments
Rik
Rik on 2 Dec 2020
Those arrows are probably newline characters. What release are you using?
I would suggest parsing each element separately. That way you can write an empty char or whatever you prefer in the email field for that person.

Sign in to comment.

More Answers (1)

b
b on 3 Dec 2020
That is exactly how I am doing it. By parsing it separately, there is no way to correlate which Name-field has the corresponding Mail-field missing. It parses all the Name-fields, then it parses all the mail-fields, as a sequential process.
What modification should be made in the codes, so that they print 'Not Found' when the mail field is missing in the corresponding iteration? Is there a way to get the index values of the missing Mail-fields?
  3 Comments
Rik
Rik on 3 Dec 2020
You're welcome (and thanks for the limerick XD).
If you have follow-up question, feel free to post a link to it here.

Sign in to comment.

Categories

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!