I want to extract the page buttons/widgets in a website using URLREAD.
4 views (last 30 days)
Show older comments
I want to learn what is the common expression for Buttons/Widgets that contain page numbers of a catalog, e.g. like in this website . In this capture you'll see what are the numbers I'd like to get using URLread command.
Do you know how to do this? You'd help me A LOT if you can. I already tried printing everything into a .txt file but I can't write the whole HTML code into it. My plan was to look for the common expression manually but I couldn't print the whole outcome of URLread into the .txt file.
Thanks a lot,
Aquiles
3 Comments
Walter Roberson
on 14 Sep 2017
Yup, I just visited the page in Firefox and hit command-U and scrolled through the HTML.
Accepted Answer
Cedric
on 14 Sep 2017
Edited: Cedric
on 14 Sep 2017
When you start clicking on pages, the page ID is in the URL, e.g.
https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=17
you can see it as the last URL parameter. It is therefore easy to build the URL for a given page with SPRINTF e.g. in a loop..
urlBase = 'https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=' ;
for pageId = 1 : 83
url = sprintf( '%s%d', urlBase, pageId ) ;
html = urlread( url ) ;
% Do something.
end
Then maybe you want to parse the HTML to get the table data, and you can use regular expressions for this. Training with page 1:
pageId = 1 ;
url = sprintf( '%s%d', urlBase, pageId ) ;
html = urlread( url ) ;
pattern = ['>(?<ibSymbol>[^<]+)</td>\s*<td><a href="javascript:NewWindow\(''', ...
'(?<externalUrl>[^'']+)[^>]+>(?<name>[^<]+)</a></td>\s*<td>(?<symbol>[^<]+)', ...
'</td>\s*<td>(?<currency>[^<]+)'] ;
data = regexp( html, pattern, 'names' ) ;
With that you get:
>> data
data =
1×100 struct array with fields:
ibSymbol
externalUrl
name
symbol
currency
>> data(1)
ans =
struct with fields:
ibSymbol: 'AT'
externalUrl: 'https://misc.interactivebrokers.com/cstools/contract_info/index2.php?action=Details&site=G…'
name: 'ATLANTIC POWER CORP'
symbol: 'AT'
currency: 'USD'
which is a struct array with the 100 entries of the table, including the URL of the page that you get in the popup window when you click on a product. So then you can work on parsing these pages:
html_ext = urlread( data(1).externalUrl ) ;
pattern_ext = '...' ;
data_ext = regexp( html_ext, pattern_ext, ... ) ;
I let you develop that part though! And putting everything together, you get a crawler/parser for the whole thing:
urlBase = 'https://www.interactivebrokers.com/en/index.php?f=2222&exch=amex&showcategories=STK&p=&cc=&limit=100&page=' ;
pattern = ['>(?<ibSymbol>[^<]+)</td>\s*<td><a href="javascript:NewWindow\(''', ...
'(?<externalUrl>[^'']+)[^>]+>(?<name>[^<]+)</a></td>\s*<td>(?<symbol>[^<]+)', ...
'</td>\s*<td>(?<currency>[^<]+)'] ;
pattern_ext = '...' ;
for pageId = 1 : 83
url = sprintf( '%s%d', urlBase, pageId ) ;
html = urlread( url ) ;
data = regexp( html, pattern, 'names' ) ;
for productId = 1 : numel( data )
html_ext = urlread( data(productId).externalUrl ) ;
data_ext = regexp( html_ext, pattern_ext, ... ) ;
% Do something.
end
end
That gives you a series of concepts/tools/examples that could be useful for what may come next in your developments.
PS: if you need to learn regular expressions in MATLAB, download the "MATLAB Programming Fundamentals" PDF document from
and go through the doc and examples on pages 2-42 to 2-73. It is a pretty good introduction/overview.
0 Comments
More Answers (0)
See Also
Categories
Find more on Spreadsheets in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!