Using webread for a website and all of its hyperlinked subwebsites

8 views (last 30 days)
Greetings,
I have used the webread function successfully for multiple websites but I was wondering how you would use it effectively if the website you are considering has hyperlinks to other further webpages. You may have to go several levels deep in this process. Ultimately, I am looking for certain types of files which are either stored on the main page (which I know how to do already) or these hyperlinked pages (which I do not know how to do).
Consider an example website: >> url = "https://en.wikipedia.org/wiki/Quantum_mechanics".
url = "https://en.wikipedia.org/wiki/Quantum_mechanics";
A = webread(url);
If you perform the webread function you can parse through an find several .pdf files. However the website contains quite a few links. You now wish to go their their respective webpages and perform the same process.
In short, I am stuck as to how exactly you would perform this recursive process in matlab. Any help is greatly appreciated!

Answers (1)

Guillaume
Guillaume on 1 Jun 2019
Even for your initial task of finding the pdf links in a html, I would argue that webread is not a good tool for the job as the raw html that it returns can be very different from the actual content that is rendered by a web browser (css, javascript, etc.will affect the rendering).
It's even less adequate for your task of following links, you really need something that is going to parse the html and give you access to the document object model (DOM) and navigate the DOM instead. Unfortunately, as far as I know there's no such tool in matlab (as that's complex).
A poor man option would be to search for '<a href' in your downloaded html but that would be very fragile:
url = "https://en.wikipedia.org/wiki/Quantum_mechanics";
A = webread(url);
suburls = regexp(A, '<a\s+href=\s*["''][^>]*>', 'match');
islocal = contains(suburls, '#');
otherpages = suburl(~islocal);
But this is very fragile (e.g. an attribute between the a and href will stop the link being detected). Again, if you want something reliable you'll need an html parser.

Tags

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!