Using webread for a website and all of its hyperlinked subwebsites

Question

dsmalenb on 1 Jun 2019

0
Link

Direct link to this question

https://in.mathworks.com/matlabcentral/answers/465123-using-webread-for-a-website-and-all-of-its-hyperlinked-subwebsites

Answered: Guillaume on 1 Jun 2019

Greetings,

I have used the webread function successfully for multiple websites but I was wondering how you would use it effectively if the website you are considering has hyperlinks to other further webpages. You may have to go several levels deep in this process. Ultimately, I am looking for certain types of files which are either stored on the main page (which I know how to do already) or these hyperlinked pages (which I do not know how to do).

Consider an example website: >> url = "https://en.wikipedia.org/wiki/Quantum_mechanics".

url = "https://en.wikipedia.org/wiki/Quantum_mechanics";
A = webread(url);

If you perform the webread function you can parse through an find several .pdf files. However the website contains quite a few links. You now wish to go their their respective webpages and perform the same process.

In short, I am stuck as to how exactly you would perform this recursive process in matlab. Any help is greatly appreciated!

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Guillaume on 1 Jun 2019

0
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/465123-using-webread-for-a-website-and-all-of-its-hyperlinked-subwebsites#answer_377483

Open in MATLAB Online

Even for your initial task of finding the pdf links in a html, I would argue that webread is not a good tool for the job as the raw html that it returns can be very different from the actual content that is rendered by a web browser (css, javascript, etc.will affect the rendering).

It's even less adequate for your task of following links, you really need something that is going to parse the html and give you access to the document object model (DOM) and navigate the DOM instead. Unfortunately, as far as I know there's no such tool in matlab (as that's complex).

A poor man option would be to search for '<a href' in your downloaded html but that would be very fragile:

url = "https://en.wikipedia.org/wiki/Quantum_mechanics";
A = webread(url);
suburls = regexp(A, '<a\s+href=\s*["''][^>]*>', 'match');
islocal = contains(suburls, '#');
otherpages = suburl(~islocal);

But this is very fragile (e.g. an attribute between the a and href will stop the link being detected). Again, if you want something reliable you'll need an html parser.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Using webread for a website and all of its hyperlinked subwebsites

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

Using webread for a website and all of its hyperlinked subwebsites

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments