How can I get all sub urls from host?

Hi,
I need to read all data from a url, but the issue is my url is like this:
"http://abc.efg.hij/klmnopqr%5stu.nsf/0/00DB180072B825?OpenDocument" or
"http://abc.efg.hij/klmnopqr%5stu.nsf/a33b09a7270068d/cc24f38720!OpenDocument"
and the last part (the serial number) is changed randomly. I just want a way to go to main site(as example: "http://abc.efg.hij/klmnopqr%5stu.nsf/") and get all urls in it.(There is about thousand sub urls on that main site)
Please let me know if any one can help.
Thank you in advance
Tara

Answers (2)

In the general case, enumerating the URLs that will be accepted by a site is not possible. URLs are evaluated by programs on the server, and need not correspond to an actual file.
You can urlread() or urlwrite() a specific URL and then parse the returned HTML to find anchors such as <a HREF=", extract each one, find the unique subset, and then iterate over each of them asking to fetch it in turn.
Thank you Walter,
but my problem is, because there is a lot links that I need to go in, I don't have enough time to go through all sub links in main site and get a copy of the address line.

1 Comment

You can urlread() or urlwrite() a specific URL (i.e., the main site), and then parse the returned HTML to find anchors such as <a HREF=", extract each one, find the unique subset, and then iterate over each of them asking urlread() or urlwrite() to fetch each in turn.
For example an approximation (one that does not take comments into account) would be
SiteContents = urlread('http://abc.efg.hij/klmnopqr%5stu.nsf/');
ContainedURLs = regexp(SiteContents, 'http://[^"]+', 'match');
UniqueURLs = unique(ContainedURLs);
for K = 1 : length(UniqueURLs)
MinedData{K} = urlread(UniqueURLs{K});
end
But watch out in case the URL is relative instead of Absolute, and watch out in case the site points to itself.
The regexp() pattern expects the URL to extend to the first " following. And the crude code here does not attempt to distinguish image tags from anchors: that's an enhancement for you to work out.

Sign in to comment.

Categories

Find more on MATLAB Coder in Help Center and File Exchange

Tags

Asked:

on 16 Jun 2015

Commented:

on 16 Jun 2015

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!