You are now following this question
- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.
Image extraction from webpage
10 views (last 30 days)
Show older comments
There are serial-numbered webpages (some of these numbers don't exist), which have images of interest at one particular location in the html file:
<h4 id="COMPANY">COMPANY</h4>
<p><img class="image" border="0" src="/resources/companyName_company.jpg"/></p>
The companyName is different in each numbered webpage.
However, urlwrite gives only html pages without these images. When opened in browser, these images are absent. Since it is these images that are of interest, and none of the other content of the webpage, the whole purpose is defeated. How can this be resolved ? Is there a way to get only these images, and nothing else from the webpage ?
2 Comments
b
on 27 Apr 2020
No, the html does contain these lines. But when opened in browser, there is no image. The heading in between the <h4></h4> appears correctly. The image part, which should be just below it, does not appear.
Everything else on the webpage is unneeded information. Unable to figure out how to filter that out and extract only this image part.
This structure is unique in the html pages. Every numbered html page has the structure of <heading> immediately followed by <image> .
Accepted Answer
Rik
on 27 Apr 2020
The HTML file doesn't contain the image. It contains a relative path to the image. Because you don't have the image file in the location the HTML file specifies the image doesn't show up. You need to use the 3 step process below to get the image file.
- download the HTML file
- determine '/resources/companyName_company.jpg'
- dowload the image from website.com/resources/companyName_company.jpg
18 Comments
Rik
on 27 Apr 2020
If the text around that is the same every time it should not be too difficult to write some code that will find it. You don't even need to store the HTML file, you can leave it as a char array if you use webread (or urlread on older releases).
Rik
on 28 Apr 2020
Where would you start? I wasn't born knowing Matlab, so you can learn it too. How can you remove specific characters from a char array? (hint: strrep) How can you find the position of a specific string in an array? (hint: strfind)
You din't share any details about the rest of the HTML document, so you will have to do this on your own. Show what you try and try to explain why it fails. That makes it easier to guide you.
Rik
on 28 Apr 2020
This is what I did:
for n=1:9
%for n=2:9
%for n=1:1000
%n=0 and 1 don't exist. For those that don't exist, it gives the error:
%Error using urlreadwrite (line 98)
%Error downloading URL. Your network connection may be down or your proxy settings improperly configured.
%Error in urlwrite (line 52)
%[f,status] = urlreadwrite(mfilename,catchErrors,url,filename,varargin{:});
try
urlwrite(sprintf('https://companyNameWebsite.org/%i?outline=by_category',n), ...
sprintf('company%i.html',n));
n
catch
%Nothing to do if n-value company webpage doesn't exist & error shows
end
end
How to get from here to getting images at
because the companyName_company are not numerical. Even in this term, only the 'companyName' varies, while the '_company' is the same.
Rik
on 28 Apr 2020
Step by step. First read the contents of the web page as a char array, then extract the image url.
You should also first do it for one page, then proceed to process all.
for n=2%1:1000
%read HTML
url=sprintf(sprintf('https://companyNameWebsite.org/%i?outline=by_category',n));
try
data=webread(url);
catch ME
%check if the error is what you expect for a non-existent page
if ~WhatYouExpect(ME)
rethrow(ME)
else
continue%go to next iteration
end
end
%now use strrep and strfind to find '</h4><p><img class'
%store the image
img_url=sprintf('%s%s','https://companyNameWebsite.org',partial_url);
websave(___)
end
b
on 28 Apr 2020
Even after sharing what I have done, there is no help from your end. If I had that much matlab knowledge, I wouldn't be undergoing this humiliation of some 'maestro' sitting at the helm and giving guidance as if to a school kid. If I had to learn on my own, I wouldn't be posting it here, right ? Please do not respond to my this or any other questions any further.
Star Strider
on 28 Apr 2020
b — This is not a trivial problem. There are no straightforward, general solutions.
Rik
on 28 Apr 2020
I didn't mean to come across as humiliating you. As Star Strider mentions: this isn't simple. Difficult problems should be cut down to smaller, solveable problems. I can't solve your question all at once, I can only describe what steps you need to take.
Since you don't share the HTML itself I can't help you with specific code. I did try to help you. You shared some code, in which you were using some functions; I posted code with different functions and a slightly different structure. I'm fine with it if you don't want any further help from me, but if you only want that because you feel I'm belittling you, I want you to know that is not my intention.
Intentions are difficult to judge on the internet, because either or both may not be native speakers of the language they use to communicate. Even if that were the case, there can still exist a socio-cultural difference. And of course it difficult to convey tone in text.
b
on 28 Apr 2020
What difference does an example URL make ? How can I share HTML ? I see no bearing of the actual HTML to this problem. How can you work in the software field/industry and be unaware of Confidentiality Clauses and Agreements ? You say that you posted a code with different functions and slightly different structure, but apart from one commented line, can you tell me what is new in your code that is not already there in the example m file ? How can you frustrate people who are already grappling with problems, let alone programming skills ?
Rik
on 28 Apr 2020
The HTML has everything to do with it. If you want explicit help we will need explicit data. An explicit example will allow me to write some code that will help you read the image URL. (side note: you never mentioned an NDA before, so why would you assume I don't know about those? And are you sure you are even allowed by that NDA to post this question?) Do you think it is a smart move to tell people they frustrate you when you are the one asking for help?
Your code has a fundamentally different structure than mine, but we don't have to argue about that.
You can leave the WhatYouExpect function blank if you like, but if you want help on the part with %now use strrep and strfind to find '</h4><p><img class', you will have to provide me with an example file. It doesn't have to be an actual file. It just has to be real enough for a parser function to work and to show you how you can improve/alter it so it works on the real files.
Rik
on 29 Apr 2020
Despite what you said, the pattern you mentioned is not unique. Below is my guess for your pattern. Modify as needed.
for n=2%1:1000
%read HTML
%url=sprintf(sprintf('https://companyNameWebsite.org/%i?outline=by_category',n));
url='https://www.mathworks.com/matlabcentral/answers/uploaded_files/288498/company3a.txt';
try
data=webread(url);
catch ME
%check if the error is what you expect for a non-existent page
if ~WhatYouExpect(ME)
rethrow(ME)
else
continue%go to next iteration
end
end
t=strsplit(data,'<h4');
pattern=' id="company"';numel_pattern=numel(pattern);
partial_url='';%set a default in case of failure
for k=1:numel(t)
try
if strcmp(t{k}(1:numel_pattern),pattern)
ind1=strfind(t{k},'src="')+4;ind1=ind1(1)+1;
ind2=strfind(t{k},'"');
ind2=ind2(ind2>ind1);ind2=ind2(1)-1;
partial_url=t{k}(ind1:ind2);
end
catch
%line too short, or url reading failed
end
end
if isempty(partial_url)
%Should the code throw an error here? Warn? Simply continue?
end
%store the image
img_url=sprintf('%s%s','https://companyNameWebsite.org',partial_url);
websave(___)
end
Rik
on 29 Apr 2020
Glad to be of help.
Since you suggested to be bound by an NDA not to provide more details I don't see what adding "(subject to testing)" is trying to accomplish. Obviously it works on a recent release of Matlab for this example, otherwise I wouldn't have posted it. The only thing it currently accomplishes is sounding condescending.
b
on 30 Apr 2020
This is working well. I can clearly see now how strsplit, strfind and strcmp can be used. After experimenting with this code on few different configurations, the run-time is also reasonable - a few hours for 1000 cases. Another thing was that the fileSize of the image file that it retrieves is exactly the same as the original file. This may not be surprising to an experienced coder, but something could be done as a modification to bring down the run-time as well as disk-space so that the user gets an option to vary the fileSize of the retrieved file. If the original image file is 4MB, but maybe only 60kb suffices, then that is a reduction by ~70 times. This will translate to an almost equivalent reduction in the run-time and surely the same amount of reduction in disk-space. Instead of 4GB of space, only 60MB will be used. The trick will be in the amount of processing time taken by the dimension or the size reducing algorithm.
But that goes beyond the purview of this question thread.
More Answers (0)
See Also
Categories
Find more on Image Processing Toolbox in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!An Error Occurred
Unable to complete the action because of changes made to the page. Reload the page to see its updated state.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)