Perform Google Search in Matlab

Question

dsmalenb on 4 Jun 2019

0
Link

Direct link to this question

https://in.mathworks.com/matlabcentral/answers/465408-perform-google-search-in-matlab

Answered: DGM on 18 Sep 2024

Hi!

I am trying to figure out how to perform a Google search automatically in matlab and save the results in an array.

Say I wanted to save the paths to the pdf files: "site:www.cnn.com filetype:pdf"

Some answers in the list should then be:

'http://www.cnn.com/2004/images/01/23/jackson.doc.pdf'

'http://www.cnn.com/StudioTour/pdf/websitefaq.pdf'

...

I have seen some scripts (links below) but unfortunately they are outdated or simply do not work. I am guessing it may be possible to do this but I cannot seem to figure it out. Any assistance would be very welcome!

Links:

https://www.mathworks.com/matlabcentral/fileexchange/41042-google

https://www.mathworks.com/matlabcentral/fileexchange/65829-google-search-answer-links

3 Comments
Show 1 older commentHide 1 older comment

dsmalenb on 4 Jun 2019

Joel,

Thank you for your response. Perhaps I am missing something significant but after parsing through the html I tried to compare the parts so I can made the neccesary changes. However, it does not seem as if all the necessary parts of the link are available. I have included an example below. It is for the first arciel that the search displays.

We have:

The file typoe is in GREEN
The Article's title is in YELLOW
The parts of the link are in MAGENTA

I am missing "2004" and "01/23/" to complete the link. These parts do not seem to be listed in the HTML code.

Any idea how to get these pieces?

Joel Handy on 10 Jun 2019

After doing some more research, it looks like scraping (thats what we are doing, scraping googles search results) is against their terms of service and they actively attempt to thwart it. That would explain why some older tools are no longer maintained. I'm not a web expert, There appear to be ways of doing what you want but I dont think any of them are simple.

Sorry I couldnt be more help.

Sign in to comment.

Sign in to answer this question.

Answer 1

Monika Phadnis on 27 Jun 2019

0
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/465408-perform-google-search-in-matlab#answer_381039

https://www.mathworks.com/help/textanalytics/ref/htmltree.getattribute.html?s_tid=doc_ta

I followed the example given on this link to extract data from the url.

As for the url, I used " http://www.google.com/search?q=cnn.com+filetype%3Apdf " this as the url parameter for webread for the example given by you. This gives string array of the href links, you can try parsing the array for the required links.

In my output strings starting with " /url " had the search links.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 2

KARTIK GURNANI on 21 May 2020

0
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/465408-perform-google-search-in-matlab#answer_435378

Open in MATLAB Online

This Does seem true.

Ps :

Microsoft introduced this feature to prevent Other Web engines from copying their data {Search Results } on Bing way before Google.

It seems like we would be violating TOS on google and bing .

I tried.

I got Partial Results.

The best possible way would be to use Matlab to build a Neural Network which Runs search Querries from a system with Dynamic IP.

@AndrewNg might shed some better light on this.

There is a possible solution to this .

But , the Biggest issue of it all :

Google and Bing {Microsoft} might label your ip address as spam or bot .

Which Means , No netflix , No Hulu , No other streaming Service.

You might get locked out of Even Reading News on certain websites.

Hell , even simple web searches you might end up solving Recaptcha or the Newer Version : ImageCaptcha.

Dynamic IP will help in this case but check with your ISP before attempting this.

You might lose the Security or your Plan may get suspended .

>>It will take the ISP a lot of man hours to get that single IP cleaned up : Removed from Blacklist across most filters.

>>You would mostly increase their headache.

##
Note :
I have created a matlab script that can work your search querry.
I am not sure about posting it here.

The issue being you can only run it :

Single Search Query

It works but crawling takes a while , then use of postcript to convert to pdf .
Better when saving to HTML file with images.

If anyone would like the script , please let me know.

The script is only for educational terms.

Do not use it to violate TOS of any organization.

Good Luck & Stay Safe,

Kartik

2 Comments
Show NoneHide None

David Chen on 27 May 2020

Edited: David Chen on 27 May 2020

"If anyone would like the script , please let me know."

I want.

Dwan Andrés Mahecha Vallejo on 17 Sep 2024

Por favor

Sign in to comment.

Answer 3

DGM on 18 Sep 2024

0
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/465408-perform-google-search-in-matlab#answer_1518440

Open in MATLAB Online

Here's a basic example. I'm pretty sure there are other ways of doing this, but the docs are a confusing maze. Last I checked, DDG's API wasn't even complete enough to be useful for anything.

% your query string
query = '+site:www.cnn.com banana';
% your google custom search key, etc
% https://developers.google.com/custom-search/v1/overview
% https://developers.google.com/custom-search/v1/introduction
% https://developers.google.com/custom-search/docs/tutorial/creatingcse
% free accounts are limited to 10 results per query, 100 queries per day
% there are also rate limits
apikey = 'your_key_goes_here'; % API key
cx = 'your_cx_goes_here'; % CSE identifier
% search setup
wopt = weboptions('contenttype','json');
url = ['https://customsearch.googleapis.com/customsearch/v1?cx=' cx '&key=' apikey '&q=' query '&num=10'];
% try to perform the search
try
    S = webread(url,wopt);
catch
    % this might also happen if API call is broken somehow
    fprintf('Connection error.  Web search failed.\n')
    return;
end
% extract the urls
if isfield(S,'items')
    items = S.items;
    % depending on the results, items is either a struct array 
    % or a cell array of dissimilar structs
    if isstruct(items)
        urllist = {items.link}.';
    else
        urllist = cellfun(@(x) x.link,items,'uniform',false);
    end
else
    fprintf('No results.\n')
    return;
end
urllist
urllist = 10x1 cell array
    {'https://www.cnn.com/2020/05/02/health/banana-bread-pandemic-baking-wellness-trnd/index.html'                }
    {'https://www.cnn.com/2020/02/22/us/banana-label-collector-becky-martz-trnd/index.html'                       }
    {'https://www.cnn.com/2024/03/25/business/trader-joes-banana-price-increase/index.html'                       }
    {'https://www.cnn.com/style/article/student-eats-maurizio-cattelan-banana-art-south-korea-intl-hnk/index.html'}
    {'https://www.cnn.com/travel/article/banana-island-qatar/index.html'                                          }
    {'https://www.cnn.com/style/article/david-datuna-banana-art-basel-trnd/index.html'                            }
    {'https://www.cnn.com/2016/10/25/health/banana-extinction/index.html'                                         }
    {'https://www.cnn.com/2015/07/22/africa/banana-panama-disease/index.html'                                     }
    {'https://www.cnn.com/style/article/banana-artwork-eaten-scli-intl/index.html'                                }
    {'https://www.cnn.com/2021/12/09/entertainment/the-masked-singer-reveal/index.html'                           }

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Perform Google Search in Matlab

3 Comments
Show 1 older commentHide 1 older comment

Answers (3)

0 Comments
Show -2 older commentsHide -2 older comments

2 Comments
Show NoneHide None

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Perform Google Search in Matlab

3 Comments Show 1 older commentHide 1 older comment

Answers (3)

0 Comments Show -2 older commentsHide -2 older comments

2 Comments Show NoneHide None

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

3 Comments
Show 1 older commentHide 1 older comment

0 Comments
Show -2 older commentsHide -2 older comments

2 Comments
Show NoneHide None

0 Comments
Show -2 older commentsHide -2 older comments