How to read files from a particular website?

14 views (last 30 days)
Pouya
Pouya on 3 Mar 2022
Answered: VINAYAK LUHA on 4 Jan 2024
Hello,
I'm having problem with matlab not recognizing the files in this link ( https://swarm-diss.eo.esa.int/#swarm/Level1b/Entire_mission_data/MAGx_HR/Sat_A )
There should be multiple files each about 300mb with their names starting with "SW_OPER_MAGA_HR". But instead matlab read something else as " 1x136910 char ".
Please see the code below:
clc
clear
web='https://swarm-diss.eo.esa.int/#swarm/Level1b/Entire_mission_data/MAGx_HR/Sat_A';
str=webread(web);
fn=regexpi(str,'SW[A-Z_0-9]+.zip','match');
for k=1:size(fn,2)
file=fn{k};
unzip([web file(8:9)]);
end
Thank you in advance.
  1 Comment
Ive J
Ive J on 3 Mar 2022
Your url is protected by cookies, I guess your best chance is to try with Python. MATLAB is quite immature for web scraping.

Sign in to comment.

Answers (1)

VINAYAK LUHA
VINAYAK LUHA on 4 Jan 2024
Hello Pouya,
I understand that you're looking to download files organized as a table from the mentioned website using MATLAB and have already attempted to use the "webread" function, but instead, it gave you a character array.
The webread function did indeed deliver the HTML content of the page as anticipated.
To accomplish your goal, it's important to note that the table data on the website is dynamically generated, which means webread might not be the right tool for the task. Instead, you should consider saving the webpage as an HTML file and then utilizing htmlTree to extract the necessary links from the HTML source code.
Here's a code along with explanations on how to proceed:
% Read the HTML content from a saved file
html = fileread('htmlFile.html');
% Parse the HTML content to create a tree structure
tree = htmlTree(html);
% Locate all 'a' (anchor) elements within the parsed HTML tree
anchorElements = findElement(tree, "A");
% Retrieve the 'href' attributes from the identified anchor elements
hrefAttributes = getAttribute(anchorElements, "href");
% Identify the 'href' attributes that include the download keyword
downloadLinks = hrefAttributes(contains(hrefAttributes, "?do=download"));
% Iterate over the first 10 download links (or fewer if there are not as many)
for i = 1:min(10, numel(downloadLinks))
% URL-decode each download link to get a human-readable format
decodedText = urldecode(downloadLinks(i));
% Split the decoded URL by '/' to isolate the file name
parts = strsplit(decodedText, '/');
% Extract the file name, which is the last segment after splitting
lastPart = parts(end);
% Formulate the full download URL by adding the base URL to the relative path
modifiedLink = "https://swarm-diss.eo.esa.int/" + downloadLinks(i);
% Download the file using websave and name it with the extracted file name
websave(lastPart{1}, modifiedLink);
end
You can refer to the following documentations for more details about the used MATLAB functions-
I hope this guidance clarifies how to retrieve files from the desired website.Additionally, I've included the website's html source code as a text file as an attachment for your reference.
Regards
Vinayak Luha

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!