Combine data from multiple netCDF files

56 views (last 30 days)
Dear reader,
I have multiple datasets for hourly wave height in the NE Atlantic Ocean that is stored in netCDF files for each month (Jan 1990 till Dec 2019). To properly work with the data I wish to combine these datasets such that I get the following output:
Vectors for longitude and latitude
Datetime array from Jan 1990 till Dec 2019
3D data for wave height stored as longitude x latitude x time\
For this, I have written a script (see below). However, at the moment this script is very inefficient and it takes hours to do the job. Is there any way to optimize this script or is there another way to efficiently combine data from multiple NETCDF files?
Your help would be very much appreciated.
param = 'hs';
year = (1990:2019)';
years = num2str(year);
month = ['01'; '02'; '03'; '04'; '05'; '06'; '07'; '08'; '09'; '10'; '11'; '12'];
%% Step 2: Process the data
% Load longitude bands
longitude = ncread(fileIN,'longitude');
% Load latitude bands
latitude = ncread(fileIN,'latitude');
% Create the timeTotal and paramTotal here
timeTotal = datetime(1900,1,1,0,0,0,'Format','dd-MMM-yyyy HH:mm:ss');
paramDataTotal = zeros(length(longitude),length(latitude),1);
% The following for loops first checks if the file exists, then extracts
% the time and param arrays and adds those to the "total dataset."
for i = 1:length(years(:,1));
for j = 1:length(month(:,1));
fileIN = ['WW3-ATNE-10M_' years(i,:) month(j,:) '_hs.nc']
if isfile([Directory_load '\' fileIN]) == 1
% Load all time arrays time array
time = ncread(fileIN,'time');
epoch = datetime(1990,01,01);
time = epoch + days(time);
% add time array to the total time array
timeTotal = cat(1,timeTotal,time);
% Load parameter data
paramData = ncread(fileIN,param);
% add param data to the total 3D paramdata
paramDataTotal = cat(3,paramDataTotal,paramData);
else
continue
end
end
end
% Remove the first entry in the total data.
timeTotal(1) = [];
paramDataTotal(:,:,1) = [];

Accepted Answer

Seth Furman
Seth Furman on 8 Nov 2021
Take a look at the documentation for reading large files and big data.
One simple approach would be to read your data into a tall timetable, e.g.
% Create some sample NetCDF files.
createSampleNetCDFFiles();
ls data
WW3-ATNE-10M_199001_hs.nc WW3-ATNE-10M_199003_hs.nc WW3-ATNE-10M_199005_hs.nc WW3-ATNE-10M_199007_hs.nc WW3-ATNE-10M_199009_hs.nc WW3-ATNE-10M_199011_hs.nc WW3-ATNE-10M_199002_hs.nc WW3-ATNE-10M_199004_hs.nc WW3-ATNE-10M_199006_hs.nc WW3-ATNE-10M_199008_hs.nc WW3-ATNE-10M_199010_hs.nc WW3-ATNE-10M_199012_hs.nc
% Read data as a tall timetable.
ds = fileDatastore("data", "ReadFcn", @readNetCDF, "UniformRead", true);
tTall = tall(ds)
tTall = M×3 tall timetable Time Latitude Longitude Data ____________________ ________ _________ ________________ 30-Dec-1989 23:59:53 -1.0001 -1.0001 706.05 438.74 03-Jan-1990 19:02:14 2.7932 2.7932 31.833 381.56 09-Jan-1990 23:34:35 8.9824 8.9824 276.92 765.52 17-Jan-1990 08:28:52 16.353 16.353 46.171 795.2 17-Jan-1990 08:39:27 16.361 16.361 97.132 186.87 18-Jan-1990 00:06:36 17.005 17.005 823.46 489.76 19-Jan-1990 19:23:06 18.808 18.808 694.83 445.59 20-Jan-1990 05:12:04 19.217 19.217 317.1 646.31 : : : : : : : :
function t = readNetCDF(filename)
Time = datetime(1990, 1, 1) + days(ncread(filename, "time"));
Latitude = ncread(filename, "time");
Longitude = ncread(filename, "time");
Data = ncread(filename, "data");
t = timetable(Time, Latitude, Longitude, Data);
end
function createSampleNetCDFFiles()
rng default
y = 1990;
latitude = sort(180 * rand([10 1]) - 90);
longitude = sort(360 * rand([10 1]) - 180);
mkdir data
cd data
for m = 1:12
filename = sprintf("WW3-ATNE-10M_%d%02d_hs.nc", y, m);
d = 28 * sort(rand([10 1]));
time = days(datetime(y, m, 0) + days(d-1) - datetime(1990, 1, 1));
data = 1000 * rand(10, 2);
nccreate(filename, "time", "Dimensions", {'r', size(time,1), 'c', size(time,2)});
nccreate(filename, "data", "Dimensions", {'r', 'c1', size(data,2)});
nccreate(filename, "latitude", "Dimensions", {'r', 'c2', size(latitude,2)});
nccreate(filename, "longitude", "Dimensions", {'r', 'c3', size(longitude,2)});
ncwrite(filename, "time", time);
ncwrite(filename, "data", data);
ncwrite(filename, "latitude", latitude);
ncwrite(filename, "longitude", longitude);
end
cd ..
end
  1 Comment
Michiel Smit
Michiel Smit on 9 Nov 2021
Dear Seth,
Thank you for your response and tips. I will take a look at it.
For the moment, I have figured out a way so that I do not need al data at once, but rather cut the coordinatespecific data from each dataset and connect those together to get a time series for the desired location. Although this still takes considerable time, it is more managable.
Best regards,
Michiel

Sign in to comment.

More Answers (1)

Kelly Kearney
Kelly Kearney on 8 Nov 2021
What are the dimensions of the total dataset? And do you really need to hold the entire thing in memory at once?
I think you could speed things up in your current script by preallocating the third dimension of paramDataTotal; right now, the constant expansion of the array in each loop iteration requires Matlab to reallocate memory, which can slow things down considerably. You may want to take a look at some of the netCDF-reading tools in the Climate Data Toolbox (particularly ncstruct.m); they are designed to optimize reading in netCDF data from a series of files.
That said, reading in the entirety of a large dataset may be inherently slow regardless of how optimized your code is, simply due to the size. Which brings me back to the question of whether you really need to hold the entire dataset in memory at once. I find it often makes more sense to rethink the rest of my calculations and only read in specific hyperslabs of data as needed (again, ncstruct may simplify the reading-in syntax for you). This read-as-needed technique is similar to how talltable and other Matlab large-dataset functions work under the hood... unfortunately, tabular constructs aren't ideal for working with 3D data, in my opinion.
  1 Comment
Michiel Smit
Michiel Smit on 9 Nov 2021
Dear Kelly,
Thank you for your response and tips.
I have figured out a way so that I do not need al data at once, but rather cut the coordinatespecific data from each dataset and connect those together to get a time series for the desired location. Although this still takes considerable time, it is more managable.
Best regards,
Michiel

Sign in to comment.

Products


Release

R2019b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!