Creating an algorithm to average large data sets with numerical and non-numerical data?

2 views (last 30 days)
Hi all, I'm currently working on a project that requires me to find the mean and other basic statistical facts from large data sets that contain numerical and non-numerical values.
This data is for air pollution levels across different counties from 1999 - 2014. The data I have does not have numbers for every county for every year. For example, some counties have data for 1999-2011 and no data for the last three years.
However, the counties that don't have complete data tend to vary in the years they are missing data. For instance, one county with incomplete data might have air pollution data for 6 separate years while another county has air pollution data for 8 distinct years.
Wherever a county doesn't have any data available, the cell contains a single period (ie: .), which is an indicator that air pollution data was not available for that county for that particular year.
I have all of the data for all counties for all 15 year stored in a single sheet in Excel, which I imported to MATLAB as a dataset array.
The code I wrote thus far allows me to get the mean of the data; however, it only returns a numerical value when a county has data for all 15 years. For all counties with incomplete data in one or more years, the code I wrote generates 'NaN'.
I need some advice on how I can edit my code so that it will return numerical data for all counties. For instance, if a county on has air pollution data available across six distinct years, I want an algorithm that will automatically take the 6 years of numerical data available, exclude the years with no numerical data, and obtain an average over those 6 years. And so on and so forth for all of the counties.
I need help writing something that will both add all of the available numerical data for each county and average that data across the years that are available.
Here is the code I have written thus far:
>> Unique_Counties = unique(ds.County);
size_unique = size(Unique_Counties);
for i=1:size_unique(1)
selection_from_unique = ds(nominal(ds.County) == char(Unique_Counties (i)),:);
average(i)=nanmean(str2num(char(cellstr(selection_from_unique.UnhealthyForSensitiveGroups))));
end
Thank you all in advance for your willingness to help. I am hoping that together we can generate a code that will allow me to easily sort through this massive amount of data. Please do not hesitate to let me know if you have any questions and/or concerns, and I will be happy to help!

Answers (1)

Sulaymon Eshkabilov
Sulaymon Eshkabilov on 23 Jun 2019
Hi,
Simple solution is to use isnan() to find out empty cells and set their values equal to 0, e.g.:
A = xlsread('MY_data.xls'); % Load MS Excel file
Index = find(isnan(A)); % To find where empty cells are
A(Index)=0; % Set equal to "0"
Good luck
  1 Comment
Matthew Moldthan
Matthew Moldthan on 23 Jun 2019
Edited: Matthew Moldthan on 23 Jun 2019
Hi,
First of all, I greatly appreciate your willingness to help. I thought about this option. I was unfortunately disappointed with the result because it included the 0 value in the average.
For instance, data is available for a county for 6 years. I want to get the mean of the available data without including the other 9 years as 0 values in the total. ie (6+0+0+5+9)/5 vs. (6+0+0+5+9)/3
How can I convert the values in the non-numerical cells to 0 and instruct the function to preclude those values in the numerator and the number of years in the divisor.
Thank you for your help. I am looking forward to hearing your thoughts.
Best,

Sign in to comment.

Categories

Find more on Data Import from MATLAB in Help Center and File Exchange

Products


Release

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!