Creating an algorithm to average large data sets with numerical and non-numerical data?

Question

Matthew Moldthan on 23 Jun 2019

0
Link

Direct link to this question

https://in.mathworks.com/matlabcentral/answers/468436-creating-an-algorithm-to-average-large-data-sets-with-numerical-and-non-numerical-data

Edited: Matthew Moldthan on 23 Jun 2019

Hi all, I'm currently working on a project that requires me to find the mean and other basic statistical facts from large data sets that contain numerical and non-numerical values.

This data is for air pollution levels across different counties from 1999 - 2014. The data I have does not have numbers for every county for every year. For example, some counties have data for 1999-2011 and no data for the last three years.

However, the counties that don't have complete data tend to vary in the years they are missing data. For instance, one county with incomplete data might have air pollution data for 6 separate years while another county has air pollution data for 8 distinct years.

Wherever a county doesn't have any data available, the cell contains a single period (ie: .), which is an indicator that air pollution data was not available for that county for that particular year.

I have all of the data for all counties for all 15 year stored in a single sheet in Excel, which I imported to MATLAB as a dataset array.

The code I wrote thus far allows me to get the mean of the data; however, it only returns a numerical value when a county has data for all 15 years. For all counties with incomplete data in one or more years, the code I wrote generates 'NaN'.

I need some advice on how I can edit my code so that it will return numerical data for all counties. For instance, if a county on has air pollution data available across six distinct years, I want an algorithm that will automatically take the 6 years of numerical data available, exclude the years with no numerical data, and obtain an average over those 6 years. And so on and so forth for all of the counties.

I need help writing something that will both add all of the available numerical data for each county and average that data across the years that are available.

Here is the code I have written thus far:

>> Unique_Counties = unique(ds.County);

size_unique = size(Unique_Counties);

for i=1:size_unique(1)

selection_from_unique = ds(nominal(ds.County) == char(Unique_Counties (i)),:);

average(i)=nanmean(str2num(char(cellstr(selection_from_unique.UnhealthyForSensitiveGroups))));

end

Thank you all in advance for your willingness to help. I am hoping that together we can generate a code that will allow me to easily sort through this massive amount of data. Please do not hesitate to let me know if you have any questions and/or concerns, and I will be happy to help!

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Sulaymon Eshkabilov on 23 Jun 2019

0
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/468436-creating-an-algorithm-to-average-large-data-sets-with-numerical-and-non-numerical-data#answer_380380

Hi,

Simple solution is to use isnan() to find out empty cells and set their values equal to 0, e.g.:

A = xlsread('MY_data.xls'); % Load MS Excel file

Index = find(isnan(A)); % To find where empty cells are

A(Index)=0; % Set equal to "0"

Good luck

1 Comment
Show -1 older commentsHide -1 older comments

Matthew Moldthan on 23 Jun 2019

Edited: Matthew Moldthan on 23 Jun 2019

Hi,

First of all, I greatly appreciate your willingness to help. I thought about this option. I was unfortunately disappointed with the result because it included the 0 value in the average.

For instance, data is available for a county for 6 years. I want to get the mean of the available data without including the other 9 years as 0 values in the total. ie (6+0+0+5+9)/5 vs. (6+0+0+5+9)/3

How can I convert the values in the non-numerical cells to 0 and instruct the function to preclude those values in the numerator and the number of years in the divisor.

Thank you for your help. I am looking forward to hearing your thoughts.

Best,

Sign in to comment.

Creating an algorithm to average large data sets with numerical and non-numerical data?

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

1 Comment
Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Creating an algorithm to average large data sets with numerical and non-numerical data?

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

1 Comment Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

1 Comment
Show -1 older commentsHide -1 older comments