What is the best way to count occurrences of data from an excel file?

9 views (last 30 days)
I am trying to count the occurences of strings in an excel doc but there are some complications.
My dataset is large but a sample would be the following: (there is no pattern)
C01 C02 " "
B01 C02 C02 D04
C05 C01 " "
... (there are many rows and columns)
The " symbol occurs because in a previous step the data was converted from delimited to non-delimited, there are also some 'Error" values in the data.
I want my output to tell me: (using example data above)
C01 occurred 2
C02 occurred 3
etc.
I have looked into using tabulate(), hist(), and groupcounts() but i haven't figured it out.
What method would you recommend?

Accepted Answer

Voss
Voss on 9 Mar 2022
There are different ways to read an excel doc, and the function you use might treat some things differently (in particular, the cells with ").
Here are two functions you might use for reading the data, readcell() and xlsread(). The counting of the data in the cells is done the same way in both cases, in the function report_occurrences(), defined at the bottom.
C = readcell('data.xlsx')
C = 3×4 cell array
{'C01'} {'C02'} {0×0 char} {0×0 char} {'B01'} {'C02'} {'C02' } {'D04' } {'C05'} {'C01'} {0×0 char} {0×0 char}
report_occurrences(C);
ans =
' occurred 4 B01 occurred 1 C01 occurred 2 C02 occurred 3 C05 occurred 1 D04 occurred 1 '
[~,~,C] = xlsread('data.xlsx')
C = 3×4 cell array
{'C01'} {'C02'} {'"' } {'"' } {'B01'} {'C02'} {'C02'} {'D04'} {'C05'} {'C01'} {'"' } {'"' }
report_occurrences(C);
ans =
'" occurred 4 B01 occurred 1 C01 occurred 2 C02 occurred 3 C05 occurred 1 D04 occurred 1 '
function report_occurrences(C)
[uC,~,jj] = unique(C(:));
counts = zeros(numel(uC),1);
for ii = 1:numel(uC)
counts(ii) = nnz(jj == ii);
end
result = [uC num2cell(counts)].';
sprintf('%s occurred %d\n',result{:})
end
  6 Comments
Voss
Voss on 23 Mar 2022
Edited: Voss on 23 Mar 2022
C = readcell('example_data.csv','delimiter',' ')
C = 13×18 cell array
{'CPC' } {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {'B64D'} {'F42B' } {'F42B' } {'B64D' } {'B60K' } {'B64D' } {'B64C' } {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {'B64C'} {'B64C' } {'B64D' } {'B64D' } {'B64C' } {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {'B64C'} {'B64C' } {'B64C' } {'B64C' } {'B64C' } {'B64D' } {'B64C' } {'B64C' } {'B64C' } {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {'G06G'} {'F02K' } {'G01M' } {'B64C' } {'F02K' } {'B64F' } {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {'F41G'} {'F42C' } {'F41G' } {'F42B' } {'F41F' } {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {'F42B'} {'F41F' } {'F42B' } {'Expired' } {'F42B' } {'B64C' } {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {'F42B'} {'F42B' } {'Error' } {'F16F' } {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {'B64C'} {'B64C' } {'B64D' } {'B64D' } {'B64D' } {'F02C' } {'B64D' } {'G21D' } {'G21D' } {'B64D' } {'G21H' } {'F02C' } {'B64D' } {'G21D' } {'B64C' } {'B64C' } {'B64C' } {'B64C' } {'B64C'} {'B64C' } {'B64C' } {'B64C' } {'B64C' } {'B64C' } {'B64C' } {'B64C' } {'B64C' } {'F01D' } {'B64C' } {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {'G05D'} {'G05D' } {'B64C' } {'Error' } {'B64C' } {'G05D' } {'B64C' } {'B64C' } {'G05D' } {'F41G' } {'G01S' } {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {'B64C'} {'B64C' } {'B64C' } {'B64C' } {'B64C' } {'F16N' } {'B64C' } {'G01F' } {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {'B64C'} {'B64C' } {'B64C' } {'B64C' } {'B64C' } {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing} {1×1 missing}
C = C(~cellfun(@(x)isa(x,'missing'),C)) % remove the 'missing's from cell C
C = 96×1 cell array
{'CPC' } {'B64D'} {'B64C'} {'B64C'} {'G06G'} {'F41G'} {'F42B'} {'F42B'} {'B64C'} {'B64C'} {'G05D'} {'B64C'} {'B64C'} {'F42B'} {'B64C'} {'B64C'} {'F02K'} {'F42C'} {'F41F'} {'F42B'} {'B64C'} {'B64C'} {'G05D'} {'B64C'} {'B64C'} {'F42B'} {'B64D'} {'B64C'} {'G01M'} {'F41G'}
report_occurrences(C); % now run the function like usual
ans =
'B60K occurred 1 B64C occurred 45 B64D occurred 12 B64F occurred 1 CPC occurred 1 Error occurred 2 Expired occurred 1 F01D occurred 1 F02C occurred 2 F02K occurred 2 F16F occurred 1 F16N occurred 1 F41F occurred 2 F41G occurred 3 F42B occurred 8 F42C occurred 1 G01F occurred 1 G01M occurred 1 G01S occurred 1 G05D occurred 4 G06G occurred 1 G21D occurred 3 G21H occurred 1 '
function report_occurrences(C) % function definition is the same as before
[uC,~,jj] = unique(C(:));
counts = zeros(numel(uC),1);
for ii = 1:numel(uC)
counts(ii) = nnz(jj == ii);
end
result = [uC num2cell(counts)].';
sprintf('%s occurred %d\n',result{:})
end
Michael Cottingham
Michael Cottingham on 30 Mar 2022
Thank you, this method worked but I believe it also capped out at 10,000 data points.
Now I just have to figure out how to keep associated years with it.
cheers

Sign in to comment.

More Answers (0)

Products


Release

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!