Finding groups based on matching multiple values in a column

80 views (last 30 days)
I'd like to find groups in my table based on the values of multiple columnns- but I'd like to have the groups allow multiple specific values in one of the columns.
For the "I'd like to find groups in my table based on the values of multiple columnns" part, I can do that:
>> T=table({'Home Depot';'Lowes';'Home Depot';'Menards';'Menards';'Home Depot'},{'USA';'China';'Canada';'France';'USA';'Canada'},{'Smith';'Johnson';'Jones';'Miller';'Williams';'Brown'},[123;456;789;1010;1112;1314],'VariableNames',{'Store' 'Country' 'Manager' 'Revenue'});
T =
6×4 table
Store Country Manager Revenue
______________ __________ ____________ _______
{'Home Depot'} {'USA' } {'Smith' } 123
{'Lowes' } {'China' } {'Johnson' } 456
{'Home Depot'} {'Canada'} {'Jones' } 789
{'Menards' } {'France'} {'Miller' } 1010
{'Menards' } {'USA' } {'Williams'} 1112
{'Home Depot'} {'Canada'} {'Brown' } 1314
>> [G,~,idx]=unique(T(:,[1 2]),'stable') %want 'stable' option so not using findgroups
G =
5×2 table
Store Country
______________ __________
{'Home Depot'} {'USA' }
{'Lowes' } {'China' }
{'Home Depot'} {'Canada'}
{'Menards' } {'France'}
{'Menards' } {'USA' }
idx =
1
2
3
4
5
3
But what I'd like to do is introduce an 'or' grouping rule to designate the region that Country is in. In this case, I want to group the Country by region as well. I can also do that... though not sure my converting to cell method is the best way, but it works.
>> NA=find(ismember(table2cell(T(:,2)),{'USA','Canada','Mexico'}))
NA =
1
3
5
6
But ultimately, I want to do this at the same time- show groups that match both the Store as well as the Region. I am not sure the best way to go about this- my actual table is very large. I thought of creating a new varibale for Region and then match off that- is that the way to go or can I combine these into one sort? I was also considering looping methods, but not sure the most efficient way to proceed. My desired result would be:
T =
6×4 table
Store Country Manager Revenue
______________ __________ ____________ _______
{'Home Depot'} {'USA' } {'Smith' } 123
{'Lowes' } {'China' } {'Johnson' } 456
{'Home Depot'} {'Canada'} {'Jones' } 789
{'Menards' } {'France'} {'Miller' } 1010
{'Menards' } {'USA' } {'Williams'} 1112
{'Home Depot'} {'Canada'} {'Brown' } 1314
idx =
1
2
1
3
4
1
Thanks for any insight.
  3 Comments
Marcus Glover
Marcus Glover on 6 Feb 2021
Edited: Marcus Glover on 6 Feb 2021
Thanks. Something like the binning in groupsummary may work- but I'll have to take a look and see how I can do that with strings.
My question is more of how to do this 'better' or 'correctly'. I only ask because I have a huge table and I am not the strongest coder, but really waiting 2 minutes instead of 2 seconds is not a major issue for me. Perhaps an example will make it more clear-the following does what I want but I think it is not efficient and is probably poor code. I feel like there may be a way to do it all at once instead of creating a new variable.
%starting Table
>> T=table({'Home Depot';'Lowes';'Home Depot';'Menards';'Menards';'Home Depot'},{'USA';'China';'Canada';'France';'USA';'Canada'},{'Smith';'Johnson';'Jones';'Miller';'Williams';'Brown'},[123;456;789;1010;1112;1314],'VariableNames',{'Store' 'Country' 'Manager' 'Revenue'});
T =
6×4 table
Store Country Manager Revenue
______________ __________ ____________ _______
{'Home Depot'} {'USA' } {'Smith' } 123
{'Lowes' } {'China' } {'Johnson' } 456
{'Home Depot'} {'Canada'} {'Jones' } 789
{'Menards' } {'France'} {'Miller' } 1010
{'Menards' } {'USA' } {'Williams'} 1112
{'Home Depot'} {'Canada'} {'Brown' } 1314
% adds new variable for the 'Region' which makes subgroups of column 2
>> T.Region=repmat({''},size(T,1),1);
T =
6×5 table
Store Country Manager Revenue Region
______________ __________ ____________ _______ __________
{'Home Depot'} {'USA' } {'Smith' } 123 {0×0 char}
{'Lowes' } {'China' } {'Johnson' } 456 {0×0 char}
{'Home Depot'} {'Canada'} {'Jones' } 789 {0×0 char}
{'Menards' } {'France'} {'Miller' } 1010 {0×0 char}
{'Menards' } {'USA' } {'Williams'} 1112 {0×0 char}
{'Home Depot'} {'Canada'} {'Brown' } 1314 {0×0 char}
%Now I want to assign group values to the variable 'Region' based on value in 'Country'
%This step I would like to roll in to one sort on columns 1 and 2 without
% creating column 5, but maybe it is fine. My real table is very very large.
>> Europe=find(ismember(table2cell(T(:,2)),{'UK','France','Gremany'}))
Europe =
4
>> T(Europe,5)={'Europe'};
>> Asia=find(ismember(table2cell(T(:,2)),{'China','Japan','SouthKorea'}))
Asia =
2
>> T(Asia,5)={'Asia'}
>> NA=find(ismember(table2cell(T(:,2)),{'USA','Canada','Mexico'}))
NA =
1
3
5
6
>> T(NA,5)={'NA'};
>> T
T =
6×5 table
Store Country Manager Revenue Region
______________ __________ ____________ _______ __________
{'Home Depot'} {'USA' } {'Smith' } 123 {'NA' }
{'Lowes' } {'China' } {'Johnson' } 456 {'Asia' }
{'Home Depot'} {'Canada'} {'Jones' } 789 {'NA' }
{'Menards' } {'France'} {'Miller' } 1010 {'Europe'}
{'Menards' } {'USA' } {'Williams'} 1112 {'NA' }
{'Home Depot'} {'Canada'} {'Brown' } 1314 {'NA' }
% Now I find groups that are unique in columns 1 and 5
>> [G,~,idx]=unique(T(:,[1 5]),'stable');
>> idx
idx =
1
2
1
3
4
1

Sign in to comment.

Accepted Answer

Adam Danz
Adam Danz on 6 Feb 2021
Edited: Adam Danz on 6 Feb 2021
I recommend adding the region column to your tables. I doubt it will have an impact on computation time relative to using a separate column and it will keep the data tidy.
I prefer to work with catgegories over string when working with categorical data.
T=table(categorical({'Home Depot';'Lowes';'Home Depot';'Menards';'Menards';'Menards'}),...
categorical({'ISS';'Mexico';'Canada';'France';'USA';'Germany'}),...
categorical({'Smith';'Johnson';'Jones';'Miller';'Williams';'Brown'}),...
[123;456;789;1010;1112;1314],...
'VariableNames',{'Store' 'Country' 'Manager' 'Revenue'}) % modified example
T = 6x4 table
Store Country Manager Revenue __________ _______ ________ _______ Home Depot ISS Smith 123 Lowes Mexico Johnson 456 Home Depot Canada Jones 789 Menards France Miller 1010 Menards USA Williams 1112 Menards Germany Brown 1314
Add region column for regions (assuming you have a key already)
% Column 1 is region name
% Column 2 is a list of countries in the region
regions = {
categorical({'North America'}), categorical({'Canada', 'USA', 'Mexico'});
categorical({'Europe'}), categorical({'France', 'Germany', 'Bulgaria'});
};
T.Region = repmat(categorical({'NA'}), height(T),1); % Default is NA
for i = 1:size(regions,1);
T.Region(ismember(T.Country, regions{i,2})) = regions{i,1};
end
disp(T)
Store Country Manager Revenue Region __________ _______ ________ _______ _____________ Home Depot ISS Smith 123 NA Lowes Mexico Johnson 456 North America Home Depot Canada Jones 789 North America Menards France Miller 1010 Europe Menards USA Williams 1112 North America Menards Germany Brown 1314 Europe
Summary of Stores by region using
Tsummary = groupsummary(T,["Store","Region"],["sum","mean"], "Revenue")
Tsummary = 5x5 table
Store Region GroupCount sum_Revenue mean_Revenue __________ _____________ __________ ___________ ____________ Home Depot NA 1 123 123 Home Depot North America 1 789 789 Lowes North America 1 456 456 Menards North America 1 1112 1112 Menards Europe 2 2324 1162
Time test
I replicated the demo table many times to create a single table with 184,320 rows and times this solution 1000 times using tic/toc (including the construction of the regions array but not including the disp()). The median of the distribution of durations was 0.07 seconds.
  2 Comments
Marcus Glover
Marcus Glover on 6 Feb 2021
Edited: Marcus Glover on 6 Feb 2021
Thank you, that is good advice. I am constructing the table from a SQL database using select. I believe I can use setoptions to make the appropriate variables categorical.
I was looking for the indices of the rows for each group, but it seems like I will be able to pass a function handle as the method in groupsummary and skip my intended plan to loop through each group of indices. That may or may not save me time but it certianly seems nicer.
Looking at your time test, I guess I was worried for nothing! Out of curiosity, what is the reson for the (admittedly small) distribution- just whatever also may be happening in the background of your computer?
Adam Danz
Adam Danz on 6 Feb 2021
> I believe I can use setoptions to make the appropriate variables categorical.
Perhaps (I haven't looked into it) but you could also convert from char to categorical the way I did.
> That may or may not save me time but it certianly seems nicer.
I think it will save you time.
> what is the reson for the (admittedly small) distribution
There will always be variation. Here are some reasons.
  • Random fluxuation
  • JIT compilation (see this answer)
  • Other stuff my computer is doing that may limit resources.
  • Other stuff I don't know about :D

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!