Removing rows duplicates based on some conditions

8 views (last 30 days)
Hello. I already asked a very similar question here, but I didn't understand the commands well, so I decided to ask for help again. I now have a file with 7 columns. I need to exclude the lines where the values ​​in the third column are the same, but based on the conditions: the date (column 1), the time (column2) and the index (column3) cannot be repeated together and you must keep only the line where the value of the seventh column is the largest. The matrix looks like this:
20140101 2100 1762 -5.983 -37.483 26.8 11000
20140101 2100 1762 -5.983 -37.483 36.8 15000
20140109 2040 1762 -5.983 -37.483 25.0 3000
20140109 2040 1762 -5.983 -37.483 28.0 7000
20140130 2250 1762 -5.983 -37.483 22.0 3000
20140109 2040 1763 -5.992 -37.483 25.0 4000
20140109 2040 1763 -5.992 -37.483 23.2 6000
20140114 1940 1763 -5.992 -37.483 34.0 6000
And I want this result:
20140101 2100 1762 -5.983 -37.483 36.8 15000
20140109 2040 1762 -5.983 -37.483 28.0 7000
20140130 2250 1762 -5.983 -37.483 22.0 3000
20140109 2040 1763 -5.992 -37.483 23.2 6000
20140114 1940 1763 -5.992 -37.483 34.0 6000
In other words, the conditions are: if you have more than one row with the same date (for example, 2040101) in column 1, the same time (for example, 2100) in column 2 and the same index (for example, 1762 ) in the third column, keep only the row where the value in the last column is greater. In the case of the example I gave, just keep:
20140101 2100 1762 -5.983 -37.483 36.8 15000
20140109 2040 1762 -5.983 -37.483 28.0 7000
20140130 2250 1762 -5.983 -37.483 22.0 3000
20140109 2040 1763 -5.992 -37.483 23.2 6000
20140114 1940 1763 -5.992 -37.483 34.0 6000
Thank you so much for your help!

Accepted Answer

Adam Danz
Adam Danz on 25 Sep 2020
Edited: Adam Danz on 25 Sep 2020
Method 1
Here's a simple solution that assumes matching rows in columns 1:3 are next to each other. The assert command tests this assumption and will throw an error if the assumption is not met.
data = [20140101 2100 1762 -5.983 -37.483 26.8 11000
20140101 2100 1762 -5.983 -37.483 36.8 15000
20140109 2040 1762 -5.983 -37.483 25.0 3000
20140109 2040 1762 -5.983 -37.483 28.0 7000
20140130 2250 1762 -5.983 -37.483 22.0 3000
20140109 2040 1763 -5.992 -37.483 25.0 4000
20140109 2040 1763 -5.992 -37.483 23.2 6000
20140114 1940 1763 -5.992 -37.483 34.0 6000];
[~,~,groupID] = unique(data(:,1:3),'rows','stable');
assert(all(diff(groupID)>=0),'Assumption violation: Matching rows in cols 1:3 are not juxtaposed.')
[~,groupMax] = splitapply(@max,data(:,7),groupID);
m = data(cumsum(groupMax),:)
Result
m =
2.014e+07 2100 1762 -5.983 -37.483 36.8 15000
2.014e+07 2040 1762 -5.983 -37.483 28 7000
2.014e+07 2250 1762 -5.983 -37.483 22 3000
2.014e+07 2040 1763 -5.992 -37.483 23.2 6000
2.014e+07 1940 1763 -5.992 -37.483 34 6000
Method 2
This method does not make any assumptions about row order.
[~,~,groupID] = unique(data(:,1:3),'rows','stable');
[~,groupMax] = splitapply(@max,data(:,7),groupID);
groups = unique(groupID);
finalRowNums = zeros(numel(groups),1);
for i = 1:numel(groups)
rowNums = find(groupID==i);
finalRowNums(i) = rowNums(groupMax(i));
end
m = data(finalRowNums, :);
  7 Comments
Adam Danz
Adam Danz on 25 Sep 2020
Edited: Adam Danz on 28 Sep 2020
It works perfectly fine for me on the sample you provided. Run the code below. If it results in an error, tell us what Matlab release you're using and the full error message. If Matlab crahses, something unrelated is wrong on your system.
If you attach the mat file so I can simply load the matrix into Matlab, I can test the full dataset.
data = [20140101 0 69760 -5.965 -36.250 26.0 2000
20140101 0 69761 -5.974 -36.250 23.5 2000
20140101 0 73180 -5.247 -36.187 23.5 2000
20140101 0 73678 -5.229 -36.178 26.5 2000
20140101 0 74178 -5.229 -36.169 26.5 2000
20140101 0 128828 -6.576 -35.181 22.6 2000
20140101 0 138373 -6.980 -35.009 20.8 2000
20140101 0 139404 -7.259 -34.991 22.0 2000
20140101 0 139904 -7.259 -34.982 23.0 2000
20140101 0 140375 -6.998 -34.973 22.2 2000
20140101 0 140404 -7.259 -34.973 24.5 2000
20140101 0 140903 -7.250 -34.964 22.2 2000
20140101 0 140904 -7.259 -34.964 27.2 2000
20140101 0 141403 -7.250 -34.955 29.2 2000
20140101 0 141404 -7.259 -34.955 28.5 2000
20140101 0 141874 -6.989 -34.946 23.4 2000
20140101 0 141902 -7.241 -34.946 20.8 2000
20140101 0 141903 -7.250 -34.946 32.0 2000
20140101 0 141904 -7.259 -34.946 27.0 2000
20140101 0 142403 -7.250 -34.937 22.5 2000
20140101 0 142404 -7.259 -34.937 23.9 2000
20140101 0 142903 -7.250 -34.928 23.2 2000
20140101 0 142904 -7.259 -34.928 24.5 2000
20140101 0 143403 -7.250 -34.919 24.5 2000
20140101 0 146874 -6.989 -34.855 21.5 2000
20140101 0 146934 -7.528 -34.855 21.5 2000
20140101 0 148907 -7.286 -34.819 21.2 2000
20140101 0 149406 -7.277 -34.810 26.0 2000];
[~,~,groupID] = unique(data(:,1:3),'rows','stable');
[~,groupMax] = splitapply(@max,data(:,7),groupID);
groups = unique(groupID);
finalRowNums = zeros(numel(groups),1);
for i = 1:numel(groups)
rowNums = find(groupID==i);
finalRowNums(i) = rowNums(groupMax(i));
end
m = data(finalRowNums, :);
Result
m =
2.014e+07 0 69760 -5.965 -36.25 26 2000
2.014e+07 0 69761 -5.974 -36.25 23.5 2000
2.014e+07 0 73180 -5.247 -36.187 23.5 2000
2.014e+07 0 73678 -5.229 -36.178 26.5 2000
2.014e+07 0 74178 -5.229 -36.169 26.5 2000
2.014e+07 0 1.2883e+05 -6.576 -35.181 22.6 2000
2.014e+07 0 1.3837e+05 -6.98 -35.009 20.8 2000
2.014e+07 0 1.394e+05 -7.259 -34.991 22 2000
2.014e+07 0 1.399e+05 -7.259 -34.982 23 2000
2.014e+07 0 1.4038e+05 -6.998 -34.973 22.2 2000
2.014e+07 0 1.404e+05 -7.259 -34.973 24.5 2000
2.014e+07 0 1.409e+05 -7.25 -34.964 22.2 2000
2.014e+07 0 1.409e+05 -7.259 -34.964 27.2 2000
2.014e+07 0 1.414e+05 -7.25 -34.955 29.2 2000
2.014e+07 0 1.414e+05 -7.259 -34.955 28.5 2000
2.014e+07 0 1.4187e+05 -6.989 -34.946 23.4 2000
2.014e+07 0 1.419e+05 -7.241 -34.946 20.8 2000
2.014e+07 0 1.419e+05 -7.25 -34.946 32 2000
2.014e+07 0 1.419e+05 -7.259 -34.946 27 2000
2.014e+07 0 1.424e+05 -7.25 -34.937 22.5 2000
2.014e+07 0 1.424e+05 -7.259 -34.937 23.9 2000
2.014e+07 0 1.429e+05 -7.25 -34.928 23.2 2000
2.014e+07 0 1.429e+05 -7.259 -34.928 24.5 2000
2.014e+07 0 1.434e+05 -7.25 -34.919 24.5 2000
2.014e+07 0 1.4687e+05 -6.989 -34.855 21.5 2000
2.014e+07 0 1.4693e+05 -7.528 -34.855 21.5 2000
2.014e+07 0 1.4891e+05 -7.286 -34.819 21.2 2000
2.014e+07 0 1.4941e+05 -7.277 -34.81 26 2000
pink flower
pink flower on 30 Sep 2020
These commands worked very well, it only takes a long time to process, after all there are more than 10 million lines. Thank you very much!

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!