Efficient ways to match observations

2 views (last 30 days)
Daniel Pinto
Daniel Pinto on 1 Apr 2021
Edited: Daniel Pinto on 1 Apr 2021
Suppose I have a table "FILE A" that has two variables "Name,ID, Date".
Name is the name of an individual, e.g. "John The Cowboy".
ID identifies an individual. Eg. 'AVD4345'.
Date is a datetiime field with a date. There is a data point for each ID for each month, hence, each month-year only appears once for each ID.
A given individual,with a give name, might have one ID in one year and another ID in another. IDs are not reused, though.Hence, ID "AVD4345" always refers to "John The Cowboy" but "John The Cowboy" might have other IDs for some other dates.
Then I have another table "FILE B" that has three variables "ID, Date1,Date2,VariableX".
ID is the same as above. VariableX is a variable that I want to assign to FileA from FileB (can be double, character, etc).
Date1 and Date2 are datetime variables that indicate, respectively, the beginning and end of the date range for which a given ID is active. For instance, 'AVD4345' might be active from the January 5th 1983 (Date1) until March 24th 1987 (Date2) with a given value of VariableX, and then be active from March 25th 1987 (Date1) until November 28th 2010 (Date2) with a new value of VariableX
I want to find the match of each observation in File A (an ID-date combination) in File B to retrieve the correct value of VariableX.
It can happen that there is no match because File B is incomplete, for instance.
One approach is to loop through File A, and for each ID (in each row), find all the ID's
FileA.VariableX = repmat({'[]'},size(FileA,1),1); % pre-allocates
for kk = 1 : size(FileA,1)
FileBTemp = FileB(ismember(FileB.ID,FileA.ID(kk)),:);
FileBTemp = FileBTemp(FileA.Date(kk) >= FileBTemp.Date1 & FileA.Date(kk) <= FileBTemp.Date2,:);
if size(FileBTemp,1)==1 % if match unique, then assign value
FileA.VariableX(kk) = FileBTemp.VariableX;
elseif isempty(FileBTemp) % do nothing (i.e., leave {'[]'})
else
error('error') % size(FileBTemp,1)>1 should not happen, so if it does I'd have to check what's going on in the database
end
end
This is in principle fine. However, it is not efficient for a large value of
size(FileA,1)
What is a more efficient way of doing this?
One option would be to just transform "File B" so that an ID that is valid from January 5th 1983 (Date1) until March 24th 1987 (Date2) is repeated several times from January 1983 until March 1987 for each month, and then try to apply ismember() or intersection() on ID-month. However, this generates the problem that some ID-month in File A would have more than one match in File B (when variable X changes within the month in File B). Another option that would work would be to transform file B and repeat dates on a day by day basis (e.g., from January 5th 1983 (Date1) until March 24th 1987 I would repeat that row in File B once for each day in the data range and just change the date to account for all possible days). The issue with this is that it would generate a horribly tall array. I could in principle work with tall arrays, of course. Another way is to loop through File B, which is smaller. But I am wondering whether there's an overall more straighforward way of solving this kind of issue efficiently without loops and / or tall arrays.

Answers (0)

Categories

Find more on Dates and Time in Help Center and File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!