Match names from two different columns - Comparing strings of different lengths
6 views (last 30 days)
Show older comments
I have a cell variable A with about 4000 rows and 5 columns:
C1 c2 c3 c4 c5
A={1997 'Michelle Applebaum' 'Salmon' 'BASIC' 'STEEL'
1997 'Jambardella Arnold' 'Butter' 'BASIC' 'STEEL'
1999 'Cai von Rumohr' 'Cow' 'CAPITAL' 'AEROPLANE'
2011 'Pierre Smith' 'Milk' 'GOOD' 'AEROPLANE'
2004 'Jinder Kauffman' 'Star' 'CAPITAL' 'PHONE'
And I have a second cell variable B with about 200 000 rows and 8 columns:
c1 c2 c3 c4 c5 c6 c7 c8
B={2013 29 2225 'ELD1' 29 'SMITH P' 4817 'HAYWOOD'
2013 70 2628 'CCRN' 70 'FRANCE J' 11688 'CANTORFZ'
2013 02 952 'ABFS' 02 'KAUFFMAN J' 356 'BUCK'
2013 20297 157 'DUK' 20297 'ARNOLD J' 1382 'LAWRENCE'
2013 78362 260 'APA' 78362 'ARIF A' 2213 'STIFEL'}
The focus is on c2 of A and on c6 of B.
- C2 of A gives the complete first name and last name (and sometimes other names in between) of an individual.
- C6 of B gives the last name (in capital letters) and only the initial of the first name of an individual.
I am trying to match both cells . So in case both last names are the same (or silmilar) I would like to add the columns of B to A.
C1 c2 c3 c4 c5
A={1997 'Michelle Applebaum' 'Salmon' 'BASIC' 'STEEL'
1997 'Jambardella Arnold ' 'Butter' 'BASIC' 'STEEL' 2013 20297 157 'DUK' 20297 'ARNOLD J' 1382 'LAWRENCE'
1999 'Cai von Rumohr' 'Cow' 'CAPITAL' 'AEROPLANE'
2011 'Pierre Smith' 'Milk' 'GOOD' 'AEROPLANE' 2013 29 2225 'ELD1' 29 'SMITH P' 4817 'HAYWOOD'
2004 'Jinder Kauffman' 'Star' 'CAPITAL' 'PHONE' 2013 02 952 'ABFS' 02'KAUFFMAN J' 356 'BUCK'
I never matched cells comparing names, and the function has to be case insensitive, ignore points, commas or spaces, and also ignore if the same letter appears twice in a row. I say this because variable A was wrote by me by hand, so it's possible the last name is not exactly equal in both variables. `strrep & strtrim` functions can help solving the problem.
Can someone help me please? Thank you.
0 Comments
Accepted Answer
Guillaume
on 6 Sep 2014
You've not specified what happened to compound surnames (like 'von Ruhmor' in your example) or first names (like 'Jean-Pierre'). it also appears that the fields in B are fixed width so what happens to very long surname? So, I've assumed that only the last part of a compound surname is in B, only the initial of compound first name is in B and for very long surname B is larger. I'm also ignoring case where names don't match exactly:
%extract names from A and transform in uppercase 'SURNAME 1STLETTEROFNAME':
namesfromA = upper(regexprep(A(:,2), '([A-Z]).* ([A-Z][a-z]+)', '$2 $1'));
%x=extract names from B and remove extra spaces:
namesfromB = regexprep(B(:, 6), '([A-Z]+) +([A-Z])', '$1 $2');
%find the intersection and postion of matches. Assume there is only ever one match
[~, ia, ib] = intersect(namesfromA, namesfromB);
%add matches to A:
A(ia, 6:13) = B(ib, :);
If surnames don't match exactly, then it gets a lot more complicated. There are a numbers of algorithm ( Damerau-Levenshtein, Hamming) that allows you to find how two strings are similar, but I don't think any of them are built-in in matlab.
More Answers (1)
Stephen23
on 6 Sep 2014
You are trying to match words in two cell arrays, including allowing for punctuation characters and possibly slightly different spellings... not an easy task to perform! You will also need to consider initials and name order.
It really depends on how different the strings might be: if the differences only include repeated characters, then you might be able to get away with creating some regexp pattern to help with that.
If the differences are more complicated, then you need to find a similarity measure. Some common similarity measures are the Rabin-Karp algorithm, the Levenshtein distance, the Needleman–Wunsch algorithm, or the Hamming distance.
You will also find submissions on MATLAB File Exchange that support several of these measures for analyzing string similarity, and plenty of examples online.
See Also
Categories
Find more on Entering Commands in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!