How do I consider words with hyphens and spaces as equivalent while tokenizing using the Text Analytics toolbox in MATLAB R2022a?

I would like to be able to add known tokens when I tokenize a document. I would like "in person", "in-person", and "inperson" to all be equivalent to "in-person". How and where would I add this during the text preparation process?

 Accepted Answer

'CustomTokens’ might be a helpful method towards this direction. Please refer to the following documentation page on ‘tokenizedDocument’, specifically the ‘Specify Custom Tokens’ section:
  
 Please also refer to the following example that illustrates the above suggestion:   
str = "I can be at the following locations, in-person, inperson and in-person and finally in person"; 
documents = tokenizedDocument(str) 
documents = tokenizedDocument(str,'CustomTokens',["in-person" "inperson", "in person"]) 
tdetails = tokenDetails(documents) 
T = table; 
T.Token = ["in-person" "inperson" "in person"]'; 
T.Type = ["location" "location" "location"]' 
documents = tokenizedDocument(str,'CustomTokens',T); 
tdetails = tokenDetails(documents) 
% Perform other preprocessing steps if needed. 
% Finding words that match the same token type 
idx = tdetails.Type == 'location' % Finding occurrences based on the token type 
tdetails.Token(idx==1) = 'inperson' % replacing the other versions of ‘in person’ to a single word ‘inperson’ 
wordcloud(tdetails.Token) % showcasing a wordcloud of the tokens for illustration purposes 
Alternatively, the following two resource might be helpful: 
To specify custom tokens using Regular Expressions: 
Erase Punctuation from text and documents (might be of interest for hyphens): 

More Answers (0)

Categories

Products

Release

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!