How do I consider words with hyphens and spaces as equivalent while tokenizing using the Text Analytics toolbox in MATLAB R2022a?

Question

0 votes

I would like to be able to add known tokens when I tokenize a document. I would like "in person", "in-person", and "inperson" to all be equivalent to "in-person". How and where would I add this during the text preparation process?

Sign in to answer this question.

Follow Question

Answer 1

MathWorks Support Team on 17 Nov 2022

Open in MATLAB Online

0 votes

'CustomTokens’ might be a helpful method towards this direction. Please refer to the following documentation page on ‘tokenizedDocument’, specifically the ‘Specify Custom Tokens’ section:

https://www.mathworks.com/help/textanalytics/ref/tokenizeddocument.html#mw_f9d9d081-a9ca-4188-80b5-bf5a2bc5ea4c

Please also refer to the following example that illustrates the above suggestion:

str = "I can be at the following locations, in-person, inperson and in-person and finally in person"; 
documents = tokenizedDocument(str) 
documents = tokenizedDocument(str,'CustomTokens',["in-person" "inperson", "in person"]) 
tdetails = tokenDetails(documents) 
T = table; 
T.Token = ["in-person" "inperson" "in person"]'; 
T.Type = ["location" "location" "location"]' 
documents = tokenizedDocument(str,'CustomTokens',T); 
tdetails = tokenDetails(documents) 
% Perform other preprocessing steps if needed. 
% Finding words that match the same token type 
idx = tdetails.Type == 'location' % Finding occurrences based on the token type 
tdetails.Token(idx==1) = 'inperson' % replacing the other versions of ‘in person’ to a single word ‘inperson’ 
wordcloud(tdetails.Token) % showcasing a wordcloud of the tokens for illustration purposes 

Alternatively, the following two resource might be helpful:

To specify custom tokens using Regular Expressions:

https://www.mathworks.com/help/releases/R2022a/textanalytics/ref/tokenizeddocument.html#mw_72ad54b6-c767-4124-98c3-41bb12fc708f

Erase Punctuation from text and documents (might be of interest for hyphens):

https://www.mathworks.com/help/releases/R2022a/textanalytics/ref/tokenizeddocument.erasepunctuation.html

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

How do I consider words with hyphens and spaces as equivalent while tokenizing using the Text Analytics toolbox in MATLAB R2022a?

Accepted Answer

0 Comments
Show -2 older comments Hide -2 older comments

More Answers (0)

Categories

Products

Release

Tags

Community Treasure Hunt

How do I consider words with hyphens and spaces as equivalent while tokenizing using the Text Analytics toolbox in MATLAB R2022a?

Accepted Answer

0 Comments Show -2 older comments Hide -2 older comments

More Answers (0)

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments