searching a string for a word
22 views (last 30 days)
Show older comments
So I have a text file, and i am looking for the frequency of appearance of those words in the text file. I have used strfind, but the problem is if one of the words I am searching for is small say "and" then it can be found in other words like "band", but I only want it to appear when it is standing alone. I tried searching for when the word only had a space before and after it (so when it stands alone) but this ignores if the word is first or last on a line in the text file. code is attached.
A = fileread(txt)
fh = fopen(txt,'r')
B = strfind(A, firstword);
line = fgetl(fh)
C = strfind(A,secondword);
vec = [length(B),length(C)];
1 Comment
Cedric
on 13 Oct 2017
Edited: Cedric
on 13 Oct 2017
Part of the code is useless. The following
A = fileread(txt)
already opens the file, reads it as text, and closes it. After it is executed, A contains the full content of the file. So then there is no need to open the file again and read one line (and forget to close it).
As explained by Per below, STRFIND matches strict occurrences of the text that you are looking for. You could observe that it is difficult to use it for matching patterns (situations a little more flexible than the simple occurrence of letters). Looking for white spaces before and after was a good first attempt, but there are cases where it fails .. and there is the upper/lower case issue.
All these considerations are a good signal that you need an approach a little more elaborate based on pattern matching, using regular expressions. This is what Per develops. Note that he uses REGEXPI and not REGEXP, to provide a case-insensitive solution.
Your code should look a bit like the following:
textContent = fileread( textFile ) ;
countWord1 = length( regexpi( textContent, ... )) ;
countWord2 = length( regexpi( textContent, ... )) ;
counts = [countWord1, countWord2] ;
where ... are appropriate arguments (at least the pattern). Even better:
wordsToFind = {'and', 'here', 'not'} ;
textFile = 'MyFile.txt' ;
counts = zeros( size( wordsToFind )) ;
textContent = fileread( textFile ) ;
for wordId = 1 : numel( wordsToFind )
pattern = sprintf( '\\<%s\\>', wordsToFind{wordId} ) ;
counts(wordId) = length( regexpi( textContent, pattern )) ;
end
where we loop over a series of words defined in a cell array, and we build the pattern proposed by Per dynamically.
Answers (1)
per isakson
on 13 Oct 2017
Edited: per isakson
on 13 Oct 2017
Try
>> regexpi( 'And, and other words and_ 2and and', '(^|\W)and(\W|$)', 'start' )
ans =
1 5 31
The search term includes the character before the word, and. Thus the value returned will often point at a space.
Better
>> regexpi( 'And, and other words and_ 2and and', '\<and\>', 'start' )
ans =
1 6 32
Why read line by line and not the entire text in one go
str = fileread( filespec );
pos = regexpi( str, '\<and\>', 'start' );
Doc says:
- \W Any character that is not alphabetic, numeric, or underscore. For English character sets, \W is equivalent to [^a-zA-Z_0-9]
- \<expr Beginning of a word.
0 Comments
See Also
Categories
Find more on Characters and Strings in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!