I have a text file that I would like to split into an array. Each array cell should be a word, not a sentence or line in the file.

This is what I got so far. But it does not actually solve my problem.
file= fopen('marktwain.txt','r');
string= fread(file, [1, inf], 'char');
fclose(file);
CStr = dataread('file', 'marktwain.txt', '%s', 'delimiter', '\n');
I have little clue where to go from here.

 Accepted Answer

buffer = fileread('marktwain.txt') ;
words = regexp(buffer, '\<\w+', 'match') ;
.. and we can discuss the pattern if you want to refine the regexp. You could for example have "it's" or "John's" count as single words (and not two) using (EDITED)
words = regexp(buffer, '\<[\w'']+', 'match') ;
The final answer, after the discussion below, is:
buffer = fileread('marktwain.txt') ;
words = regexp(buffer, '\<[\w''\-,]+', 'match') ;

8 Comments

I like it but there is a new issue; words like don't are split into 'don' and 't' rather than one word. Other than that it is fine.
Apostrophe is usually counted as part of {,";:?! etc. that you said should be excluded.
Are numbers to be excluded? So if the topic was food additives and E102 was mentioned, then how should that be handled?
The second regexp solves this issue (EDITED).
words = regexp(buffer, '\<[\w'']+', 'match') ;
Here, we define words as being characters from the set [\w'], where \w stands for any alphabetic, numeric, or underscore character. The double quotes in the expression that I give is just the way to have a single quote ultimately, as single quotes are used as string delimiters in MATLAB.
words = regexp(buffer, '\<[\w''-]+', 'match') ;
the [] frame the set of characters that can be part of words, so I just added the '-'. Actually, the 2nd argument of the call to REGEXP is the pattern; the \< matches the beginning of a word, the content of [] define the set of characters that compose words, and the + indicates that the preceding element (the []) can/must be matched 1 or more times (as many times as possible). The regexp engine matches the first occurrence of the pattern and extracts it, and then goes on iteratively. So the pattern that you want ultimately is
\<[\w'-]+
which is expressed in MATLAB as
pattern = '\<[\w''-]+' ;
that I wrote directly in the expression at the top of this comment.
Thank you very much, I understand this much better now.
You want the comma to be part of words? If so, you probably figured out now that you can match it with
words = regexp(buffer, '\<[\w'',-]+', 'match') ;
Note that the dash has a special meaning when followed by a literal (it codes a range, like in A-Z that means A to Z), so you have to escape it if it doesn't come last within the []:
words = regexp(buffer, '\<[\w''\-,]+', 'match') ;
This is why I put the comma before the dash in the first expression.

Sign in to comment.

More Answers (2)

file = fopen('marktwain.txt', 'rt');
CStr = textscan(file, '%s');
fclose(file);
Only problem: you have not defined exactly what a "word" is for your purposes, so the above is going to break things up at whitespace.

1 Comment

All right; a word is the letters , like a in apple,between spaces excluding { , " ; : ? ! etc.

Sign in to comment.

For example:
>> allwords('This is what I got so far. But it does not actually solve my problem.')
ans =
'This' 'is' 'what' 'I' 'got' 'so' 'far' 'But' 'it' 'does' 'not' 'actually' 'solve' 'my' 'problem'

2 Comments

I tried allwords but MATLAB didn't recognize the function. It is useful, do I have to download it?
Yes you would have to download it from the link that was given.

Sign in to comment.

Categories

Asked:

on 17 Mar 2013

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!