I have a text file that I would like to split into an array. Each array cell should be a word, not a sentence or line in the file.

Question

0 votes

This is what I got so far. But it does not actually solve my problem.

file= fopen('marktwain.txt','r');
string= fread(file, [1, inf], 'char');
fclose(file);
CStr = dataread('file', 'marktwain.txt', '%s', 'delimiter', '\n');

I have little clue where to go from here.

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Cedric on 17 Mar 2013

Edited: Cedric on 17 Mar 2013

Open in MATLAB Online

0 votes

 buffer = fileread('marktwain.txt') ;
 words = regexp(buffer, '\<\w+', 'match') ;

.. and we can discuss the pattern if you want to refine the regexp. You could for example have "it's" or "John's" count as single words (and not two) using (EDITED)

words = regexp(buffer, '\<[\w'']+', 'match') ;

The final answer, after the discussion below, is:

 buffer = fileread('marktwain.txt') ;
 words = regexp(buffer, '\<[\w''\-,]+', 'match') ;

8 Comments
Show 6 older comments Hide 6 older comments

Marco on 17 Mar 2013

Thank you very much, I understand this much better now.

Cedric on 17 Mar 2013

Edited: Cedric on 17 Mar 2013

Open in MATLAB Online

You want the comma to be part of words? If so, you probably figured out now that you can match it with

words = regexp(buffer, '\<[\w'',-]+', 'match') ;

Note that the dash has a special meaning when followed by a literal (it codes a range, like in A-Z that means A to Z), so you have to escape it if it doesn't come last within the []:

words = regexp(buffer, '\<[\w''\-,]+', 'match') ;

This is why I put the comma before the dash in the first expression.

Sign in to comment.

Answer 2

Walter Roberson on 17 Mar 2013

Open in MATLAB Online

0 votes

file = fopen('marktwain.txt', 'rt');
CStr = textscan(file, '%s');
fclose(file);

Only problem: you have not defined exactly what a "word" is for your purposes, so the above is going to break things up at whitespace.

1 Comment
Show -1 older comments Hide -1 older comments

Marco on 17 Mar 2013

All right; a word is the letters , like a in apple,between spaces excluding { , " ; : ? ! etc.

Sign in to comment.

Answer 3

Image Analyst on 17 Mar 2013

Edited: Image Analyst on 17 Mar 2013

Open in MATLAB Online

0 votes

Try John D'Errico's allwords(): http://www.mathworks.com/matlabcentral/fileexchange/27184-allwords

For example:

>> allwords('This is what I got so far. But it does not actually solve my problem.')
ans = 
    'This'    'is'    'what'    'I'    'got'    'so'    'far'    'But'    'it'    'does'    'not'    'actually'    'solve'    'my'    'problem'

2 Comments
Show None Hide None

Marco on 17 Mar 2013

I tried allwords but MATLAB didn't recognize the function. It is useful, do I have to download it?

Walter Roberson on 17 Mar 2013

Yes you would have to download it from the link that was given.

Sign in to comment.

I have a text file that I would like to split into an array. Each array cell should be a word, not a sentence or line in the file.

0 Comments
Show -2 older comments Hide -2 older comments

Accepted Answer

8 Comments
Show 6 older comments Hide 6 older comments

More Answers (2)

1 Comment
Show -1 older comments Hide -1 older comments

2 Comments
Show None Hide None

Categories

Tags

Community Treasure Hunt

I have a text file that I would like to split into an array. Each array cell should be a word, not a sentence or line in the file.

0 Comments Show -2 older comments Hide -2 older comments

Accepted Answer

8 Comments Show 6 older comments Hide 6 older comments

More Answers (2)

1 Comment Show -1 older comments Hide -1 older comments

2 Comments Show None Hide None

Categories

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

8 Comments
Show 6 older comments Hide 6 older comments

1 Comment
Show -1 older comments Hide -1 older comments

2 Comments
Show None Hide None