Clear Filters
Clear Filters

textscan or import of unicode encoded textfile

5 views (last 30 days)
Question 1: Are textscan and importdata supposed to work with unicode encoded text file?
Question 2: After UTF-8 encoded file is opened with the correct encoding spec in the fopen argument, textscan output puts the following three characters  preceding the very first valid data I have in the file. Is this expected behavior undocumented?

Answers (2)

Anne
Anne on 5 Dec 2011
I have the same problem with my old MATLAB 7.3.0. Textscan won't read correctly unicode files, but it can deal with unicode formatted strings.
Thus a simple (but slow) workaround is to read text first with scanf and run textscan on the text.
[f,msg]=fopen(nomfic,'r','n','UTF-8');
LIGNES=textscan(f,'%[^\n]','delimiter','\n');
won't work with unicode encoded characters but
[f,msg]=fopen(nomfic,'r','n','UTF-8');
txt=fscanf(f,'%c');
LIGNES=textscan(txt,'%[^\n]','delimiter','\n');
will.

Walter Roberson
Walter Roberson on 22 Sep 2011
Answer 1: textscan() is; I do not know about importdata
Answer 2: When you explicitly specify one of the UTF-* as the encoding, the MATLAB code will not look for a Byte Order Mark, and will leave any Byte Order Mark in the file stream. If you do not explicitly specify the encoding, then the byte stream will be examined for a Byte Order Mark and if found the encoding will be determined by that.
It is not recommended that a Byte Order Mark be used with UTF-8, but some Windows editors insert it anyhow. The Byte Order Mark represented in UTF-8 is 0xEF,0xBB,0xBF which show up exactly as the characters you notice. See reference
I have not examined to see whether it makes a difference as to whether you opened the file with 'r' or 'rt' . I use 'rt' when referring to text files, as it can make a difference in some instances.

Categories

Find more on Data Import and Export in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!