Cannot read emojis correctly from an imported json file

5 views (last 30 days)
I have a json file coming from an exported telegram chat. I made some code to import it and process it so to have as a result a 2-column cell array where the first column is the sender and the second column is the message. Problem is: I see the whole message fine, except foor the emoji. They either come up as a weird character with a bunch of squares after it or straight up just a square (depending on how I import the json). Following code is the importing section:
fname = 'result.json';
fid = fopen(fname);
raw = fread(fid,inf);
% this way, an emoji is shown as ð
% fid = fopen(fname,'r','n','UTF-8');
% raw = fread(fid,'*char');
% this way, an emoji is shown as
str = char(raw');
fclose(fid);
val = jsondecode(str);
data=cell(size(val.messages,1),2);
for i=1:size(val.messages,1)
if val.messages{i,1}.type=='message'
data{i,1}=val.messages{i,1}.from;
data{i,2}=val.messages{i,1}.text;
else
data{i,1}='not message';
data{i,2}='not message';
end
end
I need some help to figure out how to show emojis properly. Or at least to have a way to distinguish them (like some ID string/code or something), since I need to do some data analysis down the line. How could I find a solution to this? Can I use some different importing step?
I also have HTML files of the chat and I found a way to import them successfully (even though the processing is much more difficult). The emojis in the files show fine but are not shown in MATLAB. This might be a second step in case I can't solve the json problem. Any help is appreciated, thank you.
Edit: I've seen some ready-made whatsapp parsers that automatically organize data in tables. Alternatively, if someone has something similar for telegram raw data, it would be nice. I'd love to solve this problem directly to learn more, but if that isn't possible then I'd love an alternative solution.

Answers (1)

Poorna
Poorna on 7 Apr 2024
Hi Paye,
I see that you want to extract emojis from your exported telegram chat. If you have access to the html files of the chat, then you can use the "extractHTMLText" function to extract the text from your html file. In most cases this will also read the emojis from the text. You can then use the "tokenizedDocument" function to tokenize the extracted text. This function will automatically detect emojis and assign their type to be emoji.
To know more about the above functions, refer to the following documentation:
To know more about analyzing emojis in MATLAB, refer to the following documentation:
Hope this Helps!
  2 Comments
Paye
Paye on 8 Apr 2024
Hi Poorna, thanks for your response. I already tried the extractHTMLText function on the html file with the following code:
fname="html/messages.html";
fid=fileread(fname,"Encoding","UTF-8");
str=string(fid);
elem=extractHTMLText(str);
tok=tokenizedDocument(elem);
Unfortunately, emojis are not visualized in any way inside the output string "str", or even inside "fid". I'm not sure how or why that is. Is it my code that's wrong? Becuase emojis show up in the html file just fine.

Sign in to comment.

Categories

Find more on Text Data Preparation in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!