How to use Unicode numeric values in regexprep?

1 view (last 30 days)
How can "Häagen-Dasz" be converted to "Haagen-Dasz" using Uincode numeric values? For example,
regexprep('Häagen-Dasz','ä','A')
works fine, but
regexprep('Häagen-Dasz','\x{C4}','a')
does not. Here, the hexadecimal \x{C4} stands for [latin capital letter a] with diaeresis, i.e. [ä].

Accepted Answer

Yash
Yash on 28 Mar 2024
Edited: Yash on 28 Mar 2024
Hi Vlad,
'\x{C4}' represents the Unicode character Ä (Latin Capital Letter A with Diaeresis) in hexadecimal notation.
If you want to replace ä (Latin Small Letter A with Diaeresis), you should use \x{E4}, which is its Unicode hexadecimal representation.
In the context of your question, you're looking to replace ä with a. The correct approach would be to use the Unicode numeric value for ä in the regex and replace it with a. Here is the code:
regexprep('Häagen-Dasz','\x{E4}','a')
ans = 'Haagen-Dasz'
Hope this helps!

More Answers (2)

Stephen23
Stephen23 on 28 Mar 2024
inp = 'Häagen-Dasz';
baz = @(v)char(v(1)); % only need the first decomposed character.
out = arrayfun(@(c)baz(py.unicodedata.normalize('NFKD',c)),inp) % remove diacritics.
out = 'Haagen-Dasz'
Read more:
https://docs.python.org/3/library/unicodedata.html
https://stackoverflow.com/questions/16467479/normalizing-unicode

VBBV
VBBV on 28 Mar 2024
regexprep('Häagen-Dasz','ä','A')
ans = 'HAagen-Dasz'
regexprep('Häagen-Dasz','ä','\x{C4}')
ans = 'HÄagen-Dasz'
  2 Comments
VBBV
VBBV on 28 Mar 2024
Moved: VBBV on 28 Mar 2024
regexprep('Häagen-Dasz','\x{e4}','a')
ans = 'Haagen-Dasz'
VBBV
VBBV on 28 Mar 2024
The unicode character for small a is \x{e4}

Sign in to comment.

Categories

Find more on Just for fun in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!