赞
踩
The difference between Unicode normalization forms
Photo by Joel Filipe on Unsplash
Recently I am working on an NLP task in Japanese, one problem is to convert special characters to a normalized form. So I have done a little research and write this post for anyone who has the same need.
Japanese contains different forms of the character, for example, Latin has two forms, full-width form, and half-width.
In the above example, we can see the full-width form is very ugly and is also hard to utilizing for the following processing. So we need to convert it to a normalized form.
TL;DR
Use NFKC method.
from unicodedata import normalize
s = “株式会社KADOKAWA Future Publishing”
normalize(‘NFKC’, s)
株式会社KADOKAWA Future Publishing
Unicode normalization forms
from Wikipedia
There are 4 kinds of Unicode normalization forms. This article give a very detailed explanation. But I will explain the difference with a simple and easy understanding way.
First, we could see the below result for an intuitive understanding.
アイウエオ (NFC)> アイウエオ
アイウエオ (NFD)> アイウエオ
アイウエオ (NFKC)> アイウエオ
アイウエオ (NFKD)> アイウエオ
パピプペポ (NFC)> パピプペポ
パピプペポ (NFD)> パピプペポ
パピプペポ (NFKC)> パピプペポ
パピプペポ (NFKD)> パピプペポ
パピプペポ (NFC)> パピプペポ
パピプペポ (NFD)> パピプペポ
パピプペポ (NFKC)> パピプペポ
パピプペポ (NFKD)> パピプペポ
abcABC (NFC)> abcABC
abcABC (NFD)> abcABC
abcABC (NFKC)> abcABC
abcABC (NFKD)> abcABC
123 (NFC)> 123
123 (NFD)> 123
123 (NFKC)> 123
123 (NFKD)> 123
+-.~)} (NFC)> +-.~)}
+-.~)} (NFD)> +-.~)}
+-.~)} (NFKC)> ±.~)}
+-.~)} (NFKD)> ±.~)}
There are two classification methods for these 4 forms.
D = Decomposition
C = Composition
K = Compatibility
K means compatibility, which is used to distinguish with the original form. Because K changes the original form, so the length is also changed.
s= ‘…’
normalize(‘NFKC’, s)
‘…’
len(s)
1
len(normalize(‘NFC’, s))
1
len(normalize(‘NFKC’, s))
3
len(normalize(‘NFD’, s))
1
len(normalize(‘NFKD’, s))
3
2 Whether the length of original form is changed or not
パピプペポ (NFC)> パピプペポ
パピプペポ (NFD)> パピプペポ
パピプペポ (NFKC)> パピプペポ
パピプペポ (NFKD)> パピプペポ
You might be wondering why the length is change? Please see the test below.
from unicodedata import normalize
s = “パピプペポ”
len(s)
5
len(normalize(‘NFC’, s))
5
len(normalize(‘NFKC’, s))
5
len(normalize(‘NFD’, s))
10
len(normalize(‘NFKD’, s))
10
We can find the “decomposition” method doubles the length.
from Unicode正規化とは
This is because the NFD & NFKD decompose each Unicode character into two Unicode characters. For example, ポ(U+30DD) = ホ(U+30DB) + Dot(U+309A) . So the length change from 5 to 10. NFC & NFKC compose separated Unicode characters together, so the length is not changed.
Python Implementation
You can use the unicodedata library to get different forms.
from unicodedata import normalize
s = “パピプペポ”
len(s)
5
len(normalize(‘NFC’, s))
5
len(normalize(‘NFKC’, s))
5
len(normalize(‘NFD’, s))
10
len(normalize(‘NFKD’, s))
10
Length
Take Away
Usually, we can use either of NFKC or NFKD to get the normalized form. The length won’t make trouble only if your NLP task is length sensitive. I usually use the NFKC method.
Check out my other posts on Medium with a categorized view!
GitHub: BrambleXu
LinkedIn: Xu Liang
Blog: BrambleXu
Reference
https://unicode.org/reports/tr15/#Norm_Forms
https://www.wikiwand.com/en/Unicode_equivalence#/Normal_forms
http://nomenclator.la.coocan.jp/unicode/normalization.htm
https://maku77.github.io/js/string/normalize.html
http://tech.albert2005.co.jp/501/
https://towardsdatascience.com/difference-between-nfd-nfc-nfkd-and-nfkc-explained-with-python-code-e2631f96ae6c
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。