当前位置:   article > 正文

Difference Between NFD, NFC, NFKD, and NFKC Explained with Python Code

nfkc

The difference between Unicode normalization forms

Photo by Joel Filipe on Unsplash
Recently I am working on an NLP task in Japanese, one problem is to convert special characters to a normalized form. So I have done a little research and write this post for anyone who has the same need.

Japanese contains different forms of the character, for example, Latin has two forms, full-width form, and half-width.

In the above example, we can see the full-width form is very ugly and is also hard to utilizing for the following processing. So we need to convert it to a normalized form.

TL;DR
Use NFKC method.

from unicodedata import normalize
s = “株式会社KADOKAWA Future Publishing”
normalize(‘NFKC’, s)
株式会社KADOKAWA Future Publishing
Unicode normalization forms

from Wikipedia
There are 4 kinds of Unicode normalization forms. This article give a very detailed explanation. But I will explain the difference with a simple and easy understanding way.

First, we could see the below result for an intuitive understanding.

アイウエオ (NFC)> アイウエオ
アイウエオ (NFD)> アイウエオ
アイウエオ (NFKC)> アイウエオ
アイウエオ (NFKD)> アイウエオ
パピプペポ (NFC)> パピプペポ
パピプペポ (NFD)> パピプペポ
パピプペポ (NFKC)> パピプペポ
パピプペポ (NFKD)> パピプペポ
パピプペポ (NFC)> パピプペポ
パピプペポ (NFD)> パピプペポ
パピプペポ (NFKC)> パピプペポ
パピプペポ (NFKD)> パピプペポ
abcABC (NFC)> abcABC
abcABC (NFD)> abcABC
abcABC (NFKC)> abcABC
abcABC (NFKD)> abcABC
123 (NFC)> 123
123 (NFD)> 123
123 (NFKC)> 123
123 (NFKD)> 123
+-.~)} (NFC)> +-.~)}
+-.~)} (NFD)> +-.~)}
+-.~)} (NFKC)> ±.~)}
+-.~)} (NFKD)> ±.~)}
There are two classification methods for these 4 forms.

1 original form changed or not

  • A(not changed): NFC & NFD
  • B(changed): NFKC & NFKD

2 the length of original length changed or not

  • A(not changed): NFC & NFKC
  • B(changed): NFD & NFKD
    1 Whether the original form is changed or not
    abcABC (NFC)> abcABC
    abcABC (NFD)> abcABC
    abcABC (NFKC)> abcABC
    abcABC (NFKD)> abcABC

1 original form changed or not

  • A(not changed): NFC & NFD
  • B(changed): NFKC & NFKD
    The first classification method is based on whether the original form is changed or not. More specifically, A group does not contain K but B group contains K. What does K means?

D = Decomposition
C = Composition
K = Compatibility
K means compatibility, which is used to distinguish with the original form. Because K changes the original form, so the length is also changed.

s= ‘…’
normalize(‘NFKC’, s)
‘…’
len(s)
1
len(normalize(‘NFC’, s))
1
len(normalize(‘NFKC’, s))
3
len(normalize(‘NFD’, s))
1
len(normalize(‘NFKD’, s))
3
2 Whether the length of original form is changed or not
パピプペポ (NFC)> パピプペポ
パピプペポ (NFD)> パピプペポ
パピプペポ (NFKC)> パピプペポ
パピプペポ (NFKD)> パピプペポ

2 the length of original length changed or not

  • A(not changed): NFC & NFKC
  • B(changed): NFD & NFKD
    This second classification method is based on whether the length of the original form is changed or not. A group contains C(Composition), which won’t change the length. B group contains D(Decomposition), which will change the length.

You might be wondering why the length is change? Please see the test below.

from unicodedata import normalize
s = “パピプペポ”
len(s)
5
len(normalize(‘NFC’, s))
5
len(normalize(‘NFKC’, s))
5
len(normalize(‘NFD’, s))
10
len(normalize(‘NFKD’, s))
10
We can find the “decomposition” method doubles the length.

from Unicode正規化とは
This is because the NFD & NFKD decompose each Unicode character into two Unicode characters. For example, ポ(U+30DD) = ホ(U+30DB) + Dot(U+309A) . So the length change from 5 to 10. NFC & NFKC compose separated Unicode characters together, so the length is not changed.

Python Implementation
You can use the unicodedata library to get different forms.

from unicodedata import normalize
s = “パピプペポ”
len(s)
5
len(normalize(‘NFC’, s))
5
len(normalize(‘NFKC’, s))
5
len(normalize(‘NFD’, s))
10
len(normalize(‘NFKD’, s))
10
Length

Take Away
Usually, we can use either of NFKC or NFKD to get the normalized form. The length won’t make trouble only if your NLP task is length sensitive. I usually use the NFKC method.

Check out my other posts on Medium with a categorized view!
GitHub: BrambleXu
LinkedIn: Xu Liang
Blog: BrambleXu

Reference
https://unicode.org/reports/tr15/#Norm_Forms
https://www.wikiwand.com/en/Unicode_equivalence#/Normal_forms
http://nomenclator.la.coocan.jp/unicode/normalization.htm
https://maku77.github.io/js/string/normalize.html
http://tech.albert2005.co.jp/501/

https://towardsdatascience.com/difference-between-nfd-nfc-nfkd-and-nfkc-explained-with-python-code-e2631f96ae6c

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/很楠不爱3/article/detail/204082
推荐阅读
相关标签
  

闽ICP备14008679号