当前位置:   article > 正文

python tokenizer_Python中速度最快,最完整/可自定义的tokenizer

tokenizer的速度

tok

Fast and most complete/customizable tokenizer in Python.

It is roughly 25x faster than spacy's and nltk's regex based tokenizers.

Using the aho-corasick algorithm makes it a novelty and allows it to be both explainable and fast in how it will split.

The heavy lifting is done by textsearch and pyahocorasick, allowing this to be written in only ~200 lines of code.

Contrary to regex-based approaches, it will go over each character in a text only once. Read below about how this works.

Installation

pip install tok

Usage

By default it handles contractions, http, (float) numbers and currencies.

from tok import word_tokenize

word_tokenize("I wouldn't do that.... would you?")

['I', 'would', 'not', 'do', 'that', '...', 'would', 'you', '?']

Or configure it yourself:

from tok import Tokenizer

tokenizer = Tokenizer(protected_words=["some.thing"]) # still using the defaults

tokenizer.word_tokenize("I want to protect some.thing")

['I', 'want', 'to', 'protect', 'some.thing']

Split by sentences:

from tok import sent_tokenize

sent_tokenize("I wouldn't do that.... would you?")

[['I', 'would', 'not', 'do', 'that', '...'], ['would', 'you', '?']]

for more options check the documentation of the Tokenizer.

Further customization

Given:

from tok import Tokenizer

t = Tokenizer(protected_words=["some.thing"]) # still using the defaults

You can add your own ideas to the tokenizer by using:

t.keep(x, reason): Whenever it finds x, it will not add whitespace. Prevents direct tokenization.

t.split(x, reason): Whenever it finds x, it will surround it by whitespace, thus creating a token.

t.drop(x, reason): Whenever it finds x, it will remove it but add a split.

t.strip(x, reason): Whenever it finds x, it will remove it without splitting.

t.drop("bla", "bla is not needed")

t.word_tokenize("Please remove bla, thank you")

['Please', 'remove', ',', 'thank', 'you']

Explainable

Explain what happened:

t.explain("bla")

[{'from': 'bla', 'to': ' ', 'explanation': 'bla is not needed'}]

See everything in there (will help you understand how it works):

t.explain_dict

How it works

It will always only keep the longest match. By introducing a space in your tokens, it will make it be split.

If you consider how the tokenization of . works, see here:

When it finds a A. it will make it A. (single letter abbreviations)

When it finds a .0 it will make it .0 (numbers)

When it finds a ., it will make it . (thus making a split)

If you want to make sure something including a dot stays, you can use for example:

t.keep("cool.")

Contributing

It would be greatly appreciated if you want to contribute to this library.

It would also be great to add contractions for other languages.

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/2023面试高手/article/detail/354837
推荐阅读
相关标签
  

闽ICP备14008679号