Sentencepiece as a spacy.Language

Camphr supports Sentencepiece as a spacy.Language. You can use Sentencepiece as you would use en or other languages.

Usage

Pass your trained spiece.model file path to spacy.blank, as follows:

>>> import spacy
>>> nlp = spacy.blank("sentencepiece", meta={"tokenizer": {"model_path": "/path/to/your/spiece.model"}})

Now you can use nlp as you normally would:

>>> doc = nlp("I saw a  girl with a telescope.")
>>> print(list(doc))
[I, saw, a, girl, with, a, te, le, s, c, o, pe, .]

(The result of the tokenization depends on the spiece.model, so you would see a different result from the above.)

The raw result of Sentencepiece can be obtained via doc._.spm_pieces_:

>>> print(doc._.spm_pieces_)
["▁I", "▁saw", "▁a", "▁girl", "▁with", "▁a", "▁", "te", "le", "s", "c", "o", "pe, "."]

You can easily get an alignment between doc._.spm_pieces_ and doc with pytokenizations:

>>> import tokenizations
>>> a2b, b2a = tokenizations.get_alignments(doc._.spm_pieces_, [token.text for token in doc])
>>> print(a2b)
[[0], [1], [2], [3], [4], [5], [6], [], [7], [8], [9], [10], [11], [12]]
>>> print(doc[1:4])
[saw, a, girl]
>>> import itertools
>>> print(doc._.spm_pieces_[i] for i in itertools.chain.from_iterable(b2a[1:4]))
["_saw", "_a", "_girl"]