Sentencepiece as a spacy.Language¶
Camphr supports Sentencepiece as a spacy.Language
.
You can use Sentencepiece as you would use en or other languages.
Usage¶
Pass your trained spiece.model
file path to spacy.blank
, as follows:
>>> import spacy
>>> nlp = spacy.blank("sentencepiece", meta={"tokenizer": {"model_path": "/path/to/your/spiece.model"}})
Now you can use nlp
as you normally would:
>>> doc = nlp("I saw a girl with a telescope.")
>>> print(list(doc))
[I, saw, a, girl, with, a, te, le, s, c, o, pe, .]
(The result of the tokenization depends on the spiece.model
, so you would see a different result from the above.)
The raw result of Sentencepiece can be obtained via doc._.spm_pieces_
:
>>> print(doc._.spm_pieces_)
["▁I", "▁saw", "▁a", "▁girl", "▁with", "▁a", "▁", "te", "le", "s", "c", "o", "pe, "."]
You can easily get an alignment between doc._.spm_pieces_
and doc
with pytokenizations:
>>> import tokenizations
>>> a2b, b2a = tokenizations.get_alignments(doc._.spm_pieces_, [token.text for token in doc])
>>> print(a2b)
[[0], [1], [2], [3], [4], [5], [6], [], [7], [8], [9], [10], [11], [12]]
>>> print(doc[1:4])
[saw, a, girl]
>>> import itertools
>>> print(doc._.spm_pieces_[i] for i in itertools.chain.from_iterable(b2a[1:4]))
["_saw", "_a", "_girl"]