

Camphr provides Transformers as spaCy pipelines. You can use the transformers outputs with spaCy interface and finetune them for downstream tasks.

In this section, we will explain how to use Transformers models as text embedding layers. See Fine tuning Transformers for fine-tuning transformers models.


$ pip install camphr


Create and add transformers_tokenizer and transformers_model to nlp

>>> nlp = spacy.blank("en")
>>> config = {"trf_name_or_path": "bert-base-cased"}
>>> nlp.add_pipe(nlp.create_pipe("transformers_tokenizer", config=config))
>>> nlp.add_pipe(nlp.create_pipe("transformers_model", config=config))

You can also get this nlp more easily with camphr.load

>>> import camphr
>>> nlp = camphr.load(
>>> """
>>> lang:
>>>     name: en
>>> pipeline:
>>>     transformers_model:
>>>         trf_name_or_path: xlnet-base-cased # Other than BERT can be used.
>>> """
>>> ) # pass config that omegaconf can parse (YAML, Json, Dict...)

Transformers computes the vector representation of an input text:

>>> doc = nlp("BERT converts text to vector")
>>> doc.tensor
tensor([[-0.5427, -0.9614, -0.4943,  ...,  2.2654,  0.5592,  0.4276],
    [ 0.2395,  0.5651, -0.0630,  ..., -0.5684,  0.3808,  0.2490]])

>>> doc[0].vector # token vector
array([-5.42725086e-01, -9.61372316e-01, -4.94263291e-01,  4.83379781e-01,
   -1.52603614e+00, -1.25056303e+00,  6.28554821e-01,  2.57751465e-01,
    3.44272882e-01, -3.19559097e-01, -6.80006146e-01,  1.15556490e+00,
    ... ]

>>> doc2 = nlp("Doc simlarity can be computed based on doc.tensor")
>>> doc.similarity(doc2)

>>> doc[0].similarity(doc2[0]) # tokens similarity

Use nlp.pipe to process multiple texts at once:

>>> texts = ["I am a cat.", "As yet I have no name.", "I have no idea where I was born."]
>>> docs = nlp.pipe(texts)

Use for faster processing (CUDA is required):

>>> import torch
>>> docs = nlp.pipe(texts)

Load local models

You can also use models stored in local directories:

>>> nlp = load(
>>> """
>>> lang:
>>>     name: en
>>> pipeline:
>>>     transformers_model:
>>>         trf_name_or_path: /path/to/your/model/directory
>>> """
>>> )

See also

