Transformers

Overview

Camphr provides Transformers as spaCy pipelines. You can use the transformers outputs with spaCy interface and finetune them for downstream tasks.

In this section, we will explain how to use Transformers models as text embedding layers. See Fine tuning Transformers for fine-tuning transformers models.

Installation

$ pip install camphr

Usage

Create and add transformers_tokenizer and transformers_model to nlp

>>> nlp = spacy.blank("en")
>>> config = {"trf_name_or_path": "bert-base-cased"}
>>> nlp.add_pipe(nlp.create_pipe("transformers_tokenizer", config=config))
>>> nlp.add_pipe(nlp.create_pipe("transformers_model", config=config))

You can also get this nlp more easily with camphr.load

>>> import camphr
>>> nlp = camphr.load(
>>> """
>>> lang:
>>>     name: en
>>> pipeline:
>>>     transformers_model:
>>>         trf_name_or_path: xlnet-base-cased # Other than BERT can be used.
>>> """
>>> ) # pass config that omegaconf can parse (YAML, Json, Dict...)

Transformers computes the vector representation of an input text:

>>> doc = nlp("BERT converts text to vector")
>>> doc.tensor
tensor([[-0.5427, -0.9614, -0.4943,  ...,  2.2654,  0.5592,  0.4276],
    ...
    [ 0.2395,  0.5651, -0.0630,  ..., -0.5684,  0.3808,  0.2490]])

>>> doc[0].vector # token vector
array([-5.42725086e-01, -9.61372316e-01, -4.94263291e-01,  4.83379781e-01,
   -1.52603614e+00, -1.25056303e+00,  6.28554821e-01,  2.57751465e-01,
    3.44272882e-01, -3.19559097e-01, -6.80006146e-01,  1.15556490e+00,
    ... ]

>>> doc2 = nlp("Doc simlarity can be computed based on doc.tensor")
>>> doc.similarity(doc2)
0.8234463930130005

>>> doc[0].similarity(doc2[0]) # tokens similarity
0.4105265140533447

Use nlp.pipe to process multiple texts at once:

>>> texts = ["I am a cat.", "As yet I have no name.", "I have no idea where I was born."]
>>> docs = nlp.pipe(texts)

Use nlp.to for faster processing (CUDA is required):

>>> import torch
>>> nlp.to(torch.device("cuda"))
>>> docs = nlp.pipe(texts)

Load local models

You can also use models stored in local directories:

>>> nlp = load(
>>> """
>>> lang:
>>>     name: en
>>> pipeline:
>>>     transformers_model:
>>>         trf_name_or_path: /path/to/your/model/directory
>>> """
>>> )

See also

Fine tuning Transformers: For downstream tasks, such as named entity recognition or text classification.