Fine tuning Transformers

Overview

Camphr provides a command line interface to fine-tune Transformers’ pretrained models for downstream tasks, e.g. text classification and named entity recognition

Text classification

You can fine-tune Transformers pretrained models for text classification tasks as follows:

$ camphr train model.task=textcat \
               train.data.path=./train.jsonl \
               model.labels=./label.json  \
               model.pretrained=bert-base-cased  \
               model.lang.name=en

Let’s look at the details.

1. Prepare training data

Two files are required for training - train.jsonl, label.json. Like spacy, train.jsonl contains the training data in the following format known as jsonl :

["Each line contains json array", {"cats": {"POSITIVE": 0.1, "NEGATIVE": 0.9}}]
["Each array contains text and gold label", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}]
 ...

label.json is a json file defining classification labels. For example:

["POSITIVE", "NEGATIVE"]

2. Choose Transformers pretrained models

The following models can be used:

  1. Transformers official models

You can use pretrained models listed in Transformers website without manually downloading them. The models will be automatically downloaded and cached at the start of the training.

  1. Transformers models in local directory

You can also use local models created by transformers.PretrainedModel.save_pretrained method, by passing pretrained=/path/to/your-modles in cli.

3. Configure and Start fine-tuning

The following is the minimal configuration to fine-tune bert-base-cased with English tokenizer.

$ camphr train model.task=textcat \
               train.data.path="./train.jsonl" \
               model.labels="./label.json" \
               model.pretrained=bert-base-cased  \
               model.lang.name=en

Of course, you can also use non-English languages, by changing model.lang.name:

$ camphr train model.task=textcat \
               train.data.path="./train.jsonl" \
               model.labels="./label.json" \
               model.pretrained=bert-base-multilingual-cased  \
               model.lang.name=ja # Japanese
note

If CUDA is available, it will be enabled automatically.

The trained models, logs and data are stored in “./outputs/YYYY-mm-dd/HH-MM-SS/”, thanks to Hydra’s functionality. See Hydra’s documentation to customize the directory location or logging.

You can easily configure learning rate, batch size, and other hyperparameters. See Advanced Configuration for details.

4. Use fine-tuned models

Fine-tuned models are stored in “./outputs/DATE/TIME/models/${i}” directories. ${i} is the number of the training steps (e.g. 0 for the first training loop). The model can be loaded as follows:

>>> import spacy
>>> nlp = spacy.load("./outputs/2020-01-30/19-31-23/models/0")

And you can use it as follows:

>>> doc = nlp("Hi, this is fine-tuned model")
>>> texts = ["You can process multiple texts at once.", "Use nlp.pipe."]
>>> docs = nlp.pipe(texts)

>>> import torch
>>> nlp.to(torch.device("cuda"))
>>> docs = nlp.pipe(texts) # use gpu to process faster (CUDA is required)

To create python package, use spacy package CLI:

$ mkdir packages
$ spacy package ./outputs/2020-01-30/19-31-23/models/0 ./packages

See also

spacy.package : Create python model package to distribute the models.

Multilabel Text classification

Camphr enables you to fine-tune transformers pretrained model for multi-labels textcat tasks:

$ camphr train model.task=multilabel_textcat \
               train.data.path=./train.jsonl \
               model.labels=./label.json  \
               model.pretrained=bert-base-cased  \
               model.lang.name=en

Let’s look at the details.

1. Prepare training data

Two files are required for training - train.jsonl, label.json. Like spacy, train.jsonl contains the training data in the following format known as jsonl :

["Each line contains json array", {"cats": {"A": 0.1, "B": 0.8, "C": 0.8}}]
["Each array contains text and gold label", {"cats": {"A": 0.1, "B": 0.9, "C": 0.8}}]
 ...

Because the task is multi-labels, the total score on each labels doesn’t have to be 1.

label.json is a json file defining classification labels. For example:

["A", "B", "C"]

2. Choose Transformers pretrained models

The following models can be used:

  1. Transformers official models

You can use pretrained models listed in Transformers website without manually downloading them. The models will be automatically downloaded and cached at the start of the training.

  1. Transformers models in local directory

You can also use local models created by transformers.PretrainedModel.save_pretrained method, by passing pretrained=/path/to/your-modles in cli.

3. Configure and Start fine-tuning

The following is the minimal configuration to fine-tune bert-base-cased with English tokenizer.

$ camphr train model.task=multilabel_textcat \
               train.data.path="./train.jsonl" \
               model.labels="./label.json" \
               model.pretrained=bert-base-cased  \
               model.lang.name=en

Of course, you can also use non-English languages, by changing model.lang.name:

$ camphr train model.task=multilabel_textcat \
               train.data.path="./train.jsonl" \
               model.labels="./label.json" \
               model.pretrained=bert-base-multilingual-cased  \
               model.lang.name=ja # Japanese
note

If CUDA is available, it will be enabled automatically.

The trained models, logs and data are stored in “./outputs/YYYY-mm-dd/HH-MM-SS/”, thanks to Hydra’s functionality. See Hydra’s documentation to customize the directory location or logging.

You can easily configure learning rate, batch size, and other hyperparameters. See Advanced Configuration for details.

4. Use fine-tuned models

Fine-tuned models are stored in “./outputs/DATE/TIME/models/${i}” directories. ${i} is the number of the training steps (e.g. 0 for the first training loop). The model can be loaded as follows:

>>> import spacy
>>> nlp = spacy.load("./outputs/2020-01-30/19-31-23/models/0")

And you can use it as follows:

>>> doc = nlp("Hi, this is fine-tuned model")
>>> texts = ["You can process multiple texts at once.", "Use nlp.pipe."]
>>> docs = nlp.pipe(texts)

>>> import torch
>>> nlp.to(torch.device("cuda"))
>>> docs = nlp.pipe(texts) # use gpu to process faster (CUDA is required)

To create python package, use spacy package CLI:

$ mkdir packages
$ spacy package ./outputs/2020-01-30/19-31-23/models/0 ./packages

See also

spacy.package : Create python model package to distribute the models.

Named entity recognition

You can also fine-tune Transformers models for named entity recognition with Camphr’s CLI:

$ camphr train model.task=ner \
               train.data.path="./train.jsonl" \
               model.labels="./label.json" \
               model.pretrained=bert-base-cased  \
               model.lang.name=en

Let’s look at the details.

1. Prepare training data

Two files are required for training - train.jsonl, label.json. Like spacy, train.jsonl contains the training data in the following format known as jsonl :

["I live in Japan.", {"entities": [[10, 15, "LOCATION"]] }]
["Today is January 30th", {"entities": [[9, 21, "DATE"]] }]
 ...

“entities” is an array containing arrays that consist of [start_char_pos, end_char_pos, label_type].

label.json is a json file defining classification labels. For example:

["DATE", "PERSON", "ORGANIZATION"]

2. Choose Transformers pretrained model

The following models can be used:

  1. Transformers official models

You can use pretrained models listed in Transformers website without manually downloading them. The models will be automatically downloaded and cached at the start of the training.

  1. Transformers models in local directory

You can also use local models created by transformers.PretrainedModel.save_pretrained method, by passing pretrained=/path/to/your-modles in cli.

3. Configure and Start fine-tune

The following is the minimal configuration to fine-tune bert-base-cased with English tokenizer.

$ camphr train model.task=ner \
               train.data.path="./train.jsonl" \
               model.labels="./label.json" \
               model.pretrained=bert-base-cased  \
               model.lang.name=en

You can also use non-English languages, by changing model.lang.name:

$ camphr train model.task=ner \
               train.data.path="./train.jsonl" \
               model.label="./label.json" \
               model.pretrained=bert-base-multilingual-cased  \
               model.lang.name=ja # Japanese
note

If CUDA is available, it will be enabled automatically.

You can easily configure learning rate, batch size, and other hyperparameters. See Advanced Configuration for details.

The trained models, logs and data are stored in “./outputs/YYYY-mm-dd/HH-MM-SS/”, thanks to Hydra’s functionality. See Hydra’s documentation to customize the directory location or logging.

4. Use fine-tuned models

Fine-tuned models are stored in “./outputs/DATE/TIME/models/${i}” directories. ${i} is the number of the training steps (e.g. 0 for the first training loop). The model can be loaded as follows:

>>> import spacy
>>> nlp = spacy.load("./outputs/2020-01-30/19-31-23/models/0")

And you can use it as follows:

>>> doc = nlp("Hi, this is fine-tuned model")
>>> texts = ["You can process multiple texts at once.", "Use nlp.pipe."]
>>> docs = nlp.pipe(texts)

>>> import torch
>>> nlp.to(torch.device("cuda"))
>>> docs = nlp.pipe(texts) # use gpu to process faster (CUDA is required)

To create python package, use spacy package CLI:

$ mkdir packages
$ spacy package ./outputs/2020-01-30/19-31-23/models/0 ./packages

See also

spacy.package : Create python model package to distribute the models.

Advanced Configuration

Camphr uses Hydra as a training configuration system, and the configuration can be customized in Hydra’s convention.

First, let’s see a sample configuration:

$ camphr train example=ner --cfg job

model:
    lang:
        name: en
    ner_label: ~/irex.json
    pipeline: null
    pretrained: bert-base-cased
train:
    data:
        ndata: -1
        path: ~/train.jsonl
        val_size: 0.1
    nbatch: 16
    niter: 10
    optimizer:
        class: transformers.optimization.AdamW
        params:
            eps: 1.0e-08
            lr: 2.0e-05
    scheduler:
        class: transformers.optimization.get_linear_schedule_with_warmup
        params:
            num_training_steps: 7
            num_warmup_steps: 3

As you can see, the configuration is defined in YAML format.

You can override values in the loaded configuration from the commend line. For example, in order to replace model.lang.name with ja, pass model.lang.name=ja in CLI:

$ camphr train model.lang.name=ja

model:
    lang:
        name: ja
...

Pass yaml

The more items you wish to override, the more tedious it becomes for you to enter them on the command line.

You can rewrite the configuration with yaml file instead of command line options. For example, prepare user.yaml as follows:

model:
    lang:
        name: ja
train:
    data:
        ndata: -1
        path: ~/train.jsonl
        val_size: 0.1
    nbatch: 128
    niter: 30
    optimizer:
        class: transformers.optimization.AdamW
        params:
            eps: 1.0e-05
            lr: 1.0e-03

And pass the yaml to CLI as follows:

$ camphr train user_config=user.yaml

See also

Transformers: For use embedding vector, without fine-tuning