Fine tuning Transformers¶
Overview¶
Camphr provides a command line interface to fine-tune Transformers’ pretrained models for downstream tasks, e.g. text classification and named entity recognition
Text classification¶
You can fine-tune Transformers pretrained models for text classification tasks as follows:
$ camphr train model.task=textcat \
train.data.path=./train.jsonl \
model.labels=./label.json \
model.pretrained=bert-base-cased \
model.lang.name=en
Let’s look at the details.
1. Prepare training data¶
Two files are required for training - train.jsonl
, label.json
.
Like spacy, train.jsonl
contains the training data in the following format known as jsonl :
["Each line contains json array", {"cats": {"POSITIVE": 0.1, "NEGATIVE": 0.9}}]
["Each array contains text and gold label", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}]
...
label.json
is a json file defining classification labels. For example:
["POSITIVE", "NEGATIVE"]
2. Choose Transformers pretrained models¶
The following models can be used:
Transformers official models
You can use pretrained models listed in Transformers website without manually downloading them. The models will be automatically downloaded and cached at the start of the training.
Transformers models in local directory
You can also use local models created by transformers.PretrainedModel.save_pretrained
method, by passing pretrained=/path/to/your-modles
in cli.
3. Configure and Start fine-tuning¶
The following is the minimal configuration to fine-tune bert-base-cased with English tokenizer.
$ camphr train model.task=textcat \
train.data.path="./train.jsonl" \
model.labels="./label.json" \
model.pretrained=bert-base-cased \
model.lang.name=en
Of course, you can also use non-English languages, by changing model.lang.name:
$ camphr train model.task=textcat \
train.data.path="./train.jsonl" \
model.labels="./label.json" \
model.pretrained=bert-base-multilingual-cased \
model.lang.name=ja # Japanese
- note
If CUDA is available, it will be enabled automatically.
The trained models, logs and data are stored in “./outputs/YYYY-mm-dd/HH-MM-SS/”, thanks to Hydra’s functionality. See Hydra’s documentation to customize the directory location or logging.
You can easily configure learning rate, batch size, and other hyperparameters. See Advanced Configuration for details.
4. Use fine-tuned models¶
Fine-tuned models are stored in “./outputs/DATE/TIME/models/${i}” directories.
${i}
is the number of the training steps (e.g. 0
for the first training loop).
The model can be loaded as follows:
>>> import spacy
>>> nlp = spacy.load("./outputs/2020-01-30/19-31-23/models/0")
And you can use it as follows:
>>> doc = nlp("Hi, this is fine-tuned model")
>>> texts = ["You can process multiple texts at once.", "Use nlp.pipe."]
>>> docs = nlp.pipe(texts)
>>> import torch
>>> nlp.to(torch.device("cuda"))
>>> docs = nlp.pipe(texts) # use gpu to process faster (CUDA is required)
To create python package, use spacy package
CLI:
$ mkdir packages
$ spacy package ./outputs/2020-01-30/19-31-23/models/0 ./packages
See also
spacy.package : Create python model package to distribute the models.
Multilabel Text classification¶
Camphr enables you to fine-tune transformers pretrained model for multi-labels textcat tasks:
$ camphr train model.task=multilabel_textcat \
train.data.path=./train.jsonl \
model.labels=./label.json \
model.pretrained=bert-base-cased \
model.lang.name=en
Let’s look at the details.
1. Prepare training data¶
Two files are required for training - train.jsonl
, label.json
.
Like spacy, train.jsonl
contains the training data in the following format known as jsonl :
["Each line contains json array", {"cats": {"A": 0.1, "B": 0.8, "C": 0.8}}]
["Each array contains text and gold label", {"cats": {"A": 0.1, "B": 0.9, "C": 0.8}}]
...
Because the task is multi-labels, the total score on each labels doesn’t have to be 1.
label.json
is a json file defining classification labels. For example:
["A", "B", "C"]
2. Choose Transformers pretrained models¶
The following models can be used:
Transformers official models
You can use pretrained models listed in Transformers website without manually downloading them. The models will be automatically downloaded and cached at the start of the training.
Transformers models in local directory
You can also use local models created by transformers.PretrainedModel.save_pretrained
method, by passing pretrained=/path/to/your-modles
in cli.
3. Configure and Start fine-tuning¶
The following is the minimal configuration to fine-tune bert-base-cased with English tokenizer.
$ camphr train model.task=multilabel_textcat \
train.data.path="./train.jsonl" \
model.labels="./label.json" \
model.pretrained=bert-base-cased \
model.lang.name=en
Of course, you can also use non-English languages, by changing model.lang.name:
$ camphr train model.task=multilabel_textcat \
train.data.path="./train.jsonl" \
model.labels="./label.json" \
model.pretrained=bert-base-multilingual-cased \
model.lang.name=ja # Japanese
- note
If CUDA is available, it will be enabled automatically.
The trained models, logs and data are stored in “./outputs/YYYY-mm-dd/HH-MM-SS/”, thanks to Hydra’s functionality. See Hydra’s documentation to customize the directory location or logging.
You can easily configure learning rate, batch size, and other hyperparameters. See Advanced Configuration for details.
4. Use fine-tuned models¶
Fine-tuned models are stored in “./outputs/DATE/TIME/models/${i}” directories.
${i}
is the number of the training steps (e.g. 0
for the first training loop).
The model can be loaded as follows:
>>> import spacy
>>> nlp = spacy.load("./outputs/2020-01-30/19-31-23/models/0")
And you can use it as follows:
>>> doc = nlp("Hi, this is fine-tuned model")
>>> texts = ["You can process multiple texts at once.", "Use nlp.pipe."]
>>> docs = nlp.pipe(texts)
>>> import torch
>>> nlp.to(torch.device("cuda"))
>>> docs = nlp.pipe(texts) # use gpu to process faster (CUDA is required)
To create python package, use spacy package
CLI:
$ mkdir packages
$ spacy package ./outputs/2020-01-30/19-31-23/models/0 ./packages
See also
spacy.package : Create python model package to distribute the models.
Named entity recognition¶
You can also fine-tune Transformers models for named entity recognition with Camphr’s CLI:
$ camphr train model.task=ner \
train.data.path="./train.jsonl" \
model.labels="./label.json" \
model.pretrained=bert-base-cased \
model.lang.name=en
Let’s look at the details.
1. Prepare training data¶
Two files are required for training - train.jsonl
, label.json
.
Like spacy, train.jsonl
contains the training data in the following format known as jsonl :
["I live in Japan.", {"entities": [[10, 15, "LOCATION"]] }]
["Today is January 30th", {"entities": [[9, 21, "DATE"]] }]
...
“entities” is an array containing arrays that consist of [start_char_pos, end_char_pos, label_type]
.
label.json
is a json file defining classification labels. For example:
["DATE", "PERSON", "ORGANIZATION"]
2. Choose Transformers pretrained model¶
The following models can be used:
Transformers official models
You can use pretrained models listed in Transformers website without manually downloading them. The models will be automatically downloaded and cached at the start of the training.
Transformers models in local directory
You can also use local models created by transformers.PretrainedModel.save_pretrained
method, by passing pretrained=/path/to/your-modles
in cli.
3. Configure and Start fine-tune¶
The following is the minimal configuration to fine-tune bert-base-cased with English tokenizer.
$ camphr train model.task=ner \
train.data.path="./train.jsonl" \
model.labels="./label.json" \
model.pretrained=bert-base-cased \
model.lang.name=en
You can also use non-English languages, by changing model.lang.name:
$ camphr train model.task=ner \
train.data.path="./train.jsonl" \
model.label="./label.json" \
model.pretrained=bert-base-multilingual-cased \
model.lang.name=ja # Japanese
- note
If CUDA is available, it will be enabled automatically.
You can easily configure learning rate, batch size, and other hyperparameters. See Advanced Configuration for details.
The trained models, logs and data are stored in “./outputs/YYYY-mm-dd/HH-MM-SS/”, thanks to Hydra’s functionality. See Hydra’s documentation to customize the directory location or logging.
4. Use fine-tuned models¶
Fine-tuned models are stored in “./outputs/DATE/TIME/models/${i}” directories.
${i}
is the number of the training steps (e.g. 0
for the first training loop).
The model can be loaded as follows:
>>> import spacy
>>> nlp = spacy.load("./outputs/2020-01-30/19-31-23/models/0")
And you can use it as follows:
>>> doc = nlp("Hi, this is fine-tuned model")
>>> texts = ["You can process multiple texts at once.", "Use nlp.pipe."]
>>> docs = nlp.pipe(texts)
>>> import torch
>>> nlp.to(torch.device("cuda"))
>>> docs = nlp.pipe(texts) # use gpu to process faster (CUDA is required)
To create python package, use spacy package
CLI:
$ mkdir packages
$ spacy package ./outputs/2020-01-30/19-31-23/models/0 ./packages
See also
spacy.package : Create python model package to distribute the models.
Advanced Configuration¶
Camphr uses Hydra as a training configuration system, and the configuration can be customized in Hydra’s convention.
First, let’s see a sample configuration:
$ camphr train example=ner --cfg job
model:
lang:
name: en
ner_label: ~/irex.json
pipeline: null
pretrained: bert-base-cased
train:
data:
ndata: -1
path: ~/train.jsonl
val_size: 0.1
nbatch: 16
niter: 10
optimizer:
class: transformers.optimization.AdamW
params:
eps: 1.0e-08
lr: 2.0e-05
scheduler:
class: transformers.optimization.get_linear_schedule_with_warmup
params:
num_training_steps: 7
num_warmup_steps: 3
As you can see, the configuration is defined in YAML format.
You can override values in the loaded configuration from the commend line.
For example, in order to replace model.lang.name
with ja
, pass model.lang.name=ja
in CLI:
$ camphr train model.lang.name=ja
model:
lang:
name: ja
...
Pass yaml¶
The more items you wish to override, the more tedious it becomes for you to enter them on the command line.
You can rewrite the configuration with yaml file instead of command line options.
For example, prepare user.yaml
as follows:
model:
lang:
name: ja
train:
data:
ndata: -1
path: ~/train.jsonl
val_size: 0.1
nbatch: 128
niter: 30
optimizer:
class: transformers.optimization.AdamW
params:
eps: 1.0e-05
lr: 1.0e-03
And pass the yaml to CLI as follows:
$ camphr train user_config=user.yaml
See also
Transformers: For use embedding vector, without fine-tuning