BERT Classifier Running in Goole Colab (TPU & GPU)
Introduction
I recently took part of the Jigsaw Unintended Bias in Toxicity Classification Kaggle competition.
My final solution was a combined ensemble of multiple BERT and GPT2 models built using the popular PyTorch implementation of BERT. All training was done either using cloud P100s or on my local 1070 Ti GPU. I had some remaining free credits on Google Cloud so I did not mind, but, after the competition, it got me thinking on how I could have done the same training while reducing the overall compute cost and time (if possible). That is what I will explore in this post.
I will compare the performances of my local GPU, Colab using a GPU (a slow Tesla K80) and Colab using a TPU. What I am really interested about is this free TPU option. How fast is it? Should I have been using it during the competition to save time?
Additionnally, I will present a clean and minimal implementation of a classifier built on top of BERT. It will be based on the official Google repo, however, I will remove all the stuff not needed and clarifiy the implementation.
At the moment of this writing, TPUs are not available yet for TensorFlow 2.0, therefore, version 1.14.0 will be used.
Both implementations (GPU and TPU) are available here
Data
To get the data (csv format), navigate to the Kaggle competition site. The comment_text
and target
columns contain respectively comments collected from the web and the associated label indicating if it was judged toxic.
Model
The first step is to create the model_fn
function. The actual graph is defined in the build_model
function. It uses TensorFlow Hub to get a pretrained BERT model and define a fully connected layer on top of the BERT pooling layer.
def model_fn(features, labels, mode, params):
input_ids = features["input_ids"]
input_mask = features["input_mask"]
segment_ids = features["segment_ids"]
label_ids = features["label_ids"]
loss, train_op, eval_op, accuracy = build_model(
params['config'],
input_ids,
input_mask,
segment_ids,
label_ids
)
if mode == tf.estimator.ModeKeys.TRAIN:
spec = tf.estimator.EstimatorSpec(
mode=tf.estimator.ModeKeys.TRAIN,
loss=loss,
train_op=train_op
)
elif mode == tf.estimator.ModeKeys.EVAL:
spec = tf.estimator.EstimatorSpec(
mode=mode,
loss=loss,
eval_metric_ops=eval_op
)
return spec
An EstimatorSpec
is returned based on the current mode (eval, train).
Input
To manage inputs we use the input_fn
function. It returns a dataset
based on the running mode.
def input_fn(features, seq_length, batch_size, mode):
all_input_ids = []
all_input_mask = []
all_segment_ids = []
all_label_ids = []
for feature in features:
all_input_ids.append(feature.input_ids)
all_input_mask.append(feature.input_mask)
all_segment_ids.append(feature.segment_ids)
all_label_ids.append(feature.label_id)
num_examples = len(features)
dataset = tf.data.Dataset.from_tensor_slices({
"input_ids":
tf.constant(
all_input_ids, shape=[num_examples, seq_length],
dtype=tf.int32),
"input_mask":
tf.constant(
all_input_mask,
shape=[num_examples, seq_length],
dtype=tf.int32),
"segment_ids":
tf.constant(
all_segment_ids,
shape=[num_examples, seq_length],
dtype=tf.int32),
"label_ids":
tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32),
})
if mode == tf.estimator.ModeKeys.TRAIN:
dataset = dataset.repeat()
dataset = dataset.shuffle(buffer_size=100)
dataset = dataset.batch(batch_size=batch_size, drop_remainder=True)
elif mode == tf.estimator.ModeKeys.EVAL:
dataset = dataset.batch(batch_size=batch_size, drop_remainder=False)
return dataset
Estimator
Finally, we can create the Estimator
by passing in our model_fn
. Some running configuration are also provided.
run_config = tf.estimator.RunConfig(
log_step_count_steps=10,
save_summary_steps=10
)
classifier = tf.estimator.Estimator(
model_fn=model_fn,
config=run_config,
params={
'config': config},
model_dir='tmp'
)
Train
To start training, we simply call the train
providing the input_fn
.
%%time
classifier.train(
input_fn=lambda: input_fn(features, config.maxlen, config.bs, tf.estimator.ModeKeys.TRAIN),
max_steps=config.num_train_steps
)
Results
Below is a comparison of the run time for a GPU and TPU on Google Colab. The difference is huge with minimal changes required to the code between.
Env. | Wall time |
---|---|
GPU Colab | 6036 sec |
TPU Colab | 850 sec |