Boosting BERT Performances With Low Resources

Keyboard mockup with red key that reads - Hate Speech


At Citibeats, we often attempt to build new classifiers to gain insights from social media posts and provide the best context we can to our users around citizens' needs. For instance, we developed some tools to extract information from those posts: a request detector, gratitude detector, concern detector, human/bot or male/female detectors.

Nevertheless, one problem we regularly face to build our models is the lack of resources regarding good quality labeled data. Therefore, we tend to leverage alternative methodologies instead of classical supervised learning on an already big labeled dataset. You can read more about some of these methodologies in “Developing Models With Low Resources.” 

Now, we’d like to introduce one fast solution the Citibeats Data Science team implemented. This solution boosts some of our models' performances when training machine-learning classification models with low resources. Some of the ideas are also tackled in Zhuang et al. (2021). In our case, boosting performances means outperforming the main metrics on a given test set: F1, recall and precision

To highlight the methodology, we took a concrete example with an open source dataset and constructed a hatred speech detector from a hatred speech dataset on Twitter during the #MeToo movement in the United States. We also provide a github repository where you can test the code we implemented to train the models and a colab notebook to reproduce it. 

*DISCLAIMER*: The objective is not to beat the state-of-the-art solutions, but provide one methodology that, with low resources, efficiently classifies texts.


*Warning*: The following image includes disturbing language.

Examples of hate speech and how Citibeats qualifies as hatred or not hatred

Figure 1: Examples of hatred speech or not


The problem

First, let’s surface the main problem we tackle here, which is recurrent for data scientists: You have a really good idea of a concept you would like to detect, but unfortunately, it isn’t exactly a problem the community has already tackled. It’s most likely that you won’t find any good quality datasets: labeled data that you would use to train your model from scratch in a supervised way.

For instance, detecting a particular type of hatred speech, like LGBTQ+ assault, would be complicated, not because you would lack hatred texts, but because you would have a problem finding a dataset with texts explicitly tagged as LGBTQ+ assault. We can also imagine that you want to detect if a text carries a special topic such as transports or health without an existing labeled dataset.

To deal with this problem, you may want to annotate some texts to consolidate a training set that would enable you to build your machine-learning classifier. But annotation is time-consuming and maybe not the best way to effectively build up a high volume of good quality datasets.

However, you can annotate a few texts and leverage existing technology to get a first version that is acceptable! That’s what we attempted: use a BERT-like architecture with a few labeled texts and then try to boost its performance.

In the following sections, we detail one method to deal with low resources, leveraging existing Natural Language Processing (NLP) technologies. As mentioned previously, we take a real example with an open source dataset and provide codes and notebooks to make your own experiments.

The data and repo

First, let’s focus on the dataset we used. We found an open source dataset on kaggle that gathered tweets with hatred speech during the #MeToo movement in the U.S. Of course, this dataset is big and does not fit our problem context, so we adapted it to highlight the methodology we developed:

Let’s say we have a test dataset to assess the performance of the classifier we want to build and only a training dataset of 100 texts with 50 texts labeled as hatred speech and 50 texts labeled as normal texts. How can we build a solid classifier with only 100 texts and boost the classifier performance?

One big benefit of social media data is that we can access a high volume of data, even if it’s unlabeled. We’ll try to leverage this asset in our investigation. So we’ll keep about 120,000 unlabelled tweets for later on.


Table 1: Datasets description

You can find the training set and the testing set in the data folder in the repo, so you can reproduce the experiment on your own or in the colab

You will find three different columns in those datasets: the original text, the pre-processed texts and the categories 1 for hatred speech, 0 for non-hatred speech. In the 120,000 unlabelled tweets, you’ll have only the texts and the pre-processed version. 

A word on the text pre-processing, whose code is available in the repo and you can apply directly from it: we only clean text from hashtags, mentions, urls, and duplicated tokens, and manage to split contractions into whole words. The idea is to stay only with the whole context and quit some informal way of writing that you can find in social media posts. 

Bag of words…or fine-tuning a BERT model

The first thing we did was establish a baseline using classical machine-learning models. 

We created a pipeline from scikit-learn library, an open-source and well-known library to easily train machine-learning models. The main pipeline consists of two steps: encoding the vectors through a TF-IDF vectorizer with basic parameters, such as avoiding english stop words, and considering n-grams of length 1 to 3 and only tokens that appeared at least twice in the training set. 

Scikit Learn Pipeline with text, pre-processing and vectorizerregression

Figure 2: Scikit-Learn pipeline


This bag-of-words representation may not be the best, and some work on hyper parameters or feature engineering would give better results. However, it gives us a good idea of the magnitude of the results and the complexity of the task.

The second model we wanted to implement is the BERT pipeline. Thanks to the Hugging Face hub, it is quite easy to download pre-trained models and fine-tune them for our task. In our implementation, we only added a dense layer and another sigmoid layer on top of it. We also considered a dropout parameter at the end of each layer in order to limit overfitting as we are only training on 100 texts.

BERT Pipeline with text, pre-processing and densesigmoid layer

Figure 3: BERT pipeline


For other implementations of a BERT classification model, you can go directly to the dedicated page on HuggingFace with a lot of tutorials around it. 

Also, it is important to mention that your training and inferences are not free of carbon footprint, and you should consider it when training or putting in production models. In Strubell et al. (2019), the authors provide some recommendations to reduce environmental costs (and so financial ones too!). 

In our investigation, we opted for one of the most downloaded models: distilBERT uncased, which is a good way to speed up the investigation and reduce the inference time, limiting your carbon footprint. This model was developed by Sanh et al. (2019), and it kept 97% of original BERT performances and is 60% faster than the original BERT.

We also used another model called XtremeDistillTransformers, which is smaller still, has better results than distilBERT in some benchmarks, and is more than 5 times faster than the original BERT. This model was developed by the Microsoft teams (Mukherjee et al. (2021)). Thank you, Jo Kristian Bergum, for the inspiration!

If you want to compare effectiveness on your own, you can use the colab we provide and test it against other models. For the distilBERT we trained only for 15 epochs, and for the XtremeDistillTransformers, we trained for 100 epochs to reach more or less the same loss. Of course, a more careful study on these hyperparameters should be made, but it was not the main objective of this experiment. We used a batch size of 16 and a learning rate of 2e-5 for each model. 

The results on the hatred speech test set for this few shot learning are really interesting:


Charting Random, Bag of Words, distilBERT and XtremeDistil for precision, recall, time to train and inference time metrics

Figure 4: First results


First, we see that all the models are performing better than a random classifier. The precision is more or less the same for all three trained models. The real difference lies in the recall with better results for BERT-like models. It is aligned with the fact that BERT models focus more on contexts and not only tokens like the scikit-learn pipeline. And it confirms Brown et al. (2020), BERT are good few shot learners !

Thus, the bag of words is still a pretty interesting model to try before going to deep learning, at least to get a good baseline. There’s a chance some feature engineering would enable the model to perform better and maybe outperform the BERT.

The second observation is about the results concerning XtremeDistil and distilBERT. We see that, in this task, the distilBERT outperforms the XtremeDistil by more than a point. But as the training and inferring—calculated when applying each pipeline on 10k texts—steps are around 5 times faster for the XtremeDistil. Maybe we should consider using XtremeDistilTransformers instead of distilBERT despite the latter outperforming the former. 

Model ensembling with BERT

The first results were only to get an idea of the magnitude of the results. We then trained the models several more times to get a better approximation of the metrics. Indeed, each time we launch a new training, the initialization of some weights are random, and the gradient descent is also a stochastic process that could lead to different results from one training to another.

Thus, the results presented above are actually a mean of 15 different fine-tunings. We add the standard deviation in the following chart:


F1 Charting distilBERT and XtremeDistil for precision, recall, time to train and inference time metrics

Figure 5: First results - with standard deviation


We see that the standard deviation is much higher—around 3 times higher—for the distilBERT than the XtremeDistillTransformers. It looks like the results are more stable for the XtremeDistilTransformers. 

We should study the exact reasons of those different magnitudes of variance, but the size of the embeddings and of the models, much lower for the XtremeDistillTransformers (embedding size: 384 vs 768 and number of parameters 22M vs 66M), should be one of the main factor of this variance difference: the higher the number of parameters the higher the variance, as argued in Bendersky et al. (2020).

Those variances imply that some BERT models—called BERT experts from now on—may perform better than others, depending on the texts we want to process. Thus, we had the idea to check if combining all the BERT experts together could improve the results. The roughest idea we had was to compute the final predictions as the mean predictions of all the BERT experts. 


BERT Pipeline with text, pre-processing and densesigmoid layers for hatred

Figure 6: Ensembling BERT pipeline


Each expert is trained independently from the others and then we consolidate the results by a mean. The following chart sums up the results:


distilBERT and XtremeDistil data points for precision and recall

Figure 6: Mixing up BERT experts


We see that this first rough model-ensembling, just by taking the mean of all experts, enables us to outperform the results we had at first by 3 points for the distilBERT and by almost 1 point for the XtremeDistilTransformers. 

So if we sum it up, on colab, we can train 15 different distilBERT in less than 30 minutes, then combine the results of those 15 experts to boost the performance of the initial model. 

Nevertheless, if the idea looks easy it comes with a trade-off: the more experts, the slower the inferences and, especially, the higher the energy consumption and carbon footprint.

The main objective now would be finding a way to get the similar results as the experts mixed-up together with the size of the initial model. In other words, we strived to compress the results of the final model into a smaller one.

This is exactly what we can achieve with a distillation process.

Model distillation

Distillation process was introduced in Bucilā et al. (2006), and the idea resurfaced when studies began to try to compress neural networks such as in Hinton et al. (2015).The main objective is to use big model outputs or predictions (the teacher) to train a smaller model (the student) to make the latter learn to behave exactly like its teacher model. 

This method had several successes, especially when trying to distill task specific knowledge like in Tang et al (2019). We decided to apply this methodology to distill the knowledge of the experts when mixed-up together.


Random Texts running through DenseSigmoid hatred layers

Figure 7: distillation Process


Luckily, we already had 120,000 unlabelled data. As we already mentioned, the cost of getting text data in social media is pretty low. As a result, it’s easy to gather data and compute the predictions of a big model.

We applied the experts on those 120,000 unlabelled data texts. Then we computed the mean predictions for each text. Those predictions are now the outputs to learn for our student model (the same architecture as Figure 3). The only thing changing from the training we did at first is replace loss function, which is now a mean square error, with a temperature factor to smoothen the predictions to learn, instead of a binary cross entropy. We also changed the batch size to 128 and the number of epochs to 5 to limit the time of training.


distilBERT, XtremeDistil results for precision, recall

Figure 8: Distillation Results


The results are really interesting for two reasons: first, we see that we have similar results as when we mix experts together. Indeed, the precision is 0.1 or 0.2 below the one we got when computing the mean of all the trained simple models. 

Second, we improved the results regarding the recall by around 0.5-1 point— which makes us outperform the experts.

But how can we explain that we outperformed the results when the expected outcomes should have metrics beneath the trainer models?

There is one solid hypothesis that, at first, we trained experts with only 100 text data.  Applying the distillation process, it is most likely that the dataset we trained the teacher with acted like an augmented dataset that enabled us to improve the recall facilitating us to cover more social posts than before, but with the same precision.

We could also consider putting those results in touch with the PET algorithm developed in Schick and al. (2021), where the iterative PET uses different language models on unlabelled data and gets better results. This method explores the space of possibilities more. 


Conclusion and Next Steps

In this experiment, we saw a lot of different results: how to implement BERT classifier and fine-tune it with lower ressource models than the original BERT. Furthermore it has decent results in a few shot learning contexts.

We also highlighted that the variance of the BERT implementation seems to depend on the number of parameters and the higher this number, the higher the variance. In fact, we can leverage that to concatenate several BERT models as an ensemble model that outperforms the original architecture. It appears the higher the variance, the higher the difference between the ensemble model and the original architecture.

Finally, in a social media context, as we have low cost unlabelled data at our disposal, we can distill this model ensembling distribution in a smaller model. This model, the same size as the distilBERT or the XtremeDistillTransformers, produces even better results than the ensembling model regarding the recall thanks to the augmentation of the dataset after the few shot learners BERT experts.

Several improvements and investigations may arise from this work: first, study the number of optimal annotated texts and unlabelled texts to use with this method to reach supervised learning results, a bit like in Le Scao et al. (2021). We can also work on the ideal number of experts to get for this strategy, as we arbitrarily chose 15 models, or  smoothing predictions in the loss functions when distilling the model. Finally, the two more important questions are regarding the adaptability of this method to multi-languages and also how to reduce the size of the models to lower our carbon footprint as we need to train not 1 but 16 models. 

These are the next steps that we will share in our next blog posts.Feel free to take a look at the repo and the colab to try it out.



[1] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - Devlin et al. - 2019

[2] Developing Models with Low Resources - Gombert - 2021

[3] Ensemble Distillation for BERT-Based Ranking - Zhuang et al. - 2021

[4] Precision and Recall

[5] Bag of words representation

[6] Transformers: State-of-the-Art Natural Language Processing - Wolf et al. - 2020

[7] Energy and Policy Considerations for Deep Learning in NLP - Strubell et al. - 2019

[8] DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter - Sanh et al. - 2019

[9] XtremeDistilTransformers: Taks Transfer for Task-agnostic Distillation - Mukherjee et al. - 2021

[10] Language Models are Few-Shot Learners - Brown et al. - 2020

[11] RRF102: Meeting the TREC-COVID Challenge with a 100+ Runs Ensemble - Bendersky et al. - 2020

[12] Model Compression - Bucilā et al. - 2006 

[13] Distilling the Knowledge in a Neural Network - Hinton et al. - 2015

[14] Distilling Task-Specific Knowledge from BERT into Simple Neural Networks - Tang et al. - 2019

[15] Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference - Schick et al. - 2021

[16] How Many Data Points is a Prompt Worth? - Le Scao et al. - 2021