Developing Models With Low Resources

Close up of text editor with dark mode and colored code

Every day, our data science team tackles challenges to improve our product. To give you insight into the process, we will describe the development of our multilingual complaint detector for social media texts and explain how we solved implementation issues.

*DISCLAIMER* We tried to limit technical language to make this article approachable to a wider audience. 

The Algorithm

Various clients worldwide, including the Inter-American Development Bank and the WHO COVID observatory, use the Citibeats’ algorithm, which supports more than 50 languages with precision and recall higher than 75%. 

But before the production phase, the data science team faced challenges that are well known in the machine learning community. For instance: How to get a well-labeled dataset to train the detector in various languages.

 

Main Example

To highlight the challenge, let’s look at an easier problem. We will follow the workflow of creating a complaint detector for English-language tweets alone. We will not reach the results of our production model, as we simplified the problem. The objective is to give an overview of potential hard tasks when tackling an a priori simple data science problem.

 

The Challenge

*DISCLAIMER* We have simplified the problem and will not reach the results of our production model. The objective is to give an overview of potential hard tasks when tackling an a priori simple data science problem.

Analyzing the literature around complaint detection problems in social media, we find that researchers generally formalize the problem like a classification task. Something like: “We would like to have a decision system that takes a text as an input and returns 1 if it carries a complaint. Otherwise, it returns 0.” So far, so good. It looks just like a typical machine learning issue.

Figure 1: Complaint detector formalized

If we take into consideration papers that address the problem, such as Preotiuc-Pietro et al. (2019), we see that they generally use a supervised algorithm that learns from labeled examples. Preotiuc-Pietro et al. used around 2,000 labeled examples to train their final classifier. Nevertheless, here lies the main issue: How do you get high-quality labeled data to train a classifier in a supervised way?

In this specific case, the authors freely released their own dataset to enable faster training of classifiers. But in most cases with lower resource languages, the open-source datasets are scarce, necessitating new ways of acquiring labeled data. 

 

Labeled Data Creation

There are several ways to get labeled data. Here are some methods:

  • Manually Annotate Your Own Data
    While this may be the most straightforward solution, it is not really scalable. At Citibeats, we have a lot of multilingual binary classifiers. Thus, we would have to annotate around 3,000 texts in all our supported languages for one classifier. Clearly, this is not feasible.
  • Bootstrap From a Small Annotated Set
    You can try using a small batch of annotated data to find closed data and thus increase the dataset of annotated data. For instance, with Snorkel or by applying machine learning algorithms to get confident data.
  • Use Pseudo-Labels
    On social media, you often find tags, such as hashtags on Twitter or Instagram. You can try to use some hashtags in order to create pseudo-labeled data to feed an algorithm.
  • Automatically Create Your Own Dataset
    With the rise of the really big models (billions of parameters) in natural language processing and the new capacities of text generation, it would be interesting to try out to automate the construction of your dataset.

We have mentioned some methods, but there are many more. For example, using few-shot learning to adapt pre-trained models to a new task or data augmentation to increase the coverage and make the model focus on the words or phrases that carry the meaning of the complaint.

Another method worth mentioning is the prompting strategy. For instance, Schick describes a prompting approach outperforming GPT-3 with 99.9% fewer parameters. 

In a nutshell, the classification task is seen as a closed question. It means that we ask an already-trained language model to process some text and fill in the blank: this text is about ___. The label provided by the model is then used for your downstream classification problem. 

 

Experiments

*DISCLAIMER* For the sake of comparison, we used a pre-existing annotated labeled dataset of nearly 1,500 data with around 600 complaints. This provided us a test set to estimate the performance of the model depending on the training dataset we feed it.

To compare results depending on the training set, we applied the same training pipeline to all the trials. 

We fitted a TF-IDF + Logistic Regression pipeline to all experimental datasets (without any cross-validation step to simplify). We used an NLTK lemmatizer for the TF-IDF tokenizer. We also considered only words appearing at least twice in the training dataset and n-grams till 4.

The first experiment was to manually annotate around 450 texts with labels "Complaint" and "No Complaint." We called the resulting dataset of this experiment v0.

The second experiment consisted of gathering a list of hashtags that would be associated with a complaint in a text. For instance, we considered the hashtags "#badbusiness" or "#neveragain" to be associated with "Complaints." We did the same for "No Complaints" with random hashtags, such as "#dogecoin" or "#NationalBurgerDay." We called the resulting experimental dataset HT

Figure 2: Using hashtags as pseudo-labels

The third experiment was the one that generated the most enthusiasm. We tried the implementation from Schick et al. (2021). The idea was to give clear instructions, like “write a complaint about XXX,” to a big pre-trained model, such as GPT-3, to create a dataset of generated texts with labels.

Figure 3: Using DINO to generate reviews

To adapt the model to our own problem, we changed the instructions with two different approaches. First, we used the exact same implementation. Second, we adapted an open-source GPT-3 trained on the pile, a big open-source dataset. Finally, we got two datasets with labeled "Complaint" and "No Complainttexts. We called the two experiments DINO and DINO-GPT3.

We also considered a last experiment. Once we trained the whole pipeline on a dataset, we applied this pipeline to unlabeled data in order to find new “confident” data. Then, we mixed this new labeled data with the previous training set and retrained the whole pipeline. 

To define “confident” data, after applying the already trained pipeline on unlabeled data, we looked at the probability of an unlabeled text to carry a complaint. We labeled all texts with probability higher than 0.9 as "Complaint" and all the texts with a probability lower than 0.1as "No Complaint." We applied this procedure twice in a row, and we called this experiment BStrap.

Figure 4: Bootstrap procedure illustration

To establish a solid ground of comparison for our experiments, we took into consideration the performance of a random classifier (Bernoulli random variable with p=0.5) and trained the same pipeline with the open-source data provided by Preotiuc-Pietro et al. (2019).

Here is a synthesis of the datasets we used and the number of complaints after applying the dataset constructions:

Table 1: Datasets description

 

Results

Our results and conclusions are not universal and apply only to this case. To make them more general, we would need to push forward the analysis.

The Four Experiments

First, we considered each experiment, ignoring the bootstrap step:

Table 2: First results

We saw that all the models gave an F1 higher than the random process—lower than 0.5 as the test dataset was not balanced. We also noticed that the dataset provided by Preotiuc-Pietro et al. (2019) reached better results with an F1 of 65%. 

Second, the results highlighted that with only 450 labeled data (v0), we reached an equivalent precision as with the best model (the whole Preotiuc-Pietro dataset). For this pipeline, then, the learning curve could be pretty flat from 450 to 2,000 labeled texts. The more data that's available, the better the coverage that's needed. The precision was higher with the manually labeled dataset than other methods. 

Third, the results from the DINO implementations were pretty disappointing. While it was appealing to create our dataset from scratch, our results with this pipeline were poor with a precision equivalent to the one of the random classifier. 

However, we did not generate a dataset as big as Schick's, and we will try out to improve this pipeline in the future. Moreover, automating such a pipeline would be incredible for a lot of binary classification problems.

Finally, the best outcomes came from the HT dataset. With this pseudo-labeled dataset, we got almost the best results achieved by the pipeline on the bigger manually labeled dataset. Even if the precision was much lower, the coverage was high and the prospect of using this technique to improve results at one point looked promising. 

Adding a Bootstrap Layer

In a second analysis, we looked at how the models behaved when adding data labeled from the BStrap strategy. In other words, we fitted a first pipeline, collected "confident" data to increase the first version of the dataset, and trained the pipeline again on the new bootstrapped dataset.

In the chart below, we see the metrics differences before and applying the bootstrap methodology.

Table 3: Bootstrap effect results—metrics difference before and after applying the bootstrap strategy

There are a few things worth noting. First, the recall increased for two out of three datasets. Second, it decreased only for the HT dataset that had the best recall by a large margin before the bootstrap step. Finally, the precision went the other way around: the higher the gain in recall, the higher the loss in precision. 

So if we consider F1, it looks like the decrease of precision was compensated by the larger increase of recalls. Thus, the results seemed better for this metric. Maybe the original dataset sizes also played a role in the changes. 

This experiment seemed to confirm the possibility of improving the F1 metrics with some bootstrap processes. Nevertheless, the trade-off was clear: the loss of precision will echo any gain in recall. The decision to apply the bootstrapping process to your classifier depends on your goals as far as your product is concerned.

Add Pseudo Labels to Labeled Data

Finally, as the pseudo-labeled dataset (HT) looked promising, we tried to concatenate it with manually annotated data. We thought that a mix of both could result in a good trade-off between annotating data and using an automatic procedure to label text.

Below, you can see the results with the original manual datasets and when the pseudo-labels extracted from hashtags were added:

Table 4: Results by combining datasets

As you can see, we improved the results for the two datasets in F1 thanks to a higher recall (+20pts for v0 and +5pts for Preotiuc-Pietro dataset) while we kept a stable precision (less than 1pt decrease for both datasets). 

When adding the pseudo labels, we noticed a clear difference of amplitude rise between both manually annotated datasets. This difference could depend on the initial volume of each dataset.

Finally, when looking at the F1 results, we noticed that the difference between v0 + HT and Preotiuc-Pietro dataset + HT was lower than 1 pt. It had more or less the same precision and differed mainly on the recall metrics. But we had 10 times more annotated data in the Preotiuc-Pietro dataset than in the v0

Considering only the Preotiuc-Pietro dataset, the combination of a small batch of annotations and pseudo labels beat the results of the classifier trained on the bigger manually annotated dataset.

 

Next Steps

In this blog post, we presented one big challenge we faced at Citibeats that's very common in the data science industry: how to develop a model with low resources. In other words, how to overcome the lack of high-quality labeled data to train a classifier.

As we’ve seen, part of the work at Citibeats is to find ways to solve this problem. For example, by using tricks like pseudo-labeling or unsupervised learning to increase a small batch of labeled data, or by implementing more advanced machine learning techniques like text generation to create a labeled dataset. In this case, it was really helpful for us to train a complaint detector.

But this is just one of the many tasks we perform at Citibeats. Every day, we face new, inspiring challenges, such as adapting methods to work with more than 100 different languages or reducing the bias of the results.