Every day, our data science team tackles challenges to improve our product. To give you insight into the process, we will describe the development of our multilingual complaint detector for social media texts and explain how we solved implementation issues.
*DISCLAIMER* We tried to limit technical language to make this article approachable to a wider audience.
Various clients worldwide, including the Inter-American Development Bank and the WHO COVID observatory, use the Citibeats’ algorithm, which supports more than 50 languages with precision and recall higher than 75%.
But before the production phase, the data science team faced challenges that are well known in the machine learning community. For instance: How to get a well-labeled dataset to train the detector in various languages.
To highlight the challenge, let’s look at an easier problem. We will follow the workflow of creating a complaint detector for English-language tweets alone. We will not reach the results of our production model, as we simplified the problem. The objective is to give an overview of potential hard tasks when tackling an a priori simple data science problem.
*DISCLAIMER* We have simplified the problem and will not reach the results of our production model. The objective is to give an overview of potential hard tasks when tackling an a priori simple data science problem.
Analyzing the literature around complaint detection problems in social media, we find that researchers generally formalize the problem like a classification task. Something like: “We would like to have a decision system that takes a text as an input and returns 1 if it carries a complaint. Otherwise, it returns 0.” So far, so good. It looks just like a typical machine learning issue.
If we take into consideration papers that address the problem, such as Preotiuc-Pietro et al. (2019), we see that they generally use a supervised algorithm that learns from labeled examples. Preotiuc-Pietro et al. used around 2,000 labeled examples to train their final classifier. Nevertheless, here lies the main issue: How do you get high-quality labeled data to train a classifier in a supervised way?
In this specific case, the authors freely released their own dataset to enable faster training of classifiers. But in most cases with lower resource languages, the open-source datasets are scarce, necessitating new ways of acquiring labeled data.
Labeled Data Creation
There are several ways to get labeled data. Here are some methods:
- Manually Annotate Your Own Data
While this may be the most straightforward solution, it is not really scalable. At Citibeats, we have a lot of multilingual binary classifiers. Thus, we would have to annotate around 3,000 texts in all our supported languages for one classifier. Clearly, this is not feasible.
- Bootstrap From a Small Annotated Set
You can try using a small batch of annotated data to find closed data and thus increase the dataset of annotated data. For instance, with Snorkel or by applying machine learning algorithms to get confident data.
- Use Pseudo-Labels
On social media, you often find tags, such as hashtags on Twitter or Instagram. You can try to use some hashtags in order to create pseudo-labeled data to feed an algorithm.
- Automatically Create Your Own Dataset
With the rise of the really big models (billions of parameters) in natural language processing and the new capacities of text generation, it would be interesting to try out to automate the construction of your dataset.
We have mentioned some methods, but there are many more. For example, using few-shot learning to adapt pre-trained models to a new task or data augmentation to increase the coverage and make the model focus on the words or phrases that carry the meaning of the complaint.
In a nutshell, the classification task is seen as a closed question. It means that we ask an already-trained language model to process some text and fill in the blank: this text is about ___. The label provided by the model is then used for your downstream classification problem.
*DISCLAIMER* For the sake of comparison, we used a pre-existing annotated labeled dataset of nearly 1,500 data with around 600 complaints. This provided us a test set to estimate the performance of the model depending on the training dataset we feed it.
To compare results depending on the training set, we applied the same training pipeline to all the trials.
We fitted a TF-IDF + Logistic Regression pipeline to all experimental datasets (without any cross-validation step to simplify). We used an NLTK lemmatizer for the TF-IDF tokenizer. We also considered only words appearing at least twice in the training dataset and n-grams till 4.
The first experiment was to manually annotate around 450 texts with labels "Complaint" and "No Complaint." We called the resulting dataset of this experiment v0.
The second experiment consisted of gathering a list of hashtags that would be associated with a complaint in a text. For instance, we considered the hashtags "#badbusiness" or "#neveragain" to be associated with "Complaints." We did the same for "No Complaints" with random hashtags, such as "#dogecoin" or "#NationalBurgerDay." We called the resulting experimental dataset HT.
The third experiment was the one that generated the most enthusiasm. We tried the implementation from Schick et al. (2021). The idea was to give clear instructions, like “write a complaint about XXX,” to a big pre-trained model, such as GPT-3, to create a dataset of generated texts with labels.
To adapt the model to our own problem, we changed the instructions with two different approaches. First, we used the exact same implementation. Second, we adapted an open-source GPT-3 trained on the pile, a big open-source dataset. Finally, we got two datasets with labeled "Complaint" and "No Complaint" texts. We called the two experiments DINO and DINO-GPT3.
We also considered a last experiment. Once we trained the whole pipeline on a dataset, we applied this pipeline to unlabeled data in order to find new “confident” data. Then, we mixed this new labeled data with the previous training set and retrained the whole pipeline.
To define “confident” data, after applying the already trained pipeline on unlabeled data, we looked at the probability of an unlabeled text to carry a complaint. We labeled all texts with probability higher than 0.9 as "Complaint" and all the texts with a probability lower than 0.1as "No Complaint." We applied this procedure twice in a row, and we called this experiment BStrap.
To establish a solid ground of comparison for our experiments, we took into consideration the performance of a random classifier (Bernoulli random variable with p=0.5) and trained the same pipeline with the open-source data provided by Preotiuc-Pietro et al. (2019).
Here is a synthesis of the datasets we used and the number of complaints after applying the dataset constructions: