Citibeats collects people’s opinions on social media data and finds actionable insights. With AI tools, we have developed different models to structure vast amounts of data and provide context on users’ and citizens’ needs.
One important part of structuring these opinions is detecting requests. We define requests as any social posts that require an action from a person, institution, or, more generally, the recipient of the message. Identifying requests in social networks gives us valuable information on the main actions that people are asking for.
At Citibeats, we built a model to identify requests. In this post we describe the steps followed to build a request detector.
Figure 1: Identify whether a text is a request or not
We’ll outline two different model approaches for detecting requests. The first approach combines TF-IDF embeddings, an algorithm that transforms sentences into vectors with a logistic regression. The second approach is based on the deep-learning BERT model.
Today, deep-learning models outperform classical machine learning models on many classification problems. For example, the BERT model became the new state-of-the-art on 11 natural language processing tasks in 2018 (J. Delvin et al. - 2018 ). However, deep learning models are difficult to interpret and require more computational power than classical approaches as explained in N. Thompson et al. (2020). The idea is to compare models, and use the one that produces better results. To measure the performance of our models we use the Accuracy, F1, recall and precision metrics.
Both approaches use supervised learning. Supervised learning is a subclass of machine learning that uses a labeled dataset to teach the model to yield desired outputs. Building the request detector requires labeled data. Our model needs to learn from examples of requests and non requests to detect them. The classical approach for creating labeled datasets is manual annotation. Data labeling is crucial for the development of supervised machine learning models. Nevertheless, it is expensive and time-consuming to manually annotate large datasets. Therefore, we’ll also describe ways of building performant models with few resources.
*DISCLAIMER*: The objective is not to beat the state-of-the-art solutions, but provide one methodology that, with low resources, efficiently classifies texts.
Detecting Requests: A Difficult Task
Detecting requests is not a trivial task. Many texts are ambiguous and difficult to classify even for a human. Therefore, we discarded all the texts expressing a request as an opinion, such as, “Ok now calculate the people infected and recovered. It's no where near 75%. That's why we need the vaccine dummy.”
Moreover we discarded requests referring to a divinity. For instance texts like, “I can't study anymore, I’m afraid I won’t pass my exam tomorrow😔 .Pls God, I need help,” are not considered requests.
Sometimes we confuse requests with questions or queries because a text request usually has an interrogative mark. All interrogative texts that necessitate only an answer and no clear action are not considered requests. For example, “Would you bring your laptop tomorrow?” is not a request since it only requires an answer.
However, many interrogative texts that necessitate only an answer have a very similar structure to requests. For example, by replacing the word “would” to “could” in the previous example, we would end up with a request.
The request detector must recognize the difference between these two types of texts.
Figure 2: Examples of requests or non requests
Another challenge in classifying texts is the huge variability present in language. Let’s show an example to better illustrate the problem:
“Hey, can I use your phone for a sec?”
“Excuse me, can I use your phone?”
“I need your phone please”
The sentences above introduce the same intent but in different ways. Thus, we need a model that can adapt to language variations without seeing an exhaustive list of examples.
Another challenge is that social network texts have abbreviations such as “pls” (please), “sec” (second) or “thnx” (thank you). Additionally, many texts have grammar and spelling mistakes increasing the complexity of the texts. Therefore, our request detector needs to adapt to these practices present in social networks.
Labeling Datasets with Low Resources
Now that the challenges regarding the request detector are clear let’s set up the models 🚀.
First, we have to create a labeled dataset to train and assess the performances of our models.
- The training set is used by our models to “learn.” Our models learn to detect requests thanks to the examples in the training dataset.
- The development set is used to improve the model and tune its hyperparameters. We can improve our models by analyzing the incoherent predictions on the development set. We also select the hyperparameters with the best results on the main metrics, F1, recall and precision, in the development set.
- The testing set serves to assess the performance of our final models without ever interacting with the model in any possible way.
We started constructing these datasets by annotating data manually. We collected a list of texts and tagged them as “requests'' or “non requests.” The bigger our datasets are, the more examples our models have for learning and testing.
We manually annotated 1.837 texts collected from social networks. Table 1 shows the partition between datasets.
Table 1: Hand Labelled Datasets
The training set has few examples but annotating more data is tedious and time-consuming. To increase the number of examples on the training dataset, we used data augmentation techniques. Data augmentation techniques are used to expand your labeled datasets, without manual annotation. Here are the main techniques that we used to increase our training set:
- Soft-Label Generation: Create patterns with high precision for requests and non requests. For example, the pattern “can you” + action verb + “me” in a sentence suggests that the sentence is a request (ex. “Can you send me an e-mail with the details?”). Then, we collect unlabelled data and automatically label it with these patterns. Creating patterns requires some descriptive data analysis to find the most accurate patterns. Once you have patterns, you can use the Python regex library or Rule-based matching from spaCy. An alternative is to use the pipeline from the snorkel project described by Ratner et al. (2017).
Figure 3: Soft-label generation with spaCy.
- Text Transformation: We transform texts from our training set to create slightly different new texts. We can create new sentences with the same meanings by replacing some words with synonyms, deleting words like “please” or adding words with no information. For example, we can transform the request, “Send me the details by email,” into the request, “Send me the details by email please” or “Send the details by email.” By adding these new labeled texts to our training set, we expand its size. An alternative is to use the library nlpaug by Edward Ma (2019); it contains many functions that automatically transform your texts. For example, by leveraging a function of the library and giving the input text, “The quick brown fox jumps over the lazy dog,” the function transforms it into, “even the quick brown fox usually jumps over the lazy dog.” The github repo contains examples of the different functionalities.
Figure 4: Text transformation swapping synonyms and deleting stopwords
- Bootstrap Augmentation: This technique involves having a preliminary request detector trained on the annotated texts. Then you collect unlabelled data and apply the model to it. We automatically annotate the texts where the model is very confident in its prediction. The figure below illustrates the process.
Figure 5: Bootstrap Augmentation
- Back Translation: This method is related to text transformation and is very fast to implement. Back translation is explained in Edunov et al. (2018), it requires a sentence translator. In this project we followed the steps described in this post written by Amit Chaudhary. We selected some texts from our training dataset and translated them into French using Google translate. Then, we translated them another time into English and used them as new texts.
Figure 6: Back Translation, image from Chaudhary's post.
Using these techniques, we expanded our training sets and achieved better results on the development set. The final training set has 4,009 sentences of which 2,027 are requests.
Machine Learning to Detect Requests
In this section, we describe the first approach for our model construction. At Citibeats, we built a multilingual request detector. However, in this post we consider an English detector to reduce the complexity.
Our first approach to build a request detector is a combination of tough data preprocessing and term frequency-inverse document frequency (TF-IDF) embedding combined with a Logistic Regression.
Preprocessing techniques are commonly used in natural language processing. In this step, we clean our texts and remove as much noise as possible facilitating the classification. We follow a similar approach as in H. Duong (2021). During this step, we get rid of abbreviations and misspellings that are present in social networks.
Next is the feature extraction step. A text classification model needs to transform text into something that a machine understands. To do so, we transform our texts into vectors of numbers.
One of the simplest approaches to vectorize a sentence is the bag-of-words technique introduced by Zellig S. Harris (1950). To construct bag-of-words vectors, we need a dataset with sentences and a vocabulary corpus where all the words from our dataset appear. For each sentence we create a vector of 1s and 0s of size equal to the number of words in the corpus depending on whether the words of the corpus appear in the sentence (assigned value 1) or not (assigned value 0). Figure 7 illustrates the technique for a dataset with only 2 documents.
Figure 7: Vectorization using Bag of Words
Therefore, two sentences with similar words will have a similar bag-of-words representation. Bag-of-words representation has many limitations. It does not account for the position of the words or the importance of the words. For example, the sentences, “I love winter and hate summer,” and, “I love summer and hate winter,” have the same representation for a bag of words on unigrams but a different meaning.
The TF-IDF approach is similar to bag-of-words but encodes the “importance” of the word for each document. In the bag-of-words approach, every word has the same score, 1 or 0. There is no distinction between words. TF-IDF assigns different scores to each word depending on the frequency of the word in a text and in the whole dataset.
The post written by Adem Akdogan (2021) explains in detail this technique. We use TF-IDF on unigrams and bigrams. Therefore our vocabulary corpus will contain all the unigrams and bigrams.
This representation highlights the importance of the word. If a word has little relevance to the sentence the score would be smaller. On the other hand, if the word is important, the score would be bigger. We fit the TF-IDF and a logistic regression to classify requests. See figure 8 below.
Figure 8: TF-IDF + logistic regression pipeline
Logistic regression is a well-known linear classifier. Chung (2020) describes this classifier in detail. Specific to our problem, the logistic regression assigns a weight, 𝛃, to each TF-IDF score of unigram and bigrams. The bigger the weight, the more likely the post is a request. This classifier is highly interpretable, and the parameters of the model give insights on requests.
To ameliorate the results of the model, we performed some feature engineering:
- Error Analysis: we studied the false positive and false negative of the model on the development set. We observed that many errors could be avoided with better preprocessing. For example the unigram “help” had a positive and big coefficient in the model for request (𝛃=2.5). A sentence with this unigram was likely to be tagged as a request. There were many false positives containing the trigram “I will help.” To solve these errors, we replaced the bigram “will help” with the unigram “help_future.”
- Fine-tuning Hyperparameters: We tried different initializations of the model (different hyperparameters) and selected the hyperparameters with the best metrics.
- Data Augmentation on Specific Examples: In the error analysis, we spotted some examples that were recurrent. To force our model to detect them, we created similar examples and added them to the training set.
Figure 9 shows the results of the model on the test dataset. The blue bars correspond to the results of the model before feature engineering and data augmentation. Contrary, the red bars correspond to the model after the feature selection and data augmentation.
Figure 9: Comparison of metric results betweenTF-IDF + Logistic regression after feature engineering (red bars) and data augmentation and TF-IDF + Logistic regression before Feature Engineering (blue bars).
The results of the model after feature selection are much better: all the metrics showed better results. The accuracy increased by 10%, F1 increased by 13%, Recall increased by 11% and the Precision increased by 13%. Therefore, all the metrics increased by more than 10% which is a big improvement. It highlights the importance of feature selection in machine learning.
Thanks to feature selection and data augmentation, we achieved an accuracy greater than 80%. However, F1, precision, and recall remain smaller, between 60-65%. Our model shows good results detecting non requests but has difficulties detecting requests.
Our second approach uses a BERT (Bidirectional Encoder Representations from Transformers) model. The BERT model was inspired by transformers and is the standard for many natural language processing tasks as shown in Devlin et al. (2018).
Transformers models were introduced by Vaswani et al. (2017) and were a breakthrough in NLP as explained in Wolf et al. (2020). The main reason for transformers’ success is the potential they have to understand the relation between words in a sentence, even if those words are far from each other.
In the previous part, we described TF-IDF representation. We saw that TF-IDF creates vectors that encode the following information:
- The words present in a sentence
- The importance of these words in the sentence and in the training set
However, these vectors do not give you any information about the position of words, the relation between words and the context of the words.
The self-attention mechanism of the transformers has the potential to encode the relationship between words and provide context. Consider the sentence, “They are my best students.” The attention mechanism would foresee that the relation between “they” and “students” is stronger than the one between “they” and “my.”
A main advantage of BERT is it is pre-trained on more than 2,500 millions words (Devlin et al. (2018)) and can be loaded in open-source thanks to the Hugging Face model hub by Wolf et al. (2020). Once the model is loaded, we can retrain it on a specific dataset to make the parameters adapt to your specific problem.
Figure 10: Citbeats BERT based request detector with a vector embedding of size 256.
For our project, we fine-tuned the twitter-roberta-base model on our training data. This model is a Roberta (Robustly optimized bert approach) described in Y. Liu et al. (2019). It has the same architecture as BERT but with some modifications on the pretraining phase.
Then, we added a Linear layer and a Softmax activation function on top of the BERT embeddings to display the probabilities of text being a request. Figure 10 shows the pipeline we followed.
To select the optimal hyperparameters, we performed several initializations with different hyperparameters and chose the hyperparameters with best results on the development set. The set of hyperparameters selected are:
- Input size: 128
- Batch size: 16
- Epochs: 5
- Learning rate: 1e-5
The input size encodes the length of the embedding vectors from BERT. The other hyperparameters are specific to the training step of the model in our training set. Deep-learning models are trained using an optimization algorithm, such as stochastic gradient descent, that updates the parameters of the model to minimize a cost function between the prediction of the model and the truth values of the training set.The batch size refers to the number of samples to work through before updating the parameters of the models. The number of epochs is the number of times that the optimization algorithm will work through the entire training set. Finally, the learning rate is a weight that determines the step size of the upgrade of your parameters.
Figure 11: Comparison of metric results between BERT with data augmentation (green bars), TF-IDF + Logistic regression after feature engineering (red bars) and TF-IDF + Logistic regression before Feature Engineering (blue bars).
Figure 11 shows the results of our models. Both models perform much better than a random classifier. Our Roberta model outperforms the TF-IDF plus logistic regression model in all metrics. Recall increases by more than 10 percentage points, and it detects requests much better. The main disadvantage observed of the TF-IDF pipeline with respect to the Roberta model is the difficulty in detecting requests with tokens that do not appear in the training set. These results suggest that, for our problem, the BERT model detects requests better than more established methods.
Conclusion and Next Steps
In this post, we described the steps to build a performant request detector despite the complexity of the task. We presented the main challenges of text classification and different request detector models.
We highlighted some techniques for data augmentation. These techniques enabled us to add more examples on our train dataset and increased the performance of our models.
Additionally, the feature engineering process significantly improved on our model results.
Finally, we presented a BERT implementation and showed how to adapt a pre-trained BERT model to detect requests. This last model showed a considerable improvement with respect to the logistic regression.
Our final model size is 1.04GB, so we need a considerable amount of space to store it. Furthermore, the inference time per prediction on a CPU is approximately 1 prediction per second. Considering that at Citibeats we work with large amounts of data, a next step for this project would be to analyze and try to reduce the costs of our model following the steps described in Gombert 2021. Using knowledge distillation, introduced in Bucilā et al. (2006), we could reduce the size of our models without losing performance. The idea of the distillation is to train a smaller model on the predictions of your original model. Then the smaller model would learn from the bigger one. At Citibeats we are concerned about the carbon footprint of our models, and we would like to reduce it without losing significant efficiency.