Using Machine Learning to Calibrate Online Opinion Bias

Charts on ipad with user pointing to them

Every day at Citibeats, we strive to gain a deeper understanding of people's opinions with the intention of having their voices be heard — by leaders, decision-makers, or whoever is best positioned to address their needs.

We use a lot of text sources such as Twitter, forums and blogs. But, opinions on online platforms are not representative of the global population. For instance, in 2020, women represent 26.2% of Twitter users in France, whereas Insee reported that French women represent 51.7% of the total population. Clearly, there is a discrepancy. So, we've made it one of our main goals to remove any bias from people's opinions and calibrate the results before delivering them to our clients.

Here, we will explain why reporting results without calibration may lead to false interpretations and how we deal with this issue at Citibeats.

The Representative Issue in Opinions

The best way to illustrate the bias issue is with an example. The government of Skotoprigonievsk, an imaginary city-state with a population of 20 million with an equal 1:1 male to female sex ratio, asked us to report on people's opinions about a new policy proposal. To do so, we collected 10 million users' opinions on online platforms. The raw results can be seen below:

Illustration 1: Raw results on policy proposal

For the government, such results look great. Nevertheless, it's important to note that none of the characteristics of the users we collected the opinions from were taken into account. But, by using Citibeats' technology, it becomes possible to infer such characteristics with precision.

On platforms like Twitter, a significant percentage of posts come from NGOs, firms, famous figures or bots. Thus, as we pointed out above, the demographics may not be representative of reality. Below, we present the demographics of the collected opinions and how they change once they have been calibrated.

Illustration 2: Results before and after unskewing

In actuality, 20% of the opinions collected came from firms or NGOs. Not only that, after discarding those institutions, 80% of total opinions were also from men. So, results needed to be calibrated to make sure that the opinions were not skewed, and that they were representative of the Skotoprigonievsk population. Being presented with the first set of results versus the calibrated results would change Skotoprigonievsk government's conclusions considerably.

This example highlights the challenge that every opinion survey faces: ensuring that a representative sample of the population is polled, in order to avoid misleading people and exacerbating common misconceptions about Internet data. 

At Citibeats, we worked on demographics segmentation, not only to assess users' gender, but age and location, plus institutions and bot detection, reducing the noise and highlighting people's opinions. This way, internet opinions can be used to collect representative or calibrated opinions, meaning it can become comparable to carefully prepared surveys — only with the advantage of being in real-time, and at enormous scale.

We now present the steps we followed to identify demographics in order to calibrate our results. We focus only on our first shot for gender segmentation (male/female/institution) on Twitter. We don't support non-binary gender yet. Nevertheless, the work is in progress to be fully representative.

To tackle gender segmentation on Twitter, we used Data Science and Machine Learning algorithms. We did all development and model training from scratch, and got inspiration from Wang et al, 2019 for our research.

The following two sections explain our methodology and are not necessary for the global comprehension, the reader can directly go to the results

Data Collection

First, we had to collect labeled data. In other terms: user account information with easily identified gender (or institution). For our first shot, we used a priori gender markers:

Illustration 3: Collect user and identify gender from scratch

We randomly collected users on Twitter with their name, username and biography. We only used that data to train our models without storing it, so we can protect users' privacy. To extract the labels, we used names and descriptions to count gender markers. In the second example above, we see that Esther is a female marker, and the emojis give one male and one female gender markers. Thus, we classified this user as female.

This method enabled us to collect a high variety of examples: we succeeded in labeling more than 50k institution accounts and more than 100k users for both main genders, in four languages (English, French, Portuguese & Spanish).

Second, we tried several approaches: 

  1. A Rule-based model with a priori Gender Markers, often used to make human comparisons
  2. A Bag-of-Words representation combined to a Logistic Regression. 
  3. A Deep Learning model inspired by Wang et al, 2019.

Modeling the Genders' Probability

The final model takes as inputs the user name, screen name and description. The output is the probability distribution across genders. We mask all the gender markers, in the inputs, that helped us to determine the gender or the organisation status of a user.


First we train for each input an independent Deep Learning model to predict the gender (right on illustration 4). For instance, only with the input name, we train a deep learning model (bi-directional long short term memory recurrent neural networks - LSTM RNN) to predict if the name is more likely to belong to a man, a woman or an organization. 

Illustration 4: Deep Learning model architecture.
Left: the global architecture. Right: the detailed architecture of the deep learning models

Second, we get the three trained models back, but we discard the final softmax layer of each model. Instead we concatenate the last 'Concat' layer (right, in illustration 4) of the three models and add a new softmax layer on top of it (the new classifier). 

We train this architecture in two steps (left, in illustration 4). During the first one, the warm-up step, we freeze all layers but the one on the top, the final softmax layer, and train the model. The second part consists in defreezing all layers and training all layers together. 

We used bidirectional LSTM RNNs to learn the best representation of each input. Concretely, learn the best representation of an input means extract the best insights of the name that would help the classifier (softmax layer) to give the best genders’ probabilities.

Illustration 5: RNN architecture

LSTMs advantage is that they learn long-distance relationships. For instance, at the first layer of the LSTM, each block gets a letter as input, and it also gets a hidden state from the previous block. This hidden state is written by the previous recurrent block. 

In fact each block has learnt what to forget from the previous hidden state and what to write from the input and what it has read from the previous hidden state. In other words, the 4th block has learnt what to forget from arn and what to write as a new hidden state from arn and the new input a. 

The blog post Written memories: understanding, deriving and extending the LSTM, is an excellent resource to understand the LSTMs.

Results of the Modeling

In this part, we display our models results. We will mention Wang et al results, as the state of the art (SOA). 

We inspect two tasks: Organization status identification and Male or Female identification.

We should mention three main differences between Wang et al.'s model and ours: they also trained their network to predict ages, they used profile pictures, features we don’t want to use and they trained on 32 languages, but they trained on a dataset 200 times bigger.

Institution Identification

First, we look at the institution detection. Our best model is not so far from the model proposed by Wang et al. Actually, we say "not so far" because we only trained those models with a bit more than 100k users, whereas they did it with 24 millions of users, and also because we don’t use profile pictures.

Illustration 6: Institution identification Results

The Deep Learning model clearly outperforms the baseline but the linear model reaches equivalent results. It means that we should increase our training dataset volume to get the full potential of the Deep Learning architecture.

Nevertheless, our methodology goes in the good direction as we reach closed results with not so much data and we may reach or beat this state of the art with more data.

Gender Identification

When we look at the gender differentiation, we are almost at the level of the SOA. We also outperform our two other models: the linear one and the rules based one. 

Illustration 7: Gender Identification results

We have good results whatever the language: all languages have F1 over 90% even in English where the gender is more difficult to detect that in the latin languages.

Chart showing language results for Citibeats RNN

Illustration 8: Male Vs Female - All languages

We have evidence that our model learnt really well how to differentiate women and men. First, we have computed the empiric probabilities associated with some features such as diminutive names (left, in illustration 9): Cris has a probability of 75% to be associated with women, as it can also be used by men. 

We also noticed an interesting fact: men use fewer emojis. And the probability to be a man with a man head as emoji is not 100% but 89%. Indeed in the collect data part above, we saw in the example with Esther that women may use emoji to describe the sex of their children.

Illustration 9: Some examples of features learnt

In fact, those results are also promising because with our easy data collection methods, we made some identification mistakes. And when we look at the prediction on the training set, the results of the deep learning algorithm corrects it.

Illustration 10: Example of the algorithm correction

Typically, such corrections can help us to apply a bootstrap methodology to classify a lot of new users to increase our training dataset size, and why not beat the state-of-the-art.

We applied the algorithm on datasets we collected in South America to compare our Twitter demographics estimation with Hootsuite's surveys

We are closed from Hootsuite's assessments; for a lot of countries, we highlight the results of four countries in the following illustration. This validates in part our approach to estimate Twitter demographics.

Illustration 11: Comparison with Hootsuite surveys

The gender identification is an example of the features we develop at Citibeats to assess populations' demographics on online platforms. We also work on other users' characteristics, such as age or location, to sharpen our demographics assessments. 


In a nutshell, at Citibeats, we dedicate much effort in unskewing the results from the data we collect. Thus we can be more representative of people's opinion by discarding any noise from corporates, brands or NGOs, but also by assessing the demographics of collected opinions to calibrate final results.

One of Citibeats' main concerns is using AI for good, or Ethical AI. Therefore, our data collecting and processing methodology takes special care of people’s privacy, making sure that the data is securely stored and deleted when it’s not needed. Also, all gender segmentation data is always shown as aggregated data, so we the individual is protected from any targeting.