Schedule a Demo Sign In

Reducing Gender Bias in Language Models: Less Is More And Fair

Computer screen showing reflection of woman in city office building looking out the window

Some of the most common traits of recent large language models (LLMs) are their ever-increasing sizes (measured by their number of parameters) and the size of their training data. 

Illustration 1: Evolution of LLM size between 2018 and 2022

Large Language Models: A New Moore's Law?

Through the last five years LLMs have been built using large volumes of internet texts that are hard to properly filter and curate. The Common Crawl dataset, for instance, has been collected through eight years of internet crawling and contains petabytes of data, most of which has been used to pre-train GPT-3. One can expect such a dataset to include the viewpoints of everyone in the world. But one must also keep in mind that internet participation is far from being fairly distributed. Most web residents are younger users from wealthy areas. 

Some models, like GPT-2, were pre-trained by scraping outbound links from Reddit. Statistics show that 67% of Reddit users in the U.S. were men and 64% were between the ages of 18 and 29 as of 2016. On the other hand, 2011 surveys of Wikipedia revealed that women constitute 9% of its editors worldwide.  

Now, we are running the risk of deploying models that exhibit stereotypical social associations and negative sentiments towards specific social groups. In fact, Kurita et al. 2019 displays how BERT may exhibit human-like biases by expressing a strong preference for male pronouns in positive contexts related to careers, skills, and salaries. Using pre-trained BERT models to build classifiers that are destined to be used in hiring, for instance, will further enforce and amplify sexist viewpoints within the hiring field. 

At Citibeats, our mission is to lend an attentive ear to every voice. We, therefore, strive to detect and mitigate biases in our models. In this article, we walk you through Citibeats’ gender bias investigation. Bit by bit, we will detail the steps we took to quantify and mitigate bias in our intent classifiers.


Defining And Detecting Gender Bias

We collect people’s opinions on social media with the aim of building actionable insights that stakeholders and policymakers can use to the make appropriate decisions. It is, therefore, obligatory to be inclusive and representative during the collection and processing of these insights. Excluding any opinions from this process can lead to decisions that put some groups at a disadvantage and aggravate conflicts.

Some of the most important thoughts that people express on social media are questions and complaints. Complaints refer to posts where people express dissatisfaction towards a person, concept, service, or product. We explain how we developed a multilingual complaint detector here

On the other hand,  we refer to expressions of doubts or information requests as questions or queries. 

The subjects of our investigation are two classifiers that we use to detect complaints and questions in arbitrary social media documents. We consider a classifier as biased once it starts exhibiting different results for a specific group or sub-group. More precisely, our models exhibit gender bias if they show different results for men and women.  

The American Psychological Association’s Dictionary defines Gender Bias as:

Any one of a variety of stereotypical beliefs about individuals on the basis of their sex, particularly as related to the differential treatment of females and males. These biases often are expressed linguistically, as in use of the phrase physicians and their wives (instead of physicians and their spouses, which avoids the implication that physicians must be male) or of the term he when people of both sexes are being discussed.

During the entire investigation, we will refer to the model’s bias towards the author of the text as bias. Since our current production algorithm leverages only from existing men/women gender bibliography, we maintain the binary gender approach for the current analysis. 

Meet The Models

The two models are used as binary classifiers that input any pre-processed social media text and return ‘1’ if the text carries a complaint/query, or ‘0’ if it doesn’t.

The table below contains examples of the models’ outputs with different inputs:






Can you believe what this tiktoker did?


What are the Delta variant symptoms?



My flight was delayed again. I can’t believe this is happening to me!


My order arrived a bit late but I like the product.


Table 1: Examples of Complaints & Queries


For our two classifiers, we use the cased DistilBERT base multilingual model (Sanh et al. (2020). This architecture is a distilled version of the BERT base multilingual model.

From the model’s page on HuggingFace:

The model is trained on the concatenation of Wikipedia in 104 different languages listed here. The model has 6 layers, 768 dimension and 12 heads, totalizing 134M parameters (compared to 177M parameters for mBERT-base). On average DistilmBERT is twice as fast as mBERT-base.

The models are fine-tuned on texts in various languages. To assess their quality, we considered global metrics such as Accuracy, Precision, Recall, and F1 Score. We can easily understand what these metrics truly measure if we think of the classifier as a COVID-19 test. We will refer to positive COVID-19 cases that were correctly detected by the test as True Positives (TPs) while referring to negative COVID-19 cases that were also correctly predicted as True Negatives (TNs). We will refer to negative COVID-19 cases that were falsely flagged by the test as False Positives (FPs), while referring to positive cases that were missed by the test as False Negatives (FNs). 

Having defined all of the possible outcomes of the test, we define the global evaluation metrics as follows:

  • Accuracy tells us how often we can expect the test to correctly predict the outcome out of the total number of times it was taken. It is calculated as (TPs+TNs)/ (TPs+FNs+TNs+ FPs).
  • Precision informs us about the proportion of the cases that were correctly flagged as positive. We can think of it as a measure of the exactness of the COVID-19 test. If the test has a 99% accuracy and 50% precision it means that half of the time it flags a case as COVID-19 positive, it is actually a false positive. This metric is calculated as TPs/(FPs+TPs).
  • Recall informs us about the test’s ability to capture as many positive COVID-19 cases out of the total number of positive cases that have taken the test. It is calculated as TPs/(FNs+TPs).
  • F1 Score is an alternative metric to accuracy that doesn’t require us to know the total number of people who have taken the test. It is the harmonic mean of precision and recall. It is calculated as 2 x Precision Score x Recall Score/ (Precision Score + Recall Score)

The following table presents an overview of the models’ metrics at the beginning of the investigation: 
















Table 2: Initial Models’ Metrics


Even though we can learn a lot about the performance of the models from these metrics, they will not tell us much about the models’ performances with respect to the speaker’s gender. As a matter of course, we will introduce new metrics: the False Negative and the False Positive Rates to measure the error likelihood of the models, and we will compute them for each gender to measure any potential gender bias. 

Following with the same COVID-19 test analogy:

  • The False Negative Rate, or the risk of missing a COVID-19 patient: In our case, The FNR measures the probability that we miss complaints or questions. It is calculated as FNs/FNs+TPs.
  • The False Positive Rate, or the risk of raising a false COVID alarm: In our case, The FPR measures the probability that we mistake negative examples for complaints or questions. It is calculated as FPs/FPs+TNs.


The aim of this investigation is to achieve a fair classification where the model yields similar results for both men and women. When the fairness criteria are met, the following three conditions are verified: 

  1. Equivalent accuracies
  2. Equivalent FPRs
  3. Equivalent FNRs


In this case, we consider that we achieved Separation between the two groups. The separation condition guarantees that different groups don’t experience different costs of misclassification.

Having a higher FNR for women, however, indicates that we are more likely to miss complaints or questions by women more often than we do for men. In that case, we risk building insights that are more representative of men’s dissatisfactions or inquiries.  

On a practical scale, we cannot strive for strictly equal values, but we always want to reduce the disparity.

Now that we have introduced the models and the motivation behind our investigation, it’s time to delve into the details of the experiments.


Test Data

Our analysis requires two test datasets, one by gender, that contain information about the text’s intent as well as its author’s gender. To consolidate this dataset, we sampled COVID-19 tweets collected during 2021 from the United Kingdom. We then used doccano to get annotations for queries and complaints. We relied on Citibeats’ API to obtain gender information using our in-house gender detector

Finally, to avoid an extreme imbalance in terms of complaints and queries, we applied annotations separately for men and women.

The following tables contain a description of the two test sets:




No Query













Table 3: Queries test set





No Complaint













Table 4: Complaints test set




We begin the analysis by assessing the level of gender disparity that our models may exhibit. The following tables contain the detailed gender metrics yielded by the two classifiers.








FP Rate

FN Rate































Table 5: Gender Metrics 

In Table 5, we can see that Queries’ accuracy and FPRs are almost equal for men and women. However, we can also see a small difference in FNRs, in men’s favor. This means that, in rare cases, we may miss questions by men more often than we do for women. Nevertheless, as twitter is skewed towards men, we judge that the probability of missing queries from men is pretty low. We, therefore, manage to meet 2 of the 3 Separation conditions with the queries classifier. The bias is light, and it is in women’s favor.

On the other hand, we can see that none of the metrics are equal for men and women with Complaints. As opposed to Queries, we have a significantly higher FNR for women. This means that we tend to miss complaints by women more often than we do for men. FPRs are also slightly higher for women. This means that, in rare cases, we tend to mistake negative examples for complaints for women more often than we do for men. Finally, we count a 6 to 7 points difference in Accuracy in men’s favor. We can conclude that the bias in the Complaints’ classifier is clear, and it favors men.

Now that we have managed to assess the level of gender bias each model exhibits, let’s find out how we can mitigate it for Complaints.

Definition of Experts

The inspiration for this investigation comes from Suau et al. (2022) on Self-Conditioning Pre-Trained Language Models. This paper describes a generative mechanism that takes advantage of special neurons, called experts, of which Transformer-based Language Models (TLMs) are composed. These units are called experts because they are the units that make the best decisions. The idea is based on the Product of Experts (PoE) formulation by Hinton (1999). 

The PoE technique, in Machine Learning, models the network’s probability distribution as a combination of several simpler distributions’ (experts’) output.

By tweaking between 5 and 15 neurons, Suau et al. managed to reduce gender disparity in text generation. Following the same principle, we will attempt to identify and tweak expert neurons in our fine-tuned models and observe the effect on the classification.

Our classifier’s final decision comes from its CLS token which is composed of 768 units towards which we will direct our focus for the rest of the analysis.

DistilBert model

Illustration 2: The Classifiers’ CLS Token 


We locate experts by treating each unit as an individual classifier of a specific concept. In our case, the concept can either be gender or complaint. In fact, we use the unit’s output to compute the unit’s area under its Precision-Recall curve. The units with the highest value of the area under their Precision-Recall curve are the units that are capable of making the best decisions.

The choice of the metric stems from it being the most robust against unbalanced datasets and rare binary events, as opposed  to F1 score for instance.


We will now embark on the journey of reducing the gender disparity in our complaints classifier. The idea is to tweak experts by transforming the unit’s weight to obtain a minimal difference in metrics and, therefore, to meet the fairness criteria. 

1st Experiment -  Deactivating Gender Experts

The first idea we had was to identify units, in the output layer,  that consider gender as an important feature and set their weights to 0. We called these units gender experts. We experimented with shutting these units off while we observe the effect on the model’s gender metrics. 

To identify gender experts, we use the complaints classifier as a gender classifier. Indeed, if the expert is able to distinguish the author's gender we would consider it as a biased expert. Then, we, rank the units based on the value of the area under their Precision-Recall curve, and we verify that none of Complaints experts are also among the top gender experts. 

After gradually decreasing the number of used gender experts, the metrics remain exactly the same. This experiment has no apparent effect on complaint classification. This is probably due to the fact that we don't have high-ranking gender experts that are, at the same time, complaint experts.

2nd Experiment - Using Top Complaints Experts Only

In this experiment, we explored a different analysis path. We rank the units from the output layer based on their Complaints expertise. We then experiment with gradually reducing the number of Complaints experts in the output layer while shutting down the rest of the units.

The following plots show the evolution in metrics disparity by the number of used Complaints experts in the output layer.


Illustration 4: Evolution of the difference in metrics by the number of used complaint experts


We can see that getting rid of the worst 200 or so complaint experts results in significantly better results for women until we leave only the top unit in use.

The same thing results in significantly worse results for men. The closest men results to the original model are obtained using the top 3 units only.

We also see that, by decreasing the number of complaint experts, FNRs witness an overall decrease while FPRs witness an overall increase for both genders. We get close to equal FPRs for both men and women at around 20 complaint experts, and we reach the lowest FNR difference at around three complaint experts.

The plots also show the lowest values of equal error rates near the lowest number of used units (3 and 1). By using between 10 and 20 of the top units, we get a much more accurate and less biased classification than by using the entire model

The table below details the model’s gender metrics achieved by using the top 10 Complaints experts only:








FP Rate

FN Rate














































Table 6: Gender Results by Language - 10 Experts only


The results of this experiment demonstrate that we can achieve a more accurate and a less biased complaint classification, with a relatively smaller model, by using only the top 10 experts in the output layer. Towards the end of the investigation, we reduced the gender disparity, in English, by 2.41points in accuracy, 1.316 points in FPRs, and 6.14 points in FNRs. We suspect that the units we deactivated, that cause the model’s bias, are units that overfit on gender components.

We also observe that Spanish shows a lower gender disparity than English and French. This might be due to the fact that this language is more gendered than English and French:it provides more gender features for the model to learn through gendered words, different conjugations, and pronouns.

Conclusions & Next Steps

During this investigation, we managed to evaluate the level of bias in our two models and reduced it for the complaints classifier. We have introduced the principle of expert units for the gender and the complaint concepts, and we concluded that transforming gender experts does not influence the classification of complaints. 

Finally, we concluded that using the top complaint experts enables us to get rid of the units that have, likely, learned some correlations between gendered or indirectly gendered words and complaints–in other words, the units that have overfitted on gender features. 

Having done that, we end up with a fairer and more accurate complaint classification.




[1] Language Models - Wikipedia

[2] Julien Simon - Large Language Models: A New Moore's Law?

[3] GPT-3 Model Card

[4] Radford et al. - Language Models Are Unsupervised Multitask Learners

[5] M. Barthel, G. Stocking, J. Holcomb and A. Mitchell - Reddit news users more likely to be male, young and digital in their news preferences

[6] Kurita et al. - Measuring Bias in Contextualized Word Representations

[7] Gender Bias Definition

[8] Arnault Gombert - Using Machine Learning to Calibrate Online Opinion Bias

[9] HuggingFace - DistilBERT base multilingual model (cased)

[10] Suau et al. - Self-conditioning Pre-Trained Language Models

[12] Vaswani et al. - Attention Is All You Need

[13] Product of experts

[14] R. Sofaer et al. - The area under the precision-recall curve as a performance metric for rare binary events

[15] doccano

[16] Bender et al. - On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?

[18] Complaints on Social Media

[19] Classification - Achieving Separation

[20] Accuracy - Wikipedia

[21] Precision & Recall - Wikipedia

[22] F1 Score - Wikipedia

[23] Sanh et al. - DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

[24] Gender Bias on Wikipedia