How to use the state-of-the-art multilingual Citibeats’ gender detector

 

Motivation

In today's digital age, social media has become a powerful tool to gauge the pulse of society. However, it's important to recognize that social media data are biased when it comes to demographics such as gender or age. Simply assuming that social media data is representative of the entire population can lead to flawed conclusions. For example, in France, women account for 51.6% of the population, but on Twitter, they represent only 33.5% of users. This significant difference in demographics highlights the need to accurately identify gender and age before diving into deeper analysis.

At Citibeats, we recognize the importance of reducing bias in social media data, which is why we are proud to introduce MiniAM2. This cutting-edge model is designed to identify the gender and organization status of Twitter accounts simply by analyzing the user's name, screen name, and bio description. Currently, MiniAM2 is available in English, Spanish, and French, but our team is working to expand its capabilities to other languages.

Citibeats is demonstrating its commitment to responsible data management by ensuring the protection of individuals' privacy while handling social media data. Hence, MiniAM2 is used in the Citibeats’s platform only for bias reduction purposes, anonymizing data and adhering to strict privacy guidelines, ensuring that sensitive information of individuals is not exposed and that user trust is maintained.

The name MiniM2 stands for Mini Assemblage distillation of M2 where M2 means Multilanguage and Multi attributes. Our team is excited to share more about this innovative model in an upcoming article.

So, how can you benefit from MiniAM2? This blog post is here to answer that question and help you better understand the value of reducing bias in social media data.

MiniAM2 in Hugging Face  

We created an organization’s account in Hugging Face for Citibeats. All our open models will be uploaded here. 

Captura de Pantalla 2023-03-09 a las 20.19.14

The MiniAM2 model can be found in Citibeats account in Hugging Face with the name  ‘miniam2_en_es_fr’. It is useful to check its model card. It provides:

  • A general description of the model
  • Intended uses & limitations
  • Training and evaluation data
  • Training procedure
  • Training hyperparameters
  • How to use
  • Example
  • Recommendations

 

How to Use MiniAM2: Generating Predictions with Hugging Face's Multilingual Gender Detection Model

If you're looking for an easy-to-use gender detection model, look no further than MiniAM2 of Citibeats from Hugging Face. In this guide, we'll walk you through the steps to get started with MiniAM2 and generate your own predictions.

Step 1: Open a Python Notebook

To begin, open a Python notebook. We created a colab project with all the instructions, that you can execute by following this link to colab. We recommend using a Colab notebook to easily follow along with our instructions.

Step 2: Install Required Packages

Before we can download and use MiniAM2, make sure your environment has the following packages installed: tensorflow, huggingface_hub, and keras.

Step 3: Download the Model

Downloading MiniAM2 is simple and only requires two lines of code. Use the following command in your notebook:

Python
from huggingface_hub import from_pretrained_keras
miniam2 = from_pretrained_keras('CitibeatsAI/miniam2'

 

Step 4: Use MiniAM2 for Gender Detection

Assume you have a DataFrame with information about social media users, including their screen names, names, and bios. 

Python
import pandas as pd
### Some random data
names = ['Jacobeo', 'Radio de Barcelona', 'Ms. Dragon']
screen_names = ['@elktb', '@radioBarcelona', '@msdragon']
bios = [
      'me gusta ir a la montaña y otear pájaros', 
      'La programacion de la mejor musica',
      'I play all kind of board and role-playing games',
      ]
data_dic = {"name": names, 
            "screen_name": screen_names,
            "bio": bios
           }
df = pd.DataFrame(data_dic)

 

Which is a Dataframe containing the information:

Captura de Pantalla 2023-05-03 a las 13.36.50

You can use MiniAM2 to generate predictions for the gender of each user by passing three tensors or columns of the DataFrame to the model.

Python
miniam2 ( [df [ "screen_name" ], df [ "name" ],  df [ "bio" ] )

 

This will return the output:

Python
<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[0.001466  , 0.79294753, 0.20558645],
       [0.9579734 , 0.02286685, 0.01915975],
       [0.0070298 , 0.1392757 , 0.8536945 ]], dtype=float32)>

The output will be a tensor containing the predictions for each user. The first column represents the probability that the user is an organization, the second column represents the probability that the user is male, and the third column represents the probability that the user is female.

For example, for the data we created, the first user account belongs to a male, the second one to an organization, and the third one to a female, according to the predictions of the model.

Use Case: Social risk monitor

Understanding the relevance of identifying the gender of Twitter accounts is crucial, and we present a use case to demonstrate this fact. In this scenario, a client of Citibeats aims to address societal problems effectively by understanding the issues concerning society the most. The client wants to utilize the social risk monitor of Citibeats, which provides trends per category to offer the best solutions. All the graphs and data shown belong to real data of the social risk monitor of Citibeats.

The monitor shows the trend of Health category compared to the rest of categories, summarized by below graph:

cat_trend

The graph shows that this category has experienced an upward trend in the last two weeks. Moreover, it also indicates that the Health category received less than 7% of the activity compared to other categories. The client might infer two conclusions from this data, but these insights could be false due to data bias.

To avoid this, we utilize MiniAM2 to identify the gender of Twitter accounts and offer two graphs, depicting the activity on the Health category by gender:

cat_trend_by_gender

We omitted the organization label for simplicity. The graph contradicts previous convictions. To start with, over 10% of the Twitter activity by women occurred in the Health category, while around 5.5% of Twitter activity by men was focused on Health. Hence, the total Twitter activity in the Health category can not be lower than 7%, since women correspond to more than 50% of the population.

The graph also shows that the slightly increasing trend of the Health category is due to the activity of men on Twitter. Observe how similar the curves of total and men are, and how different women and total curves are. The activity of women in the Health category is double that of men, and the total curve should resemble women's curve, indicating a slightly decreasing trend. Hence, the total curve is significantly biased due to the larger activity of men on Twitter.

This use case demonstrates how identifying the gender of Twitter accounts can reveal hidden data biases and help in providing a better understanding of the social trends, enabling the client to make informed decisions.

Conclusions

Social media has become a significant source of data for analyzing the pulse of society. However, it's essential to recognize the inherent bias that can exist in this data, particularly regarding demographics such as gender or age. The Citibeats team has developed the MiniAM2 model, which identifies the gender and organization status of Twitter accounts, helping reduce bias in social media data. The model is available in English, Spanish, and French and can be downloaded from Citibeats' account in Hugging Face. In this article, we have provided step-by-step instructions on how to use the MiniAM2 model in Python to generate predictions. We have also shown a use case of how the social risk monitor of Citibeats can be used to identify the most pressing social issues and help clients make data-driven decisions. Overall, the MiniAM2 model is a valuable tool for accurately identifying gender in social media data, enabling researchers and businesses to make more informed decisions.