May 9, 2023
Behavioral screening of Large Language Models

The success of Large Natural-Language Models (LLMs) has been getting more and more dramatic during recent years. It has, even, led to a torrent of new AI applications outside of the NLP field. Alas, this success is afflicted by the models’ inability to prove that their predictions were made on good premises. We know their inputs, outputs, and how likely it is for the outputs to be correct, but we still don’t have a clear understanding of what happens in between.
This lack of transparency might lead to users abstaining from using the algorithm because they simply cannot trust it. Understandably, major risks can arise if such models are misunderstood or improperly applied to make business and governance decisions.
Being computed on a test set that is limited in size and in time, accuracy doesn’t provide guarantees on how well the model will handle the real-life data that it will work with after its release. The possibility that the models are overfitting by taking shortcuts that lead to the correct predictions, from the training data, can’t be ruled out. Because of this opacity, large language-models, and deep-learning models, in general, have been used as “black-boxes”: As shown in Illustration 1, from a given input, they generate an output that provides no insight about the model’s internal behavior. Yet, this obscure output might be taken as ground truth by different users of the model.
Illustration 1 - Black-box models
What’s more, these models are, often, trained on massive volumes of data collected from people’s daily internet activities. Where people love to voice their opinions, wishes, and complaints. This led researchers to warn from the dangers of learning from people’s prejudices and biases such as racism and gender discrimination. Illustration 2 displays a vulnerability of a commercial sentiment analysis model where changing location names should not change the sentiment.
Illustration 2 - A commercial sentiment analysis model showing vulnerability to NER perturbations
Ribeiro et al., 2020 - Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
For these reasons, we are in need of a set of processes and techniques that allow human users to understand and trust the results created by LLMs as well as their behavior. These methods are called eXplainable Artificial Intelligence solutions (XAI), and despite the recently growing focus attributed to the task in the Machine Learning (ML) community, there still is a major lack of reliable solutions that are tailored for each model’s specificities.
At Citibeats, we rely on LLMs to process human opinions and extract insights that can be used by policymakers and stakeholders in the process of making the appropriate decisions for various social matters. We therefore aim to build and deploy Machine Learning (ML) systems that are accurate but also fair and representative of all the relevant opinions and also transparent and trust-worthy. In order to do that, we designed a framework of tests that provide a detailed explanation of the models’ results and their expected behavior. This framework constitutes the first steps towards a Model Explainability open-source project under Citibeats’ open Data Science initiative.
In addition to gaining a proper understanding of the models that we are working with, we also want to make sure that we are testing and deploying them in an efficient and reliable way.
Objectives
The interpretability of Machine Learning (ML) systems can be achieved either by ante-hoc or post-hoc methods. Ante-hoc works by designing intrinsically explainable models. Where the decision is explained from the beginning of the training, while maintaining the best model performance. Although they produce transparent and trust-worthy models, their training process is slower, more complex, and generally results in a lower performance than black-box models. An example of self-explaining models is RETAIN (Choi et al., 2017).
Post-hoc methods, on the other hand, are model-agnostic in nature. These methods are not tied to particular types of Machine Learning (ML) models which separates predictions from their explanation. A post-hoc method relies on an additional, “surrogate”, model that generates user-readable explanations. It is the use of this additional, often black-box, model that makes post-hoc methods less transparent than their ante-hoc counterparts. However, they offer more flexible interpretation options and eliminate performance trade-offs. This strategy led to different, yet popular approaches: LIME (Ribeiro et al., 2016) and SHAP (Lundberg et al., 2017), for instance, are prevailing feature-importance-based methods. Anchors (Ribeiro et al., 2018), is a famous rule-based method that describes complex models with a set of high-precision rules for predictions. While Checklist (Ribeiro et al., 2020), proposes a matrix of linguistic capabilities and test types that facilitate designing comprehensive tests.
Some Citibeats models rely on pre-trained LLMs. Thanks to their capacity to be trained on massive volumes of data in a relatively short time, LLMs hold the highest ranks in terms of performance. Yet, because of their large number of parameters, they rank lowest in terms of explainability.
Illustration 3 - Explainability vs Performance
Inspired by AAAI XAI Tutorial 2020
Our Machine Learning (ML) team designs transformer-based NLP systems used in various tasks such as document classification, text summarization, and information extraction. Adhering to the wisdom of the late Uncle Ben, “With great power comes great responsibility”, we aim to align these powerful systems with Citibeats’ ethical and technical standards.
One of DistilBERT’s children is the DistilBERT base uncased finetuned SST-2, a fine-tuned checkpoint of DistillBERTto be used in topic classification. According to model’s Risks, Limitations, and Biases section on HuggingFace, this model has been observed generating predictions that reveal bias against underrepresented populations:
For sentences like This film was filmed in COUNTRY, this binary classification model will give radically different probabilities for the positive label depending on the country (0.89 if the country is France, but 0.08 if the country is Afghanistan) when nothing in the input indicates such a strong semantic shift.
Illustration 4 - Bias map detailing probabilities for each country
In order to avoid such risks, we must examine these aspects with respect to our use-case. Since we are using pre-trained models, we must rely on post-hoc, model-agnostic, techniques to carry-out this examination. The goal of this process is to ensure that our NLP systems meet the ethical and technical requirements intended by their human designers. These requirements refer to certain linguistic capacities of the model, its fairness, as well as its robustness in a production environment.
In addition to meeting the forenamed requirements, our team aims to deliver speedy and reliable end-to-end NLP applications that remain in touch with the stakeholders’ feedback. Thus, we will heed Continuous Integration and Continuous Delivery (CI/CD) practices to bridge the gap between the different phases as well as the participants of this process.
In this article, we walk the reader through the process of unifying our NLP cycle. We start by presenting our testing framework for a classification task. Then, we introduce Giskard, a quality assessment tool for AI models and we finish by demonstrating our approach using a Citibeats’ Complaint classifier.
NLP tests
One way of explaining black-box models consists in understanding the connection between the model’s input and prediction. It is a process of inspection that provides a more detailed view of the model’s behavior, its strengths as well as its weaknesses. This approach includes the feature-based methods, like LIME and SHAP, mentioned in the previous section.
Another explainability path relies on testing. This approach focuses more on treating the issues related to performance decrease and biases exhibited in production.
Inspired by Ribeiro et al., 2020, we decided to implement our own tests to evaluate models with only one textual input. In the rest of this article, we will be referring to these tests and models as NLP tests and NLP models.
The idea is to challenge our model in different aspects. We, first, define a set of skills/capacities that we deem essential for our NLP model to have. Then, we test its ability to handle the challenges related to each capacity.
For a classification task, we believe it is rudimentary for the model to meet several linguistic, robustness, and ethical requirements.
In order to achieve that, we are testing the capabilities present in table 2:
Linguistic |
Robustness |
Fairness |
Vocabulary and POS: The model’s ability to correctly handle neutral language, to recognize synonyms in the form of n-grams, to recognize the sentiment of adjectives, and to differentiate between the different components of a sentence.
Negation: The model’s ability to recognize synonyms in the form of double negation, negation opposites, as well as other forms of negation.Temporal: The model’s ability to recognize and correctly handle the variation of tense in a sentence. |
Typos and mis-spellings: The model’s ability to handle mistakes in real-world text.
|
Fairness tests ensure that the model remains invariant to the change of:
|
Table 1 - Linguistic and ethical requirements
To spot the potential capability failures, we introduce two different test types. These tests show potential sources of errors in each capability. They were inspired by the principles of behavioral testing in software engineering. Software testing aims to detect faults in software applications. Using a similar approach, we detect faults in models. According to Ribeiro et al., 2020:
NLP practitioners with CheckList uncovered almost three times as many bugs, compared to users without CheckList.
To keep things simple, we describe our testing approach for binary classification models. As to multi-class models, the approach remains the same but with more tests specific to each class.
We test each capability using two different types of tests, Minimum Functionality Tests and Metamorphic tests.
Minimum functionality tests are modeled after unit testing in software engineering. We label a set of samples for each capability and compare the model’s outputs with our labels.
The first step involves creating labeled sentences that challenge each capability. We then use these sentences to test the model and evaluate its performance using metrics such as accuracy, precision, and recall. This step gives us a general view of the model’s ability to handle each capability.
Labeling texts is a time-consuming task, so we only label 10 samples per class and capability. However, to obtain significant metrics, we require larger sample sizes. Otherwise, we risk drawing incorrect conclusions about the model’s real abilities. To address this issue, we rely on augmentation techniques that transform the original text by applying slight perturbations that do not alter the text’s label. An example of this is shown in Illustration 4.
Illustration 4 - Example of augmentation function
As a result, we are able to triple the number of unit tests. Specifically, for a binary model, we have 60 labeled samples for each capability. It’s important to note that the texts are highly correlated due to their similarities. If the model classifies a text correctly, it’s more likely that it will classify the perturbed text correctly as well, and vice versa.
Here are some examples of unit tests that we use to evaluate how well the Sentiment Classifier understands negation:
Examples of unit tests for testing how well a Sentiment Classifier understand negation are:
- (“I don’t like my house”, Negative)
- (“Nowhere have I been as good as in Barcelona”, Positive)
After applying augmentation, we obtain:
- (“I don’t like my house”, Negative)
- (“I don’t like my home”, Negative)
- (“I do not like my home”, Negative)
- (“Nowhere have I been as good as in Barcelona”, Positive)
- (“Nowhere have I been as good as in Casablanca”, Positive)
- (“Nowhere have I been as great as in Barcelona”, Positive)
Once we have completed the unit tests for a given capability, we analyze the Accuracy, Precision, and Recall scores to obtain an overview of how well the model handles negation. Based on these scores, we then set a threshold to determine whether the model has passed or failed the unit tests. Specifically, if the scores are above the threshold, we consider the test passed, and if they are below the threshold, we consider the test failed. This approach allows us to ensure that the model meets our standards for performance on negation-related tasks.
Another type of test for assessing the capability are Metamorphic tests. This type of test is influenced by Software testing as in Segura et al. (2016). We test the change of behavior of a model when we transform the input. The following examples will make things clearer.
Let’s go back to our sentiment classifier and its ability to handle negation. Our goal is for the model to understand that different forms of negation have similar meanings. For instance, if we have the negation “I don’t like my house,” we want our sentiment classifier to behave the same way as for the text “I dislike my house.” Instead of comparing the model output for both texts, we check if the outputs are similar or not.
We can represent this type of test as a triple (M, T, E), where M stands for the model, T for the transformation, and E for the expected behavior of the model. In our approach, the transformation is related to the tested capability. In the previous example, we used a transformation related to negation because we were testing the negation capability.
The expected behavior of our model refers to the difference between the outputs before and after the transformation. For a binary model, the output is a 1-D array with two values: the probability of belonging to class 0 and the probability of belonging to class 1. For the previous example, the model output for a text will be the probability of expressing a negative sentiment and the probability of expressing a positive sentiment.
We separate these transformations into two subtypes: Invariant transformations and Non-Invariant transformations. Invariant transformations are perturbations that do not alter the model’s behavior. Specifically, the input is modified but the label remains the same. For instance, in the previous example, we transformed “don’t like” to “dislike” but the sentiment label of the sentence stayed unchanged. Therefore, we expect the output probabilities to remain the same or at least very similar. (Invariant transformations are label-preserving perturbations, and we expect the output to remain unchanged.)
On the other hand, Non-Invariant transformations are perturbations that change the label or the class membership of the input. For example, if we modify the text “I don’t like my house” to “I like my house”, we switch the sentiment from negative to positive. Similarly, if we transform “I’m happy” to “I’m extremely happy”, both texts express positive sentiment but with different degrees of intensity. In these cases, we expect the output probability to change towards more positive sentiment. For the first example, we expect the output probabilities to move from negative to positive, and for the second example, we expect the probabilities to move from positive to even more positive. By “moving from positive to even more positive,” we mean that if the model’s probability of being in the positive class before the transformation is, say, 0.65, then we expect that after the transformation, this probability will increase.
For binary models, Non-Invariant transformations can move the output probabilities in two directions, depending on whether they increase the probability of belonging to class 0 or class 1.
Finally, we need a metric to measure the similarity between the expected behavior and the actual behavior, as well as a test to decide whether the model has passed or failed. For unit testing, we used Accuracy, Precision, and Recall as the metric and a threshold as the test.
Our approach will rely on paired t-tests. Let’s describe how they work, in what conditions they can be used and why it applies to our case.
A paired t-test is used to compare the means of two samples when each observation in one sample can be paired with an observation in the other sample.
Typically, the measurements of the observation for the tests are taken before and after some treatment as in Rono et al (2014) . The first population are the measurements taken before the treatment and the second are the ones taken after the treatment.
In our case we have the same scenario. Given a dataset of texts, and a transformation we select only the texts affected by the transformation. The first population are the probabilities of belonging to class 1 before transforming the selected input texts and the second population are the probabilities of belonging to class 1 after applying the transformation to the selected texts.
To apply the t-test for paired measurements we have the following assumptions:
- Observations from the same sample must be independent.
- Each of the paired observations must come from the same source
- The measured differences are normally distributed
The first two conditions clearly apply to our case, the texts are independent of each other and each pair of observations comes from the same source, the input text. The third condition may not be met in some cases but either way as stated in Shansudheen et al (2021) t-tests are robust against non-normality if the sample is not too small. As a general rule of thumb it is considered that a sample size larger than 30 is already good enough. For smaller sample sizes we can use other alternatives like the Wilcoxon signed rank test. However in this post we focus on describing paired t tests.
The test requires us to formulate a null and an alternative hypothesis. The intuition behind t-tests is to assume that our data follows the null hypothesis and then prove thanks to the statistic from the t-test that under the conditions of the null it is very unlikely to have our data.
The null hypothesis should state the opposite of what we want to show.
Back to our Sentiment classifier example, suppose we have a transformation that each time it sees the expression “I’m sad” in a sentence it replaces it by the expression “I’m happy”. For example “I’m sad because Barcelona lost yesterday against Inter” becomes “I’m happy because barcelona lost yesterday against Inter”
We assume that this transformation generally shifts the predicted sentiment probabilities towards a positive sentiment, although there may be exceptions. Therefore, our null hypothesis is the opposite.
Let poriginal be a vector with the probabilities of expressing a positive sentiment outputted by our model for the texts having the expression “I’m sad”. Let pperturbed be the ones of the transformed texts where we replaced “I’m sad” by “I’m happy” then
Null hypothesis: mean(poriginal) ≥ (pperturbed)
Alternative hypothesis: mean(poriginal) < mean(pperturbed)
This specific test is called a one sided left paired t-test because the null assumes that the difference is going to be greater than 0.
The paired t-test tells you that under the null hypothesis the statistic below, follows a student distribution.
T = meandiff /(stddiff/√n) where diff = poriginal - pperturbed and std means the standard deviation.
Then if we calculate this statistic we can check how unlikely it is to get this number from a Student distribution. The p-value for this test is:
pvalue =P( X < T) which is the probability of being smaller than your T statistic.
If this probability is very small it means that it is very unlikely that if your data follows the null hypothesis you were able to find that T statistic. Therefore you can reject the null hypothesis and assume that your data follows the alternative hypothesis.
Back to our case, if in this case the p-value would have been of 0.01 it would have mean that there is 1% chance of having that mean(poriginal) ≥ mean(pperturbed ) with the probabilities that we sampled. Then we would have proved with 99% of confidence that the transformation that replaces “I’m sad” by “I’m happy” tends to shift the probabilities towards expressing a more positive sentiment.
Then the p-value will be our metric and 0.05 will be our threshold. If the p-value is smaller than 0.05 it would tell that our model has passed the test, we have a 95% confidence that it follows the same behavior as we expected towards the transformation.
However we have to be careful since the type of paired t-test will be specific to the expected behavior of the model. If we expect that our transformation should increase the probabilities of belonging to class 1 then we will use a left side paired t-test as above.
On the contrary, if we expect that our transformation should decrease the probabilities of belonging to class 1 then we perform a right sided t-test. For example if we replace the previous transformation by its opposite transformation that replaces “I’m happy” by “I’m sad” then we will perform a right sided t-test.
Finally, for invariant transformations, the procedure is more complicated. The complexity comes because we want to show evidence that pk, original and pk, perturbed are similar and we can not do it directly with one paired t-test.
To show it we use equivalence tests as in Lakens et al. (2018). The equivalence test is used to show that the mean of two populations is similar. In our case this test will rely on two paired t tests.
The null hypothesis of the equivalence test is:
H0: mean(pperturbed ) < 0.1 - mean(poriginal) or mean(pperturbed ) < 0.1 - mean(poriginal)
Then the alternative hypothesis is :
HA: mean(poriginal) - 0.1 < mean(pperturbed ) < mean(poriginal) + 0.1
For rejecting the null hypothesis we will perform two one sided-t tests and the metric will be the biggest p-value. If the biggest is smaller than 0.5 then the test will be passed our probabilities will be close enough.
Unified Machine Learning (ML) process - Giskard
Godfather of DevOps, Patrick Debois, says:
The biggest advantage of DevOps is the insight it provides.
In the software community, the umbrella term DevOps refers to a combination of practices, tools, and culture changes aiming to ensure a high software quality while maintaining a short development cycle and a continuous delivery. In many of its aspects, DevOps is derived from Agile principles.
Similarly to DevOps, Machine Learning (ML)OPs seeks to boost the automation and the quality of models in production while meeting their ethical and business constraints. It is fastly becoming an approach to manage Machine Learning (ML) models’ life cycles. According to the Continuous Delivery Foundation (CDF), Machine Learning (ML)Ops aims to unify the cycle and reduce the technical debt for machine learning and software application release. It enables: Automated testing of machine learning artifacts such as data validation, Machine Learning (ML) model testing as well as integration testing, the application of agile collaboration principles to machine learning projects, and treating models and datasets as first-class citizens within CI/CD systems.
To run our battery of NLP tests, we use Giskard. A collaborative and open-source CI/CD tool for Machine Learning (ML) teams. It provides code presets that enable users to efficiently write and automate different types of tests. Table 1 and Illustration 6 detail Giskard’s test presets.
Category |
Metamorphic |
Heuristic |
Performance |
Data Drift |
Prediction Drift |
Use |
Tests to evaluate the output after applying perturbations to the input. |
Tests to evaluate the model’s adherence to business rules. |
Tests to evaluate the model’s performance within particular data slices. |
Tests to evaluate feature drift between the reference and the actual dataset. |
Tests to evaluate the presence of concept drift inside the model. |
Table 2 - Different categories of Giskard’s test presets
Illustration 7 - Heuristic test presets: Kolmogorov-Smirnov
In addition to these presets, Giskard enables us to write our custom tests as well as use an inspection tool that enables us to, instantly, see the model’s output for a given set of feature values. This tool is extremely useful as it allows the tester to visualize feature contributions as well as to, directly, apply changes to their values and monitor the output.
Illustration 7 shows Giskard’s intuitive model inspection interface. Besides investigating the model's behavior, this interface empowers collaboration. It enables our team members to leave notes and tag teammates directly using discussion threads that facilitate follow-up and decision making.
Illustration 7 - Giskard’s model inspection interface
Experiments
We used the procedure described previously to test inhouse nlp models. Let’s describe the benefits of the procedure when testing a complaints detector.
The model used is a binary classification model trained in Citibeats that predicts whether a text
is a complaint or not, it returns 1 if the text carries a complaint, otherwise it returns 0. The architecture of the model is described in Gombert 2021
When the model was builded we evaluated its performance using a labeled test set annotated by Preotiuc-Pietro et al. (2019). The Accuracy, Precision and Recall metrics can be found below
F1 | P | R | Acc |
75.01 | 70.53 | 80.27 | 77.47 |
Table 3 - Metrics in Test set
Let’s now test the model using our screening:
In total, we run 4 series of tests to evaluate the model's linguistic capacities. In each series of tests we tested one or more capabilities and for each one we follow the procedures described in PART 2. Here we present the results on 3 of the series.
In the first series we test Vocabulary, POS, Taxonomy, Semantic Role Labeling and Temporality. In the second one we test Negation and in the third one Fairness. The results on the tests are shown in Tables 4, 5 and 6 respectively.
To decide whether the model passes or fails a test we selected a critical value. For Minimal Functionality Tests we chose the threshold to be 10 pts below the metrics in the test set. We decided to set a threshold lower to the test metrics because the idea is to challenge the model but keeping it close since at the end the metrics on the test set reflects the performance in production. For metamorphic tests, we set the p-value at 0.05. For passing the test the confidence has to be greater than 95%. This threshold is the most common threshold for rejecting the null in paired t tests.
As we observe in the tables the model fails almost all the Minimum Functionality tests. For negation it fails for the three metrics. Nonetheless, the model passed almost all the invariant tests. Indeed, the model looks stable to the invariant transformations we defined.
Unit tests |
INV tests |
DIR tests |
|||||
Accuracy |
Precision |
Recall |
Synonyms |
Stopwords |
Robustness typo |
Strong Expressions |
|
Th |
0.67 |
0.6 |
0.7 |
0.95 |
0.95 |
||
Examples |
I’m upset → I’m angry |
I’m upset → hi, I’m upset |
I’m upset → 1’m upset |
“I’m upset” → “I’m really upset”. |
|||
Result |
0.533 |
0.625 |
0.556 |
0.99 |
0.99 |
0.99 |
0.88 |
Result |
failed |
passed |
failed |
passed |
passed |
passed |
failed |
Table 4 - Results for Vocabulary, POS, Taxonomy, Semantic Role Labeling and Temporality
Unit tests |
INV tests |
|||||
Accuracy |
Precision |
Recall |
Double Negation |
Negation Opposites |
Other types of Negation |
|
Th |
0.67 |
0.6 |
0.7 |
0.95 |
||
Examples |
you are late → you are not in time |
you are selfish -> you are not generous |
I don’t like anything → I like nothing |
|||
Result Th |
0.455 |
0.5 |
0.5 |
0.97 |
0.98 |
0.97 |
Result |
failed |
failed |
failed |
passed |
passed |
passed |
Table 5 - Results for Negation
Unit tests |
INV tests |
||||||||
Accuracy |
Precision |
Recall |
Writing Style |
Gender |
Religion |
Ethnicity |
Nationality |
Sexuality |
|
Th |
0.67 |
0.6 |
0.7 |
0.95 |
|||||
Result Th |
0.692 |
1.0 |
0.333 |
0.97 |
0.98 |
0.99 |
0.99 |
0.94 |
0.99 |
Result |
passed |
passed |
failed |
passed |
passed |
passed |
passed |
failed |
passed |
Table 6 - Results for Negation
Thanks to this test suite we are able to spot the areas where the model has more weaknesses and try to reinforce them. Moreover we detected the capabilities where the model failed the most, and the more urgent to fix.
The next step is to enhance the model by feeding more data related to the capabilities in the training or with some feature engineering and then test it again and look if the results improved.
Conclusion
In essence, this behavioral screening approach enables us to release fair and trustworthy NLP models. Models that do not only abide by Citibeat’s Ethical AI values, but also show an accurate understanding of Natural Language.
Thanks to Giskard, our data scientists can quickly implement and automate this screening process. Additionally, they can collaborate with business stakeholders to exchange feedback around the results of the screening in a centralized and straightforward manner. In this article, we introduced the reader to what constitutes one of our first steps towards a Model Explainability open-source project under Citibeats’ open Data Science initiative.
If you are concerned about bias in AI models, join Citibeats’ Ethical AI community and help decision-makers make better decisions based on unbiased data.
Bibliography
[1] Ribeiro et al., 2020 - Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
[2] Giskard - Open Source CI/CD Platform for ML teams
[3] Ribeiro et al., 2016 - “Why Should I Trust You?”: Explaining the Predictions of Any Classifier
[4] M. Lundberg et al., 2017 - A Unified Approach to Interpreting Model Predictions
[6] A Method for Testing Model to Text Transformations
[7] The differences and similarities between two-sample t-test and paired t-test
[8] Wikipedia - Language Models
[10] Ribeiro et al., 2018 - Anchors: High-Precision Model-Agnostic Explanations
[12] Equivalence Testing for Psychological Research: A Tutorial
[13] Wikipedia - Agile Software Development
[14] Wikipedia - Concept Drift
[15] Github - Giskard-AI/giskard
[16] Rogers et al., 2020- A primer in BERTology: What we know about how BERT works
[17] Bender et al. - On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
[18] Github - openai/gpt3
[20] Vilone et al., 2020 - Explainable Artificial Intelligence: a Systematic Review
[21] Wikipedia - Ngrams
[22] Wikipedia - Part-of-speech tagging
[23] Segura et al., 2016 - A Survey on Metamorphic Testing
[24] Murphy et al., 2022 - Properties of Machine Learning Applications for Use in Metamorphic Testing
[25] Shamsudheen et al,. 2021 -Should we test the model assumptions before running a model-based test?
[26] Princeton University "About WordNet." WordNet. Princeton University. 2010.
[28] Papers With Code - Text Classification on GLUE
[29] Wikipedia - Natural-Language Understanding
[30] Abby Seneor & Matteo Mezzanotte - Open source data science: How to reduce bias in AI
[31] Child et al., 2019 - Generating Long Sequences with Sparse Transformers
[32] Gombert - Boosting BERT Performances With Low Resources