The issue of online toxicity is one of the most challenging problems on the internet today. We know it can amplify discord and discrimination, from racism and anti-semitism, to misogyny and homophobia. In some cases, toxic comments online can result in real life violence [1,2].
Toxicity/hate speech classification with one line of code
The issue of online toxicity is one of the most challenging problems on the internet today. We know it can amplify discord and discrimination, from racism and anti-semitism, to misogyny and homophobia. In some cases, toxic comments online can result in real life violence [1,2].
Human moderators are struggling to keep up with the increasing volumes of harmful content, which can often lead to PTSD. It shouldn’t come as a surprise therefore, that the AI community has been trying to build models to detect such toxicity for years.
In 2017, Alphabet’s Perspective API, an AI solution to detect toxic comments online, was met with criticism when users found examples of racial, sexual orientation or disability bias. In particular, they found a positive correlation between toxic labels and comments containing identity terms such as race, religion, or gender. Because of this skew in the data, models were likely to associate these neutral identity terms with toxicity. For example, “I am a gay black woman” received a toxicity score of 87%, while “I am a woman who is deaf” received 77%.
This led to them creating 3 Kaggle challenges in the following years aimed at building better toxicity models:
The Jigsaw challenges on Kaggle have pushed things forward and encouraged developers to build better toxic detection models using recent breakthroughs in natural language processing.
Detoxify is a simple python library designed to easily predict if a comment contains toxic language. It has the option to automatically load one of 3 trained models: original, unbiased, and multilingual. Each model was trained on data from one of the 3 Jigsaw Toxic Comment Classification challenges using the 🤗 transformers library.
The library can be easily installed in a terminal and imported using Python.
$ pip install detoxify
The multilingual model has been trained on 7 different languages so it should only be tested on: English, French, Spanish, Italian, Portuguese, Turkish, and Russian.
You can find more details about the training and prediction code on unitaryai/detoxify.
During the experimentation phase, we tried a few transformer variations from 🤗 HuggingFace, however, the best ones turned out to be those already suggested in the Kaggle top solutions discussions.
Originally introduced in 2018 by Google AI, BERT is a deep bidirectional transformer pre-trained on unlabelled text from the internet, which presented state-of-the-art results on a variety of NLP tasks, like Question Answering, and Natural Language Inference. The bidirectional approach resulted in a deeper understanding of context compared to previous unidirectional (left-to-right or right-to-left) approaches.
Build by Facebook AI in July 2019, RoBERTa is an optimised way of pre-training BERT. What they found was that removing BERT’s next sentence objective, training with much larger mini batches and learning rates, and training for an order of magnitude longer, resulted in better performance on the masked language modelling objective, as well as on downstream tasks.
Proposed in late 2019 by Facebook AI, XLM-Roberta is a multilingual model built on top of RoBERTa and pre-trained on 2.5TB of filtered CommonCrawl data. While trained on 100 different languages, it managed to not sacrifice per-language performance and be competitive with strong monolingual models.
The 2nd challenge required thinking about the training process more carefully. With additional identity labels (only present for a fraction of the training data), the question was how to incorporate them in a way that would minimise bias.
Our loss function was inspired from the 2nd solution which combined the weighted toxicity loss function and identity loss function to ensure the model is learning to distinguish between the 2 types of labels. Additionally, the toxicity labels are weighted more if identity labels are present for a specific comment.
This challenge also introduced a new bias metric, which calculated the ROC-AUC of 3 specific test subsets for each identity:
These are then combined into the Generalised mean of BIAS AUCs to get an overall measure.
The final score combines the overall AUC with the generalised mean of BIAS AUCs.
The combination of these resulted in less biased predictions on non-toxic sentences that mention identity terms.
If words that are associated with swearing, insults, hate speech, or profanity are present in a comment, it is likely that it will be classified as toxic (even in the unbiased model), regardless of the tone or intent of the author e.g. humorous/self-deprecating. For example: ‘I am tired of writing this stupid essay’ will give a toxicity score of 99.70%, while removing the word ‘stupid’ — ‘I am tired of writing this essay’ will give 0.05%.
However, this doesn’t necessarily mean that the absence of such words will result in a low toxicity score. For example, a common sexist stereotype such as ‘Women are not as smart as men.’ gives a toxic score of 91.41%.
Some useful resources about the risk of different biases in toxicity or hate speech detection are:
Moreover, since these models were tested mostly on the test sets provided by the Jigsaw competitions, they are likely to behave in unexpected ways on data in the wild, which will have a different distribution to the Wikipedia and Civil Comments in the training sets.
Last but not least, the definition of toxicity is itself subjective. Perhaps due to our own biases, both conscious and unconscious, it is difficult to come to a shared understanding of what should or should not be considered toxic. We encourage users to see this library as a way of identifying the potential for toxicity. We hope this can help researchers, developers, or content moderators to flag extreme cases quicker and fine-tune it on their own datasets.
While it seems that current hate speech models are sensitive to particular toxic words and phrases, we still have a long way to go until algorithms can capture actual meanings without being easily fooled.
For now, diverse datasets that reflect the real world and full context (e.g. accompanying image/video) are one of our best shots at improving toxicity models.
At Unitary we build visual understanding AI capable of interpreting visual content in context and our mission is to stop online harm.
You can find more about our mission and motivation in our previous post.