Old moderation methods are not keeping up with the rise of video. Current analysis lets harmful content slip through, failing to account for nuance and context. We need advanced, multimodal solutions, to ensure online safety for brands and users alike.
The early internet was defined by text-based interactions, on forums and through messaging apps like AOL and MSN Messenger. After that it was hours writing on people’s Facebook walls, scrolling through the 700 pictures you posted of a single party, commenting on each one. Then we had an insight into what our friends (and quite a few strangers) were having for breakfast, lunch and dinner. Nowadays though, you probably spend most of your time on the internet watching videos, sucked into reel after reel of ‘get-ready-with-me’ and football highlights.
Online video is the fastest growing kind of online content, growing at an exponential rate across social media platforms. In 2022, Cisco reported that online videos would make up more than 82% of all consumer internet traffic. Every minute, 500 hours of YouTube videos are uploaded, a number that grew by around 40% between 2014 and 2020. As of 2023, videos tagged #FYP on TikTok had been viewed a total of 35 trillion times. It is a constant barrage of upload and view, of like and subscribe.
This deluge has activated loyalties and calls users to action in new ways. Video conveys tone and emotion in powerful ways – video killed the radio star for a reason, after all. TikTok users in the US are 2.3x more likely than any other platform users to create a post tagging a brand, and in 2021, TikTok briefly dethroned Google as the world's most popular domain according to the Washington Post. Short form video isn’t just where people are turning for entertainment - it’s where people look for information.
Platforms currently rely on human moderators because they know that automated filters quickly screen uploads but lack nuance, letting harmful content slip through. A large platform like Meta will employ 15,000 human moderators; a video first site like YouTube has a 10,000 strong global workforce of moderators. The realities of these human moderators are occasionally reported in the press: we know for example that moderators in China were reported to process 1,600 video clips in a 12-hour shift, which translates to about 133 video clips per hour. The grand majority of automated solutions were built to respond to the image- and text-heavy social media of yesterday, and that same tonal and emotional context that makes video so compelling also makes it difficult to classify, hence the continued reliance of humans. This makes the process very emotionally taxing, but also very slow: 1,600 clips in 12 hours is a drop in the ocean of videos that are uploaded to platforms every day.
The consequences of this are real, and are felt both by users and by brands. Aggressive takedowns disproportionately censor marginalised groups, while traumatic and misleading videos are spread widely and without action. Brands miss out on opportunities to get their content in front of users who may become loyal customers, and users are exposed to content that makes them feel unsafe or uncomfortable online. It’s a problem that will only continue getting worse as more content is uploaded to platforms and more users are exposed to it. As one trust and safety professional said to us: “The scary part is that there's always something new as the space evolves and iterates [...] There's a lot of bad actors and there's only so much you can do to identify them [...] and it can be very difficult to respond to situations at scale."
Overall, social media companies need to reimagine moderation for an online world where video dominates. As Brand Safety Institute’s COO, AJ Brown, tells us: “Especially in the world of video, human review is still viewed as the most trustworthy moderation approach in the industry today. Human teams can review video in context and identify emerging issues more reliably than legacy image- and text-based detection systems, but this approach doesn’t scale. Automated multimodal solutions are best poised to keep people and brands safe online as the volume and velocity of video content online continue to increase.”
So: how short are platforms and brands falling, really? We’ve built an industry around the idea that brand safety solutions catch 99% of harmful content, with the awareness that the 1% of edge cases are a tiny and nigh impossible challenge to meet. The traditional solution, which catches ‘99%’ of harmful content, is often described as ‘frame by frame’; an AI model that treats a video as a series of images, and classifies the content according to a selection of images defined to be representative of the video. But we know that a video has a lot more going on inside of it than what’s visible in individual frames (otherwise we wouldn’t get so angry about clickbait). Frame-by-frame approaches fail to consider the sound, the caption, but also what happens in the video over time.
Imagine watching a film on mute, with no subtitles. Would you really be able to tell me the story? What if instead of showing you a film, I just sent you five screenshots from different points in the movie. Would you be able to tell me whether I could show that film to an eight year old? Approaches that only focus on one modality (in this case, images), can fail to pick up on a huge amount of context and content, allowing potentially harmful content to proliferate. The inverse is also true: a news report on the devastation caused by earthquakes in Libya and Morocco might be inappropriately removed by a blunt tool mistakenly identifying violence. This is something that professionals in the field are aware of; a trust and safety professional at a major social media platform told us that video posed a particular challenge because of the additional nuance inherent in the format: “[Moderation is] harder in video because there is a lot of nuance there. [...] You could use ML or human review to classify an image and identify the presence of something that might be considered harmful, like drugs and alcohol [...] but it's actually an educational video that's explaining the opioid crisis. [The classifier] is just seeing the actual image of something that might be considered harmful rather than listening to the way in which it's being described"
In order to effectively moderate video content at scale, we need a solution that analyses video content in the same way that people consume it: holistically.
A multimodal solution takes everything into account: reacting to visual signals over time (instead of single images), as well as reacting to the aural and textual content within and around it. Where we could be confident that we were capturing 99% of harmful content within the bounds of a unimodal classifier, we now have access to technology that vastly lifts the ceiling on the harmful content we can, and should, classify and act on.
It's not only about identifying harm, but about identifying anything at all! Problematically, unimodal classifiers often fail to identify other modalities. To exemplify this, we ran a dataset of 3,000 randomly selected user generated videos through a unimodal system, which only identified 170 videos. In contrast, when running the same dataset through a multimodal machine learning model, we identified nearly 400 videos, that’s 55% more potentially harmful content missed.
Functionally, the 99% accurately classified images ends up being 45%. This problem scales as well – the bigger the dataset, the more knotty the moderation challenge.
When running a 2 million impression campaign, even just appearing next to 1% of harmful videos would still be 20,000 impressions. Can you imagine 55% still being potentially harmful?! Not only might you be showing next to the 55% of harmful content missed, but you could also be missing out on a great part of the inventory you wanted to be next to.
We know that brands cannot afford a single unsafe impression: Joe Fried, VP of biddable media at Inspire Brands, told us that “A brand stands to lose everything if there's a brand safety problem [...] One impression is too many.” He made it clear that brand suitability can affect every aspect of a brand, and take away hard earned consumer trust in an instant. Brands know that the ecosystem is evolving rapidly and standards are changing: when told about the results of our research Joe admitted he was not surprised at the discrepancy between unimodal and multimodal solutions. What worked yesterday may fail you tomorrow. The right AI system can help you unlock the power of your inventory in a safe and responsible way. Really, it's a no brainer.
Today, we have an opportunity to rise to the challenge of video, and the tools are on our side: as AI continues getting more advanced, we’ll be able to move towards sophisticated approaches that understand the meaning of media and act accordingly. But this can only happen if we recognise that the old methods have become outdated. The scope of this challenge is massive; companies must begin taking steps now. With openness, creativity and care, we can make sure that online video is a force for good.