Content ModerationFeb 11, 2026

Perspective API toxicity: how the scores actually work

Ruud Visser

Founder & CEO

Perspective API returns a toxicity score between 0 and 1 for any text you send it. But that number is more nuanced than most developers realize.

The TLDR:

Perspective API toxicity scores measure probability, not severity: a 0.9 doesn't mean "very toxic," it means the model is confident it resembles comments humans tagged as toxic.
The model struggles with context, sarcasm, and identity terms: it can't tell genuine hate from quoted speech, and it over-flags content mentioning marginalized groups.
Threshold choice matters more than the score itself: a single cutoff for all use cases is the most common mistake developers make with this API.

What the Perspective API toxicity score actually measures

Toxicity vs. the other attributes

How to read the scores (and what most developers get wrong)

Choosing the right threshold for your use case

Where the toxicity model breaks down

The silent update problem

How modern AI handles toxicity differently

Frequently asked questions

What the Perspective API toxicity score actually measures

When you send text to Perspective API, the TOXICITY attribute returns a single number between 0.0 and 1.0. Most developers treat this as a "toxicity percentage." It isn't.

The score represents the probability that a human reader would perceive the comment as toxic. Perspective defines toxic as "a rude, disrespectful, or unreasonable comment that is likely to make you leave a discussion." So a score of 0.85 means roughly 85% of human raters would consider the text toxic by that definition.

That word "perceived" does a lot of heavy lifting. The model doesn't determine whether something is harmful in any objective sense. It predicts how a crowd of annotators would react to it. Those annotators were trained on data from The New York Times comments, Wikipedia talk pages, and online forums.

This is an important distinction because different communities have wildly different norms for what counts as toxic. What passes as normal banter in a gaming lobby would get flagged immediately in a newspaper's comment section. Perspective API doesn't know which community you're moderating. It applies one standard across all of them.

For a full overview of how the API works under the hood (CNNs, training data, and all available attributes), see our guide: What is Perspective API.

Toxicity vs. the other attributes

TOXICITY is Perspective API's primary attribute, but it's not the only one relevant to harmful content. Here's how the toxicity-related attributes differ from each other.

TOXICITY is the broadest category. It catches anything rude, disrespectful, or unreasonable. Think of it as the widest net: it catches insults, threats, and profanity, but also sarcasm, strong disagreement, and blunt criticism.

SEVERE_TOXICITY is a stricter version that only flags content likely to be considered very hateful, aggressive, or disrespectful. It has far fewer false positives than TOXICITY but misses more subtle harm. If TOXICITY is a smoke detector, SEVERE_TOXICITY only goes off when there's an actual fire.

INSULT specifically targets demeaning or inflammatory language aimed at a person or group. PROFANITY catches swear words and obscene language regardless of intent. THREAT detects statements that describe an intention to harm. IDENTITY_ATTACK focuses on hateful content targeting someone's race, religion, gender, sexuality, or similar identity.

Attribute Comparison

TOXICITY (broad net) vs. SEVERE_TOXICITY (tight filter)

TOXICITY

Widest net. More false positives.

Explicit insults and slurs
Direct threats
Profanity (any intent)
Strong opinions and blunt criticism
Sarcasm the model misreads
Informal or aggressive tone

SEVERE_TOXICITY

Tight filter. Fewer false positives.

Explicit hate speech
Direct threats of violence
Extreme harassment
Severe identity attacks

The production attributes (TOXICITY, SEVERE_TOXICITY, IDENTITY_ATTACK, INSULT, PROFANITY, THREAT) are stable and well-tested. Perspective also offers experimental attributes like ATTACK_ON_COMMENTER, INFLAMMATORY, SPAM, and OBSCENE. These are less reliable and can change without notice.

For the full list of all attributes (including experimental ones), check our What is Perspective API article, which includes a complete attribute table.

How to read the scores (and what most developers get wrong)

Here's a score range that trips up nearly every developer who integrates Perspective API toxicity for the first time.

0.0 to 0.3: likely not toxic. The model is fairly confident the text wouldn't bother most readers. You can generally let these through without review.

0.3 to 0.7: the uncertain middle. This is where the model is essentially guessing. A score of 0.5 is a coin flip. The text might contain strong language used positively ("that was f***ing amazing"), legitimate criticism that sounds harsh, or sarcasm the model can't parse. Most false positives and false negatives live in this range.

0.7 to 0.9: probably toxic. The model sees patterns that match its training data for toxic content. Most platforms use a threshold somewhere in this range for automated flagging or human review queues.

0.9 to 1.0: high confidence. The model is very sure the text resembles toxic comments. But here's the catch: a 0.92 score on "you smell bad and are stupid" and a 0.95 on "I am going to kill you" both land in this range. The score tells you confidence of toxicity, not severity of harm. A mild insult and a death threat can score nearly the same.

This is the single biggest misunderstanding developers have with Perspective API toxicity scores. The number doesn't measure how bad something is. It measures how sure the model is that humans would call it toxic.

Common mistakes:

Treating the score as a severity scale. A high score doesn't necessarily mean the content is more dangerous. It means the model is more confident in its classification.

Using a single threshold for all actions. Sending everything above 0.7 to the same queue means your moderators spend equal time reviewing mild profanity and genuine threats.

Ignoring the middle range. Many developers only act on high scores and auto-approve low ones. The 0.3 to 0.7 range contains the most context-dependent content, exactly the kind that needs human judgment.

Not combining attributes. Running only TOXICITY misses important nuance. A comment might score 0.6 on TOXICITY but 0.9 on THREAT. Combining attributes gives you a much clearer picture.

Choosing the right threshold for your use case

Perspective API doesn't recommend a single threshold. That's by design: the right cutoff depends entirely on what you're building and how much risk you're willing to accept.

Here's a practical framework.

For comment pre-screening on news sites or blogs (where false positives annoy readers and suppress discussion): use TOXICITY at 0.85 or higher for auto-hiding, and route 0.7 to 0.85 to a human review queue. Pair with SEVERE_TOXICITY at 0.8+ as a secondary filter.

For real-time chat moderation in gaming or social apps (where speed matters more than precision): use SEVERE_TOXICITY at 0.7+ for auto-removal and TOXICITY at 0.8+ for warnings. Gaming communities tolerate more rough language, so relying on SEVERE_TOXICITY as the primary filter reduces false positives on trash talk.

For child safety in educational platforms (where false negatives are unacceptable): use lower thresholds. TOXICITY at 0.6+, THREAT at 0.5+, SEXUALLY_EXPLICIT at 0.5+ for immediate flagging. Accept higher false positive rates in exchange for catching more harmful content.

For research or analytics (where you're measuring toxicity trends, not acting on individual comments): threshold matters less than consistency. Pick one (the 0.7 threshold is conventional in academic research) and apply it uniformly across your dataset.

Threshold Guide

Choosing a threshold by use case

News sites and blogs

TOXICITY 0.85+ auto-hide. Route 0.7 to 0.85 to human review.

Priority: reduce false positives

Gaming and social chat

SEVERE_TOXICITY 0.7+ auto-remove. TOXICITY 0.8+ for warnings.

Priority: speed, low false positives

Child safety and education

TOXICITY 0.6+, THREAT 0.5+, SEXUALLY_EXPLICIT 0.5+ for immediate flagging.

Priority: catch everything

Research and analytics

Fixed TOXICITY threshold at 0.7. Apply uniformly across your dataset.

Priority: consistency

Whatever threshold you choose, test it against your own data first. Pull a sample of 200 to 500 comments from your platform, run them through the API, and manually review the results at different cutoffs. The ideal threshold is the one that matches your community's norms, not a number from a blog post.

Where the toxicity model breaks down

The Perspective API toxicity model has well-documented blind spots. Understanding them isn't optional if you're relying on these scores for moderation decisions.

Context blindness. Perspective scores text in isolation. It can't consider who's speaking, who they're speaking to, or what came before. "I'm going to kill it on stage tonight" and "I'm going to kill you" look very different to a human but can score similarly. Quoted speech ("He called me a [slur]") gets flagged even when the person is reporting abuse, not committing it.

Identity term bias. Research has consistently shown that text mentioning LGBTQ+ terms, Black English, or disability-related language gets higher toxicity scores than equivalent text without those terms. The model learned this from its training data: because hateful comments often contain identity terms, the model associates those terms with toxicity even in non-toxic contexts. "I'm a proud gay man" can score higher than "you're annoying" simply because of the identity term.

Language bias. A 2023 study analyzing multilingual Perspective API scores found that German-language text consistently receives higher toxicity scores than equivalent content in other languages. The same sentence translated between English and German produces meaningfully different scores. This bias affects any platform moderating multilingual content with a single threshold.

Sarcasm and coded language. "Oh, what a brilliant idea" reads as positive to the model. Meanwhile, coded hate speech that avoids explicit slurs flies under the radar. The model matches surface-level patterns. It doesn't understand intent.

The formality trap. Academic research has identified formality as a latent attribute in Perspective scores. Informal writing (slang, abbreviations, unconventional grammar) tends to score higher for toxicity even when the content itself is benign. This disproportionately affects younger users and non-native English speakers.

These aren't edge cases. They're structural limitations of how the model was built. Perspective uses a CNN trained on pattern matching against labeled examples. It's fundamentally different from understanding meaning.

The silent update problem

There's a practical issue with Perspective API toxicity scores that most developers don't discover until it causes problems: the underlying models change without warning.

Google periodically retrains and updates Perspective's models to improve accuracy. But there's no versioning system. You can't pin to a specific model version, and you don't get notified when the model changes. The same text sent to the API six months apart can return different scores.

Research published in 2025 demonstrated this concretely. When researchers rescored text from the RealToxicityPrompts benchmark using the current version of Perspective API, scores dropped significantly compared to the original evaluations. One model's toxicity ranking shifted by 11 positions in the HELM benchmark purely because of Perspective API model updates that happened between evaluation dates.

For platforms using Perspective API toxicity scores to trigger automated actions (auto-hiding comments, muting users, escalating to review), this means your moderation behavior can change without you deploying any code. A comment that was safely below your threshold last month might suddenly exceed it, or vice versa.

Google calibrates scores to minimize this drift, but calibration isn't elimination. If you're building on Perspective API, you need a monitoring system that detects when score distributions shift.

How modern AI handles toxicity differently

Perspective API was groundbreaking when it launched in 2017. It was one of the first freely available ML tools for toxicity detection, and it set the standard for how platforms thought about content moderation.

But the approach has fundamental constraints. A CNN trained on pattern matching against historical labels can score text for surface-level toxicity. It can't understand context, adapt to your platform's specific policies, or distinguish between intent and impact.

A new generation of AI moderation tools takes a different approach. Instead of returning a generic toxicity probability, these tools evaluate content against configurable policy rules. You define what counts as harmful in your community, and the system enforces those specific standards.

This is where tools like Lasso Moderation represent a shift in how toxicity detection works. Rather than asking "does this look like something humans labeled as toxic?", modern LLM-based systems ask "does this violate this specific platform's policies, given the context of the conversation?" The difference is subtle but it changes everything about accuracy, false positive rates, and how much human review you actually need.

Context-aware moderation can distinguish between someone reporting abuse and committing it. It can recognize sarcasm. It can apply different standards to different parts of your platform. And it can adapt as your policies evolve without waiting for a model retrain from Google.

For a full comparison of alternatives to Perspective API (including pricing, features, and migration considerations), see our complete guide to finding an alternative.

Complete Guide

Ready to evaluate alternatives?

If these limitations are affecting your moderation quality, it may be time to look beyond Perspective API. We compare the top alternatives, including pricing, features, and what to prioritize in your migration.

Read the full guide

Frequently asked questions

What does a Perspective API toxicity score of 0.7 mean?

A score of 0.7 means the model estimates about 70% of human raters would consider the text toxic. It's not a measure of how harmful the content is. Many platforms use the 0.7 to 0.8 range as a threshold for flagging content for human review, though the right threshold depends on your specific use case and tolerance for false positives.

Is Perspective API toxicity the same as hate speech detection?

No. Perspective's TOXICITY attribute broadly covers rude, disrespectful, or unreasonable content. Hate speech is a narrower category. The IDENTITY_ATTACK attribute is closer to hate speech detection, but even that doesn't map perfectly to legal or policy definitions. For hate speech specifically, you'll typically need to combine multiple attributes or use a tool designed for policy-based classification.

Can Perspective API detect sarcasm or context?

Not reliably. Perspective scores each comment in isolation without considering who said it, who it's directed at, or what the conversation is about. Sarcastic praise ("Oh, brilliant take, really") typically scores low for toxicity, while sarcastic insults may score high but not consistently. If your platform relies heavily on sarcastic or ironic communication, expect higher false positive and false negative rates.

Why does the same comment score differently when I send it again weeks later?

Perspective API's models are periodically updated without versioning. Google retrains the models to improve accuracy and reduce bias, but this means scores can shift over time. If you're storing scores for comparison or analytics, be aware that rescoring the same text at a later date may return different results.

Is Perspective API going away?

Yes. Google announced that Perspective API will sunset after December 31, 2026, with no migration path provided. If you're currently using it for toxicity scoring, now is the time to evaluate alternatives. Our pricing article covers the cost implications, and our pillar guide walks through the full alternatives landscape.

How Lasso Moderation Can Help

At Lasso, we believe that online moderation technology should be affordable, scalable, and easy to use. Our AI-powered moderation platform allows moderators to manage content more efficiently and at scale, ensuring safer and more positive user experiences. From detecting harmful content to filtering spam, our platform helps businesses maintain control, no matter the size of their community.

Book a demo here.

Want to learn more about Content Moderation?

Learn how a platform like Lasso Moderation can help you with moderating your platform. Book a free call with one of our experts.

Lasso Moderation

Protect your brand and safeguard your user experience.

Status Page - Privacy Policy - Terms & Conditions