Benchmark SeriesEvidence-backed. AI-adjudicated. Human-reviewed.

Benchmark Series Summary: GEO Audit Accuracy Across Ahrefs, HubSpot, and Stripe

This is a 3-site benchmark series, not universal proof across the web. It exists so teams can evaluate whether the workflow is credible enough to use on their own domains, compare against other tooling, and understand the caveats before they move into the product flow.

The same bounded deep-audit workflow, evidence capture, Codex verification, adjudication layer, and human review standard were applied across Ahrefs, HubSpot, and Stripe.

For technical buyers and advanced marketers, this functions as benchmark proof in public: strict precision, weighted precision, and operational FP share are all shown together so the series does not cherry-pick the flattering number.

Publishing note

These public benchmark pages and samples are publishing artifacts, not anonymized customer stories. They exist so buyers can inspect the output quality before more rollout proof is published.

Strict precision

0.594

Fully correct findings only: 60 true positives across 101 scored findings.

Weighted precision

0.757

True positives plus half credit for partials: 76.5 weighted-correct findings out of 101.

Operational FP share

0.079

Outright wrong findings among scored findings: 8 of 101.

Scored findings

101

Aggregate denominator after preserving seven exclusions outside fair scoring.

What this means for your site

Want to benchmark your own site?

Turn curiosity from the benchmark into a real audit on your own site. The same workflow gives your team a scored report, linked evidence, and a prioritized implementation handoff you can actually ship.

Your site will be audited using the same methodology shown here; it is not scored as a direct side-by-side contest against these brands.

Unfamiliar with the methodology? Read what an AI visibility audit actually measures.

Executive Summary

Across the first three benchmark sites, the GEO audit system was generally credible and useful. The bigger quality issue was not rampant wrongness. It was partial findings: outputs that captured a real issue direction but overstated scope, certainty, or current live applicability.

The operational false-positive share remained low on all three sites. Deterministic findings such as metadata checks, stable schema absence, and preserved performance evidence held up better than interpretation-heavy findings that depended on excerpt completeness or wording.

Sites Included

  • Ahrefs: A large public SEO software site with product, editorial, and comparison surfaces that stress deterministic checks and evidence traceability.
  • HubSpot: A schema-drift-heavy SaaS benchmark that exposes where stored artifacts and current live markup no longer line up cleanly.
  • Stripe: A mixed product, pricing, support, newsroom, and customer-story benchmark that adds crawl-graph artifacts and newsroom schema drift.

Methodology

  1. Deep audit: each site used the deep audit package with a bounded crawl rather than an unbounded sweep of the entire domain.
  2. Evidence capture: the pipeline preserved screenshots, extracted content, DB exports, report artifacts, and evidence packages for later review.
  3. Codex verification: emitted findings were checked against stored artifacts plus current live spot checks where useful.
  4. Adjudication layer: ambiguous findings were classified conservatively as TP, FP, Partial, or Excluded rather than being rewritten into certainty.
  5. Human sign-off where needed: interpretation-heavy findings and unresolved contradictions remained visible in the final publication output.
  6. Exclusions preserved: query-signal stale exclusions and other pipeline-level caveats were excluded from scoring instead of being buried.

Per-Site Results

SiteTPFPPartialExcludedStrict precisionWeighted precisionOperational FP share
Ahrefs233820.6760.7940.088
HubSpot1731320.5150.7120.091
Stripe2021230.5880.7650.059

Aggregate Results

Total findings
108
Scored findings
101
TP / FP / Partial / Excluded
60 / 8 / 33 / 7
Aggregate metrics
Strict 0.594 | Weighted 0.757 | FP share 0.079

What the Results Mean

  • The system is generally credible and useful across this benchmark series, especially when findings are grounded in deterministic evidence capture.
  • False-positive share stayed low across all three sites, which matters operationally because outright wrong findings were limited relative to scored output volume.
  • The bigger quality issue is partial findings, not rampant wrong findings. Many outputs had a real factual core but overstated scope, certainty, or live applicability.
  • Deterministic findings such as metadata, stable schema absence, and preserved performance evidence were more reliable than interpretation-heavy findings.

Cross-Site Patterns

  • stored/live schema contradictions occurred across sites because the pipeline does not preserve runtime HTML snapshots.
  • Query-signal stale exclusions are systemic pipeline-level caveats and were excluded from scoring.
  • Content-heavy sites produced more Partial findings because interpretation and excerpt completeness mattered more.
  • Operational FP share stayed low across all three sites.
  • Strict precision varied more than weighted precision because ambiguity drove Partial classifications rather than pure false positives.

Limitations

  • This is a 3-site benchmark series, not universal proof across the web.
  • Operational FP share is not a classical false-positive rate because there is no enumerated true-negative universe.
  • Runtime HTML was not preserved, which is why stored/live schema contradictions remain unresolved in some cases.
  • Pipeline-level exclusions such as stale query-signal rows were excluded from scoring rather than normalized away.
  • The series is strong enough to publish as a benchmark program, but not a basis for universal claims about all sites or all audit tools.

Links to Individual Benchmarks

Ahrefs

A large public SEO software site with product, editorial, and comparison surfaces that stress deterministic checks and evidence traceability.

Read the benchmark

HubSpot

A schema-drift-heavy SaaS benchmark that exposes where stored artifacts and current live markup no longer line up cleanly.

Read the benchmark

Stripe

A mixed product, pricing, support, newsroom, and customer-story benchmark that adds crawl-graph artifacts and newsroom schema drift.

Read the benchmark

GEO Scores

In addition to accuracy validation, each audit produced a GEO Score measuring overall AI visibility readiness. The full GEO score analysis, signal comparison, and cross-site findings are available in the research study.

SiteGEO Score
Stripe81
Ahrefs77
HubSpot76

Conclusion

This benchmark series is strong enough to publish as a benchmark program and as a product proof asset for evaluators who want evidence instead of narrative alone.

It is not yet a basis for universal claims about all sites, all audit categories, or all SEO and GEO workflows.