Case Study 1Evidence-backed. AI-adjudicated. Human-reviewed.

Case Study: SEO Audit of Ahrefs.com

A deep-audit benchmark on a demanding public site, written for operators who want to know whether the pipeline is accurate enough to trust, not merely polished enough to market.

This page doubles as a real SEO audit example based on a deep audit of Ahrefs. It remains a case study first: the point is to show what the pipeline found, what held up under adjudication, and where the benchmark stays conservative.

If you are looking for an SEO audit report example, this benchmark shows the methodology, findings, evidence trail, and the limitations that keep the result publication-safe.

Publishing note

These public benchmark pages and samples are publishing artifacts, not anonymized customer stories. They exist so buyers can inspect the output quality before more rollout proof is published.

View evidence appendix See all case studies

What This Means for Your Company

Even a leading SEO platform like Ahrefs shows measurable gaps in AI visibility. The zero structured data score means AI engines must rely entirely on unstructured text extraction — less reliable and harder to verify.

Despite near-perfect citation frequency, the absence of schema markup limits how accurately AI systems can represent Ahrefs' distinct products and capabilities in generated answers.

These gaps are within Ahrefs' control. Implementing structured data and entity definitions would directly address the four zero-scoring signals and could materially move the overall GEO Score.

GEO Score

77

Overall AI visibility readiness out of 100.

Strict precision

0.676

Fully correct findings only: 23 correct out of 34 scored findings.

Weighted precision

0.794

Fully correct findings plus half credit for partials: 27 weighted correct out of 34.

Operational FP share

0.088

Outright wrong findings among scored findings: 3 of 34.

GEO Dimension Scores

Dimension	Score	Interpretation
Authority Signals	88	Strong backlink profile, web vitals, and crawl efficiency
Entity Clarity	82	Brand is recognizable but lacks formal entity disambiguation
Extractability Index	76	Content is parseable but missing structured data layers
Citation Readiness	71	Citable content exists but lacks schema reinforcement
Answerability	68	AI engines can find answers but coverage gaps remain

GEO Audit Key Findings

Zero structured data across all 30 crawled pages — no JSON-LD, Microdata, or RDFa detected
Strong technical foundations: Core Web Vitals, crawl efficiency, and canonical integrity all scored 100
AI citation frequency near-perfect at 99 — brand already frequently cited by AI engines
Metadata clarity inconsistent at 56.67 — meta descriptions and title tags missing or poor on some pages
Featured snippet coverage low at 42.50 — content not structured for snippet capture at scale

See the full benchmark study →

What this means for your site

See how your site compares to Ahrefs

Turn curiosity from the benchmark into a real audit on your own site. The same workflow gives your team a scored report, linked evidence, and a prioritized implementation handoff you can actually ship.

Analyze your site free Request demo View sample report Explore sample evidence

Executive Summary

This benchmark asked a narrow question: when the Citemeter GEO audit pipeline emits a finding, how often is that finding correct enough to rely on? Ahrefs was chosen because it is a large, public, content-rich site with enough product and editorial surface area to expose weak heuristics, weak evidence chains, and overscoped recommendations.

The authoritative run was a deep audit against ahrefs.com, using a bounded crawl, screenshots, LLM analysis, query simulation, and evidence packaging. The result was not perfection, but it was materially better than guesswork: most findings were confirmed, outright false positives were limited, and the strongest performance came from deterministic, evidence-backed checks.

SEO Audit of Ahrefs.com

The target was ahrefs.com under the deep audit package for the authoritative benchmark run, using a bounded same-origin crawl rather than an unbounded sweep of the whole property.
Coverage included page capture, screenshots, metadata and structured-data checks, LLM analysis, simulated query testing, packaged report output, and preserved evidence exports.
Ahrefs is a strong benchmark because it is a large public site with product, editorial, and comparison surfaces that can expose weak heuristics, weak evidence chains, and overscoped recommendations.
This is still a bounded deep audit of one public site and one authoritative run, not a universal claim about every Ahrefs page or about how every technical SEO audit should score.

Why This Benchmark Matters

SEO and GEO tools are often sold on narrative rather than validation. That creates a trust problem: a report can look persuasive while still emitting findings that do not survive review. This benchmark matters because it tests whether the pipeline deserves operational use on a real public site, not whether it can produce attractive output.

Evidence traceability matters for the same reason. Buyers and internal stakeholders need to be able to inspect the path from page record to screenshot to finding to recommendation. Without that chain, confidence becomes branding rather than proof.

Test Setup

Target: ahrefs.com
Configuration: deep audit package
Scope: bounded same-origin crawl with page and screenshot caps
Capture: screenshots, LLM analysis, simulated query testing, and evidence packaging
Artifacts: report PDF, evidence ZIP, DB exports, storage object records, and adjudication outputs

Validation Workflow

Audit pipeline: the system crawled the site, captured screenshots, extracted content, ran analysis, simulated query responses, and stored the resulting artifacts.
Codex verification: the emitted findings were reviewed against the stored benchmark artifacts to identify obvious true positives, false positives, exclusions, and escalation items.
Claude adjudication: ambiguous items were adjudicated conservatively, with particular attention to scope, wording, evidence strength, and unresolved contradictions.
Human sign-off: two interpretive findings were resolved as partials, while the homepage schema contradiction remained preserved as an unresolved caveat rather than silently flattened.

What Our Audit Found

Structured-data presence and entity-alignment checks were among the clearest signals in the benchmark, although the homepage schema contradiction remains unresolved and stays conservative.
One copy-positioning finding was narrowed to a partial because Ahrefs' positioning looked clear but dense, so the stronger confusion claim did not survive as written.
A trust-and-credibility finding also stayed partial: obvious testimonial proof was limited in the analyzed excerpts, while credibility came through more clearly in guides and topic coverage.
The strongest performance came from deterministic checks such as metadata, structured-data presence, and mobile-render outputs; interpretation-heavy findings produced more partials.
Query-dependent citation findings carried an explicit caveat because the run snapshot showed stale query-signal mismatches, so those items were excluded rather than overstated.

Final Findings Overview

Classification	Count	Meaning
TP	23	Fully correct findings where the emitted issue matched the available evidence.
FP	3	Findings that were emitted but did not hold up under review.
Partial	8	Findings that captured a real direction but overstated scope, certainty, or wording.
Excluded	2	Findings removed from scoring because the benchmark itself could not test them fairly.

Metrics

Strict precision = 0.676. This counts only fully correct findings, so it is the hardest accuracy threshold in the benchmark.

Weighted precision = 0.794. This gives half credit to partial findings, which is useful when an emitted finding was directionally helpful but too broad or too certain.

Operational FP share = 0.088. This shows how often emitted findings were outright wrong among scored findings, which is the most useful false-positive measure this single benchmark can support.

Why This Works as an SEO Audit Example

Readers looking for a website audit example or site audit report sample usually want something more credible than a template. This page is useful because it preserves the uncomfortable parts as well as the wins: a real public benchmark target, a real evidence trail, real adjudication, and a balanced accounting of correct, partial, false-positive, and excluded findings.

It uses a real public site rather than a toy sample, so the findings and caveats look like the tradeoffs teams actually face in production audits.
It preserves the evidence chain from page and screenshot records through findings, recommendations, and packaged report artifacts.
It reports the full balance of TP, FP, Partial, and Excluded outcomes instead of flattening the benchmark into marketing-only wins.
It shows real adjudication: ambiguous items were narrowed, excluded, or left conservative instead of being rewritten into certainty.

If you want to compare this benchmark against the product's report surfaces, review the sample report and the sample evidence index.

What an Audit Report Should Include

A strong audit report is not just a list of issues. It should show the boundaries of the audit, the evidence behind each claim, and the limitations that affect how confidently a reader should act. The Ahrefs benchmark is useful precisely because it keeps those reporting disciplines visible.

Scope: define the target, audit package, crawl boundaries, and what the report does not claim to cover.
Evidence: show the screenshots, exports, and traceability chain behind the findings.
Prioritization: separate fully supported issues from partials, false positives, and exclusions.
Methodology: explain how the audit ran, how findings were reviewed, and how adjudication changed the final read.
Limitations: keep methodological caveats visible, especially where query data, schema evidence, or extraction quality were constrained.
Next actions: tell the reader what to validate next and what could still move the benchmark.

Key Patterns from the Benchmark

Most emitted findings were fully correct once they were checked against the stored run artifacts and live spot checks.
Outright false positives were limited, which matters more operationally than theoretical false-positive claims this benchmark cannot support.
Partial findings clustered around interpretation-heavy judgments, excerpt truncation, and unresolved stored/live contradictions.
Deterministic signals such as metadata, structured-data presence, and mobile-render outputs were stronger than some LLM-mediated additional findings.
Traceability was generally strong at the artifact level, even where some signal rows lacked direct evidence references.

What This Means for Users

The product was useful and generally credible on this run, especially for deterministic issues that can be traced back to stored evidence.
It was not perfect. Some findings needed narrowing, and a few did not survive review.
The strongest layer is evidence-backed technical detection. The weaker layer is interpretive judgment where page context, excerpt completeness, or wording nuance matters.

Limitations

Stale query-signal caveat: 12 query rows persisted, but some query-dependent signal rows still showed total_queries = 0. Those findings were excluded from scoring.
The query outputs in this benchmark were simulated AI-engine responses, not live ChatGPT, Perplexity, or Google behavior.
Single-benchmark limitation: this was one benchmark on one site and one authoritative run. It does not establish cross-site generalization.
Homepage schema stored/live contradiction unresolved: signal:entity_disambiguation:1 and signal:knowledge_graph_alignment:1 stay conservative rather than fully settled.
Analyzed pages were truncated during extraction. That weakens some LLM-mediated additional findings.
No classical false-positive rate is computable from one audit because there is no enumerated true-negative universe.

Falsifiability

This case study is only useful if it can be disproved or revised when better evidence appears. The benchmark should move if contrary run-time proof, cleaner query ordering, or fuller page extraction changes the review basis.

If the stored homepage evidence proves Organization schema was present during the run, two conservative partial findings should move materially.
If query-signal ordering is repaired and rerun cleanly, the currently excluded AI-visibility finding could change status.
If longer page extraction preserves the omitted context, some interpretation-heavy additional findings could improve or degrade.
If repeated on more public benchmark sites, the overall credibility assessment could move up or down rather than staying fixed to this single run.

What Could Change These Results

Homepage structured-data confirmation could change two conservative findings.
Fixing query-signal ordering could change excluded AI-visibility findings.
Longer extraction could change some additional findings.
Repeating on more benchmark sites could move the credibility assessment.

Evidence Appendix

The technical appendix is preserved in the repo as docs/case-studies/ahrefs-accuracy-validation-evidence.md. The buyer-facing page keeps the proof chain legible without exposing internal run identifiers or storage paths inline.

The strongest reviewer-friendly traceability chain in the run goes from a high-value product page, to its stored page record and screenshot capture, to the related structured-data signal, to linked recommendations, and finally to the packaged report and evidence archive.

For a broader library of benchmark pages, return to case studies. For product-facing report format references, the sample report and sample evidence index show how the audit output is packaged.

Technical validation details

Internal validation materials retain the authoritative run identifiers, exported datasets, packaged downloads, and reviewer workpapers so the benchmark can be replayed without exposing raw engineering artifacts on the buyer-facing page.

Run scope

Bounded deep audit with crawl limits, screenshots, extracted content, simulated query checks, and packaged deliverables.

Evidence retention

Internal validation materials keep the run identifiers, exported datasets, and packaged downloads together for reviewer replay.

Review chain

Verification notes, adjudication decisions, and final scoring outputs were preserved before publication.

Conclusion

The safest publication-ready conclusion is that this benchmark supports moderate credibility on this site and run. The pipeline was strong enough to be useful and serious, especially on deterministic, evidence-backed issues, but it is not yet a basis for universal claims about performance on every site or every class of finding.

Read the full AI Visibility Benchmark study →