CertREV Performance Methodology — v1.0.0

Released 2026-05-07 · Status: current

This document defines, in full, how CertREV measures the impact of expert content certification on search performance. Every published claim — every sales-page bullet, every case-study chart, every "X% lift" — points back to a specific methodology version. This is v1.0.0.

If we change how a metric is computed, we increment the version and re-snapshot under the new version. We never silently change historical claims. Old versions remain published indefinitely so any previously cited result remains auditable.

Why a methodology page exists

Prospective customers and skeptics deserve to see the math, not just the output. Anecdotal "look at this one ranking jump" doesn't survive scrutiny. Diff-in-diff against a matched control cohort, with snapshot immutability and a public methodology, is the bar we hold ourselves to.

The methodology must be cleaner than the marketing.

Cohort definition

Each measurement compares two cohorts of URLs from the same brand:

Treated cohort. Every URL that received a CertREV certification stamp during the certification window. No cherry-picking — all certified URLs in the window enter the cohort, including those that later under-performed.
Control cohort. Uncertified URLs from the same brand, drawn from the same domain, that never received a stamp during the measurement window.

Treatment criteria

A URL is in the treated cohort if and only if all of the following hold at snapshot time:

The URL has a submissions.publishedAt (live on the brand's domain).
The URL has a submissions.certificationStampedAt inside the certification window (the contiguous date range during which all treatment URLs were stamped).
The URL has at least 14 days of pre-stamp Google Search Console data and at least 28 days of post-stamp Google Search Console data within the measurement window.
The URL is not in the cohort's exclusion list (see below).

Control matching

Each treated URL is matched to up to 5 control URLs from the same brand using a deterministic, two-key matching algorithm:

URL age bucket. The control's publishedAt must be within ±90 days of the treated URL's publishedAt. Same-vintage content controls for the "older URLs rank better" SEO maturation effect.
Pre-stamp position bucket. The control's mean Google position over the pre-window (computed below) must be within ±3 positions of the treated URL's pre-window mean position. Same starting line means the diff-in-diff is comparing realistic peers, not a top-3 page against a page-2 page.

If a treated URL has fewer than 3 candidate controls passing both keys, it is flagged as under-matched and excluded from the headline aggregate. Its per-URL impact card may still be displayed for transparency, marked clearly as under-matched.

Exclusion criteria

A URL is excluded from both cohorts if:

It receives a non-CertREV brand-side intervention during the measurement window (significant rewrite, redirect, deindex). Brands self-disclose these; any disclosed intervention triggers exclusion.
The URL was de-indexed by Google for any reason at any point in the measurement window.
The URL has fewer than 5 days of GSC data in either the pre-window or post-window. (We do not impute missing data.)
The URL is a homepage, category landing page, or PDF asset. The methodology applies to article-format certified content only.

Matching algorithm

For each treated URL T:
    candidates = ControlPool.filter(
        |publishedAt - T.publishedAt| <= 90 days
        AND |pre_window_mean_position - T.pre_window_mean_position| <= 3.0
        AND not_in_exclusions
    )
    matched_controls = candidates.top_k(5, by=position_proximity)
    if len(matched_controls) < 3:
        mark T as under-matched

Matching is performed once at snapshot time and frozen via the cohort_definition table (see "Snapshot immutability" below). Re-running the match later, even with the same inputs, would produce a different cohort because the GSC data changes. The frozen list is what every published claim refers to.

Window sizes

Default windows in v1.0.0:

Window	Length	Anchored on
Pre-window	28 days	`certificationStampedAt - 14 days` (back 28 days from there, ending 14 days before the stamp)
Exclusion gap	14 days	The 14 days immediately before the stamp + the 14 days immediately after
Post-window	28 days	Starts 14 days after the stamp; runs 28 days
Measurement window	70 days total	Pre + gap + post

The 14-day exclusion gap removes ranking volatility associated with the moment of publication or stamping itself. The two windows are 28 days each so each contains four full weeks (controlling for day-of-week traffic seasonality).

Brands with insufficient post-window data (≥28 contiguous days post-gap) are not snapshotted. We never publish numbers that don't have the full window.

Sample size and statistical power

Every snapshot stores sample_size_certified, sample_size_control, and statistical_power. We require power ≥ 0.80 (Cohen's convention) before any brand-level lift number is shown in a public claim or case study draft.

For headline aggregate lift across a brand:

Default minimum-detectable-effect (MDE). 5 percentage points of click-rate change.
Default α. 0.05 (Bonferroni-corrected when multiple metrics are reported together; see below).
Default sample-size floor. 10 treated URLs with ≥3 matched controls each, and ≥30 days of post-window data per URL. Brands not meeting this floor are flagged in admin UI as "insufficient power" and their numbers are not published.

Power is computed analytically using a two-sample t-test on the diff-in-diff delta, with pooled standard deviations from the pre-window. The exact formula and reference implementation live in src/lib/analytics/power.ts.

Statistical method

The headline metric is diff-in-diff (DiD) of click-rate, average position, and CTR. For each metric:

DiD_metric = (treated_post_mean - treated_pre_mean)
           - (control_post_mean - control_pre_mean)

Where each _mean is the per-URL average across the relevant window, then averaged across the cohort.

Significance test

For each metric we run, in order:

Welch's two-sample t-test on the per-URL DiD values (treated vs. their matched controls). Welch is preferred over Student's t because we cannot assume equal variance between treated and control distributions.
Mann-Whitney U as a non-parametric backup when the per-URL DiD distribution fails the Shapiro-Wilk normality test (p < 0.05). We report whichever the gate selected; we do not select post-hoc based on which gives a smaller p-value.
Bonferroni correction when reporting more than one metric together. If we report DiD-clicks, DiD-position, and DiD-CTR jointly, each metric's α is divided by 3 (so a "p<0.05" claim across three metrics actually means p<0.0167 per metric).

A claim is "statistically significant" only when the corrected p-value clears the threshold. Claims that do not clear are still surfaced internally but are labeled "directional" and never published in case studies.

Effect-size reporting

We report Cohen's d alongside p-values for every DiD claim. Small (d≈0.2), medium (d≈0.5), and large (d≈0.8) effect-size labels appear on every internal chart. A statistically significant tiny effect is not a sales claim.

Snapshot immutability

Once a URL is snapshotted into certification_impact_snapshot under a given methodology_version, the row is frozen. The unique constraint (submission_id, methodology_version) enforces this at the database level.

If we change how a metric is computed:

We increment the methodology version (rules below) and tag the commit methodology/<version>.
We re-snapshot affected URLs under the new version.
The previous version's rows remain queryable; old case studies that cited them stay correct.

Version-bumping rules

Change	Bump
Add a new metric (e.g. add session_duration to DiD output)	Minor (`v1.0.0 → v1.1.0`)
Change cohort matching keys or thresholds	Major (`v1.0.0 → v2.0.0`)
Change window sizes	Major
Bug fix to an existing metric's formula	Major (the fixed metric is a different metric than the broken one historically)

Known limitations

The methodology is honest about what it can and cannot prove.

Small brands

Brands with fewer than 10 certified URLs in a 90-day window cannot reach the power threshold. We do not publish lift numbers for these brands; we surface per-URL impact cards instead, with explicit "small sample" labeling.

Recent certifications

URLs certified within the last 56 days (28 post-window + 14 gap + 14 post-window slack) are not yet snapshot-ready. They appear in admin UI as "pending power" and become eligible once the post-window completes. We never back-fill claims to predict outcomes.

GSC sampling caveats

Google Search Console aggregates queries below an undisclosed privacy threshold; for low-volume URLs, daily click counts can read 0 even when real traffic occurred. Our pipeline:

Pulls daily, per-URL data via the GSC API (not the UI export, which has different sampling).
Excludes URLs whose pre-window had fewer than 100 total impressions across 28 days. Below that floor, GSC sampling noise dominates any DiD signal.
Reports the impressions floor on every snapshot for auditability.

We do not correct for sampling. We exclude under-sampled URLs instead.

Country and device

v1.0.0 aggregates across all countries and devices. A future version may stratify by country or by mobile/desktop, which would produce a different control cohort and a fresh methodology version.

AI Overviews and SGE

GSC click data does not capture AI Overview citations as a separate metric. URLs that lose clicks because Google AI Overview answered the query without a click-through can read as a position-stable, click-down result. Our v1.0.0 DiD treats this as noise. AI citation tracking (M2 of the analytics roadmap) is a separate methodology and is not included in the v1.0.0 DiD claims.

Selection bias in case studies

Brands featured in published case studies are sampled by criteria stated on the case study microsite (e.g., "all brands with ≥10 certifications, ≥90 days of post-cert data, and a power ≥0.80 aggregate"). The sampling rule is disclosed; we do not publish only winners and silently skip losers.

Confounders we do not control for

Off-platform brand-side SEO investments concurrent with certification (other than those self-disclosed).
Algorithm updates from Google during the measurement window. Major algorithm-update dates are recorded; snapshots whose post-window overlaps a named update are flagged in admin UI.
Backlink acquisition. M5 of the analytics roadmap will add backlink delta as a covariate; v1.0.0 does not.

Glossary

Certification stamp. The moment when CertREV's expert-review block (with verifiable credentials) is embedded into the published article and the JSON-LD schema goes live on the URL. Recorded as submissions.certificationStampedAt.
Cohort. A frozen list of URLs that participate in a measurement. Recorded in cohort_definition with selection criteria and member URLs at capture time.
Control cohort. The matched set of uncertified URLs from the same brand. Selected by the matching algorithm above; frozen at snapshot time.
Diff-in-diff (DiD). The difference between the treated cohort's pre→post delta and the control cohort's pre→post delta. Cancels common-mode shifts (seasonality, brand-wide ranking changes) so the residual is the certification-attributable lift.
Measurement window. The full 70-day span (pre + gap + post) used for one snapshot.
Methodology version. A semver string (v1.0.0, v1.1.0, ...) tagged at the commit that defines the methodology. Stored as methodology_version on every snapshot row.
Per-URL impact card. The transparent unit-of-account: one card per treated URL, showing its DiD vs. its specific matched controls. Aggregated numbers always link back to the cards.
Power. The probability that a true effect of the minimum-detectable size would have been detected by this snapshot. We require ≥ 0.80 before publishing.
Snapshot. An immutable row in certification_impact_snapshot capturing pre-window, post-window, and DiD metrics for one URL under one methodology version. Once written, never updated.
Treated cohort. The set of certified URLs participating in a measurement.
Under-matched. A treated URL with fewer than 3 candidate controls passing both matching keys. Excluded from headline aggregates; surfaced individually with a clear label.
Window (pre/post). The 28-day measurement window before and after the exclusion gap. The exclusion gap itself is a 14-day buffer on either side of the stamp date.

What's next

v1.0.0 is the foundation. Tracked under the Performance Analytics & Proof project, future bumps include:

v1.1.0 (planned): per-query DiD as an additional output metric (queries ranked, not just URLs ranked).
v2.0.0 (planned): country and device stratification, producing per-segment cohorts.
v2.x (planned): backlink delta as a covariate, contingent on M5 vendor selection.

Each bump will appear at this URL with its own page. Old versions stay live.

Contact

Questions about the methodology — including bug reports — go to methodology@certrev.com. We treat methodology bugs the same way we treat security bugs: acknowledged within 24 hours, fix-or-explain within a week, and any correction publicly noted on the affected version's page.