Performance attribution test data for reporting platforms

WealthSchema StaffPipeline architectureMay 9, 20265 min read

A wealth-tech reporting platform that produces performance numbers without an attribution decomposition is a platform whose users stop trusting the numbers the first time they're asked "why did we underperform the benchmark?" Performance attribution is how a reporting product earns its place in an institutional workflow — and it is also the part of the product that's hardest to test, because the attribution algebra interacts with corporate actions, regime changes, and multi-period linking in ways that single-snapshot mock data cannot exercise.

This article is the schema view: what a reporting platform's attribution engine consumes, what the realistic test inputs look like, and where the bugs hide. It's intended for engineers and product managers building the kind of reporting that has to certify against GIPS standards or comparable institutional benchmarks.

The three attribution methods you have to test

Most reporting platforms ship with one of three attribution decompositions — sometimes all three. Each decomposes the same total return differently, each consumes a different test-data shape, and each has its own most-common implementation bug.

	Method	Decomposes return into	Required inputs
Brinson (BHB / BF)	Allocation effect, selection effect, interaction	Portfolio weights & returns by sector, benchmark weights & returns by sector	Treating allocation effect as if benchmark sector return is zero
Factor attribution	Per-factor contribution + residual	Factor loadings per holding, factor returns, residual return	Cross-product term in factor decomposition is dropped
Multi-period linking	Period-by-period contributions linked over time	Each period's attribution + a linking algorithm (Carino, Menchero, GRAP)	Drift between sum of period contributions and total period return

Brinson — the test that any platform should pass

Brinson-Hood-Beebower (BHB) decomposes the active return — portfolio return minus benchmark return — into allocation, selection, and an interaction term. The sector-level allocation effect captures how the portfolio's sector weights differed from the benchmark's; the selection effect captures how the portfolio's within-sector returns differed from the benchmark's; the interaction is the cross-product.

The standard formula:

Formula

Brinson-Hood-Beebower attribution

A_i = (w_i^p - w_i^b) × r_i^b, S_i = w_i^b × (r_i^p - r_i^b), I_i = (w_i^p - w_i^b) × (r_i^p - r_i^b)

A_i: = allocation effect for sector i
S_i: = selection effect for sector i
I_i: = interaction effect for sector i
w_i^p: = portfolio weight in sector i
w_i^b: = benchmark weight in sector i
r_i^p: = portfolio return in sector i
r_i^b: = benchmark return in sector i

Example

Portfolio is 25% tech, 15% energy. Benchmark is 20% tech, 20% energy. Tech returns: portfolio +12%, benchmark +10%. Energy returns: portfolio -5%, benchmark -8%. Allocation effect (tech): (0.25 − 0.20) × 0.10 = +50bps. Selection effect (tech): 0.20 × (0.12 − 0.10) = +40bps. Interaction (tech): (0.05) × (0.02) = +10bps. Total tech effect: +100bps. Same algebra applies sector by sector; sum yields total active return.

The data your test corpus needs is a portfolio-vs-benchmark divergence at the sector level — meaning your synthetic households need positions classified into sectors, the sectors need to track a real-world sector taxonomy (GICS, NAICS, ICB), and the test corpus needs households whose sector weights deliberately differ from any plausible benchmark. Synthetic households where every household holds VTI 100% are useless for attribution testing because the BHB decomposition is degenerate.

Factor attribution — the schema gets richer

Factor attribution decomposes return into the contributions of named factors (size, value, momentum, quality, low-vol) plus a residual idiosyncratic component. The output is the equity analyst's tool: "why did the portfolio outperform" decomposes into "tilted toward value, against momentum, with positive selection within both."

The synthetic-data requirement for factor attribution is materially heavier than for Brinson. Each holding needs factor loadings (typically 5–10 factors), and the factor returns over the period need to be available. Factor loadings change over time as company fundamentals change; mock data that assigns a single static factor exposure per ticker can't exercise the time-varying-loading code path that real factor engines depend on.

// Holding-level factor exposure (one snapshot)
{
  "holding_id": "H-VTI-2025-09-30",
  "symbol": "VTI",
  "snapshot_date": "2025-09-30",
  "factor_exposures": {
    "size": -0.42,           // negative = large-cap
    "value": -0.05,          // slight growth tilt
    "momentum": 0.12,        // positive momentum
    "quality": 0.18,         // high-quality
    "low_volatility": -0.08, // slight high-vol
    "yield": 0.05            // dividend yield positive
  },
  "factor_model_version": "MSCI-USE5-2024.04",
  "residual_volatility_pct": 0.5
}

The factor model version field is not optional. Factor attribution outputs are model-dependent — the same portfolio analyzed under MSCI USE5 and Axioma WW21 will produce different per-factor contributions because the factor definitions differ. Test data has to record which factor model produced the loadings, and the reporting platform's attribution code path has to handle different factor models. A test corpus that uses only one factor model can't exercise the model-switching code, which is a source of subtle bugs in multi-currency portfolios where different regions use different factor models.

Multi-period linking — where most platforms fail

Single-period attribution is straightforward. Multi-period attribution — linking month-by-month attribution to produce a year-to-date or since-inception attribution — is the hard part, and the part where most reporting platforms have known correctness gaps.

The naïve approach is to sum up monthly attribution numbers. That's almost always wrong, because total period return is not the sum of monthly returns (it's the geometric link), and the sum of monthly attribution effects therefore doesn't equal the total period active return. The math discrepancy is small over short windows and grows over longer ones; the discrepancy is what makes the attribution numbers stop reconciling with the headline performance number that the same platform reports elsewhere.

The three most-common linking algorithms — Carino smoothing, Menchero smoothing, and GRAP (Geometric Reconciliation of Arithmetic Performance) — each redistribute the linking residual differently. GIPS-compliant platforms typically certify against one of them and have to be tested against test cases where the residual is meaningful (multiple periods of >2% active return per period are sufficient to make the linking residual visible).

Why three linking algorithms?

The arithmetic-vs-geometric reconciliation has no unique mathematical solution: any allocation of the residual between allocation and selection effects produces a valid identity. Carino's algorithm allocates the residual proportionally to the size of each period's contribution. Menchero's allocates it to preserve relative sizing across periods. GRAP smooths the residual across the entire period. All three are GIPS-acceptable; platforms have to commit to one and disclose it. A test corpus has to produce inputs where the three algorithms give measurably different answers, so QA can verify the platform implements the chosen algorithm correctly rather than accidentally implementing one of the other two.

Source: Menchero (2000); Carino (1999); GIPS Standards (2020 edition)

What a realistic test corpus needs

Pulling the requirements together, here's the minimum viable synthetic-data shape for an attribution-testing engagement:

	Requirement	What it exercises
Sector-classified holdings (GICS or equivalent)	Brinson allocation/selection algebra; sector-reclassification handling
Sector divergence vs. benchmark	The Brinson decomposition itself — equal-weighted VTI portfolios are degenerate
Factor loadings per holding per snapshot	Time-varying-exposure code path; same-ticker loading drift over time
Factor model version label	Multi-model / multi-region attribution code path
Benchmark return series at sector level	The 'compared to what' side; total-return (dividend-reinvested) benchmarks required
Multi-period test cases (12+ periods, 2%+ active each)	Linking-residual code (Carino / Menchero / GRAP)
Corporate-action-aware holdings	Sector and factor transitions across spinoffs and mergers
Currency layer for international holdings	Currency attribution decomposition; FX-consistency check

Currency attribution — the often-skipped layer

Any reporting platform that supports international holdings has to handle currency attribution. The portfolio's USD return decomposes into a local-currency return component, a currency-translation component, and an interaction term. The same allocation/selection algebra applies, but with a currency overlay that has its own factor exposures and its own benchmark.

Synthetic data for currency attribution testing needs both the local-currency price history and the FX rate history for every non-USD holding. The two have to be consistent — if the EUR/USD rate appreciates 5% over a month and a Eurostoxx position gains 3% in EUR, the USD return is approximately 8% (and exactly: (1.05)(1.03) - 1 = 8.15%). Mock-data tools that generate independent USD returns and EUR returns can't exercise the FX consistency check, which is where the next class of bugs lives.

// International holding return decomposition
{
  "holding_id": "H-EWG-2025-09-30",
  "symbol": "EWG",
  "period": "2025-09-01 to 2025-09-30",
  "local_currency_return": 0.0312,      // EUR return
  "fx_translation_return": 0.0148,      // EUR/USD appreciation
  "interaction_return": 0.0005,         // (1.0312)(1.0148) - 1 - 0.0312 - 0.0148
  "base_currency_return": 0.0465,       // USD return
  "benchmark_local_return": 0.0289,
  "benchmark_fx_return": 0.0148,
  "benchmark_base_return": 0.0441
}

How this shows up in the corpus

The institutional and high-net-worth bundles in our synthetic data catalog include factor loadings on every holding, sector classifications updated for any reclassifications during the longitudinal window, and a benchmark series (Russell 3000 for equities, Bloomberg US Agg for fixed income; international benchmarks for international holdings). The test households are deliberately constructed with sector and factor tilts away from the benchmarks so that the resulting attribution decompositions exercise the full arithmetic — including the cases where the linking residual is large enough to distinguish Carino, Menchero, and GRAP.

For more on the underlying time-series-fidelity properties any reporting platform's tests depend on, see the Time-Series Fidelity theme, and especially Modeling corporate actions in synthetic portfolios (which covers the sector-reclassification and merger cases that attribution engines have to handle correctly). For the buyer-side QA view, see Detecting unrealistic patterns in synthetic time-series wealth data.