Reconciling aggregator output with custodian source-of-truth

WealthSchema StaffIntegration patternsMay 9, 20265 min read

A wealth-tech platform that uses both aggregator data (Plaid/Yodlee/Akoya/MX) and direct custodian feeds (Schwab/Fidelity/Pershing/BNY) for the same household — which is the typical case at institutional scale — has to handle a fundamental data-quality question: which feed is the source of truth, and what happens when the two disagree?

The two will disagree. Not occasionally — routinely. The disagreements are mostly small, mostly recoverable, and mostly invisible until they manifest as a reconciliation break in a year-end report or a wrong number in a tax form. This article is the reconciliation contract: where the disagreements come from, how to design the platform to handle them, and what synthetic test data has to look like to exercise the reconciliation logic.

Why the two feeds disagree

The disagreement isn't a bug in either feed. It's structural — the two are answering slightly different questions:

Aggregator data is normalized. Plaid receives raw data from the institution and projects it through Plaid's normalized schema. The projection is lossy by design — it has to span 12,000+ institutions with very different underlying systems. Some fidelity is sacrificed for portability.
Custodian-direct data is institution-shaped. A direct feed from Schwab carries Schwab's exact data shape, with Schwab's conventions for timing, rounding, and classification. Higher fidelity, but only for that one institution.
The two feeds run on different cadences. Aggregators typically refresh hourly to daily; custodian-direct feeds can be intraday, end-of-day, or batch. The same account at the same wall-clock moment can show different values from each feed depending on when each last refreshed.
The two feeds can have different reconciliation states. An aggregator may surface a pending transaction that the custodian feed has already settled, or vice versa. The settlement-state convention differs across feeds.

	Source of disagreement	Typical magnitude
Refresh-cycle lag	Hours to a day	Always present, especially mid-day
Pending vs. settled state	Per-transaction	5-15% of transactions on active accounts
Fractional-share rounding	Sub-cent to a few cents per holding	On every fractional-share position
Distribution character classification	Categorical (qualified vs RoC)	On every multi-character distribution
Account-type taxonomy mismatch	Categorical	On accounts the aggregator misclassifies (~1-3% of IRA/Roth distinctions)
Corporate-action timing	Days	Around every ex-date for affected holdings
Cost-basis depth	Lot-level vs aggregate	On every position with multiple lots

The reconciliation contract

The platform has to make a few core decisions to handle dual-source data correctly:

1. Source-of-truth assignment per field

Different fields have different right-source-of-truth. The contract for a typical institutional platform:

	Field	Aggregator wins?	Custodian wins?
Account existence	Yes (broad)	—	Aggregator surfaces accounts the user authorized; custodian is single-institution
Position quantities	—	Yes	Custodian is canonical; aggregator typically derives from custodian feed anyway
Lot-level basis	—	Yes (when present)	Aggregator typically lacks lot data; custodian carries CBRS-format lots
Pending transactions	Yes	—	Aggregator surfaces pending faster; custodian publishes after settlement
Settled transactions	—	Yes	Custodian's settlement is canonical; aggregator inherits
Distribution character (year-end)	—	Yes (1099-DIV)	Year-end 1099 is the only correct classification source
Account-type / registration	—	Yes	Custodian holds the legal account-type record
Real-time intraday balances	Maybe	Maybe	Depends on which feed is fresher; usually a max(timestamp) decision

The contract is best implemented as a per-field source-of-truth registry rather than a per-feed precedence rule. Some fields prefer aggregator data; some prefer custodian data; some are time-conditional.

2. Reconciliation cadence

The platform has to run a reconciliation pass on a regular cadence — typically nightly batch, with intraday spot-checks for high-stakes flows (trade execution, withdrawal initiation). The pass compares the two feeds field-by-field, applies the source-of-truth rules, and surfaces unresolved disagreements as exceptions.

Realistic exception rates per 1,000 active accounts: 5–20 exceptions per nightly run on routine fields, dropping to single-digit exceptions after a stable steady state. New accounts in their first 90 days run higher exception rates because aggregator linking is most fragile in the early days.

3. The duplicate-account problem

A specific reconciliation failure mode that deserves its own treatment: the same logical account is sometimes surfaced by an aggregator under a different identifier than the custodian feed uses. The customer's Schwab account #82145678 might be linked through Plaid as Plaid-account-id P_xyz789 with a Plaid-internal account number 82145678, but the custodian-direct feed reports it with Schwab's institutional ID SCH-INST-82145678. The platform's account-matching logic has to recognize these as the same account or it will produce a household with double the actual position count.

The matching key can't be the account number alone — different institutions use overlapping number-spaces. The reliable matching keys are (institution_id, account_number) tuple, with fallback to (institution_id, last_4_account, registration_match, position_overlap). Test data has to include at least some same-account-different-source cases so the matching logic gets exercised.

// Same account, two views
// Plaid view
{
  "plaid_account_id": "P_abc123xyz",
  "institution": { "name": "Charles Schwab", "institution_id": "ins_109511" },
  "mask": "5678",
  "official_name": "INDIVIDUAL Account 82145678",
  "type": "investment",
  "subtype": "brokerage"
}

// Schwab direct view
{
  "schwab_account_number": "82145678",
  "account_type": "INDIVIDUAL_TAXABLE",
  "registration": "JOHN Q SMITH"
}

// Platform's reconciled record
{
  "platform_account_id": "ACC-2025-9821",
  "institution_id": "schwab",
  "account_number": "82145678",
  "match_keys": [
    { "source": "plaid", "key": "P_abc123xyz" },
    { "source": "schwab_direct", "key": "82145678" }
  ],
  "primary_data_source": "schwab_direct",
  "secondary_data_source": "plaid"
}

What synthetic test data has to look like

The minimum-viable corpus for testing dual-source reconciliation:

	Test scenario	What it exercises
Same account from both sources	Account-matching logic; per-field source-of-truth resolution
Aggregator-only account	Coverage of accounts the platform doesn't have direct custodian feed for
Custodian-only account	Coverage of accounts the user hasn't linked via aggregator
Disagreement on pending transactions	Pending-vs-settled reconciliation; aggregator-surfaced pending that custodian shows as already-settled
Disagreement on fractional shares	Sub-cent rounding logic; tolerance thresholds for 'agreement'
Distribution character override	Year-end 1099-DIV reclassification overriding aggregator's preliminary classification
Account-type mismatch	Aggregator misclassifies a Roth IRA as a Traditional IRA; custodian is correct; platform's mismatch-resolution logic
Aggregator-link broken	Plaid token expired, Yodlee re-auth required; platform falls back to custodian-direct only and resumes when aggregator returns

A test corpus that includes only single-source households cannot exercise the reconciliation code paths. Realistic dual-source data — same household, same account, two views — is the test shape that matters.

The year-end reconciliation cycle

The single highest-stakes reconciliation moment is the year-end 1099-DIV / 1099-B reconciliation. Aggregator feeds may have classified distributions, basis, and gains throughout the year based on preliminary information. The year-end 1099 is the canonical correction — and the gap between aggregator-classified data and 1099-corrected data is sometimes substantial.

Common year-end corrections:

Distribution character reclassification. A REIT distribution classified as "qualified dividend" through the year is reclassified at year-end to ~50% qualified, ~30% return-of-capital, ~15% ordinary, ~5% Section 199A. The 1099-DIV is the only correct source.
Wash-sale adjustments. Wash sales identified by the custodian's tax-lot software (FIFO-applied across the year) may differ from what the aggregator reported in real-time. The 1099-B is canonical.
Cost-basis corrections. Non-covered lots may have updated cost basis at year-end (customer-supplied documentation, custodian back-research). The 1099-B carries the corrected basis.

Test data has to include the year-end correction step — both the pre-correction state (what the aggregator showed throughout the year) and the post-correction state (what the 1099 ultimately said). Platforms tested only against post-correction data will fail the in-flight reconciliation logic that produces tax-aware insights mid-year.

How this shows up in our catalog

The institutional bundles in the catalog ship with paired aggregator-and-custodian views per account where both sources are configured. The disagreements are deliberately calibrated: 5–10% of transactions in pending-vs-settled disagreement at any snapshot, sub-cent fractional-share rounding gaps on every fractional position, multi-character distribution classification on REIT/MLP/BDC holdings with year-end 1099-DIV correction events. The matching keys are intentionally non-trivial — test households include at least some accounts where naive account-number matching would fail.

For the broader integration context, see Aggregator & Custodian Integration and the procurement-side Aggregator API vs. direct custodian feed comparison. For the per-source data shapes, see Modeling Plaid, Yodlee, Akoya, and MX outputs and Custodian-specific data quirks. For the time-series-fidelity properties that the dual-source reconciliation touches, see Time-Series Fidelity in Synthetic Wealth Data.