wealthschemaresourcesarticlesReconciling aggregator output with custodian source-of-truth
Article

Reconciling aggregator output with custodian source-of-truth

Two correct data sources can disagree by design. The platform's job is to know which to trust per field, per moment, per use case.

WealthSchema StaffIntegration patternsMay 9, 20265 min read

A wealth-tech platform that uses both aggregator data (Plaid/Yodlee/Akoya/MX) and direct custodian feeds (Schwab/Fidelity/Pershing/BNY) for the same household — which is the typical case at institutional scale — has to handle a fundamental data-quality question: which feed is the source of truth, and what happens when the two disagree?

The two will disagree. Not occasionally — routinely. The disagreements are mostly small, mostly recoverable, and mostly invisible until they manifest as a reconciliation break in a year-end report or a wrong number in a tax form. This article is the reconciliation contract: where the disagreements come from, how to design the platform to handle them, and what synthetic test data has to look like to exercise the reconciliation logic.

Why the two feeds disagree

The disagreement isn't a bug in either feed. It's structural — the two are answering slightly different questions:

  • Aggregator data is normalized. Plaid receives raw data from the institution and projects it through Plaid's normalized schema. The projection is lossy by design — it has to span 12,000+ institutions with very different underlying systems. Some fidelity is sacrificed for portability.
  • Custodian-direct data is institution-shaped. A direct feed from Schwab carries Schwab's exact data shape, with Schwab's conventions for timing, rounding, and classification. Higher fidelity, but only for that one institution.
  • The two feeds run on different cadences. Aggregators typically refresh hourly to daily; custodian-direct feeds can be intraday, end-of-day, or batch. The same account at the same wall-clock moment can show different values from each feed depending on when each last refreshed.
  • The two feeds can have different reconciliation states. An aggregator may surface a pending transaction that the custodian feed has already settled, or vice versa. The settlement-state convention differs across feeds.
 Source of disagreementTypical magnitudeFrequency
Refresh-cycle lagHours to a dayAlways present, especially mid-day
Pending vs. settled statePer-transaction5-15% of transactions on active accounts
Fractional-share roundingSub-cent to a few cents per holdingOn every fractional-share position
Distribution character classificationCategorical (qualified vs RoC)On every multi-character distribution
Account-type taxonomy mismatchCategoricalOn accounts the aggregator misclassifies (~1-3% of IRA/Roth distinctions)
Corporate-action timingDaysAround every ex-date for affected holdings
Cost-basis depthLot-level vs aggregateOn every position with multiple lots

The reconciliation contract

The platform has to make a few core decisions to handle dual-source data correctly:

1. Source-of-truth assignment per field

Different fields have different right-source-of-truth. The contract for a typical institutional platform:

 FieldAggregator wins?Custodian wins?Rationale
Account existenceYes (broad)Aggregator surfaces accounts the user authorized; custodian is single-institution
Position quantitiesYesCustodian is canonical; aggregator typically derives from custodian feed anyway
Lot-level basisYes (when present)Aggregator typically lacks lot data; custodian carries CBRS-format lots
Pending transactionsYesAggregator surfaces pending faster; custodian publishes after settlement
Settled transactionsYesCustodian's settlement is canonical; aggregator inherits
Distribution character (year-end)Yes (1099-DIV)Year-end 1099 is the only correct classification source
Account-type / registrationYesCustodian holds the legal account-type record
Real-time intraday balancesMaybeMaybeDepends on which feed is fresher; usually a max(timestamp) decision

The contract is best implemented as a per-field source-of-truth registry rather than a per-feed precedence rule. Some fields prefer aggregator data; some prefer custodian data; some are time-conditional.

2. Reconciliation cadence

The platform has to run a reconciliation pass on a regular cadence — typically nightly batch, with intraday spot-checks for high-stakes flows (trade execution, withdrawal initiation). The pass compares the two feeds field-by-field, applies the source-of-truth rules, and surfaces unresolved disagreements as exceptions.

Realistic exception rates per 1,000 active accounts: 5–20 exceptions per nightly run on routine fields, dropping to single-digit exceptions after a stable steady state. New accounts in their first 90 days run higher exception rates because aggregator linking is most fragile in the early days.

3. The duplicate-account problem

A specific reconciliation failure mode that deserves its own treatment: the same logical account is sometimes surfaced by an aggregator under a different identifier than the custodian feed uses. The customer's Schwab account #82145678 might be linked through Plaid as Plaid-account-id P_xyz789 with a Plaid-internal account number 82145678, but the custodian-direct feed reports it with Schwab's institutional ID SCH-INST-82145678. The platform's account-matching logic has to recognize these as the same account or it will produce a household with double the actual position count.

The matching key can't be the account number alone — different institutions use overlapping number-spaces. The reliable matching keys are (institution_id, account_number) tuple, with fallback to (institution_id, last_4_account, registration_match, position_overlap). Test data has to include at least some same-account-different-source cases so the matching logic gets exercised.

// Same account, two views
// Plaid view
{
  "plaid_account_id": "P_abc123xyz",
  "institution": { "name": "Charles Schwab", "institution_id": "ins_109511" },
  "mask": "5678",
  "official_name": "INDIVIDUAL Account 82145678",
  "type": "investment",
  "subtype": "brokerage"
}

// Schwab direct view
{
  "schwab_account_number": "82145678",
  "account_type": "INDIVIDUAL_TAXABLE",
  "registration": "JOHN Q SMITH"
}

// Platform's reconciled record
{
  "platform_account_id": "ACC-2025-9821",
  "institution_id": "schwab",
  "account_number": "82145678",
  "match_keys": [
    { "source": "plaid", "key": "P_abc123xyz" },
    { "source": "schwab_direct", "key": "82145678" }
  ],
  "primary_data_source": "schwab_direct",
  "secondary_data_source": "plaid"
}

What synthetic test data has to look like

The minimum-viable corpus for testing dual-source reconciliation:

 Test scenarioWhat it exercises
Same account from both sourcesAccount-matching logic; per-field source-of-truth resolution
Aggregator-only accountCoverage of accounts the platform doesn't have direct custodian feed for
Custodian-only accountCoverage of accounts the user hasn't linked via aggregator
Disagreement on pending transactionsPending-vs-settled reconciliation; aggregator-surfaced pending that custodian shows as already-settled
Disagreement on fractional sharesSub-cent rounding logic; tolerance thresholds for 'agreement'
Distribution character overrideYear-end 1099-DIV reclassification overriding aggregator's preliminary classification
Account-type mismatchAggregator misclassifies a Roth IRA as a Traditional IRA; custodian is correct; platform's mismatch-resolution logic
Aggregator-link brokenPlaid token expired, Yodlee re-auth required; platform falls back to custodian-direct only and resumes when aggregator returns

A test corpus that includes only single-source households cannot exercise the reconciliation code paths. Realistic dual-source data — same household, same account, two views — is the test shape that matters.

The year-end reconciliation cycle

The single highest-stakes reconciliation moment is the year-end 1099-DIV / 1099-B reconciliation. Aggregator feeds may have classified distributions, basis, and gains throughout the year based on preliminary information. The year-end 1099 is the canonical correction — and the gap between aggregator-classified data and 1099-corrected data is sometimes substantial.

Common year-end corrections:

  • Distribution character reclassification. A REIT distribution classified as "qualified dividend" through the year is reclassified at year-end to ~50% qualified, ~30% return-of-capital, ~15% ordinary, ~5% Section 199A. The 1099-DIV is the only correct source.
  • Wash-sale adjustments. Wash sales identified by the custodian's tax-lot software (FIFO-applied across the year) may differ from what the aggregator reported in real-time. The 1099-B is canonical.
  • Cost-basis corrections. Non-covered lots may have updated cost basis at year-end (customer-supplied documentation, custodian back-research). The 1099-B carries the corrected basis.

Test data has to include the year-end correction step — both the pre-correction state (what the aggregator showed throughout the year) and the post-correction state (what the 1099 ultimately said). Platforms tested only against post-correction data will fail the in-flight reconciliation logic that produces tax-aware insights mid-year.

How this shows up in our catalog

The institutional bundles in the WealthSynth catalog ship with paired aggregator-and-custodian views per account where both sources are configured. The disagreements are deliberately calibrated: 5–10% of transactions in pending-vs-settled disagreement at any snapshot, sub-cent fractional-share rounding gaps on every fractional position, multi-character distribution classification on REIT/MLP/BDC holdings with year-end 1099-DIV correction events. The matching keys are intentionally non-trivial — test households include at least some accounts where naive account-number matching would fail.

For the broader integration context, see Aggregator & Custodian Integration and the procurement-side Aggregator API vs. direct custodian feed comparison. For the per-source data shapes, see Modeling Plaid, Yodlee, Akoya, and MX outputs and Custodian-specific data quirks. For the time-series-fidelity properties that the dual-source reconciliation touches, see Time-Series Fidelity in Synthetic Wealth Data.