Guide

Playbook: Load-Testing Wealth-Tech APIs at 10,000 Synthetic Households

Q: Can we run this against the production environment if it has spare capacity?

No. Load tests against production risk customer impact even with capacity. The cost-savings of running against prod aren't worth the customer-trust risk. Run against a production-shaped staging environment with the same database, same caching layer, same API gateway configuration.

Q: What if our staging is materially different from production?

Then load tests in staging are bounded in usefulness. Track the gap explicitly — 'staging has 2 vCPU per service vs. production's 8' — and apply scaling factors when extrapolating. Where staging-prod parity gaps are large, prioritize closing the gap before relying on staging load tests.

Q: How do we reset state between load test runs?

Either restore the database from a snapshot pre-test (cleanest) or restore from the synthetic corpus seed (cheaper). The corpus-from-seed approach requires the corpus generation to be deterministic — same seed produces same corpus. WealthSchema corpora support this; verify your custom corpus does.

Q: What do we do if the test reveals capacity isn't enough for projected production load?

Three paths: (1) optimize the bottleneck identified in the test; (2) provision more infrastructure and re-test; (3) implement load shedding / rate limiting at the saturated boundary. Most teams do all three in parallel.

Q: How does this interact with chaos engineering tests?

Run load tests and chaos tests separately first. Once both pass independently, combine: chaos under load is the highest-realism test but the hardest to debug if something fails. Earn the right to combine by passing both individually.

Q: What about distributed-system load tests across multiple regions?

Same playbook applies, multiplied by region count. The key addition is testing region-failover under load — drop one region mid-test and verify the system stays within SLA. This is the failure mode that most often surprises in real outages.

Q: How does the cost compare to using anonymized prod data?

Substantially lower. Anonymized prod data has its own ongoing cost (re-anonymization on each refresh) plus the architectural risk discussed in the prod-to-synthetic migration playbook. Synthetic load-test data is the cheaper long-term path.

Published May 8, 2026

Most wealth-tech load tests fall into a predictable pattern: a developer runs k6 against the API with 1,000 RPS for 5 minutes against a 50-household fixture, sees latency stay flat, and declares the system 'load-tested.' That test misses every interesting failure mode — hot-account contention, query-plan degradation under realistic data volume, and the per-customer-state dependencies that don't surface until the corpus is big enough. This playbook covers the load-test pattern that actually catches production-readiness issues — using a 10K synthetic-household corpus structured for the test.

Sizing: why 10K and not 1K or 100K

1K is too small to exercise query-plan transitions, index-vs-scan thresholds, or the long-tail customer behaviors that surface at 1-in-1000 frequency. 100K stresses the test infrastructure beyond what most pre-prod environments can sustain and produces test runs that take days. 10K is the sweet spot: large enough to exercise realistic behavior distribution, small enough that the test itself doesn't become an operational burden.

If your production scale is materially above 100K, run the 10K test as the standard nightly and a 100K test pre-release. The 10K test catches 80%+ of production-readiness issues; the 100K test catches the remainder.

Corpus structure: what the synthetic dataset must contain

A 10K corpus dropped onto an API doesn't load-test the API — it load-tests an unrealistic uniform distribution. The corpus must reflect the production behavior distribution.

·Account-count distribution — modal household has 2-3 accounts, but 5% have 8+. Hot-account-per-customer logic depends on this.
·Position-count distribution — most households have <50 positions, but 10% have 200+. Query patterns differ at the long tail.
·Activity distribution — most households generate <10 events / day, but 1% generate 100+. Webhook subscriber patterns are stressed by this tail.
·Multi-currency representation — small share but disproportionate code-path coverage.
·Document-attachment distribution — most households have a few documents, some have hundreds (consolidated tax filings, etc.). Object-store access patterns matter.

Concurrency profile: shape the load like production traffic

Constant-rate load tests are the easy case to pass and the unrealistic case for prod. Real production load has concurrency that bursts on quarter-end, market-close, and 9:30am ET market-open. The load profile should reflect this.

The profile we recommend: a baseline rate that ramps over 10 minutes (mimics a workday warm-up), holds for 30 minutes, transitions to a 5x burst for 5 minutes (mimics market open or quarterly close), drops back to baseline for 20 minutes, then ramps down. Total run is 75 minutes, which is enough to surface query-plan transitions, GC pauses on JVM-based services, and connection-pool saturation.

// k6 load profile pseudocode
export const options = {
  stages: [
    { duration: '10m', target: 100 },  // ramp up
    { duration: '30m', target: 100 },  // hold baseline
    { duration: '2m',  target: 500 },  // burst ramp
    { duration: '5m',  target: 500 },  // burst hold
    { duration: '3m',  target: 100 },  // burst tail
    { duration: '20m', target: 100 },  // post-burst hold
    { duration: '5m',  target: 0 }     // ramp down
  ],
  thresholds: {
    http_req_duration: [{
      threshold: 'p(99) < 500',
      abortOnFail: true
    }]
  }
};

Hot-account simulation: the failure mode you're testing for

In production, some accounts get queried more than others — high-AUM clients with active-management, accounts that just had a major event, accounts being viewed by a CSR investigating a ticket. Real hot-account distribution is roughly Pareto: 10% of accounts get 50% of the traffic.

Synthetic load that's uniformly distributed across the 10K-household corpus misses this. Configure the load to follow a Pareto distribution: 10% of households get 50% of requests. This produces hot-account contention that surfaces caching layer issues, lock contention on per-account state, and connection-pool starvation when the same hot account holds connections.

Read-vs-write mix

Wealth-tech APIs are typically read-heavy — 90%+ reads is common. But the writes are where most production incidents start. Configure the load to mirror your production read/write mix, then run a separate write-heavy test to specifically stress the write paths.

Write operations to include: account updates, transaction posting, document uploads, profile changes. Each of these has different downstream effects (cache invalidation, webhook fan-out, audit-log generation) that need to be exercised.

What to measure: the metrics the post-test review consumes

The post-test review wants four metric groups: latency profile, throughput sustained, error rate by error class, and resource utilization at the bottleneck. Don't just report aggregate latency — report by endpoint, by account-segment (hot vs. cold), and by request size.

·Latency by endpoint at p50, p95, p99, p99.9. The p99.9 is where production incidents originate.
·Throughput sustained: peak RPS and the duration sustained without latency degradation. 'Peaked at 5000 RPS' is meaningless without 'sustained for 5 minutes.'
·Error rate by class: 4xx (client errors — usually load-test bug), 5xx (server errors — production-readiness issue), timeout (often the most informative).
·Resource at bottleneck: CPU, memory, DB connection pool, IOPS. Saturation of one of these is the failure root cause.

Hot-key avoidance: the synthetic-data-specific risk

Synthetic corpora generated naively can have hot-keys — IDs that are sequential, timestamps that cluster, hash distributions that aren't uniform. Hot-keys in your test data produce hot-keys in your test load that are artifacts of generation, not realistic patterns.

Verify: ID distribution is uniform across the keyspace, timestamps are spread realistically across the longitudinal window, secondary keys (account types, product mixes) match production's distribution. WealthSchema corpora are generated with these properties; if you're using a custom-generated corpus, validate before load-testing.

Post-test: artifact for the production-readiness review

The artifact the review consumes is a 1-page summary plus a detailed appendix. Summary: 'sustained N RPS for X minutes at p99 < Y ms with Z% error rate.' Appendix: per-endpoint detail, hot-account vs. cold-account separation, error class breakdown, bottleneck identification, and the recommended remediation if any thresholds were exceeded.

File the artifact in the same place your other production-readiness reviews live. Re-run the test on a regular cadence (typically nightly during active development, weekly during stable maintenance, before every release).

Key takeaways

10K is the right scale for a daily load test — large enough to exercise realistic behavior distribution, small enough that the test isn't its own operational burden.
Uniform-distributed load against a synthetic corpus is unrealistic. Pareto-shaped hot-account access is what catches the cache, lock, and connection-pool issues that bite in production.
Burst profiles (5x for 5 minutes) catch query-plan transitions and GC pauses that constant-rate loads miss. The bursts are where production incidents actually originate.
Hot-key artifacts in synthetic data corrupt load tests if not addressed. Verify ID, timestamp, and secondary-key distributions before relying on the test.

FAQ

Can we run this against the production environment if it has spare capacity?+

No. Load tests against production risk customer impact even with capacity. The cost-savings of running against prod aren't worth the customer-trust risk. Run against a production-shaped staging environment with the same database, same caching layer, same API gateway configuration.

What if our staging is materially different from production?+

Then load tests in staging are bounded in usefulness. Track the gap explicitly — 'staging has 2 vCPU per service vs. production's 8' — and apply scaling factors when extrapolating. Where staging-prod parity gaps are large, prioritize closing the gap before relying on staging load tests.

How do we reset state between load test runs?+

Either restore the database from a snapshot pre-test (cleanest) or restore from the synthetic corpus seed (cheaper). The corpus-from-seed approach requires the corpus generation to be deterministic — same seed produces same corpus. WealthSchema corpora support this; verify your custom corpus does.

What do we do if the test reveals capacity isn't enough for projected production load?+

Three paths: (1) optimize the bottleneck identified in the test; (2) provision more infrastructure and re-test; (3) implement load shedding / rate limiting at the saturated boundary. Most teams do all three in parallel.

How does this interact with chaos engineering tests?+

Run load tests and chaos tests separately first. Once both pass independently, combine: chaos under load is the highest-realism test but the hardest to debug if something fails. Earn the right to combine by passing both individually.

What about distributed-system load tests across multiple regions?+

Same playbook applies, multiplied by region count. The key addition is testing region-failover under load — drop one region mid-test and verify the system stays within SLA. This is the failure mode that most often surprises in real outages.

How does the cost compare to using anonymized prod data?+

Substantially lower. Anonymized prod data has its own ongoing cost (re-anonymization on each refresh) plus the architectural risk discussed in the prod-to-synthetic migration playbook. Synthetic load-test data is the cheaper long-term path.