Playbook: Migrating Non-Production Environments from Production Data to Synthetic
Most fintech environments started in a world where 'use a sanitized prod copy in staging' was the default. The Safeguards Rule amendments, SOC 2 expectations, and growing comfort of breach-cost-modeling teams have made that posture untenable. The migration to synthetic data isn't a flip — it's a parallel-validation pattern executed across data, integrations, and the team's mental model. This playbook is the 6-week sequence we've seen succeed.
Week 1 — Inventory and target-coverage map
Start by enumerating exactly what 'production data in non-prod' currently means in your environment. Most teams underestimate the surface area by 3-5x. The output of week one is a per-environment, per-system inventory mapped to a target-coverage matrix.
- ·Per-environment inventory: dev, staging, integration, performance, demo, sales-engineering. Each environment may have different data needs.
- ·Per-system inventory within each environment: relational stores, document stores, search indices, cache, observability, vendor sandboxes.
- ·Per-data-class identification: customer profile, account, transaction, document, audit-log. Different classes need different synthetic strategies.
- ·Target-coverage matrix: for each environment × system × data-class, the target coverage (full synthetic, masked, deleted, retain-with-tighter-controls).
Week 2 — Synthetic corpus selection / generation
Choose between a packaged corpus (faster, less customization), a custom-generated corpus (slower, exact-fit), or a hybrid (packaged + supplementary archetypes). For most fintechs, the hybrid is right — start with a packaged corpus matching the modal customer base, then add custom archetypes for the long-tail edge cases your specific product hits.
The corpus must structurally match the data shape currently in non-prod. Schema-mapping is where most migrations stall. A packaged corpus that matches your industry's modal schema may still need 10-20% of fields mapped or supplemented for your specific schema.
// Sample mapping document structure
{
"your_field_path": "synthetic_corpus_field_path",
"users.tax_id": "household.demographics.tax_id",
"users.tax_id_type": "household.demographics.tax_id_type",
"accounts.account_type": "household.accounts[].account_type",
// ... continue for all in-scope fields
"_unmapped_fields": [
"users.referral_source", // Custom field
"accounts.advisor_notes" // Custom field
]
}Week 3 — Parallel validation in shadow environment
Stand up a shadow environment loaded with the synthetic corpus alongside the existing production-data-in-non-prod environment. Every test that runs against the existing environment runs against the shadow. Compare results.
The goal is to identify tests that pass against production data but fail against synthetic — these are tests that have grown to depend on real-data idiosyncrasies (specific customer IDs hardcoded, specific transaction patterns assumed). Each one needs to be either updated to be data-agnostic or supplemented with a custom synthetic case that satisfies it.
Week 4 — Integration rewiring
External integrations (CRM sandbox, custodian sandbox, payment processor sandbox) often have data references that propagate through to non-prod. Your synthetic corpus needs to be reflected in the sandbox-side data, or the integration test will silently use a record that doesn't exist on the synthetic side.
Three integration patterns to evaluate: (1) the integration uses your IDs and you control the mapping — easiest, just map; (2) the integration provides its own IDs that get stamped into your data — you must seed the sandbox with synthetic-corresponding records; (3) the integration uses a real-data lookup against a third-party — you need a sandbox-mode toggle on the integration side.
Week 5 — Cutover with rollback gate
Cutover by environment, not by system, and not all at once. Recommended order: dev (lowest risk), then performance and integration, then staging, then demo and sales-engineering. Each cutover has a 7-day stability gate before the next.
During cutover, retain the production-data-in-non-prod environment in read-only mode as a fallback. If the synthetic environment has a critical issue in the first 7 days, you can flip back. After the stability gate, the read-only fallback is decommissioned per the Safeguards Rule's secure-disposal requirement.
Week 6 — Decommission and document
Decommission the production-data-in-non-prod environments per documented secure-disposal procedure. Update the data inventory map (week-1 deliverable) to reflect the new state. Document the per-environment synthetic-data substitution pattern in the SOC 2 / Safeguards Rule program documentation.
The completion criterion is that an auditor can read the data inventory, confirm no environment has customer information that isn't required for production, and trace the substitution decision for each environment back to a documented rationale.
Key takeaways
- The migration is parallel-validation followed by cutover, not flip. Most teams that fail this migration tried to flip and discovered downstream tests had grown to depend on real-data idiosyncrasies.
- Schema mapping is where most migrations stall. Plan 10-20% custom-mapping work even with a well-matched packaged corpus.
- Integration rewiring is invisible until it breaks. Sandbox-side data references propagate quietly — explicitly evaluate every external integration in week 4.
- Retain production-data fallback in read-only mode through the stability gate, then decommission per Safeguards Rule secure-disposal. The decommission record is the audit artifact.
FAQ
What if the production data has been in non-prod long enough that nobody knows what the original sanitization rules were?+
This is common. Treat it as discovery — week 1 inventory should explicitly list 'sanitization rules: undocumented' for any system where this is true. The migration becomes higher-priority for those systems because they represent the largest residual exposure.
Can we do this faster than 6 weeks?+
For a single environment with a single integration, yes — 2-3 weeks. The 6-week version is for the full multi-environment, multi-integration scope that mid-stage fintechs typically have. The cost of moving faster than the parallel-validation step allows is rediscovering test-data idiosyncrasies in production.
How do we handle test data that's tied to specific real customer scenarios reproduced from prod issues?+
Two paths: (1) recreate the scenario as a custom synthetic case, capturing only the structural pattern not the customer; (2) document the scenario in the test-data inventory with a 'scenario captured but not in synthetic' note so future debugging can re-create it. Option 1 is cleaner; option 2 is sometimes necessary for time pressure.
Does the synthetic corpus need to match production volume?+
Match production volume in performance and integration environments — that's where realistic load matters. Dev and demo can use a smaller curated subset. Staging is environment-dependent; if staging is your final pre-prod gate, match volume.
What about data subject access requests (DSARs) under CCPA / GDPR? Do they apply to non-prod?+
Yes — if your non-prod environment has identifiable customer data, that data is subject to DSAR. Migrating to synthetic eliminates this exposure entirely, which is a side benefit of the migration. Document this in the privacy program update.
How does this interact with the SOC 2 audit window?+
Plan the migration to complete before the start of the audit window if possible. Mid-window migrations are auditable but require explicit documentation of the change to controls, the testing of the new controls, and any gap during the change.
Is there a model contract clause for vendors that should mirror this migration?+
Yes — vendor contracts should require either fully synthetic data in vendor non-prod environments or written attestation that production data is not retained in vendor non-prod. The migration is a good moment to push this into vendor renegotiations.