GLBA, GDPR, and CCPA — why fully synthetic data sits outside personal-data regimes

WealthSchema StaffCompliance & legalMay 8, 20267 min read

Every fintech team that adopts synthetic data goes through the same conversation with their compliance counsel. "We want to use synthetic data to back the new system. Does it count as personal data under our regulatory framework?" The answer is, almost universally, no — but the path from "no" to a written sign-off is bumpy if the team doesn't know what counsel needs to see.

This article is the working reference we hand fintech engineering leaders to short-circuit that conversation. It walks through GLBA, GDPR, and CCPA in turn, identifies the decisive language in each, and provides a documentation pattern designed to hold up in compliance reviews.

The underlying principle

All three regimes share a structural pattern. Each defines a category of regulated information based on the information's relationship to a real natural person. GLBA covers Non-Public Personal Information ("NPI") about real consumers; GDPR covers personal data about identified or identifiable natural persons; CCPA covers personal information about consumers in California.

In each case, the regulated information is information about a real person. Synthetic data, correctly produced, describes entities that do not exist in any registry, payroll system, custodial account, or property record. There is no real person to which the information could be linked. The information therefore falls outside the scope of all three regimes by definition.

The principle is straightforward. The execution is where the compliance review tends to bog down, because counsel has to satisfy themselves that the synthetic data really is non-derivable from real records. The documentation pattern below is what makes that satisfaction practical.

GLBA

The Gramm-Leach-Bliley Act regulates financial institutions' handling of "Nonpublic Personal Information" — information about a consumer that is provided to the institution in connection with a financial product or service. The implementing regulation is the FTC Safeguards Rule (16 CFR Part 314), which applies to non-bank financial institutions, with parallel rules under the OCC, FDIC, FRB, NCUA, and SEC for institutions in their respective jurisdictions.

The decisive language is in 16 CFR § 313.3(o)(1): "Nonpublic personal information means: (i) Personally identifiable financial information; and (ii) Any list, description, or other grouping of consumers...derived using any personally identifiable financial information that is not publicly available." NPI is, by definition, information about real consumers.

The compliance documentation we typically see succeed for GLBA review:

A written certification from the synthetic-data vendor that the dataset describes no real consumers and was not derived from real-consumer records.
A description of the source data used to derive the synthetic distribution (public aggregates only — FRB SCF, IRS SOI, BLS CES — not licensed real records).
A risk assessment that confirms the dataset cannot be reverse-engineered to identify real individuals via auxiliary data.
A formal scope determination by the institution's compliance officer that the dataset is outside Safeguards Rule scope.

The document set is short. The scope determination is usually a one-page memo. The memo carries through every subsequent audit and any GLBA-related due diligence.

The General Data Protection Regulation defines personal data in Article 4(1) as "any information relating to an identified or identifiable natural person." The decisive question is whether synthetic data meets the "identifiable" prong. Article 4(1) defines identifiable broadly, but Recital 26 narrows the practical scope: "to determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used...by the controller or by another person to identify the natural person directly or indirectly."

The practical test is the "means reasonably likely" test, often shorthanded as the motivated intruder test. Could a reasonably skilled and motivated attacker, using available auxiliary data, link the synthetic record back to a real individual?

For a correctly produced synthetic dataset — generated from public aggregates, with no real-record provenance, with structurally non-joinable identifiers — the answer is no. There is no real individual to identify; the records describe entities that do not exist. The motivated intruder has nothing to attack.

	GDPR scope	What changes	What stays the same
Real consumer data	In scope (Article 4(1))	All controller/processor obligations apply, DPA required, lawful basis required	—
Pseudonymized real data	Still in scope (Recital 26 paragraph 2)	Reduced rather than eliminated obligations	Re-identification risk is the operative concern
Anonymized real data (true anonymization)	Out of scope (Recital 26 paragraph 5)	GDPR doesn't apply	Burden of proof is on the controller to demonstrate true anonymization
Fully synthetic data (no real provenance)	Out of scope (no data subject)	GDPR doesn't apply	Documentation of non-derivability is the audit artifact

The compliance documentation for GDPR review:

A Data Protection Impact Assessment (DPIA) confirming that the dataset is outside GDPR scope on the basis of Article 4(1) and Recital 26.
A technical description of the generation pipeline demonstrating no real-record provenance.
A risk assessment of re-identification scenarios using motivated-intruder analysis (auxiliary data, linkage attacks, membership inference).
A processor-side certification from the synthetic-data vendor.

Counsel will sometimes ask whether the generation process itself involves processing of personal data. If your vendor trained generative models on real customer records, the answer changes — the training data was personal data, the model is downstream of personal data, and there is a regulatory chain of custody to address. Vendors that derive their archetype distributions from public aggregates and use prompted LLMs (which themselves were trained on public data) avoid this chain entirely.

CCPA / CPRA

The California Consumer Privacy Act (and the California Privacy Rights Act, which amends it) regulates the handling of "personal information" about California consumers. CCPA defines personal information at Cal. Civ. Code § 1798.140(o)(1) as "information that identifies, relates to, describes, is reasonably capable of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household."

The structure is similar to GDPR: regulated information is information that can be linked to a real consumer. Synthetic data describing non-existent entities is not within scope.

CCPA also includes an explicit carve-out for deidentified information at § 1798.140(h): "Deidentified means information that cannot reasonably be used to infer information about, or otherwise be linked to, a particular consumer." The carve-out is narrower than the GDPR Recital 26 exclusion (it requires technical safeguards plus a contractual prohibition on re-identification) but synthetic data, with no consumer to identify, sits even further outside scope than deidentified data.

Other regimes worth knowing about

The pattern repeats across most modern privacy regimes. The same basic conclusion is reached under:

Brazil's LGPD — defines personal data similarly to GDPR; same Article 4(1) analysis applies.
UK GDPR — substantively identical to EU GDPR post-Brexit; the ICO has confirmed Recital 26 carries through.
PIPEDA (Canada) — uses an "identifiability" test; OPC guidance treats true synthetic data as outside scope.
PIPL (China) — narrower scope than GDPR but synthetic data with no real-person derivation is uncontroversially outside.
HIPAA (US health) — not directly applicable to fintech, but the de-identification standards (Safe Harbor and Expert Determination) are useful reference points; synthetic patient data passing both is a related precedent.

Some sector-specific regimes go further than the general privacy laws. Financial institutions subject to bank examination still have to address operational-risk and model-risk concerns about synthetic data even when privacy concerns are resolved. SR 11-7 (US bank model-risk management) has specific requirements that are outside the privacy frame entirely.

The documentation package that works

Across the three regimes, the documentation that consistently moves a review from "we need to think about this" to "approved" is:

Artifact 1
Vendor certification of non-derivability
A signed statement from the synthetic-data vendor that the dataset describes no real persons and was not produced by training on real-person records. One page.
Artifact 2
Source data inventory
List of every input to the generation pipeline — public aggregates, regulatory filings, archetype definitions. Demonstrates the absence of real-record provenance.
Artifact 3
Re-identification risk assessment
Written analysis of motivated-intruder scenarios. Should explicitly address linkage attacks, membership inference, and tail-record risk.
Artifact 4
Internal scope determination
Memo from the institution's compliance officer concluding the dataset is outside the relevant regulatory scope. The artifact that carries through to future audits.
Artifact 5
Operational controls description
Even though regulatory scope doesn't apply, most institutions apply NPI-grade controls operationally. A short description of the controls that do apply (encryption at rest, access controls, retention policy) closes the operational-risk loop.

The package totals 5–10 pages. The review usually takes one meeting, sometimes two. The artifacts carry through to subsequent audits without re-litigation.

What can go wrong

The package fails when the synthetic-data vendor can't make the non-derivability claim convincingly. The most common failure modes:

The vendor used a real-data corpus to train a generative model. This creates a chain of custody from real personal data to synthetic output. The chain is breakable but only with technical analysis (membership inference attacks, formal differential privacy bounds) that adds weeks to the compliance review. If your vendor's generative model was trained on real records, ask them to articulate the privacy-preservation argument in writing — and budget time for counsel to evaluate it.
The synthetic dataset includes records suspiciously close to real households the institution knows about. This sometimes happens when archetype distributions are calibrated against the institution's own customer book. Even if the math is sound, the optics fail the compliance review. Vendors that derive archetypes from public aggregates avoid this.
The vendor cannot describe the generation pipeline. Black-box vendors fail every compliance review because the controller cannot demonstrate what the regulated thing is or isn't. Ask for a one-page architectural description before the procurement decision; if the vendor demurs, the procurement decision is already made.

Key takeaways

GLBA, GDPR, and CCPA all define their regulated information as information about real persons. Synthetic data with no real-person provenance is outside the scope of all three by definition.
GDPR Recital 26 is the decisive language — anonymous information is excluded from GDPR scope, and synthetic data meets the standard when correctly produced.
The documentation package is 5–10 pages: vendor certification, source data inventory, re-identification risk assessment, internal scope determination, operational controls description.
Compliance reviews usually conclude in one or two meetings when the documentation package is in place. Reviews that don't have the package routinely take months.
The package fails when the vendor's pipeline trained on real-person records. Vendors that derive archetypes from public aggregates and use prompt-based generation avoid this failure mode.

Frequently asked questions

Does the institution still need a Data Processing Agreement (DPA) with the synthetic-data vendor?+

If the dataset is fully synthetic and the vendor never processes real customer data on the institution's behalf, a DPA is not required by GDPR Article 28. Most institutions execute a DPA anyway as belt-and-suspenders, especially if any future expansion of the relationship might involve real data. The DPA in this case is a contractual artifact, not a regulatory one.

What if our jurisdiction's regulator hasn't issued specific guidance on synthetic data?+

Most major regulators (ICO, CNIL, EDPB, FTC, state AGs) have issued either formal guidance or informal commentary that aligns with the Recital-26 analysis. Where formal guidance is absent, the same first-principles analysis carries: regulated information is information about real persons; synthetic data is not about real persons; the regulation does not apply. We recommend filing the scope determination with a citation to whatever closest analog exists and being prepared to update if formal guidance issues later.

Does the analysis change for state insurance regulators or self-regulatory organizations like FINRA?+

Mostly the analysis transfers, but the audit process is different. State insurance regulators look at illustration-validation data through the lens of NAIC model regulations; FINRA looks at suitability-test data through the lens of Reg BI / Rule 2111. In both cases, the synthetic-data status is rarely the issue — the issue is whether the synthetic dataset is fit for purpose for the use case (illustration validation, suitability testing). The privacy analysis is upstream and usually uncontroversial.

We have customers in many jurisdictions. Do we need a separate scope determination for each?+

Practically, no — the same package addresses GDPR (covers EEA), UK GDPR, CCPA/CPRA, GLBA, and most other modern regimes. Sector-specific regulations (HIPAA, PCI DSS, SOX) and jurisdictions with materially different definitions (China's PIPL, India's DPDP) get a one-page addendum that addresses the specific definitional differences. The base package is reusable.