Synthetic vs. Anonymized Real Data — Decision Tree
The synthetic-vs-anonymized decision often gets framed as 'how realistic does the data need to be?' That's the wrong question — both can be made realistic. The right questions are about legal-risk tolerance, edge-case coverage, and how much the test environment is going to be examined. This decision tree walks through them.
What you walk away with
~2 min · 3 questions · 3 possible outcomes- A specific 'use synthetic' / 'use anonymized' / 'either, with caveats' recommendation.
- A one-paragraph rationale citing the binding constraint.
- A linked artifact (the comparison, the relevant assessment, or a 'talk to us' if your situation is unusual).
Does the test / dev / demo environment ever process production PII or NPI?
Even occasional flow counts. If anyone has had to file a finding about this, the answer is yes.
FAQ
Why doesn't this go deeper into anonymization techniques?
Because anonymization rigor is downstream of the legal-risk decision. If the firm accepts the residual re-identification risk, the technique discussion is engineering. If it doesn't, no technique is sufficient.
What about a hybrid?
Real for distributional grounding, synthetic for edge-case top-up. The wizard surfaces this on the anonymized branch — 'talk to us' opens that conversation.