wealthschema/data sets/insurance-claims-cybersecurity-pack
All Data Sets

Insurance Claims & Cybersecurity Risk Pack

Fraud detection is one of the few problems in financial services where the modal training case is misleading. The vast majority of transactions are legitimate, the vast majority of claims are valid, and a fraud-detection model trained on a representative sample will optimize for the modal case at the expense of the rare-but-high-cost adversarial cases that actually drive risk. The Insurance Claims & Cybersecurity Risk Pack is 50 households built explicitly for the adversarial training set: labeled fraud scenarios, contested claim trajectories, account-takeover patterns, predatory-lending exposures, and the elder-financial-abuse markers that AML and supervisory systems are increasingly required to detect.

Households
50
Archetypes
3
Formats
JSON, CSV, Parquet
Deviation
High

Why this Data Set exists

Building a fraud-detection or supervisory-monitoring model requires labeled adversarial examples — and most institutions have very few. The labels in production data are sparse (fraud is a low-base-rate event, and many fraud events are never identified or never labeled correctly), the structural diversity of attack patterns is high (each new fraud scheme creates a new pattern type), and the regulatory pressure to detect new scheme types is increasing (FinCEN AML rules, FINRA elder-abuse rules, SEC supervisory expectations).

The alternative — building a synthetic adversarial corpus — has its own challenges. Generic synthetic data tends to under-represent the structural complexity of real fraud (the cluster of small signals that together indicate a fraud event). Hand-built adversarial examples tend to converge on a small number of attack types the team happens to think of.

This Data Set is a focused 50-household corpus where the adversarial scenarios are carefully calibrated against published FinCEN AML typologies, FINRA elder-abuse case studies, and CFPB predatory-lending enforcement actions. Every household carries explicit fraud-scenario labels, structured insurance-claim disputes, account-takeover patterns where applicable, and the markers that supervisory systems should detect. The corpus is high-deviation by design — for adversarial ML training, you want the cases that stress the model, not the cases that confirm it.

Use Cases

Fraud detection algorithm training
Insurance claims dispute analytics
Account takeover pattern recognition
AML transaction monitoring

Who uses this Data Set

Fraud Detection ML Engineer

Trains and validates fraud-detection models on a corpus with explicit adversarial labels: insurance-claim disputes, account-takeover patterns, AML-relevant transaction patterns, predatory-lending exposures. The labeled structure makes supervised learning possible against fraud types where production-data labels are too sparse for training.

AML Compliance Officer at a Bank

Tests the firm's transaction-monitoring system against realistic SAR-trigger scenarios calibrated to current FinCEN typologies, ensuring the system catches the patterns the regulator has flagged and avoids false-positives on lawful activity that superficially resembles those patterns.

Insurance Claims Analyst

Validates the firm's claims-dispute-resolution workflow against realistic contested-claim scenarios where the structural facts make the case ambiguous (medical-necessity disputes, pre-existing-condition exclusions, disability-claim documentation gaps).

Cybersecurity Operations Lead

Tests the firm's account-takeover detection logic against labeled ATO patterns: credential stuffing, phishing-derived credential reuse, SIM-swap-followed-by-password-reset, and the increasingly common authenticated-but-anomalous-behaviour patterns.

FINRA Elder-Abuse Compliance Specialist

Tests the firm's elder-financial-abuse detection process against labeled markers: unusual large withdrawals, new caregiver as agent on POA, beneficiary changes coinciding with cognitive-decline markers, isolation patterns. Ensures the supervisory process catches the cases regulators expect to be caught.

What's inside

The 50 households cluster around three archetypes deliberately chosen for adversarial-data utility: medical-debt-crisis households (S-03) where insurance-claim disputes are central; low-income working families (U-02) where predatory-lending exposure and AML-relevant transaction patterns concentrate; and neurodiverse / disability households (X-04) where elder-abuse-precursor markers surface (the disability-status field overlaps with cognitive-vulnerability flags in important ways).

Every household carries structured fraud-scenario labels — the scenario type (claim dispute, ATO, predatory loan, elder financial abuse, AML-suspicious-pattern), the structural markers that should trigger detection, and the resolution outcome (where applicable). Insurance-claim disputes include the disputed amount, the dispute reason (medical-necessity, pre-existing exclusion, treatment classification), the carrier and provider responses, and the resolution status. Account-takeover scenarios include the structured attack pattern (credential origin, authentication anomalies, behavioral changes post-compromise). AML-relevant patterns are calibrated to current FinCEN guidance — structuring patterns, unusual-pattern velocity, geographic-anomaly transactions, third-party-funded purchases.

The Data Set ships as JSON, CSV, and Parquet. The WealthSynth Methodology PDF documents the fraud-scenario taxonomy, the calibration sources (FinCEN advisory bulletins, FINRA elder-abuse case studies, CFPB predatory-lending enforcement actions, NAIC fraud-detection guidance), and the AML/elder-abuse markers that should drive supervisory routing.

Preview a sample household

A redacted summary of one household from this Data Set — names, employers, exact balances, and metro area are stripped. Ages are bucketed, income and net worth are reported as bands. The full record (and all 50 like it) ships in the ZIP.

S-03·Medical Debt Crisis
representative archetype household
Household
Married Joint
State
NY
Gross income (band)
$50k–$100k
Net worth (band)
Dependents
3
Income source types
w2 salary, w2 bonus
Members (5)
primary
Age 50–54
retail
spouse
Age 55–59
retail
dependent
Age 10–14
dependent
Age 5–9
dependent
Age 0–4

Technical Highlights

Labeled adversarial scenarios for ML
FinCEN AML flag alignment
FINRA elder abuse taxonomy
Insurance claims dispute records

Sample Schema Fields

sample_record.json
{
  "fraud.scenario_type": <value>,
  "insurance.claims_history[]": <value>,
  "fraud.account_takeover_flags": <value>,
  "compliance.aml_flags": <value>,
  "fraud.predatory_lending_indicators": <value>
}

Sample queries

Find labeled fraud scenarios for ML training

Returns households with at least one labeled fraud scenario, grouped by scenario type — the structured training data for supervised fraud-detection models.

households.filter(h =>
  h.fraud.scenario_type !== null
).reduce((acc, h) => {
  const type = h.fraud.scenario_type;
  if (!acc[type]) acc[type] = [];
  acc[type].push(h);
  return acc;
}, {})
Surface elder-financial-abuse precursor markers

Returns households whose elder-abuse risk markers fire above threshold: cognitive-decline markers, recent POA change to non-family member, beneficiary changes coinciding with isolation patterns, unusual-large-withdrawal patterns.

households.filter(h => {
  const cognitive = h.members.some(m =>
    m.cognitive_status !== 'none');
  const recentPoaChange = h.legal.poa_history?.some(p =>
    monthsSince(p.update_date) < 6);
  const recentBeneficiaryChange = h.estate.beneficiaries_by_account
    .some(b => monthsSince(b.last_updated) < 6);
  return cognitive &&
    (recentPoaChange || recentBeneficiaryChange);
})
Identify AML-suspicious transaction patterns

Returns households whose transaction patterns match FinCEN-typology categories: structuring (just-below-CTR-threshold deposits), velocity anomalies (sudden 10x increase in transaction frequency), or geographic anomalies (transactions in high-risk jurisdictions).

households.filter(h =>
  h.compliance.aml_flags.some(f =>
    ['structuring', 'velocity_anomaly',
     'geographic_anomaly'].includes(f)
  )
)
Track account-takeover pattern signatures

Returns ATO scenarios with their structural signatures — credential-origin path, authentication anomalies, post-compromise behavioral changes — useful for training pattern-recognition components of ATO detection systems.

households.filter(h =>
  h.fraud.account_takeover_flags
).map(h => ({
  id: h.id,
  attack_pattern: h.fraud.ato_pattern,
  authentication_anomalies: h.fraud.auth_anomalies,
  behavioral_changes: h.fraud.post_compromise_behavior
}))

Methodology

Each household's fraud-scenario assignment is generated against archetype-specific risk patterns. Medical-debt-crisis households (S-03) carry insurance-claim disputes calibrated against actual NAIC dispute typologies. Low-income working families (U-02) carry predatory-lending exposures calibrated against CFPB enforcement actions and AML-relevant patterns calibrated against FinCEN advisories. Neurodiverse / disability households (X-04) carry elder-abuse precursor markers calibrated against FINRA and SEC supervisory guidance. Account-takeover scenarios use realistic attack-pattern signatures derived from published cybersecurity research (Verizon DBIR, IBM Threat Intelligence). The scenarios are deliberately concentrated for adversarial utility — the 50-household corpus is roughly equivalent in adversarial-coverage value to a 1000-household corpus with realistic base rates, since the rare events are upweighted. The corpus passes the WealthSynth consistency validator (fraud-scenario labels are structurally consistent with the underlying household data; AML flags fire when the underlying transaction patterns match the typology) and the LLM-as-judge gate. Annual refresh tracks FinCEN, FINRA, NAIC, and CFPB guidance updates.

Included Archetypes (3)

Frequently asked questions

Why only 50 households for an ML training corpus?+

Adversarial training value isn't proportional to corpus size — it's proportional to scenario diversity within the corpus. 50 households with 50 well-calibrated scenarios across multiple typologies gives more training value than 5000 households where scenarios cluster on the modal case. For production-scale ML training, this corpus is best used for fine-tuning or as a labeled validation set; the bulk training would still come from the institution's own production data.

Are FinCEN advisories current?+

Yes. The AML scenario calibration tracks current FinCEN advisory bulletins (2023-2024 advisories on synthetic identity, business email compromise, real estate money laundering, and cyber-enabled fraud). Annual refresh tracks subsequent advisory updates.

How are elder-abuse markers structured?+

Per FINRA Rule 2165 and the SEC's senior-investor focus, elder-abuse markers are structured around the patterns regulators have repeatedly flagged: cognitive-decline indicators paired with new caregiver-as-agent designations, beneficiary changes coinciding with isolation patterns, unusual-large-withdrawal patterns, and the involvement of unrelated third parties as account agents. About 30% of the corpus has at least one elder-abuse precursor marker firing.

Are predatory-lending exposures realistic?+

Yes. The structured predatory-lending markers cover the patterns CFPB has emphasized in enforcement: triple-digit-APR personal loans, undisclosed fees in title-loan contexts, balloon-payment auto loans with negative-equity rollover, and refund-anticipation-loan churn. About 18% of the corpus's underbanked households have at least one predatory-loan exposure structurally documented.

Does the corpus include cyber-insurance claim scenarios?+

Yes — about 12% of the corpus has a cyber-insurance claim scenario: identity theft with documented financial loss, account-takeover with funds-recovery sub-claim, ransomware affecting a small business owned by the household. The structured claim data lets your tools test the carrier-and-policyholder coordination workflow specifically.

How are insurance-claim disputes structured?+

Disputed claims include the disputed amount, the dispute reason, the carrier's position with citation to the policy section, the policyholder's position with documentation, and the resolution status (pending, denied-final, mediated, paid-after-appeal). About 25% of the corpus has at least one disputed claim. The structured data supports both training claim-decisioning models and testing claims-dispute workflow software.

Are AML scenarios concentrated in specific archetypes?+

Most AML-relevant patterns concentrate in the U-02 low-income-working-family archetype (where structuring-pattern false-positives are common) and in cash-economy adjacent profiles. The corpus deliberately includes the LAWFUL patterns that look adversarial as well as the genuinely adversarial patterns — testing AML systems on lawful patterns is critical to avoid false-positive SAR filings on legitimate underbanked-customer behaviour.

How does this fit alongside B10 (InsurTech Illustration)?+

B10 covers the pre-issue side of insurance: needs analysis, illustration, underwriting. B15 covers the post-issue / claims side: dispute resolution, fraud detection, AML monitoring, elder-abuse detection. Carriers serving the full insurance lifecycle typically buy both. ML / fraud-detection teams typically buy B15 standalone for the labeled adversarial corpus.

Related Wealth Data Sets

$5,500
one-time purchase
50 households (ZIP)
Methodology PDF
JSON, CSV, Parquet formats
Account required to purchase

Purchases are for internal use only. Redistribution or resale of data is prohibited under the WealthSchema Data License.

View data license →