Skip to content

Data Generators

Functions that produce Dataset objects with controlled missingness for simulation studies. Each function returns a populated Dataset with full_data set for evaluation.

All generators accept ci=True (conditional independence holds) or ci=False (outcome influences missingness).

Synthetic DGPs

Simple linear data-generating processes for controlled experiments.

Function Missingness Description
single_mar MAR Simple linear DGP, missingness on X2
single_mnar MNAR Linear DGP, missingness depends on unobserved Z
MAR1 MAR King (2001) DGP with multi-variable missingness
MNAR1 MNAR MNAR variant of King (2001) DGP

single_mar(n, ci, missing_mech='linear')

Simulated linear DGP with MAR missingness on X2.

single_mnar(n, ci, missing_mech='linear')

Simulated linear DGP with MNAR missingness on X2.

MAR1(n, ci=True, missing_mech='linear', beta_y=2.0)

MAR-1 DGP from King (2001) with multi-variable missingness.

MNAR1(n, ci=True, missing_mech='linear', beta_y=2.0)

MNAR-1 variant of the King (2001) DGP (X2 depends on unobserved Z).

Real-data DGPs

Download real datasets and impose controlled missingness mechanisms.

Function Source Default n
adult UCI Adult (census income) 1000
adult_mnar UCI Adult with MNAR via unobserved sex 1000
mushrooms UCI Mushroom 1000
breast_cancer Wisconsin Breast Cancer 500
wine UCI Wine 500
diabetes Diabetes progression 442
covertype Covertype 5000
california_housing California housing --
german_credit German credit --
bank_marketing Bank marketing --
ames_housing Ames housing --
give_me_some_credit Give Me Some Credit --

adult(n=1000, ci=True, mcar_prop=0.5, k=None, missing_mech='linear', beta_y=6.0)

UCI Adult (census income) with MAR masking on education.

adult_mnar(n=1000, ci=True, mcar_prop=0.5, missing_mech='linear', beta_y=6.0)

UCI Adult with MNAR masking via unobserved sex.

mushrooms(n=1000, ci=True, mcar_prop=0.5, missing_mech='linear')

UCI Mushroom dataset with MAR masking on odor columns.

breast_cancer(n=500, ci=True, mcar_prop=0.5, missing_mech='linear')

Wisconsin Diagnostic Breast Cancer dataset with MAR/MNAR-style masks.

wine(n=500, ci=True, mcar_prop=0.5, missing_mech='linear')

UCI Wine dataset with controllable MAR/MNAR masking.

diabetes(n=442, ci=True, mcar_prop=0.5, missing_mech='linear')

Diabetes progression dataset with MAR/MNAR masks (regression target).

covertype(n=5000, ci=True, mcar_prop=0.3, missing_mech='linear')

Forest CoverType dataset with MAR/MNAR-style masking.

california_housing(n=20000, ci=True, mcar_prop=0.3, missing_mech='linear')

California Housing regression dataset with MAR/MNAR masks.

german_credit(n=1000, ci=True, mcar_prop=0.3, missing_mech='linear')

German credit (OpenML id=31) with MAR/MNAR masking on categorical features.

bank_marketing(n=10000, ci=True, mcar_prop=0.3, missing_mech='linear')

Bank marketing (OpenML id=1461) with MAR/MNAR masking.

ames_housing(n=3000, ci=True, mcar_prop=0.3, missing_mech='linear')

Ames housing prices (OpenML id=43952) with MAR/MNAR masking.

give_me_some_credit(n=10000, ci=True, mcar_prop=0.3, missing_mech='linear')

Give Me Some Credit (OpenML) credit default with MAR/MNAR masking.