Data Generators¶
Functions that produce Dataset objects with controlled missingness for simulation studies. Each function returns a populated Dataset with full_data set for evaluation.
All generators accept ci=True (conditional independence holds) or ci=False (outcome influences missingness).
Synthetic DGPs¶
Simple linear data-generating processes for controlled experiments.
| Function | Missingness | Description |
|---|---|---|
single_mar |
MAR | Simple linear DGP, missingness on X2 |
single_mnar |
MNAR | Linear DGP, missingness depends on unobserved Z |
MAR1 |
MAR | King (2001) DGP with multi-variable missingness |
MNAR1 |
MNAR | MNAR variant of King (2001) DGP |
single_mar(n, ci, missing_mech='linear')
¶
Simulated linear DGP with MAR missingness on X2.
single_mnar(n, ci, missing_mech='linear')
¶
Simulated linear DGP with MNAR missingness on X2.
MAR1(n, ci=True, missing_mech='linear', beta_y=2.0)
¶
MAR-1 DGP from King (2001) with multi-variable missingness.
MNAR1(n, ci=True, missing_mech='linear', beta_y=2.0)
¶
MNAR-1 variant of the King (2001) DGP (X2 depends on unobserved Z).
Real-data DGPs¶
Download real datasets and impose controlled missingness mechanisms.
| Function | Source | Default n |
|---|---|---|
adult |
UCI Adult (census income) | 1000 |
adult_mnar |
UCI Adult with MNAR via unobserved sex | 1000 |
mushrooms |
UCI Mushroom | 1000 |
breast_cancer |
Wisconsin Breast Cancer | 500 |
wine |
UCI Wine | 500 |
diabetes |
Diabetes progression | 442 |
covertype |
Covertype | 5000 |
california_housing |
California housing | -- |
german_credit |
German credit | -- |
bank_marketing |
Bank marketing | -- |
ames_housing |
Ames housing | -- |
give_me_some_credit |
Give Me Some Credit | -- |
adult(n=1000, ci=True, mcar_prop=0.5, k=None, missing_mech='linear', beta_y=6.0)
¶
UCI Adult (census income) with MAR masking on education.
adult_mnar(n=1000, ci=True, mcar_prop=0.5, missing_mech='linear', beta_y=6.0)
¶
UCI Adult with MNAR masking via unobserved sex.
mushrooms(n=1000, ci=True, mcar_prop=0.5, missing_mech='linear')
¶
UCI Mushroom dataset with MAR masking on odor columns.
breast_cancer(n=500, ci=True, mcar_prop=0.5, missing_mech='linear')
¶
Wisconsin Diagnostic Breast Cancer dataset with MAR/MNAR-style masks.
wine(n=500, ci=True, mcar_prop=0.5, missing_mech='linear')
¶
UCI Wine dataset with controllable MAR/MNAR masking.
diabetes(n=442, ci=True, mcar_prop=0.5, missing_mech='linear')
¶
Diabetes progression dataset with MAR/MNAR masks (regression target).
covertype(n=5000, ci=True, mcar_prop=0.3, missing_mech='linear')
¶
Forest CoverType dataset with MAR/MNAR-style masking.
california_housing(n=20000, ci=True, mcar_prop=0.3, missing_mech='linear')
¶
California Housing regression dataset with MAR/MNAR masks.
german_credit(n=1000, ci=True, mcar_prop=0.3, missing_mech='linear')
¶
German credit (OpenML id=31) with MAR/MNAR masking on categorical features.
bank_marketing(n=10000, ci=True, mcar_prop=0.3, missing_mech='linear')
¶
Bank marketing (OpenML id=1461) with MAR/MNAR masking.
ames_housing(n=3000, ci=True, mcar_prop=0.3, missing_mech='linear')
¶
Ames housing prices (OpenML id=43952) with MAR/MNAR masking.
give_me_some_credit(n=10000, ci=True, mcar_prop=0.3, missing_mech='linear')
¶
Give Me Some Credit (OpenML) credit default with MAR/MNAR masking.