Skip to content

Getting Started

Installation

pip install midasverse-citest

citest depends on PyTorch and midas2 for the default imputer. These are installed automatically.

Prepare your data

Start with a pandas DataFrame where missing values are encoded as NaN. Wrap it in a Dataset object:

import pandas as pd
from citest.data import Dataset

data = pd.read_csv("my_data.csv")

dataset = Dataset()
dataset.make(
    data,
    y="target_variable",
    expl_vars=["x1", "x2", "x3"],
)
  • y -- the outcome variable whose relationship to missingness you want to test.
  • expl_vars -- covariates to condition on. If omitted, all columns except y are used.

Categorical columns are one-hot encoded automatically.

Run the test

from citest import CIMissTest

test = CIMissTest(
    dataset,
    classifier_args={"n_estimators": 20, "target_n_jobs": 8},
)
test.run()

This performs multiple imputation, trains classifiers to predict missingness with and without the outcome, and combines the results into a test statistic.

Interpret the results

test.summary()

The summary reports:

  • Mean difference in BCE -- the average reduction in binary cross-entropy when the real outcome is included. Positive values indicate the outcome helps predict missingness.
  • t-statistic / p-value -- a one-sided test of the null hypothesis that the outcome does not improve missingness prediction. A small p-value provides evidence against conditional independence (i.e. evidence of MNAR-type missingness).

Key parameters

Parameter Default Description
m 10 Number of multiply imputed datasets
n_folds 10 Number of cross-validation folds
variance_method "mi_crossfit" Variance estimator
target_level "variable" Missingness granularity: "variable" or "column"
random_state 42 Random seed for reproducibility

Choosing an imputer

Class Description When to use
MidasImputer (default) MIDAS denoising autoencoder via midas2 General purpose; handles mixed types well
IterativeImputer scikit-learn iterative imputer with posterior sampling Faster; good for moderate-sized numeric data
IterativeImputer2 Robust variant with numerical guards Wide or sparse data where IterativeImputer fails
from citest.imputer import IterativeImputer

test = CIMissTest(dataset, imputer=IterativeImputer)

Choosing a classifier

Class Description When to use
RFClassifier (default) Random forest with auto-tuned hyperparameters General purpose; robust default
ETClassifier Extremely randomized trees Faster training; more variance
LogisticClassifier Logistic regression Linear relationships; fast
from citest.classifier import RFClassifier

test = CIMissTest(
    dataset,
    classifier=RFClassifier,
    classifier_args={"n_estimators": 100, "target_n_jobs": 8},
)