Imputation Toolkit for Evaluating Missing Data Methods • imputetoolkit

Overview

imputetoolkit is an R package for evaluating and benchmarking missing-data imputation techniques using a unified R + C++ backend.

It automates the full workflow of data preparation → missingness injection → imputation → metric-based evaluation → visualization, providing both numeric and categorical assessments.

With parallel computation, C++ acceleration, and an intuitive S3 interface (print, summary, plot), the package allows users to quantitatively justify imputation strategy choices.

Key Features

Category	Description
Imputation Methods	Mean/Mode (skewness-aware log transform), Median/Mode, MICE, and mixed-type KNN (parallelized)
Evaluation Metrics (Rcpp)	Split into: • Numeric: RMSE, MAE, R², Correlation, KS • Categorical: Accuracy, Kappa, F1, MacroF1, Balanced Accuracy
Parallelization	Multi-core KNN imputation using `FNN`, `foreach`, and `doParallel`
Scaling	Min-max scaling of numeric columns for fair metric comparison
Visualization	ggplot2-based metric and density plots (per method or all methods)
Recommendation	Automatic “best-method” suggestions per metric type
Reproducibility	Deterministic pipeline with fixed seeds and consistent outputs

Package Structure

## Package Structure

imputetoolkit/
├── DESCRIPTION                     # Package metadata (name, version, author, title, description, dependencies)
├── NAMESPACE                       # Auto-generated by Roxygen; lists exported/imported functions
│
├── R/                              # Core R functions
│   ├── evaluator.R                 # Main pipeline: data validation → imputation → metric evaluation (S3 methods defined here)
│   ├── api.R                       # Public-facing wrapper/helper functions (clean user interface)
│   ├── RcppExports.R               # Auto-generated bridge between R and C++ (created by Rcpp::compileAttributes())
│
├── src/                            # High-performance C++ backend
│   ├── evaluator.cpp               # Core metrics computation (numeric & categorical in one pass)
│   ├── imputer.cpp                 # Internal functions for imputation logic and handling mixed data types
│   └── RcppExports.cpp             # Auto-generated by Rcpp — connects compiled C++ to R functions
│
├── inst/extdata/                   # Example data files bundled with the package
│   ├── sample_dataset.csv          # Small mixed-type dataset (for quick testing)
│   └── synthetic_mixed_dataset.csv # Larger synthetic dataset used in vignette + examples
│
├── man/                            # Auto-generated documentation (one .Rd per exported function)
│   ├── evaluator.Rd                # Docs for main function
│   ├── plot_metrics.Rd             # Docs for visualisation helper
│   ├── suggest_best_method.Rd      # Docs for recommendation helper
│   └── ...                         # (Other .Rd files for all exported functions)
│
├── vignettes/                      # Long-form user guide in R Markdown format
│   └── imputetoolkit.Rmd           # Full walkthrough (loading data → imputation → evaluation → plots)
│
├── tests/                          # Unit testing infrastructure
│   ├── testthat.R                  # Entry point for testthat (auto-runs all test files below)
│   └── testthat/                   # Folder containing detailed test files
│       ├── test-evaluator.R        # Tests for pipeline, error handling, and reproducibility
│       ├── test-rcpp-evaluator.R   # Tests for C++ metrics (numeric/categorical accuracy)
│       └── ...                     # (Future tests for new methods or features)
│
├── README.md                       # Main package overview (installation, examples, visuals, and test plan)
├── LICENSE / LICENSE.md            # License declaration (MIT License in your case)
│
├── .Rbuildignore                   # Excludes local/non-package files from build (e.g., .git, .Rproj, docs/)
├── .gitignore                      # Git version control exclusions
│
├── imputetoolkit.Rproj             # RStudio project file (makes package development reproducible)
│
├── _pkgdown.yml                    # Configuration file for your pkgdown documentation website
├── docs/                           # Auto-generated pkgdown site (HTML pages hosted on GitHub Pages)
│   ├── index.html                  # Home page for pkgdown site
│   ├── reference/                  # Function documentation in HTML form
│   ├── articles/                   # Vignettes rendered as articles
│   └── figures/                    # Plot and logo images used in docs
│
└── pkgdown/                        # Static assets for website styling
    ├── extra.css                   # Custom CSS for theme styling
    └── favicon/                    # App icons (favicon, apple-touch-icon, etc.)

Core Functions

`evaluator()`

Main pipeline that executes: 1. Data loading or input validation
2. Missingness injection
3. Imputation via all four methods
4. Metric evaluation (numeric & categorical)
5. Return of structured evaluator objects

Returns:

list(mean_mode, median_mode, mice, knn)

Each object contains:

$method
$metrics_numeric
$metrics_categorical

S3 Methods

print.evaluator() – concise summary by metric type
summary.evaluator() – per-column and aggregated results
print_metrics() – combined table across all methods
plot_metrics() – ggplot visualization for any metric
suggest_best_method() – identifies top-performing methods
evaluate_results() – runs print + plot + recommendation
get_eval_list() – extracts true/imputed pairs for density plots
plot_density_per_column() / plot_density_all() – visualize true vs imputed distributions

C++ Backend (`evaluate_imputation_split.cpp`)

Efficient Rcpp backend that calculates both numeric and categorical metrics in one pass.

Numeric metrics

Metric	Description
RMSE	Root Mean Square Error
MAE	Mean Absolute Error
R²	Coefficient of Determination
Correlation	Pearson correlation
KS	Kolmogorov–Smirnov statistic

Categorical metrics

Metric	Description
Accuracy	Exact match ratio
Kappa	Chance-adjusted agreement
F1	Micro-F1 score
MacroF1	Mean F1 over all classes
BalancedAccuracy	Mean recall per class

All metrics are implemented in C++ using Rcpp, ensuring fast and stable performance.

Unit Tests (`tests/testthat/`)

Comprehensive suite covering:

Test Area	Focus
Pipeline	Verifies all 4 imputation methods run successfully
S3 Methods	`print`, `summary`, and invisibility checks
Wrapper Functions	`extract_metrics`, `plot_metrics`, `suggest_best_method`
Error Handling	Invalid inputs, unsupported file formats
Reproducibility	Consistent metrics with fixed seed
Parallel KNN	Multi-core execution validation
C++ Evaluator	Direct unit tests for numeric/categorical metrics

Example:

[ FAIL 0 | WARN 1 | SKIP 0 | PASS 76 ]

Installation

Requires R ≥ 4.0, Rtools, and compilation support for Rcpp.

Install from GitHub

install.packages("remotes")
remotes::install_github("tanveer09/imputetoolkit@main", build_vignettes = TRUE, INSTALL_opts = c("--install-tests"))

Load the package

library(imputetoolkit)

Browse documentation

browseVignettes("imputetoolkit")

Dependencies

Before installation, ensure the following R packages are installed:

install.packages(c(
  "Rcpp", "FNN", "mice", "ggplot2", "dplyr",
  "knitr", "rmarkdown", "testthat", "foreach", "doParallel"
))

Example Usage

1. Load Data

file <- system.file("extdata", "synthetic_mixed_dataset.csv", package = "imputetoolkit")
raw_data <- read.csv(file, stringsAsFactors = TRUE)

2. Run the Evaluator

res <- evaluator(data = raw_data)

3. Inspect One Method

print(res$mean_mode)
summary(res$mean_mode)

4. Compare All Methods

print_metrics(res)
plot_metrics(res, "ALL")

5. Suggest the Best Method

suggest_best_method(res, "ALL")

Example output:

Numeric Columns:
     Best imputation method as per "RMSE, MAE, R2, KS" metric: Mean/Mode
     Best imputation method as per "Correlation" metric: MICE

Categorical Columns:
     Best imputation method as per "Kappa" metric: KNN
     Best imputation method as per "Accuracy, F1, BalancedAccuracy" metric: Mean/Mode
     Best imputation method as per "MacroF1" metric: MICE

Example Visuals

Metric Comparison Across Methods

plot_metrics(res, "ALL")

Comparison of Imputation Methods across All Metrics )

Single Metric (e.g. RMSE)

plot_metrics(res, "RMSE")

Comparison of Methods on RMSE

Density Comparison

eval_list <- get_eval_list(res)
plot_density_per_column(eval_list, "age")
plot_density_all(eval_list)

True vs Imputed Densities

Evaluation Metrics Summary

Metric	Type	Goal	Description
RMSE	Numeric	↓	Root Mean Squared Error
MAE	Numeric	↓	Mean Absolute Error
R²	Numeric	↑	Variance explained
Correlation	Numeric	↑	Pearson correlation
KS	Numeric	↑	Distribution similarity
Accuracy	Categorical	↑	Exact match rate
Kappa	Categorical	↑	Chance-corrected agreement
F1 / MacroF1	Categorical	↑	Balance between precision & recall
BalancedAccuracy	Categorical	↑	Mean recall per class

Additional Features

Split-metric evaluation: numeric and categorical metrics handled separately.
C++ acceleration: high-performance evaluation with safe NA handling.
Parallel KNN: multi-core computation for mixed-type datasets.
Skewness-aware mean imputation: applies log + geometric mean for |skewness| > 1.
Reproducible pipeline: consistent metrics across runs (set.seed() used).
Custom plotting: side-by-side bar charts, density overlays, and panel comparisons.

Documentation

Function references: ?evaluator, ?plot_metrics, ?suggest_best_method
Vignette tutorial: vignette("imputetoolkit")
Online docs: pkgdown site

Test Plan

This section provides a detailed plan for testing the imputetoolkit package. Follow the steps in order. Each step includes the purpose, example commands, and the expected outcome. This plan allows reviewers to verify both functionality and error handling.

1. Environment Setup

Purpose:

Set up the environment and ensure the package installs and loads correctly.

Steps:

1. Install `remotes` (if not already installed):

install.packages("remotes")

2. Install the package from GitHub (recommended command):

remotes::install_github(
  "tanveer09/imputetoolkit@draft",
  build_vignettes = TRUE,
  INSTALL_opts = c("--install-tests")
)

This command builds vignettes and installs all test files automatically. If the package was already installed, reinstall using the same command to ensure tests are available.

3. Load the package:

library(imputetoolkit)

Expected outcome: Package installs and loads without errors.

4. Check vignette availability:

browseVignettes("imputetoolkit")

Expected outcome: At least one vignette appears, linking to the HTML vignette documentation.

2. Data Requirements

Purpose:

Confirm that reviewers use a dataset suitable for imputation evaluation.

Requirements:

Your dataset should:

Contain at least one numeric and/or categorical variable.
Include some missing values for imputation.
Have no constant or identical columns.
Avoid extremely small datasets (fewer than ~5 rows), as MICE may fail due to collinearity.

Example:

filename <- system.file("extdata", "sample_dataset.csv", package = "imputetoolkit")
raw_data <- read.csv(filename, stringsAsFactors = TRUE)
head(raw_data)

Expected outcome: A dataframe similar to the example below, with some missing values.

age	income	score	height	color	city	education	satisfaction	purchased
56	126091	51.19	166.1	Red	DAL	High School	Neutral	Yes
46	129500	52.75	182.1	Green	LA			Yes

3. Evaluator Pipeline

Purpose:

Run the evaluator function and confirm that it detects missing values, validates inputs, and produces results.

results <- evaluator(data = raw_data)
names(results)

Expected outcome: A list with elements "mean_mode", "median_mode", and "mice".

Example failure case:

If the dataset has no missing values, the function prints:

“No missing values detected. Please provide data with missing entries.”
If both data and filename are provided, message:

“Both data and filename provided — using data input and ignoring filename.”

4. Inspect Individual Results

Purpose:

Validate print and summary outputs for each imputation method.

4.1 print()

print(results$mean_mode)

Expected: Displays global metrics (RMSE, MAE, R², Correlation, KS, Accuracy).

4.2 summary()

summary(results$mean_mode)

Expected: Returns a per-column metrics data frame and a “GLOBAL” summary row.

5. Compare Methods

Purpose:

Confirm comparative visualization and tabular summaries of metrics.

5.1 print_metrics()

print_metrics(results)

Expected: Table comparing all imputation methods by metric.

5.2 plot_metrics()

plot_metrics(results, "R2")

Expected: Facetted bar chart comparing methods by the selected metric.

6. Suggest Best Method

Purpose:

Check that the helper correctly identifies the top-performing imputation method(s).

output <- suggest_best_method(results, "ALL")

Expected: Displays methods performing best for each metric (R², KS, RMSE, etc.).

7. Vignette Validation

Purpose:

Ensure the vignette compiles successfully and demonstrates the full workflow.

browseVignettes("imputetoolkit")

Expected: Vignette opens and runs without errors, showcasing data loading, imputation, and evaluation.

8. Error Handling

Purpose:

Test robustness and clarity of error messages.

# Missing arguments
try(evaluator())

# Invalid file path
try(evaluator(data = "fake.csv"))

Expected outcomes:

Missing arguments → “Please provide either a filename or a data.frame.”
Invalid file path → “File not found: fake.csv”
Small dataset / constant variables → “MICE imputation failed: dataset may be too small or contain collinear variables.”

9. Reproducibility Check

Purpose:

Verify results are consistent across runs.

res1 <- evaluator(data = raw_data)
res2 <- evaluator(data = raw_data)
identical(res1, res2)

Expected outcome: TRUE — confirming reproducibility.

10. Run Unit Tests

Purpose:

Execute all built-in unit tests to confirm functionality.

library(testthat)
test_package("imputetoolkit")

Expected outcome:

[ FAIL 0 | WARN 1 | SKIP 0 | PASS 76 ]

Development Notes

Modular design: easy extension for new imputation algorithms (e.g. missForest, EM).
Full testthat coverage with >50 passing tests.
Fully vectorized Rcpp implementation.
Works seamlessly with knitr / rmarkdown for reproducible reports.

Feedback and Contributions

Feedback, feature requests, or bug reports are welcome via the Issues page.

License

Released under the MIT License.

Citation

Singh, Tanveer. (2025). imputetoolkit: An R Package for Evaluating Missing Data Imputation Methods. Victoria University of Wellington.

imputetoolkit

Overview

Key Features

Package Structure

Core Functions

evaluator()

S3 Methods

C++ Backend (evaluate_imputation_split.cpp)

Numeric metrics

Categorical metrics

Unit Tests (tests/testthat/)

Installation

Install from GitHub

Load the package

Browse documentation

Dependencies

Example Usage

1. Load Data

2. Run the Evaluator

3. Inspect One Method

4. Compare All Methods

5. Suggest the Best Method

Example Visuals

Metric Comparison Across Methods

Single Metric (e.g. RMSE)

Density Comparison

Evaluation Metrics Summary

Additional Features

Documentation

Test Plan

1. Environment Setup

Purpose:

Steps:

1. Install remotes (if not already installed):

2. Install the package from GitHub (recommended command):

3. Load the package:

4. Check vignette availability:

2. Data Requirements

Purpose:

Requirements:

Example:

3. Evaluator Pipeline

Purpose:

4. Inspect Individual Results

Purpose:

4.1 print()

4.2 summary()

5. Compare Methods

Purpose:

5.1 print_metrics()

5.2 plot_metrics()

6. Suggest Best Method

Purpose:

7. Vignette Validation

Purpose:

8. Error Handling

Purpose:

9. Reproducibility Check

Purpose:

10. Run Unit Tests

Purpose:

Development Notes

Feedback and Contributions

License

Citation

`evaluator()`

C++ Backend (`evaluate_imputation_split.cpp`)

Unit Tests (`tests/testthat/`)

1. Install `remotes` (if not already installed):