Lifecycle: stable CRAN_Status R-Universe Last Commit

CASIdata CASIdata logo

CASIdata provides the datasets from Efron & Hastie (2016, ISBN: 9781108107952), Computer Age Statistical Inference: Algorithms, Evidence, and Data Science in an accessible R format for those who want to use them for teaching, study or to try to reproduce or extend analyses from the book. They were downloaded from Trevor Hastie’s web site, https://hastie.su.domains/CASI_files/DATA/, but quite a few files were messy and required some processing to make into R datasets.

Even so, some of the datasets may require data cleaning, renaming of variables, re-shaping or other tidying steps to be useful for analysis. But that’s part of learning.

Installation

This package is not yet on CRAN. You can install it from this GitHub repo or from R-universe

remotes::install.github("friendly/CASIdata")
install.packages('CASIdata', repos = c('https://friendly.r-universe.dev'))

Datasets included here

Loading package: CASIdata

Dataset dim Title
DTI 15443x4 DTI Brain Imaging Data
als 1822x371 ALS Data
baseball 18x3 Baseball Batting Averages
bivnorm 40x2 Bivariate Normal Data
butterfly 24x2 Butterfly Species Data
cellinfusion 25x4 Cell Infusion Data
cholesterol 164x2 Cholesterol Data
diabetes 442x12 Diabetes Data
doseresponse 11x2 Dose Response Data
galaxy 270x3 Galaxy Data
haplotype 197x102 Human Ancestry Haplotype Data
insurance 60x3 Insurance Life Table Data
leukemia_small 3571x72 Leukemia Gene Expression Data (Small)
ncog 96x6 NCOG Head and Neck Cancer Data
nodes 844x2 Lymph Nodes Cancer Data
pediatric 1620x7 Pediatric Cancer Survival Data
police 2748x1 Police Racial Bias Data
prostz 6032x1 Prostate Cancer Z-values
student_score 22x5 Student Score Data
supernova 39x11 Type Ia Supernova Data
vasoconstriction 39x2 Vasoconstriction Data

Missing Datasets

The following dataset appears in data-raw/CASI-save.R but is not (yet) included in the package:

Dataset Reason
SPAM Variable names need cleanup; requires mapping from UCI Spambase documentation

See data-raw/missing-datasets.md for details on resolving this.

External Datasets (Not Included)

These large datasets are referenced in the book but not included in the package due to size constraints. They can be downloaded directly from the sources listed below.

CASI datasets (too large for CRAN)

Image datasets (hosted externally)

Variable Renaming

Some datasets had variables renamed for clarity:

Dataset Original Renamed
butterfly x, y k, count
police X2.411 z
prostz X1.47236666651029 z
galaxy Reshaped from wide to long format with mag, red, freq

Example

No examples yet.

library(CASIdata)
## basic example code