
CASIdata provides the datasets from Efron & Hastie (2016, ISBN: 9781108107952), Computer Age Statistical Inference: Algorithms, Evidence, and Data Science in an accessible R format for those who want to use them for teaching, study or to try to reproduce or extend analyses from the book. They were downloaded from Trevor Hastie’s web site, https://hastie.su.domains/CASI_files/DATA/, but quite a few files were messy and required some processing to make into R datasets.
Even so, some of the datasets may require data cleaning, renaming of variables, re-shaping or other tidying steps to be useful for analysis. But that’s part of learning.
This package is not yet on CRAN. You can install it from this GitHub repo or from R-universe
remotes::install.github("friendly/CASIdata")
install.packages('CASIdata', repos = c('https://friendly.r-universe.dev'))Loading package: CASIdata
| Dataset | dim | Title |
|---|---|---|
| DTI | 15443x4 | DTI Brain Imaging Data |
| als | 1822x371 | ALS Data |
| baseball | 18x3 | Baseball Batting Averages |
| bivnorm | 40x2 | Bivariate Normal Data |
| butterfly | 24x2 | Butterfly Species Data |
| cellinfusion | 25x4 | Cell Infusion Data |
| cholesterol | 164x2 | Cholesterol Data |
| diabetes | 442x12 | Diabetes Data |
| doseresponse | 11x2 | Dose Response Data |
| galaxy | 270x3 | Galaxy Data |
| haplotype | 197x102 | Human Ancestry Haplotype Data |
| insurance | 60x3 | Insurance Life Table Data |
| leukemia_small | 3571x72 | Leukemia Gene Expression Data (Small) |
| ncog | 96x6 | NCOG Head and Neck Cancer Data |
| nodes | 844x2 | Lymph Nodes Cancer Data |
| pediatric | 1620x7 | Pediatric Cancer Survival Data |
| police | 2748x1 | Police Racial Bias Data |
| prostz | 6032x1 | Prostate Cancer Z-values |
| student_score | 22x5 | Student Score Data |
| supernova | 39x11 | Type Ia Supernova Data |
| vasoconstriction | 39x2 | Vasoconstriction Data |
The following dataset appears in data-raw/CASI-save.R
but is not (yet) included in the package:
| Dataset | Reason |
|---|---|
SPAM |
Variable names need cleanup; requires mapping from UCI Spambase documentation |
See data-raw/missing-datasets.md for details on
resolving this.
These large datasets are referenced in the book but not included in the package due to size constraints. They can be downloaded directly from the sources listed below.
protein_kernel <- matrix(scan("https://hastie.su.domains/CASI_files/DATA/protein_kernel.txt", what=0), 1708, 1708)protein_label <- scan("https://hastie.su.domains/CASI_files/DATA/protein_label.txt", what=0)prostmat <- read.csv("https://hastie.su.domains/CASI_files/DATA/prostmat.csv")data-raw/missing-datasets.md for renaming code)leukemia_small.
leukemia_big <- read.csv("https://hastie.su.domains/CASI_files/DATA/leukemia_big.csv")Some datasets had variables renamed for clarity:
| Dataset | Original | Renamed |
|---|---|---|
butterfly |
x, y | k, count |
police |
X2.411 | z |
prostz |
X1.47236666651029 | z |
galaxy |
Reshaped from wide to long format with mag,
red, freq |
No examples yet.
library(CASIdata)
## basic example code