| Title: | Name-Blind Variable-Role Detection by Data Signature |
| Version: | 0.1.0 |
| Description: | Deterministic, name-blind detection of variable roles (group, outcome, survival time and event, paired and agreement measurements, repeated measures, scale items, subject identifier, covariate) in tabular data. Roles are assigned from each column's information-theoretic signature – Shannon entropy, normalized mutual information, and distributional shape – rather than from column names, so renaming columns to 'col_1', 'col_2', ... does not change the result ("Data inspice, non nomen"). An optional, capped name-based hint and automatic header-row detection are also provided. No large language models and no external data transmission. Extracted from the 'MDStatR' biostatistics engine; see Boynukara (2026) <doi:10.5281/zenodo.20707791>. |
| License: | Apache License (== 2.0) |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| Depends: | R (≥ 4.0.0) |
| Imports: | stats, utils |
| Suggests: | moments, diptest, stringdist, readxl, openxlsx, haven, testthat (≥ 3.0.0), knitr, rmarkdown, spelling |
| Config/testthat/edition: | 3 |
| VignetteBuilder: | knitr |
| URL: | https://github.com/canboynukara/rolescry |
| BugReports: | https://github.com/canboynukara/rolescry/issues |
| Language: | en-US |
| NeedsCompilation: | no |
| Packaged: | 2026-06-17 20:27:43 UTC; cboyn |
| Author: | Can Boynukara |
| Maintainer: | Can Boynukara <canboynukara1@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-06-22 16:20:02 UTC |
rolescry: Name-blind Variable-Role Detection
Description
Deterministic, name-blind detection of variable roles (group, outcome, survival time/event, paired, agreement, repeated measures, scale items, subject id, covariate) in tabular data, using information-theoretic signatures (Shannon entropy, normalized mutual information) and distributional shape rather than column names. The guiding principle is "Data inspice, non nomen" – inspect the data, not the name.
Details
The single public entry point is detect_roles(). Header-aware data
loading is available via read_data(). No LLMs, no external data
transmission; detection is >= 90 percent mathematical signature with an
optional, capped (<= 10 percent) name bonus (see the name_bonus
argument).
Extracted from the MDStatR biostatistics engine.
Author(s)
Maintainer: Can Boynukara canboynukara1@gmail.com (ORCID) [copyright holder]
Other contributors:
M. Yasir Ceyhan (ORCID) [contributor]
See Also
Useful links:
Report bugs at https://github.com/canboynukara/rolescry/issues
Normalized mutual information
Description
Computes the normalized mutual information (NMI) between two discrete
variables, a name-blind, information-theoretic measure of association in
[0, 1]. NMI is the mutual information divided by the smaller of the two
marginal Shannon entropies; it is 0 for independent variables and 1 for a
perfect (deterministic) association, and unlike a raw chi-squared it is
comparable across variables with different numbers of levels.
Usage
compute_nmi(x, y = NULL)
Arguments
x |
Either a two-way contingency table / matrix of counts, or a vector (factor, character, or numeric) of the first variable. |
y |
Optional. If |
Value
A single numeric in [0, 1]. Returns 0 for degenerate input
(fewer than two rows/columns, zero total, or near-zero marginal entropy).
Examples
set.seed(1)
g <- sample(c("A", "B", "C"), 200, replace = TRUE)
y <- ifelse(g == "A", "yes", sample(c("yes", "no"), 200, replace = TRUE))
compute_nmi(g, y) # > 0: g carries information about y
compute_nmi(g, sample(g)) # ~0: shuffled -> independent
Detect the header row of a raw, unparsed table
Description
Given a raw table read with no header (every cell character), scores
each of the first rows with a 7-signal weighted heuristic (alphabetic ratio,
non-numeric ratio, uniqueness, normalized Shannon entropy, median string
length, alpha-vs-next-row transition, fill completeness) and returns the
most header-like row plus repaired, unique column names. Empty cells
are upward-filled (merged-cell repair) and any still-empty name becomes
col_<j>. Base + stats only; no file-format dependencies.
Usage
detect_header(raw, verbose = FALSE)
Arguments
raw |
A data.frame or matrix of the raw sheet, read with
|
verbose |
Logical; emit the chosen row via |
Value
A list with header_row (integer), score (numeric),
names (repaired character vector, length == ncol(raw)),
and all_scores.
See Also
Examples
raw <- data.frame(
V1 = c("age", "34", "51"),
V2 = c("sex", "M", "F"),
V3 = c("score", "8.1", "7.4"),
stringsAsFactors = FALSE
)
detect_header(raw)$names
Detect variable roles by data signature, not by name
Description
Inspects an already-loaded data frame and assigns each column (or group of
columns) to statistical roles – group variable, continuous/binary outcome,
survival time and event, paired and agreement measurement pairs, repeated
measures, scale items, subject id, and covariates – using only the data
information-theoretic signature (Shannon entropy, distributional shape,
inter-column structure) and never the column names. Renaming columns to
col_1, col_2, ... does not change the result (the name-blindness, or
"turnusol", invariant).
Usage
detect_roles(data, name_bonus = NULL, verbose = FALSE)
## S3 method for class 'role_detection'
print(x, ...)
## S3 method for class 'role_detection'
summary(object, ...)
Arguments
data |
A |
name_bonus |
Optional named list mapping role keys to character vectors
of case-insensitive keyword regex fragments, e.g.
|
verbose |
Logical; if |
x |
A |
... |
Ignored. |
object |
A |
Details
Detection is purely mathematical by default (name_bonus = NULL). When
a keyword dictionary is supplied via name_bonus, column names act only
as a small, capped tie-breaker (at most a +10 point nudge, i.e. <= 10 percent,
applied to candidate selection for the group, outcome, subject-id and
survival roles) – the reported confidence stays the mathematical signature.
See rolescry_default_name_bonus() for a ready-made dictionary.
Value
An S3 object of class "role_detection": a list with
- var_info
data.frame(column, type) – value-based column typing.
- roles
named list; each entry has
found,columns,score,max_score,pct,detected_byand acomponentsscore breakdown.- value_types
named character vector of per-column value-type labels.
- potential_pairs
list of candidate continuous column pairs with paired and agreement scores.
- n_obs, n_var
dataset dimensions.
See Also
read_data(), compute_nmi(), rolescry_default_name_bonus()
Examples
set.seed(1)
d <- data.frame(
arm = rep(c(0, 1), each = 50),
pre = rnorm(100, 10, 2),
post = rnorm(100, 11, 2),
resp = rbinom(100, 1, 0.4)
)
res <- detect_roles(d)
res
res$roles$group_var$columns
Read a data file with automatic header detection
Description
Reads a tabular file into a data.frame, detecting the header row with
detect_header() (the data is first read with no header so the header row is
visible as data). Delimited text (.csv/.tsv) is read with base R and always
works; spreadsheet and statistical formats use optional packages and degrade
gracefully with an actionable error if the package is absent.
Usage
read_data(
path,
header = NULL,
sheet = NULL,
na_strings = c("", "NA", "N/A", "n/a", "na", "NULL", "null", "."),
verbose = FALSE
)
Arguments
path |
Path to the file. |
header |
Optional integer giving the 1-based header row to use directly,
bypassing detection. |
sheet |
Optional sheet name/index for Excel files. |
na_strings |
Character vector of tokens mapped to |
verbose |
Logical; emit the detected header row via |
Details
Supported: .csv, .tsv/.tab (base);
.xlsx/.xls/.xlsm (Suggests: readxl or openxlsx);
.sav/.sas7bdat/.dta (Suggests: haven, header intrinsic);
.rds (base, returned as stored).
Value
A data.frame with detected column names and per-column types
inferred via type.convert.
See Also
detect_header(), detect_roles()
Examples
tmp <- tempfile(fileext = ".csv")
writeLines(c("age,sex,score", "34,M,8.1", "51,F,7.4"), tmp)
read_data(tmp)
file.remove(tmp)
Default name-bonus keyword dictionary
Description
Returns a ready-made, ASCII-English keyword dictionary suitable for the
name_bonus argument of detect_roles(). It externalizes the
hard-coded keyword lists that lived inside the original MDStatR engine
(group/treatment, outcome, survival-time and event, subject-id terms) into a
plain, inspectable, locale-neutral list.
Usage
rolescry_default_name_bonus()
Details
Passing this turns column names into a small, capped tie-breaker only
(\le 10 percent of the selection score); the mathematical signature
still dominates (>= 90 percent), satisfying the name-blindness contract.
Detection without it (name_bonus = NULL) is purely mathematical.
Value
A named list of character vectors (regex fragments), with keys
group_var, outcome_continuous, outcome_binary,
subject_id, time_variable, event_variable.
Examples
nb <- rolescry_default_name_bonus()
names(nb)
set.seed(1)
d <- data.frame(
treatment_arm = rep(c("A", "B"), each = 60),
biomarker = rnorm(120),
death = rbinom(120, 1, 0.3)
)
detect_roles(d, name_bonus = nb)$roles$group_var$columns