Reproducibility workflow: snapshot, manifest, SHA-256, Zenodo

Published tax research (PBO costings, Grattan reform papers, Tax Institute briefs) has a reproducibility bar that goes beyond “I called ato_individuals() and summed column X.” Reviewers need to verify that the data you used is exactly the data you say you used. ato provides four features to meet that bar:

  1. Snapshot pin : declare the intended vintage of the data.
  2. SHA-256 integrity : every cached file is hashed; drift warns.
  3. Session manifest : every fetch is recorded with URL, SHA, retrieval time, and snapshot pin.
  4. Zenodo DOI : mint a DOI for the manifest so a paper can cite the exact data snapshot.

Setup

library(ato)

ato_snapshot("2026-04-24")
ato_manifest_clear()

Fetch your datasets

ind <- ato_individuals_postcode(
  year = c("2020-21", "2021-22", "2022-23"),
  state = "NSW"
)

companies <- ato_companies(year = "2022-23", table = "industry")
tax_gap   <- ato_tax_gaps()

Each ato_tbl prints with the snapshot pin and SHA-256 digest in its provenance header.

Inspect the session manifest

man <- ato_manifest()
man[, c("title", "sha256", "retrieved", "snapshot_date")]

Export the manifest for your paper appendix

ato_manifest_write("appendix/ato_manifest.csv")
ato_manifest_write("appendix/ato_manifest.yaml")

Mint a DOI via Zenodo

A DOI makes “retrieved from data.gov.au on 2026-04-24” citable and immutable. Your paper then cites doi:10.5281/zenodo.XXXXXXXX instead of a URL that might rotate.

dep <- ato_deposit_zenodo(
  title = "ATO data snapshot for working paper v1",
  creators = list(list(name = "Author, A.", orcid = "0000-0000-0000-0000")),
  upload = FALSE  # dry run; inspect payload first
)
dep$payload$metadata$title

# When ready to actually deposit:
# Sys.setenv(ZENODO_TOKEN = "...your token...")
# dep <- ato_deposit_zenodo(upload = TRUE)
# dep$doi_prereserve

Citing a dataset with full provenance

ato_cite(ind, style = "bibtex", doi = "10.5281/zenodo.XXXXXXXX")

The BibTeX note field includes the snapshot date and first 12 hex characters of the SHA-256. That is the verifiable audit trail a reviewer would ask for.