---
title: "Database / indexing layer"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Database / indexing layer}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", eval = FALSE)
```

```{r minimal-example, eval = TRUE}
# Minimal executable example — selectRecords() works entirely in memory
library(gmsp)
library(data.table)
master <- data.table(
  RecordID  = c("aabbccdd00112233", "aabbccdd00112233", "eeff00112233aabb"),
  OwnerID   = c("NGAW", "NGAW", "CESMD"),
  EventID   = c("20100227T063452Z", "20100227T063452Z", "20110311T054624Z"),
  StationID = c("ANTU", "ANTU", "MYG004"),
  DIR       = c("H1", "H2", "H1"),
  EventMagnitude = c(8.8, 8.8, 9.1),
  Repi      = c(90, 90, 140)
)
sel <- selectRecords(master[EventMagnitude > 8 & DIR == "H1"])
print(sel)
```

`gmsp` ships an optional layer for managing a local strong-motion
record archive. It is **separate** from the signal-processing core
(`AT2TS`, `TS2IMF`, `TSL2PS`, `getIntensity`) — you can use the core
without ever touching the indexing layer.

The indexing layer assumes records on disk in a fixed directory
structure. The base paths are yours to choose; functions that touch
disk take explicit `path`, `path.records`, or `path.index` arguments.

## Expected file layout

```
<recordsDir>/                                      ← you choose this
  <OwnerID>/                                       e.g. "NGAW", "CESMD", "ESM"
    <EventID>/                                     e.g. "20060803T030800Z"
      <StationID>/                                 e.g. "NTYB"
        raw.owner/                                 provider files as downloaded
          record.json                              owner-supplied metadata
          <component-files>                        .AT2 / .v2 / .ac / .tr / ...
        raw/                                       gmsp output of extractRecord()
          AT.<RecordID>.csv                        WIDE: provider OCID columns (scaled to mm)
          AT.<RecordID>.json                       DIR / OCID / NP / PGA / dt / Fs / Units

<indexDir>/                                        ← you choose this
  RawFileTable.<OwnerID>.csv                       provider file inventory
  RawRecordTable.<OwnerID>.csv                     one row per RecordID
  RawIntensityTable.<OwnerID>.csv                  per (RecordID, DIR), 20 IM scalars
  EventTable.<OwnerID>.csv                         event metadata
  StationTable.<OwnerID>.csv                       station metadata

<selectionDir>/                                    ← you choose this
  <name>.csv                                       writeSelection() output
  <name>.json                                      sidecar with audit metadata
```

## Provider formats supported

| `OwnerID` | Format | Parser | Quantity | Notes |
|---|---|---|---|---|
| `NGAW` | AT2 | `readAT2()` | AT | PEER NGA-West2 (4-line header, NPTS/DT) |
| `CESMD` | V2 / V2c | `readV2()` | AT | multi-channel V2 or single-channel V2c |
| `NWZ` | V2A | `readV2A()` | AT | NWZ-flavoured V2 |
| `GSC` | TR (A/B/C/Z) | `readTR()` | AT | Geological Survey of Canada |
| `IGP` | ACA / LIS | `readAC()` | AT | Instituto Geofísico del Perú |
| `UCR` | ACB | `readAC()` | AT | Universidad de Costa Rica |
| Generic | two-col | `readTwoCol()` | AT | (t, s) ASCII columns; used by CAL, CENA, etc. |
| `ISEE` | ISEE | `readISEE()` | VT | Micromate / ISEE blasting seismograph (mm/s velocity, MicL dropped) |

Each parser returns a LONG `data.table(t, OCID, s)` for one component
file. `parseRecord()` is the dispatcher that consults `.OWNER_FORMAT`
and calls the right parser for the owner.

## Extraction pipeline

```
parseRecord()       ── reads raw.owner/* via the owner's parser
   │                   returns LONG (t, OCID, s) for all components
   ▼
mapComponents()     ── derives DIR labels H1 / H2 / UP from provider OCIDs
   │                   H1/H2 are derived processing directions
   │                   `extractRecord()` uses rotate = FALSE
   │                   Returns NULL for arrays or 2-comp records
   ▼
alignComponents()   ── pads (or truncates) to equal NP across components
   │
   ▼
extractRecord()     ── scales to canonical mm via .parseUnits + .getSF
                       writes raw/<KIND>.<RecordID>.csv + <KIND>.<RecordID>.json
                       CSV columns remain provider OCID values; the JSON
                       sidecar stores the DIR -> OCID mapping.
                       KIND ∈ {AT, VT, DT} -- derived from the Units
                       suffix by .parseKind(), or forced by the
                       `kind = "VT"` argument (e.g. for blasting
                       records whose Units may be missing).
                       Sidecar peak field is named accordingly:
                       PGA (KIND=AT) / PGV (KIND=VT) / PGD (KIND=DT).
                       RecordID = first 16 hex chars of md5(CSV).
```

`extractRecord()` is the orchestrator; parsers and `mapComponents()` are
public so they can be reused or audited. Public calls use `parseRecord(.x, path)`
and `extractRecord(.x, path)`, where `.x` is the one-record master subset
and `path` is the records root.

## Indexing tables

After `extractRecord()` has produced `raw/` outputs for some records,
the indexing functions scan the records tree and emit per-owner CSVs
to `<indexDir>/`:

* `buildRawFileTable()` — provider-file inventory (one row per
  `ComponentID × FileID`); reads `raw.owner/record.json` or
  `raw.owner.tar.gz` (post-archive safe).
* `buildRawRecordTable()` — one row per `RecordID`
  (`NP = max(post-align)`, `pad = max NP − min NP`, `Fs`).
* `buildRawIntensityTable()` — calls `getRawIntensities()` per
  station; emits three rows per record (one per `DIR`), each
  carrying the 20 AT-derivable scalars from `getIntensity()`.

The provider-flatfile + USGS catalog join (`buildEventTable()`) is
under development and ships in `inst/dev/`; it is not yet part of the
exported API.

## Master record catalog

`buildMaster()` joins, per owner:

* `RawRecordTable.<O>.csv` (record list),
* `EventTable.<O>.csv` (event scalars, merged via `fcoalesce` with
  source precedence `*.owner` > `*.USGS` > `*.ISC`),
* `StationTable.<O>.csv` (station scalars including Vs30),

and emits a `data.table` keyed at `(RecordID, DIR)`. It adds:

* `Repi` — epicentral distance (haversine, km),
* `Rhyp` — hypocentral distance, $\sqrt{\mathrm{Repi}^2
  + \mathrm{EventDepth}^2}$ (km).

After `buildMaster()` you can filter the master and pass the subset
to `selectRecords()` to produce a
`(RecordID, OwnerID, EventID, StationID)` selection, which is the
input contract for the `readTS()` family — `readAT()` /
`readVT()` / `readDT()` are KIND-specific wrappers around
`readTS(.x, path, kind = ...)` — and for `writeSelection()`
(persists the selection to disk for orchestration).

## Composing with the processing core

The natural composition for acceleration records is:

```r
M   <- buildMaster(path = "<your index path>")
Selection <- selectRecords(M[EventMagnitude > 7 & Repi < 100 & DIR == "H1"])
TS  <- readAT(.x = Selection, path = "<your records path>")
ATS <- TS[, AT2TS(.SD, units.source = "mm", Fmax = 25),
          by = .(RecordID, OwnerID, EventID, StationID)]
```

The output of `readAT()` is a wide table keyed by
`(RecordID, OwnerID, EventID, StationID, t)` with one column per
provider `OCID`. `AT2TS()` consumes it per record. The shape is
identical for `readVT()` and `readDT()`; pair them with `VT2TS()` /
`DT2TS()`. Blasting records (e.g. ISEE) typically flow through
`readVT()` + `VT2TS()`.

## Audit helpers

* `auditSite(M)` — flags rows with missing or out-of-range
  `StationVs30`.
* `auditDistances(M)` — flags `lat/lon` NA or out-of-range, negative
  depths, large `Repi`, geometric impossibility (`Rhyp < Repi`).
* `auditParsers(.x = M, owner = "NGAW", path = ...)` — dry-run
  `parseRecord()` per `(EventID, StationID)` of one owner and report
  OK / FAIL / WARN with reason.

## Maintenance

`archiveRawOwner(path)` compresses `raw.owner/` to
`raw.owner.tar.gz` after extraction has succeeded, verifies the
archive is readable, and only then unlinks the original.

## Notes

* The package does **not** download data. Bringing raw provider files
  to `raw.owner/` is the user's responsibility. Examples under
  `examples/maintenance/` in the source repository show a pattern for
  ingestion (USGS catalog matching, staging / promote / rollback,
  etc.).
* `RecordID` is a 16-character hex hash (`openssl::md5` of the WIDE
  CSV body, truncated). It is stable across re-extraction of the
  same record.
