Vignette 2 of 4 for the pep725 R package. The most current version is available on GitHub and CRAN.

Introduction

Data quality assessment is a critical but often overlooked step in phenological analysis. Before calculating trends, normals, or climate sensitivities, you need to understand what’s in your data and make informed decisions about unusual observations.

The Data Quality Challenge

Long-term phenological datasets such as PEP725 are compiled from many observers over decades. This richness enables large-scale analyses, but it also introduces data quality challenges that need to be assessed:

Issue Type	Examples	Consequence if Ignored
Recording errors	Typographical errors (DOY 150 instead of 105), incorrect year	Outliers bias statistics
Identification errors	Confusion between morphologically similar species	Inconsistent or mixed time series
Protocol changes	Changes in the definition of phenological phases over time	Artificial shifts or trends
Biological anomalies	Irregular phenological events as a result of climate extremes	May represent real biological signals

The following sections introduce practical tools in pep725 package for diagnosing and handling these issues in a transparent and reproducible way.

What You’ll Learn

Section	Topic	Key Functions
Part 1	Detecting statistical outliers	`pep_flag_outliers()`, `pep_plot_outliers()`
Part 2	Temporal coverage and completeness	`pep_completeness()`
Part 3	Phase presence validation	`pep_check_phases()`
Part 4	Integrated workflow	Combining all approaches

Prerequisites

This vignette assumes you are familiar with:

Basic phenological concepts (from “Getting Started” vignette)
Day of Year (DOY) as a timing metric
BBCH phenological phase codes

Setup

library(pep725)
library(data.table)

# Download the synthetic dataset
pep <- pep_download()

# For this vignette, we'll focus on flowering phases
flowering <- pep[phase_id %in% c(60, 65)]
cat("Flowering observations:", nrow(flowering), "\n")
#> Flowering observations: 1310094
cat("Species:", length(unique(flowering$species)), "\n")
#> Species: 36
cat("Year range:", min(flowering$year), "-", max(flowering$year), "\n")
#> Year range: 1868 - 2025

Part 1: Outlier Detection

Why Detect Outliers?

An outlier is an observation that differs substantially from the expected pattern. In phenological data, outliers may indicate:

Recording errors that should be excluded from analysis
- A typo: DOY 250 instead of DOY 150 (roughly 3 months difference)
- Wrong year entered: observation assigned to incorrect season
Unusual weather events worth investigating
- Very warm winter causing early flowering
- Late frost delaying spring phenology
Second flowering or other abnormal phenological events
- Plants flowering again in autumn/winter after spring flowering
- Increasingly observed with climate change

The challenge: You can’t simply delete all outliers. Some are errors, but others are scientifically valuable observations of unusual events.

Statistical Methods for Outlier Detection

The pep_flag_outliers() function provides four statistical methods for identifying potential outliers. Each method has different assumptions and strengths:

Method 1: 30-Day Rule (Simple Threshold)

The simplest approach: flag any observation more than 30 days from the median for that species/phase combination.

# Flag outliers using the default 30-day rule
outliers <- pep_flag_outliers(
  pep = flowering,
  method = "30day",
  by = c("species", "phase_id")
)

print(outliers)
#> Phenological Outlier Detection Results
#> ============================================= 
#> Method: 30day 
#> Threshold: 30 days 
#> Grouping: species, phase_id 
#> 
#> Total observations: 1,310,094
#> Outliers flagged: 26573 (2.03%)
#>   - Early outliers: 15011
#>   - Late outliers: 11562
#> 
#> Deviation summary for outliers:
#>   Min: -167.0 days
#>   Max: 223.0 days
#>   Mean |deviation|: 38.4 days

How it works:

                     Distribution of flowering dates

        ← Early                                    Late →

   ----|----[=======|=======]----|-----
        ^           ^           ^
     -30 days    median      +30 days

   Anything outside the brackets is flagged as an outlier

Understanding the Output

The function adds new columns to your original data:

Column	What It Contains	How to Use It
`is_outlier`	TRUE/FALSE flag	Filter data: `data[is_outlier == FALSE]`
`deviation`	Days from expected	Prioritize investigation: larger = more extreme
`expected_doy`	Reference DOY (median)	Understand what “normal” looks like

Method 2: MAD (Median Absolute Deviation) - Recommended

The MAD method is robust to the presence of multiple outliers. This is important because if your data contains many outliers, they can distort mean-based methods.

# MAD method: flag if > 3 MAD from median (robust)
outliers_mad <- pep_flag_outliers(flowering, method = "mad", threshold = 3)

Why MAD is recommended:

Scenario	Mean-based method	MAD method
1-2 outliers	Works reasonably	Works well
Many outliers	Outliers inflate SD, may miss more	Stays robust
Asymmetric data	Assumes symmetry	Handles asymmetry

Method 3: IQR (Interquartile Range) - Also Robust

The IQR method uses quartiles and is familiar from boxplot “whiskers”:

# IQR method: flag if outside 1.5 * IQR (standard boxplot rule)
outliers_iqr <- pep_flag_outliers(flowering, method = "iqr", threshold = 1.5)

How it works: - Q1 = 25th percentile, Q3 = 75th percentile - IQR = Q3 - Q1 - Outlier if: value < Q1 - 1.5×IQR OR value > Q3 + 1.5×IQR

Method 4: Z-Score - Sensitive but Less Robust

The z-score method assumes normally distributed data:

# Z-score method: flag if |z| > 3 (assumes normal distribution)
outliers_zscore <- pep_flag_outliers(flowering, method = "zscore", threshold = 3)

Caution: The z-score method is sensitive to existing outliers. If your data has many outliers, they will inflate the standard deviation, making the z-score threshold less effective at detecting them.

Method 5: GAM Residual - Model-based, Covariate-aware

The four methods above are all univariate within a group — they compare each observation to the central tendency of its own station-species-phase combination. That misses three realistic outlier patterns:

Year-effect outliers. In a year where most of central Europe bloomed 8 days earlier, a station that flowered on its 30-year median looks “late” only because the reference ignores the shared weather signal.
Covariate-inconsistent reports. A lowland station that reports a DOY typical of a mountain station is invisible to a within-station detector because the station’s own history may be short.
Trend-inconsistent reports. A station whose DOY is constant while every other station in the network has a trend is anomalous only in relation to the trend.

method = "gam_residual" tackles all three. For each group (typically species × phase), it fits a GAM of DOY on year, altitude, latitude, and a station random intercept (customisable via formula =) and flags observations whose residual — standardised by a robust MAD scale — exceeds the threshold:

# Model-based detection across the whole species x phase group.
outliers_gam <- pep_flag_outliers(
  apple_flowering,
  by = c("genus", "species", "phase_id"),
  method = "gam_residual",
  threshold = 3.5
)

The deviation column now holds GAM residuals in days (not deviations from a within-station median), and expected_doy holds the fitted values — so the station-year whose residual is 20 days above the fit is flagged whether or not its station’s history made that value look “normal”.

Groups with fewer observations than min_n_per_group (default 50) or where the GAM fails to converge silently fall back to the 30-day rule.

Method 6: Mahalanobis - Multivariate, Robust (MCD)

All methods above flag observations one at a time. A station-year could have an entirely reasonable BBCH 60 date and an entirely reasonable BBCH 65 date, but only one day apart — biologically impossible, because first flowering precedes full flowering by ~10 days.

method = "mahalanobis" catches this by treating each station-year as a vector across phases and computing the robust Mahalanobis distance of that vector from the centre of the rest of the network. “Robust” matters: the classical Mahalanobis distance uses the sample mean and covariance, which are themselves pulled by outliers (masking). We use the Minimum Covariance Determinant estimator (Rousseeuw 1984, robustbase::covMcd()) which gives a high-breakdown robust centre and scatter, so the outliers stay outside the robust ellipsoid instead of contaminating it:

# Multivariate detection on the phase profile per station-year.
# by = c("genus", "species") pools across phases so the covariance
# is estimated from the joint (BBCH 60, BBCH 65, BBCH 89) distribution.
outliers_mahal <- pep_flag_outliers(
  apple,
  by = c("genus", "species"),
  method = "mahalanobis"
)

The default threshold is sqrt(qchisq(0.975, df = p)) where p is the number of phases — the standard chi-square containment cut-off under multivariate normality. Flagged station-years get all their phase rows marked; deviation stores the robust MD (not a day-count, so it is unitless).

Choosing a Method

Method	Best For	Threshold Meaning
`30day`	Simple, interpretable threshold	Absolute days from median
`mad`	Most situations (recommended for single-phase)	Number of MADs from median
`iqr`	Familiar boxplot-style detection	IQR multiplier
`zscore`	Clean data, normal distribution	Standard deviations
`gam_residual`	Year / elevation / station covariates available	Robust z on GAM residuals
`mahalanobis`	Multiple phases per station-year	`sqrt(chi^2_{0.975, p})` default

Summary Statistics

summary(outliers)
#> Phenological Outlier Summary
#> ============================================= 
#> 
#> Total observations: 1,310,094
#> Total outliers: 26573 (2.03%)
#> 
#> Groups with most outliers:
#>               species phase_id  n_obs n_outliers pct_outliers mean_dev
#>                <fctr>    <int>  <int>      <int>        <num>    <num>
#>  1: Solanum tuberosum       60 114245       6420         5.62     12.0
#>  2:  Prunus domestica       60 114641       2958         2.58     10.5
#>  3:  Prunus domestica       65  77393       2146         2.77     10.8
#>  4:   Malus domestica       60 205403       1957         0.95      9.1
#>  5:   Malus domestica       65 175706       1770         1.01      9.1
#>  6:    Prunus persica       60  48074       1515         3.15     11.0
#>  7:    Prunus cerasus       60 136588       1227         0.90      9.0
#>  8:    Prunus cerasus       65 112597       1042         0.93      9.0
#>  9:  Prunus armeniaca       60  22692        943         4.16     12.1
#> 10:    Pyrus communis       60  50335        940         1.87      9.5
#> 
#> Deviation distribution (all data):
#>  1%  5% 25% 50% 75% 95% 99% 
#> -31 -21  -8   0   8  20  30

Comparing Outlier Rates Across Species

Different species may have different outlier rates due to observation difficulty or data quality:

# Check outliers for grapevine
vine_flowering <- pep[species == "Vitis vinifera" & phase_id %in% c(60, 65)]

if (nrow(vine_flowering) > 0) {
  outliers_vine <- pep_flag_outliers(
    pep = vine_flowering,
    method = "30day",
    by = c("species", "phase_id")
  )

  cat("Outlier comparison:\n")
  cat("All flowering species: ", round(100 * mean(outliers$is_outlier), 2), "%\n")
  cat("Grapevine only:        ", round(100 * mean(outliers_vine$is_outlier), 2), "%\n")
} else {
  cat("No grapevine flowering data available for comparison.\n")
}
#> Outlier comparison:
#> All flowering species:  2.03 %
#> Grapevine only:         3.68 %

Visualizing Outliers

Visual inspection is essential for understanding outlier patterns. The pep_plot_outliers() function provides several visualization types — four for the univariate methods plus two targeted at the model-based and multivariate methods:

1. Overview Plot

Shows the big picture: how many outliers, which species/phases are affected:

# Overview of outlier patterns
pep_plot_outliers(outliers, type = "overview")
#> Warning: Removed 16 rows containing non-finite outside the scale range
#> (`stat_bin()`).

What to look for: - Which species have the most outliers? (Might indicate data quality issues) - Are outliers concentrated in certain phases? (Might indicate definition problems) - What’s the overall outlier rate? (Typical: 1-5% for well-curated data)

2. Seasonal Distribution

When in the year do outliers occur?

# When do outliers occur in the year?
pep_plot_outliers(outliers, type = "seasonal")

Interpretation guide:

Outlier Timing	Likely Explanation
Very early (DOY < 50)	Possible data errors (flowering in January/February unlikely for most species)
Somewhat early	Warm winters or Mediterranean locations
Somewhat late	Cool springs or high-altitude stations
Very late (DOY > 250)	Possible second flowering events - investigate!

3. Detailed Context

See outliers alongside normal observations:

# See outliers in context of all observations
pep_plot_outliers(outliers, type = "detail", n_top = 15)

This plot shows: - Full distribution of observations (gray) - Flagged outliers (highlighted) - The n_top parameter controls how many species/phases to show

4. Geographic Distribution

Where are outliers located?

# Where are outliers located?
pep_plot_outliers(outliers, type = "map")

What to look for: - Clustered outliers in one region? Might indicate local data quality issues - Outliers at network edges? Might be at environmental limits - Random scatter? Suggests individual observation errors

5. Model Diagnostic (for `gam_residual` / `mahalanobis`)

type = "diagnostic" produces a paper-ready 4-panel figure tailored to the detection method. For gam_residual the panels are classical GAM diagnostics — residuals vs fitted, Q-Q against Normal, |residual| vs altitude, and a per-station worst-case-residual map. For mahalanobis the panels switch automatically to MD-specific diagnostics: sorted MD with the chi-square threshold line, Q-Q of MD² against χ²_p, mean/max MD over time (exposes network-wide anomalous epochs), and the per-station worst-case MD map.

# For gam_residual: classical model diagnostics
outliers_gam <- pep_flag_outliers(
  apple_flowering,
  by = c("genus", "species", "phase_id"),
  method = "gam_residual"
)
pep_plot_outliers(outliers_gam, type = "diagnostic")

GAM-residual diagnostic on a synthetic fixture with one injected contamination at station 1, year 2000. Panel A: residuals vs fitted; the injected point appears at residual ≈ +50 days. Panel B: Q-Q against N(0,1); the right-tail departure corresponds to the outlier. Panel C: |residual| vs altitude, showing the model has no altitude bias away from the outlier. Panel D: per-station worst-case residual map.

# For mahalanobis: MD-specific diagnostics
outliers_mahal <- pep_flag_outliers(
  apple,
  by = c("genus", "species"),
  method = "mahalanobis"
)
pep_plot_outliers(outliers_mahal, type = "diagnostic")

Mahalanobis-specific diagnostic on the same synthetic fixture. Panel A: sorted robust Mahalanobis distances across all station-years, with the \(\sqrt{\chi^2_{0.975,\,3}} \approx 3.06\) threshold as a red dashed line; flagged points are red. Panel B: Q-Q of observed \(\mathrm{MD}^2\) against \(\chi^2_3\) — upper-tail departure identifies the outlier mass. Panel C: mean (blue) and max (red dotted) MD per year — the spike at year 2000 coincides with the injected anomaly. Panel D: per-station worst-case MD map.

What to look for:

Residuals vs fitted (GAM): a level trend with no structure is ideal; a funnel or curve signals model misspecification, not outliers.
Q-Q (GAM): heavy-tailed deviations in the corners are the real outlier signal — light tails mean the threshold is picking up noise.
Sorted MD with threshold (Mahalanobis): the steep rise at the right-hand end is where flagged station-years sit; a long flat “knee” across the threshold means the threshold is arguably too loose or too strict.
Mean / max MD over time (Mahalanobis): spikes in the max line identify years in which many stations reported biologically inconsistent phase sequences — useful for correlating with known data-collection or instrumental events.

6. Phase-profile Plot (primary Mahalanobis figure)

The MD detector flags station-years whose shape across phases is atypical, even when each marginal DOY looks fine. type = "profile" plots the phase profile as parallel coordinates — one line per station-year, flagged ones in red, with the robust median profile overlaid in black as a reference:

# Phase-profile parallel coordinates (primary Mahalanobis figure).
pep_plot_outliers(outliers_mahal, type = "profile")

Phase-profile parallel-coordinates plot. Each line is one station-year; the grey envelope shows the natural variability of the network, the red lines are station-years flagged by the robust Mahalanobis detector, and the thick black line is the MCD robust per-phase median profile. Red lines that cross the phases at atypical angles — not just at unusual absolute DOYs — are the classic Mahalanobis flags.

How to read it:

Every grey line is a “normal-looking” station-year; together they form the envelope of the robust ellipsoid.
Red lines are flagged. Look where they bend oddly: a station-year whose BBCH 60 → 65 slope is much flatter or steeper than the reference is the classic MD flag.
The black reference profile is the robust (MCD) per-phase median — the centre of the ellipsoid. Anything strongly parallel to this is safe; anything that crosses multiple phases at unusual angles is suspect.

Part 2: Data Completeness

Why Check Completeness?

Data completeness refers to how many years of observations exist for each station/phase combination. This matters because:

Trend analysis requires continuous data: Gaps can bias trend estimates
Normals require representative coverage: WMO recommends 24+ years in a 30-year period
Missing data isn’t random: Stations often drop out or are added systematically

Visualizing Completeness Issues

Station A: ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●  <- 100% complete
           1990                      2020

Station B: ●●●●●●●●●●○○○○○○○○○○●●●●●●●●●●  <- Gap in middle
           1990                      2020

Station C: ○○○○○○○○○○○○○○○○○○○○●●●●●●●●●●  <- Only recent data
           1990                      2020

Station D: ●●●●●●●●●●●●●●●●●●●●○○○○○○○○○○  <- Discontinued
           1990                      2020

Different completeness patterns have different implications:

Pattern	Implications for Analysis
Full coverage	Ideal - reliable trends and normals
Gap in middle	Careful - may miss important changes
Only recent	Cannot compare to historical period
Discontinued	Cannot assess recent changes

Assessing Completeness

# Check completeness by station and phase
# Use year_range to focus on a specific period
completeness <- pep_completeness(
  pep = flowering,
  by = c("s_id", "phase_id"),
  year_range = c(1990, 2020)
)

print(completeness)
#> Phenological Data Completeness Assessment
#> -------------------------------------------------- 
#> Grouping by: s_id, phase_id
#> Year range: 1990-2020
#> Total groups: 8390
#> Mean completeness: 95.1%
#> Groups with >80% completeness: 7686 (91.6%)
#> -------------------------------------------------- 
#> 
#>       s_id phase_id n_obs n_stations n_years year_min year_max year_span
#>     <fctr>    <int> <int>      <int>   <int>    <int>    <int>     <int>
#>  1:      1       60    33          1      10     1990     1999        10
#>  2:      1       65    25          1       8     1990     1998         9
#>  3:      3       60    30          1      17     1990     2006        17
#>  4:      3       65    27          1      16     1991     2006        16
#>  5:      4       60    62          1      26     1990     2015        26
#>  6:      4       65    55          1      26     1990     2015        26
#>  7:      5       60    10          1       6     1990     1995         6
#>  8:      5       65     7          1       4     1991     1994         4
#>  9:      9       60     6          1       3     1990     1992         3
#> 10:      9       65     4          1       3     1990     1992         3
#> 11:     11       60    16          1       7     1990     1996         7
#> 12:     11       65    15          1       7     1990     1996         7
#> 13:     13       60     3          1       3     1990     1995         6
#> 14:     13       65     4          1       4     1993     1998         6
#> 15:     15       60    27          1      10     1994     2012        19
#>     completeness_pct median_doy iqr_doy
#>                <num>      <num>   <num>
#>  1:            100.0      129.0    13.0
#>  2:             88.9      137.0    12.0
#>  3:            100.0      116.0    13.8
#>  4:            100.0      125.0     8.0
#>  5:            100.0      119.0    17.0
#>  6:            100.0      123.0    12.5
#>  7:            100.0      116.0    13.5
#>  8:            100.0      129.0     6.5
#>  9:            100.0      121.5    21.0
#> 10:            100.0      126.0    22.5
#> 11:            100.0      121.5    10.0
#> 12:            100.0      134.0     7.0
#> 13:             50.0      134.0    11.0
#> 14:             66.7      133.0     2.8
#> 15:             52.6      123.0    11.0
#> 
#> ... and 8375 more rows

Understanding Completeness Metrics

The function calculates several metrics:

Metric	What It Measures	Interpretation
`n_years`	Total years with data	More = better
`completeness_pct`	% of year span covered	70%+ for trends, 80%+ for normals
`year_span`	Total span (year_max - year_min + 1)	Context for completeness
`year_min`	Earliest observation	Needed for historical comparison
`year_max`	Most recent observation	Needed for current status

Summary Statistics

summary(completeness)
#> Phenological Data Completeness Summary
#> ================================================== 
#> 
#> Total groups: 8390
#> Total observations: 441,295
#> Total unique stations: 8,390
#> 
#> Completeness Distribution:
#>   0-20%:   33 (  0.4%) 
#>   20-40%:   99 (  1.2%) 
#>   40-60%:  224 (  2.7%) *
#>   60-80%:  413 (  4.9%) *
#>   80-100%: 7621 ( 90.8%) ******************
#> 
#> Year Coverage Statistics:
#>   Mean year span: 14.1 years
#>   Mean years with data: 13.2
#>   Earliest data: 1990
#>   Latest data: 2020
#> 
#> Observation Statistics:
#>   Mean observations per group: 52.6
#>   Median observations per group: 38
#>   Max observations: 346
#> 
#> Phenology Statistics:
#>   DOY range: 6 - 304

Filtering by Completeness

Use completeness information to select stations appropriate for your analysis:

# Get stations with good coverage (>= 70%)
good_coverage <- completeness[completeness_pct >= 70]
cat("Stations with >= 70% coverage:", nrow(good_coverage), "\n")
#> Stations with >= 70% coverage: 7893

# Use these for trend analysis
good_stations <- unique(good_coverage$s_id)
flowering_complete <- flowering[s_id %in% good_stations]
cat("Observations from complete stations:", nrow(flowering_complete), "\n")
#> Observations from complete stations: 935551

Completeness Thresholds for Different Analyses

Analysis Type	Minimum Completeness	Reasoning
Trend detection	50% (15+ years)	Need enough points for trend fitting
Climate sensitivity	60%	Need variability across climate conditions
30-year normals	80% (24+ years)	WMO standard requirement
Station comparison	Same time period	Avoid bias from different eras

Visualizing Completeness

plot(completeness)

Part 3: Phase Presence Validation

Why Check Phase Presence?

Before starting an analysis, it’s important to verify that your data contains the phenological phases you need. The pep_check_phases() function validates that expected BBCH phase codes are present in your data.

Common questions this helps answer:

Does my dataset have flowering observations (phase 60)?
Are flowering and fruit maturity phases both available for apple?
Which phases are missing for my species of interest?

Checking Phase Presence

# Check if expected phases are present for apple
apple <- pep[species == "Malus domestica"]

phase_check <- pep_check_phases(
  pep = apple,
  expected = c(60, 65, 87)  # flowering, full flowering, fruit maturity
)

print(phase_check)
#> Phenological Phase Availability Check
#> --------------------------------------------- 
#> 
#> Expected phases: 60, 65, 87
#> Status: COMPLETE - all expected phases present
#> 
#> Phases found:
#>   Phase   0: 1 observations [extra]
#>   Phase   7: 33 observations [extra]
#>   Phase  10: 1,394 observations [extra]
#>   Phase  11: 700 observations [extra]
#>   Phase  15: 159 observations [extra]
#>   Phase  57: 26 observations [extra]
#>   Phase  59: 25 observations [extra]
#>   Phase  60: 205,403 observations
#>   Phase  61: 1,055 observations [extra]
#>   Phase  63: 80 observations [extra]
#>   Phase  65: 175,706 observations
#>   Phase  67: 29 observations [extra]
#>   Phase  69: 170,323 observations [extra]
#>   Phase  81: 151 observations [extra]
#>   Phase  85: 93 observations [extra]
#>   Phase  87: 253,169 observations
#>   Phase  89: 17 observations [extra]
#>   Phase  91: 80 observations [extra]
#>   Phase  93: 617 observations [extra]
#>   Phase  95: 76,453 observations [extra]
#>   Phase  97: 2,470 observations [extra]
#>   Phase 100: 225 observations [extra]
#>   Phase 200: 3 observations [extra]
#>   Phase 201: 517 observations [extra]
#>   Phase 203: 1 observations [extra]
#>   Phase 205: 7,912 observations [extra]
#>   Phase 209: 2,358 observations [extra]
#>   Phase 213: 1 observations [extra]
#>   Phase 380: 208 observations [extra]
#>   Phase 381: 7 observations [extra]
#>   Phase 385: 95 observations [extra]
#> 
#> Detailed coverage:
#>     phase_id  n_obs n_stations n_years year_min year_max
#>        <int>  <int>      <int>   <int>    <int>    <int>
#>  1:        0      1          1       1     2022     2022
#>  2:        7     33          5      11     2014     2025
#>  3:       10   1394        258      26     2000     2025
#>  4:       11    700        157      65     1961     2025
#>  5:       15    159         53      14     2009     2025
#>  6:       57     26          1       7     1942     1949
#>  7:       59     25          4      14     1949     2025
#>  8:       60 205403       7947     100     1926     2025
#>  9:       61   1055        192      69     1936     2025
#> 10:       63     80          5      39     1986     2025
#> 11:       65 175706       7272     125     1896     2025
#> 12:       67     29          5      10     2016     2025
#> 13:       69 170323       7192      86     1937     2025
#> 14:       81    151          5      54     1952     2021
#> 15:       85     93         19       8     2013     2020
#> 16:       87 253169       7504      86     1928     2024
#> 17:       89     17          5      10     2000     2024
#> 18:       91     80         26      13     2013     2025
#> 19:       93    617         76      55     1970     2024
#> 20:       95  76453       4120      78     1944     2024
#> 21:       97   2470        368      32     1966     2024
#> 22:      100    225         18      64     1951     2024
#> 23:      200      3          2       2     2016     2017
#> 24:      201    517         64      55     1970     2024
#> 25:      203      1          1       1     2016     2016
#> 26:      205   7912        779      79     1943     2025
#> 27:      209   2358        357      25     1966     2014
#> 28:      213      1          1       1     2012     2012
#> 29:      380    208         10      63     1961     2024
#> 30:      381      7          4       2     2023     2024
#> 31:      385     95         37      13     2013     2025
#>     phase_id  n_obs n_stations n_years year_min year_max
#>        <int>  <int>      <int>   <int>    <int>    <int>

Understanding the Output

The function returns information about phase coverage:

Output	What It Shows
`expected`	The phases you asked for
`present`	Which expected phases were found
`missing`	Which expected phases are absent
`complete`	TRUE if all expected phases are present
`n_obs`	Number of observations per phase

Checking Multiple Species

To check phase presence across multiple species at once:

# Check phases for multiple species
multi_check <- pep_check_phases_multi(
  pep = pep,
  species_list = c("Malus domestica", "Vitis vinifera"),
  expected = c(60, 65, 87)
)

print(multi_check)
#> Multi-Species Phase Availability Check
#> ================================================== 
#> Expected phases: 60, 65, 87
#> 
#> Species with all phases: 2 / 2 (100%)
#> -------------------------------------------------- 
#> 
#>            species complete n_expected n_present n_missing missing_phases
#>             <char>   <lgcl>      <int>     <int>     <int>         <char>
#> 1: Malus domestica     TRUE          3         3         0           <NA>
#> 2:  Vitis vinifera     TRUE          3         3         0           <NA>
#>    total_obs
#>        <int>
#> 1:    899311
#> 2:    146667

Common Phases to Check

Plant Type	Common Phases	BBCH Codes
Cereals	Heading, Flowering, Harvest	60, 65, 100
Fruit trees	Flowering, Full flowering, Fruit maturity	60, 65, 87
Deciduous trees	Leaf unfolding, Flowering, Leaf fall	11, 60, 95

Part 4: Integrated Quality Workflow

Putting It All Together

Here’s a recommended workflow for data quality assessment that combines all the tools covered in this vignette:

# ══════════════════════════════════════════════════════════════════════════════
# STEP 1: Assess temporal completeness
# ══════════════════════════════════════════════════════════════════════════════
# Why: Incomplete stations can bias trend estimates and normals

completeness <- pep_completeness(flowering, by = c("s_id", "phase_id"))
good_stations <- completeness[completeness_pct >= 50, s_id]
fl_filtered <- flowering[s_id %in% good_stations]

cat("Kept", length(good_stations), "stations with >= 50% completeness\n")
#> Kept 15213 stations with >= 50% completeness

# ══════════════════════════════════════════════════════════════════════════════
# STEP 2: Flag statistical outliers
# ══════════════════════════════════════════════════════════════════════════════
# Why: Identify observations that deviate from expected patterns

outliers_wf <- pep_flag_outliers(fl_filtered, method = "mad", threshold = 3)

cat("Flagged", sum(outliers_wf$is_outlier), "outliers",
    "(", round(100 * mean(outliers_wf$is_outlier), 1), "% )\n")
#> Flagged 20449 outliers ( 1.6 % )

# ══════════════════════════════════════════════════════════════════════════════
# STEP 3: Make informed decisions about exclusion
# ══════════════════════════════════════════════════════════════════════════════
# Key principle: Document your decisions!

# Option A: Strict cleaning (for normals calculation)
fl_strict <- outliers_wf[is_outlier == FALSE]

# Option B: Moderate cleaning (keep moderate outliers)
fl_moderate <- outliers_wf[is_outlier == FALSE | abs(deviation) < 60]

cat("Strict cleaning keeps:", nrow(fl_strict), "obs\n")
#> Strict cleaning keeps: 1252610 obs
cat("Moderate cleaning keeps:", nrow(fl_moderate), "obs\n")
#> Moderate cleaning keeps: 1272698 obs

# ══════════════════════════════════════════════════════════════════════════════
# STEP 4: Proceed with analysis on cleaned data
# ══════════════════════════════════════════════════════════════════════════════

normals <- pheno_normals(fl_moderate, period = 1991:2020, min_years = 5)
#> Note: 44 group(s) have fewer than 5 years of data and return NA values.
cat("Normals calculated for", sum(!is.na(normals$mean_doy)), "groups\n")
#> Normals calculated for 202 groups

Documentation Template

When publishing research, document your quality control decisions:

Data Quality Control:
1. Completeness filtering: Excluded stations with < 50% temporal coverage
   (reduced dataset from N to M stations)

2. Outlier detection: Used MAD method with threshold = 3
   (flagged X observations as outliers, Y% of total)

3. Outlier disposition:
   - Excluded: Z observations with deviations > 60 days (likely errors)
   - Retained: W observations identified as abnormal events (e.g., second flowering)

4. Phase validation: Checked for sequence violations in [species list]
   (identified V cases for manual review)

Best Practices Summary

For Outlier Detection

Use robust methods: MAD or IQR methods handle multiple outliers better than z-scores. The MAD method is generally recommended.
Check seasonally: Late-season outliers (DOY > 250) may be biologically meaningful events (e.g. second flowering) rather than errors.
Verify extremes: For observations with very large deviations (> 60 days), try to check original data sources or contact data providers.
Document decisions: Record which outliers you exclude and why. This is essential for reproducibility and peer review.
Don’t delete automatically: Human judgment is needed to distinguish errors from genuine unusual events.

For Abnormal Event Detection

Consider species biology: Some species (e.g., certain Rosaceae) are more prone to second flowering than others.
Check geographic patterns: If multiple nearby stations report late events in the same year, this suggests a real regional phenomenon.
Look at weather context: Link abnormal events to weather extremes - drought is a common trigger.
Distinguish from errors: Very isolated observations (single station, single year) need verification before treating them as e.g. second flowering.
Keep for separate analysis: Abnormal phenological events are scientifically interesting (proof of climate change) - don’t just delete them!

For Completeness Assessment

Set appropriate thresholds: Use 80%+ coverage for calculating normals, 50%+ for trend analysis.
Consider gap patterns: Many small gaps are usually better than one long gap that might coincide with important climate changes.
Check temporal bias: Ensure you’re not comparing stations with data only from early periods to stations with only recent data.
Document station selection: Report how many stations met your completeness criteria and how this affected your sample.

For Phase Presence Checking

Check early: Run pep_check_phases() at the start of your analysis to verify required phases are available in your data.
Know required phases: Different analyses need different phases - know which BBCH codes you need before starting.
Check across species: Use pep_check_phases_multi() to verify data availability for all species you plan to analyze.

Summary

This vignette covered data quality tools for phenological analysis:

Function	Purpose	Key Output
`pep_flag_outliers()`	Identify unusual observations	`is_outlier` flag, `deviation` in days
`pep_plot_outliers()`	Visualize outlier patterns	Four plot types: overview, seasonal, detail, map
`pep_completeness()`	Assess temporal coverage	Completeness %, gaps, year range
`pep_check_phases()`	Check phase presence	Missing phases, observation counts

Key Take-Home Messages

Data quality assessment is not optional - it’s a critical step that affects the validity of all downstream analyses.
Not all outliers are errors - some represent genuine biological phenomena like second flowering that deserve further study.
Document your decisions - quality control choices should be transparent and reproducible.
Use robust methods - the MAD method is recommended for phenological data because it handles multiple outliers well.
Visual inspection matters - plots often reveal patterns that summary statistics miss.

Next Steps

Explore the other vignettes for complementary analyses:

Getting Started: Data access, the pep class, and basic exploration

vignette("getting-started", package = "pep725")

Phenological Analysis: Normals, anomalies, quality grading, and trends

vignette("phenological-analysis", package = "pep725")

Spatial Phenological Patterns: Gradients, synchrony, and mapping

vignette("spatial-patterns", package = "pep725")

Session Info

sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Tahoe 26.0.1
#> 
#> Matrix products: default
#> BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
#> 
#> locale:
#> [1] C/UTF-8/C.UTF-8/C/C.UTF-8/C.UTF-8
#> 
#> time zone: Europe/Zurich
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] pep725_1.1.0        ggplot2_4.0.2       data.table_1.18.2.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6        jsonlite_2.0.0      dplyr_1.2.0        
#>  [4] compiler_4.5.2      Rcpp_1.1.1          tidyselect_1.2.1   
#>  [7] tidyr_1.3.2         jquerylib_0.1.4     scales_1.4.0       
#> [10] yaml_2.3.12         fastmap_1.2.0       R6_2.6.1           
#> [13] labeling_0.4.3      generics_0.1.4      patchwork_1.3.2    
#> [16] classInt_0.4-11     robustbase_0.99-7   sf_1.0-24          
#> [19] knitr_1.51          tibble_3.3.1        units_1.0-0        
#> [22] DBI_1.2.3           bslib_0.10.0        pillar_1.11.1      
#> [25] RColorBrewer_1.1-3  rlang_1.1.7         cachem_1.1.0       
#> [28] xfun_0.56           sass_0.4.10         S7_0.2.1           
#> [31] otel_0.2.0          rnaturalearth_1.2.0 viridisLite_0.4.3  
#> [34] cli_3.6.5           withr_3.0.2         magrittr_2.0.4     
#> [37] class_7.3-23        digest_0.6.39       grid_4.5.2         
#> [40] lifecycle_1.0.5     DEoptimR_1.1-4      vctrs_0.7.1        
#> [43] KernSmooth_2.23-26  proxy_0.4-29        evaluate_1.0.5     
#> [46] glue_1.8.0          farver_2.1.2        e1071_1.7-17       
#> [49] rmarkdown_2.30      purrr_1.2.1         tools_4.5.2        
#> [52] pkgconfig_2.0.3     htmltools_0.5.9

University of Applied Sciences Northwestern Switzerland (FHNW)↩︎
Swiss Federal Institute for Forest, Snow and Landscape Research (WSL)↩︎

Data Quality Assessment

Matthias Templ1

Barbara Templ2

2026-04-23

Introduction

The Data Quality Challenge

What You’ll Learn

Prerequisites

Setup

Part 1: Outlier Detection

Why Detect Outliers?

Statistical Methods for Outlier Detection

Method 1: 30-Day Rule (Simple Threshold)

Understanding the Output

Method 2: MAD (Median Absolute Deviation) - Recommended

Method 3: IQR (Interquartile Range) - Also Robust

Method 4: Z-Score - Sensitive but Less Robust

Method 5: GAM Residual - Model-based, Covariate-aware

Method 6: Mahalanobis - Multivariate, Robust (MCD)

Choosing a Method

Summary Statistics

Comparing Outlier Rates Across Species

Visualizing Outliers

1. Overview Plot

2. Seasonal Distribution

3. Detailed Context

4. Geographic Distribution

5. Model Diagnostic (for gam_residual / mahalanobis)

6. Phase-profile Plot (primary Mahalanobis figure)

Part 2: Data Completeness

Why Check Completeness?

Visualizing Completeness Issues

Assessing Completeness

Understanding Completeness Metrics

Summary Statistics

Filtering by Completeness

Completeness Thresholds for Different Analyses

Visualizing Completeness

Part 3: Phase Presence Validation

Why Check Phase Presence?

Checking Phase Presence

Understanding the Output

Checking Multiple Species

Common Phases to Check

Part 4: Integrated Quality Workflow

Putting It All Together

Documentation Template

Best Practices Summary

For Outlier Detection

For Abnormal Event Detection

For Completeness Assessment

For Phase Presence Checking

Summary

Key Take-Home Messages

Next Steps

Session Info

Matthias Templ¹

Barbara Templ²

5. Model Diagnostic (for `gam_residual` / `mahalanobis`)