Data Quality Assessment

Matthias Templ1

Barbara Templ2

2026-03-04

Vignette 2 of 4 for the pep725 R package. The most current version is available on GitHub and CRAN.

Introduction

Data quality assessment is a critical but often overlooked step in phenological analysis. Before calculating trends, normals, or climate sensitivities, you need to understand what’s in your data and make informed decisions about unusual observations.

The Data Quality Challenge

Long-term phenological datasets such as PEP725 are compiled from many observers over decades. This richness enables large-scale analyses, but it also introduces data quality challenges that need to be assessed:

Issue Type Examples Consequence if Ignored
Recording errors Typographical errors (DOY 150 instead of 105), incorrect year Outliers bias statistics
Identification errors Confusion between morphologically similar species Inconsistent or mixed time series
Protocol changes Changes in the definition of phenological phases over time Artificial shifts or trends
Biological anomalies Irregular phenological events as a result of climate extremes May represent real biological signals

The following sections introduce practical tools in pep725 package for diagnosing and handling these issues in a transparent and reproducible way.

What You’ll Learn

Section Topic Key Functions
Part 1 Detecting statistical outliers pep_flag_outliers(), pep_plot_outliers()
Part 2 Temporal coverage and completeness pep_completeness()
Part 3 Phase presence validation pep_check_phases()
Part 4 Integrated workflow Combining all approaches

Prerequisites

This vignette assumes you are familiar with:

Setup

library(pep725)
library(data.table)

# Download the synthetic dataset
pep <- pep_download()

# For this vignette, we'll focus on flowering phases
flowering <- pep[phase_id %in% c(60, 65)]
cat("Flowering observations:", nrow(flowering), "\n")
#> Flowering observations: 1310094
cat("Species:", length(unique(flowering$species)), "\n")
#> Species: 36
cat("Year range:", min(flowering$year), "-", max(flowering$year), "\n")
#> Year range: 1868 - 2025

Part 1: Outlier Detection

Why Detect Outliers?

An outlier is an observation that differs substantially from the expected pattern. In phenological data, outliers may indicate:

  1. Recording errors that should be excluded from analysis
    • A typo: DOY 250 instead of DOY 150 (roughly 3 months difference)
    • Wrong year entered: observation assigned to incorrect season
  2. Unusual weather events worth investigating
    • Very warm winter causing early flowering
    • Late frost delaying spring phenology
  3. Second flowering or other abnormal phenological events
    • Plants flowering again in autumn/winter after spring flowering
    • Increasingly observed with climate change

The challenge: You can’t simply delete all outliers. Some are errors, but others are scientifically valuable observations of unusual events.

Statistical Methods for Outlier Detection

The pep_flag_outliers() function provides four statistical methods for identifying potential outliers. Each method has different assumptions and strengths:

Method 1: 30-Day Rule (Simple Threshold)

The simplest approach: flag any observation more than 30 days from the median for that species/phase combination.

# Flag outliers using the default 30-day rule
outliers <- pep_flag_outliers(
  pep = flowering,
  method = "30day",
  by = c("species", "phase_id")
)

print(outliers)
#> Phenological Outlier Detection Results
#> ============================================= 
#> Method: 30day 
#> Threshold: 30 days 
#> Grouping: species, phase_id 
#> 
#> Total observations: 1,310,094
#> Outliers flagged: 26573 (2.03%)
#>   - Early outliers: 15011
#>   - Late outliers: 11562
#> 
#> Deviation summary for outliers:
#>   Min: -167.0 days
#>   Max: 223.0 days
#>   Mean |deviation|: 38.4 days

How it works:

                     Distribution of flowering dates

        ← Early                                    Late →

   ----|----[=======|=======]----|-----
        ^           ^           ^
     -30 days    median      +30 days

   Anything outside the brackets is flagged as an outlier

Understanding the Output

The function adds new columns to your original data:

Column What It Contains How to Use It
is_outlier TRUE/FALSE flag Filter data: data[is_outlier == FALSE]
deviation Days from expected Prioritize investigation: larger = more extreme
expected_doy Reference DOY (median) Understand what “normal” looks like

Method 3: IQR (Interquartile Range) - Also Robust

The IQR method uses quartiles and is familiar from boxplot “whiskers”:

# IQR method: flag if outside 1.5 * IQR (standard boxplot rule)
outliers_iqr <- pep_flag_outliers(flowering, method = "iqr", threshold = 1.5)

How it works: - Q1 = 25th percentile, Q3 = 75th percentile - IQR = Q3 - Q1 - Outlier if: value < Q1 - 1.5×IQR OR value > Q3 + 1.5×IQR

Method 4: Z-Score - Sensitive but Less Robust

The z-score method assumes normally distributed data:

# Z-score method: flag if |z| > 3 (assumes normal distribution)
outliers_zscore <- pep_flag_outliers(flowering, method = "zscore", threshold = 3)

Caution: The z-score method is sensitive to existing outliers. If your data has many outliers, they will inflate the standard deviation, making the z-score threshold less effective at detecting them.

Choosing a Method

Method Best For Threshold Meaning
30day Simple, interpretable threshold Absolute days from median
mad Most situations (recommended) Number of MADs from median
iqr Familiar boxplot-style detection IQR multiplier
zscore Clean data, normal distribution Standard deviations

Summary Statistics

summary(outliers)
#> Phenological Outlier Summary
#> ============================================= 
#> 
#> Total observations: 1,310,094
#> Total outliers: 26573 (2.03%)
#> 
#> Groups with most outliers:
#>               species phase_id  n_obs n_outliers pct_outliers mean_dev
#>                <fctr>    <int>  <int>      <int>        <num>    <num>
#>  1: Solanum tuberosum       60 114245       6420         5.62     12.0
#>  2:  Prunus domestica       60 114641       2958         2.58     10.5
#>  3:  Prunus domestica       65  77393       2146         2.77     10.8
#>  4:   Malus domestica       60 205403       1957         0.95      9.1
#>  5:   Malus domestica       65 175706       1770         1.01      9.1
#>  6:    Prunus persica       60  48074       1515         3.15     11.0
#>  7:    Prunus cerasus       60 136588       1227         0.90      9.0
#>  8:    Prunus cerasus       65 112597       1042         0.93      9.0
#>  9:  Prunus armeniaca       60  22692        943         4.16     12.1
#> 10:    Pyrus communis       60  50335        940         1.87      9.5
#> 
#> Deviation distribution (all data):
#>  1%  5% 25% 50% 75% 95% 99% 
#> -31 -21  -8   0   8  20  30

Comparing Outlier Rates Across Species

Different species may have different outlier rates due to observation difficulty or data quality:

# Check outliers for grapevine
vine_flowering <- pep[species == "Vitis vinifera" & phase_id %in% c(60, 65)]

if (nrow(vine_flowering) > 0) {
  outliers_vine <- pep_flag_outliers(
    pep = vine_flowering,
    method = "30day",
    by = c("species", "phase_id")
  )

  cat("Outlier comparison:\n")
  cat("All flowering species: ", round(100 * mean(outliers$is_outlier), 2), "%\n")
  cat("Grapevine only:        ", round(100 * mean(outliers_vine$is_outlier), 2), "%\n")
} else {
  cat("No grapevine flowering data available for comparison.\n")
}
#> Outlier comparison:
#> All flowering species:  2.03 %
#> Grapevine only:         3.68 %

Visualizing Outliers

Visual inspection is essential for understanding outlier patterns. The pep_plot_outliers() function provides four visualization types:

1. Overview Plot

Shows the big picture: how many outliers, which species/phases are affected:

# Overview of outlier patterns
pep_plot_outliers(outliers, type = "overview")
#> Warning: Removed 16 rows containing non-finite outside the scale range
#> (`stat_bin()`).

What to look for: - Which species have the most outliers? (Might indicate data quality issues) - Are outliers concentrated in certain phases? (Might indicate definition problems) - What’s the overall outlier rate? (Typical: 1-5% for well-curated data)

2. Seasonal Distribution

When in the year do outliers occur?

# When do outliers occur in the year?
pep_plot_outliers(outliers, type = "seasonal")

Interpretation guide:

Outlier Timing Likely Explanation
Very early (DOY < 50) Possible data errors (flowering in January/February unlikely for most species)
Somewhat early Warm winters or Mediterranean locations
Somewhat late Cool springs or high-altitude stations
Very late (DOY > 250) Possible second flowering events - investigate!

3. Detailed Context

See outliers alongside normal observations:

# See outliers in context of all observations
pep_plot_outliers(outliers, type = "detail", n_top = 15)

This plot shows: - Full distribution of observations (gray) - Flagged outliers (highlighted) - The n_top parameter controls how many species/phases to show

4. Geographic Distribution

Where are outliers located?

# Where are outliers located?
pep_plot_outliers(outliers, type = "map")

What to look for: - Clustered outliers in one region? Might indicate local data quality issues - Outliers at network edges? Might be at environmental limits - Random scatter? Suggests individual observation errors


Part 2: Data Completeness

Why Check Completeness?

Data completeness refers to how many years of observations exist for each station/phase combination. This matters because:

  1. Trend analysis requires continuous data: Gaps can bias trend estimates
  2. Normals require representative coverage: WMO recommends 24+ years in a 30-year period
  3. Missing data isn’t random: Stations often drop out or are added systematically

Visualizing Completeness Issues

Station A: ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●  <- 100% complete
           1990                      2020

Station B: ●●●●●●●●●●○○○○○○○○○○●●●●●●●●●●  <- Gap in middle
           1990                      2020

Station C: ○○○○○○○○○○○○○○○○○○○○●●●●●●●●●●  <- Only recent data
           1990                      2020

Station D: ●●●●●●●●●●●●●●●●●●●●○○○○○○○○○○  <- Discontinued
           1990                      2020

Different completeness patterns have different implications:

Pattern Implications for Analysis
Full coverage Ideal - reliable trends and normals
Gap in middle Careful - may miss important changes
Only recent Cannot compare to historical period
Discontinued Cannot assess recent changes

Assessing Completeness

# Check completeness by station and phase
# Use year_range to focus on a specific period
completeness <- pep_completeness(
  pep = flowering,
  by = c("s_id", "phase_id"),
  year_range = c(1990, 2020)
)

print(completeness)
#> Phenological Data Completeness Assessment
#> -------------------------------------------------- 
#> Grouping by: s_id, phase_id
#> Year range: 1990-2020
#> Total groups: 8390
#> Mean completeness: 95.1%
#> Groups with >80% completeness: 7686 (91.6%)
#> -------------------------------------------------- 
#> 
#>       s_id phase_id n_obs n_stations n_years year_min year_max year_span
#>     <fctr>    <int> <int>      <int>   <int>    <int>    <int>     <int>
#>  1:      1       60    33          1      10     1990     1999        10
#>  2:      1       65    25          1       8     1990     1998         9
#>  3:      3       60    30          1      17     1990     2006        17
#>  4:      3       65    27          1      16     1991     2006        16
#>  5:      4       60    62          1      26     1990     2015        26
#>  6:      4       65    55          1      26     1990     2015        26
#>  7:      5       60    10          1       6     1990     1995         6
#>  8:      5       65     7          1       4     1991     1994         4
#>  9:      9       60     6          1       3     1990     1992         3
#> 10:      9       65     4          1       3     1990     1992         3
#> 11:     11       60    16          1       7     1990     1996         7
#> 12:     11       65    15          1       7     1990     1996         7
#> 13:     13       60     3          1       3     1990     1995         6
#> 14:     13       65     4          1       4     1993     1998         6
#> 15:     15       60    27          1      10     1994     2012        19
#>     completeness_pct median_doy iqr_doy
#>                <num>      <num>   <num>
#>  1:            100.0      129.0    13.0
#>  2:             88.9      137.0    12.0
#>  3:            100.0      116.0    13.8
#>  4:            100.0      125.0     8.0
#>  5:            100.0      119.0    17.0
#>  6:            100.0      123.0    12.5
#>  7:            100.0      116.0    13.5
#>  8:            100.0      129.0     6.5
#>  9:            100.0      121.5    21.0
#> 10:            100.0      126.0    22.5
#> 11:            100.0      121.5    10.0
#> 12:            100.0      134.0     7.0
#> 13:             50.0      134.0    11.0
#> 14:             66.7      133.0     2.8
#> 15:             52.6      123.0    11.0
#> 
#> ... and 8375 more rows

Understanding Completeness Metrics

The function calculates several metrics:

Metric What It Measures Interpretation
n_years Total years with data More = better
completeness_pct % of year span covered 70%+ for trends, 80%+ for normals
year_span Total span (year_max - year_min + 1) Context for completeness
year_min Earliest observation Needed for historical comparison
year_max Most recent observation Needed for current status

Summary Statistics

summary(completeness)
#> Phenological Data Completeness Summary
#> ================================================== 
#> 
#> Total groups: 8390
#> Total observations: 441,295
#> Total unique stations: 8,390
#> 
#> Completeness Distribution:
#>   0-20%:   33 (  0.4%) 
#>   20-40%:   99 (  1.2%) 
#>   40-60%:  224 (  2.7%) *
#>   60-80%:  413 (  4.9%) *
#>   80-100%: 7621 ( 90.8%) ******************
#> 
#> Year Coverage Statistics:
#>   Mean year span: 14.1 years
#>   Mean years with data: 13.2
#>   Earliest data: 1990
#>   Latest data: 2020
#> 
#> Observation Statistics:
#>   Mean observations per group: 52.6
#>   Median observations per group: 38
#>   Max observations: 346
#> 
#> Phenology Statistics:
#>   DOY range: 6 - 304

Filtering by Completeness

Use completeness information to select stations appropriate for your analysis:

# Get stations with good coverage (>= 70%)
good_coverage <- completeness[completeness_pct >= 70]
cat("Stations with >= 70% coverage:", nrow(good_coverage), "\n")
#> Stations with >= 70% coverage: 7893

# Use these for trend analysis
good_stations <- unique(good_coverage$s_id)
flowering_complete <- flowering[s_id %in% good_stations]
cat("Observations from complete stations:", nrow(flowering_complete), "\n")
#> Observations from complete stations: 935551

Completeness Thresholds for Different Analyses

Analysis Type Minimum Completeness Reasoning
Trend detection 50% (15+ years) Need enough points for trend fitting
Climate sensitivity 60% Need variability across climate conditions
30-year normals 80% (24+ years) WMO standard requirement
Station comparison Same time period Avoid bias from different eras

Visualizing Completeness

plot(completeness)


Part 3: Phase Presence Validation

Why Check Phase Presence?

Before starting an analysis, it’s important to verify that your data contains the phenological phases you need. The pep_check_phases() function validates that expected BBCH phase codes are present in your data.

Common questions this helps answer:

Checking Phase Presence

# Check if expected phases are present for apple
apple <- pep[species == "Malus domestica"]

phase_check <- pep_check_phases(
  pep = apple,
  expected = c(60, 65, 87)  # flowering, full flowering, fruit maturity
)

print(phase_check)
#> Phenological Phase Availability Check
#> --------------------------------------------- 
#> 
#> Expected phases: 60, 65, 87
#> Status: COMPLETE - all expected phases present
#> 
#> Phases found:
#>   Phase   0: 1 observations [extra]
#>   Phase   7: 33 observations [extra]
#>   Phase  10: 1,394 observations [extra]
#>   Phase  11: 700 observations [extra]
#>   Phase  15: 159 observations [extra]
#>   Phase  57: 26 observations [extra]
#>   Phase  59: 25 observations [extra]
#>   Phase  60: 205,403 observations
#>   Phase  61: 1,055 observations [extra]
#>   Phase  63: 80 observations [extra]
#>   Phase  65: 175,706 observations
#>   Phase  67: 29 observations [extra]
#>   Phase  69: 170,323 observations [extra]
#>   Phase  81: 151 observations [extra]
#>   Phase  85: 93 observations [extra]
#>   Phase  87: 253,169 observations
#>   Phase  89: 17 observations [extra]
#>   Phase  91: 80 observations [extra]
#>   Phase  93: 617 observations [extra]
#>   Phase  95: 76,453 observations [extra]
#>   Phase  97: 2,470 observations [extra]
#>   Phase 100: 225 observations [extra]
#>   Phase 200: 3 observations [extra]
#>   Phase 201: 517 observations [extra]
#>   Phase 203: 1 observations [extra]
#>   Phase 205: 7,912 observations [extra]
#>   Phase 209: 2,358 observations [extra]
#>   Phase 213: 1 observations [extra]
#>   Phase 380: 208 observations [extra]
#>   Phase 381: 7 observations [extra]
#>   Phase 385: 95 observations [extra]
#> 
#> Detailed coverage:
#>     phase_id  n_obs n_stations n_years year_min year_max
#>        <int>  <int>      <int>   <int>    <int>    <int>
#>  1:        0      1          1       1     2022     2022
#>  2:        7     33          5      11     2014     2025
#>  3:       10   1394        258      26     2000     2025
#>  4:       11    700        157      65     1961     2025
#>  5:       15    159         53      14     2009     2025
#>  6:       57     26          1       7     1942     1949
#>  7:       59     25          4      14     1949     2025
#>  8:       60 205403       7947     100     1926     2025
#>  9:       61   1055        192      69     1936     2025
#> 10:       63     80          5      39     1986     2025
#> 11:       65 175706       7272     125     1896     2025
#> 12:       67     29          5      10     2016     2025
#> 13:       69 170323       7192      86     1937     2025
#> 14:       81    151          5      54     1952     2021
#> 15:       85     93         19       8     2013     2020
#> 16:       87 253169       7504      86     1928     2024
#> 17:       89     17          5      10     2000     2024
#> 18:       91     80         26      13     2013     2025
#> 19:       93    617         76      55     1970     2024
#> 20:       95  76453       4120      78     1944     2024
#> 21:       97   2470        368      32     1966     2024
#> 22:      100    225         18      64     1951     2024
#> 23:      200      3          2       2     2016     2017
#> 24:      201    517         64      55     1970     2024
#> 25:      203      1          1       1     2016     2016
#> 26:      205   7912        779      79     1943     2025
#> 27:      209   2358        357      25     1966     2014
#> 28:      213      1          1       1     2012     2012
#> 29:      380    208         10      63     1961     2024
#> 30:      381      7          4       2     2023     2024
#> 31:      385     95         37      13     2013     2025
#>     phase_id  n_obs n_stations n_years year_min year_max
#>        <int>  <int>      <int>   <int>    <int>    <int>

Understanding the Output

The function returns information about phase coverage:

Output What It Shows
expected The phases you asked for
present Which expected phases were found
missing Which expected phases are absent
complete TRUE if all expected phases are present
n_obs Number of observations per phase

Checking Multiple Species

To check phase presence across multiple species at once:

# Check phases for multiple species
multi_check <- pep_check_phases_multi(
  pep = pep,
  species_list = c("Malus domestica", "Vitis vinifera"),
  expected = c(60, 65, 87)
)

print(multi_check)
#> Multi-Species Phase Availability Check
#> ================================================== 
#> Expected phases: 60, 65, 87
#> 
#> Species with all phases: 2 / 2 (100%)
#> -------------------------------------------------- 
#> 
#>            species complete n_expected n_present n_missing missing_phases
#>             <char>   <lgcl>      <int>     <int>     <int>         <char>
#> 1: Malus domestica     TRUE          3         3         0           <NA>
#> 2:  Vitis vinifera     TRUE          3         3         0           <NA>
#>    total_obs
#>        <int>
#> 1:    899311
#> 2:    146667

Common Phases to Check

Plant Type Common Phases BBCH Codes
Cereals Heading, Flowering, Harvest 60, 65, 100
Fruit trees Flowering, Full flowering, Fruit maturity 60, 65, 87
Deciduous trees Leaf unfolding, Flowering, Leaf fall 11, 60, 95

Part 4: Integrated Quality Workflow

Putting It All Together

Here’s a recommended workflow for data quality assessment that combines all the tools covered in this vignette:

# ══════════════════════════════════════════════════════════════════════════════
# STEP 1: Assess temporal completeness
# ══════════════════════════════════════════════════════════════════════════════
# Why: Incomplete stations can bias trend estimates and normals

completeness <- pep_completeness(flowering, by = c("s_id", "phase_id"))
good_stations <- completeness[completeness_pct >= 50, s_id]
fl_filtered <- flowering[s_id %in% good_stations]

cat("Kept", length(good_stations), "stations with >= 50% completeness\n")
#> Kept 15213 stations with >= 50% completeness

# ══════════════════════════════════════════════════════════════════════════════
# STEP 2: Flag statistical outliers
# ══════════════════════════════════════════════════════════════════════════════
# Why: Identify observations that deviate from expected patterns

outliers_wf <- pep_flag_outliers(fl_filtered, method = "mad", threshold = 3)

cat("Flagged", sum(outliers_wf$is_outlier), "outliers",
    "(", round(100 * mean(outliers_wf$is_outlier), 1), "% )\n")
#> Flagged 20449 outliers ( 1.6 % )

# ══════════════════════════════════════════════════════════════════════════════
# STEP 3: Make informed decisions about exclusion
# ══════════════════════════════════════════════════════════════════════════════
# Key principle: Document your decisions!

# Option A: Strict cleaning (for normals calculation)
fl_strict <- outliers_wf[is_outlier == FALSE]

# Option B: Moderate cleaning (keep moderate outliers)
fl_moderate <- outliers_wf[is_outlier == FALSE | abs(deviation) < 60]

cat("Strict cleaning keeps:", nrow(fl_strict), "obs\n")
#> Strict cleaning keeps: 1252610 obs
cat("Moderate cleaning keeps:", nrow(fl_moderate), "obs\n")
#> Moderate cleaning keeps: 1272698 obs

# ══════════════════════════════════════════════════════════════════════════════
# STEP 4: Proceed with analysis on cleaned data
# ══════════════════════════════════════════════════════════════════════════════

normals <- pheno_normals(fl_moderate, period = 1991:2020, min_years = 5)
#> Note: 44 group(s) have fewer than 5 years of data and return NA values.
cat("Normals calculated for", sum(!is.na(normals$mean_doy)), "groups\n")
#> Normals calculated for 202 groups

Documentation Template

When publishing research, document your quality control decisions:

Data Quality Control:
1. Completeness filtering: Excluded stations with < 50% temporal coverage
   (reduced dataset from N to M stations)

2. Outlier detection: Used MAD method with threshold = 3
   (flagged X observations as outliers, Y% of total)

3. Outlier disposition:
   - Excluded: Z observations with deviations > 60 days (likely errors)
   - Retained: W observations identified as abnormal events (e.g., second flowering)

4. Phase validation: Checked for sequence violations in [species list]
   (identified V cases for manual review)

Best Practices Summary

For Outlier Detection

  1. Use robust methods: MAD or IQR methods handle multiple outliers better than z-scores. The MAD method is generally recommended.

  2. Check seasonally: Late-season outliers (DOY > 250) may be biologically meaningful events (e.g. second flowering) rather than errors.

  3. Verify extremes: For observations with very large deviations (> 60 days), try to check original data sources or contact data providers.

  4. Document decisions: Record which outliers you exclude and why. This is essential for reproducibility and peer review.

  5. Don’t delete automatically: Human judgment is needed to distinguish errors from genuine unusual events.

For Abnormal Event Detection

  1. Consider species biology: Some species (e.g., certain Rosaceae) are more prone to second flowering than others.

  2. Check geographic patterns: If multiple nearby stations report late events in the same year, this suggests a real regional phenomenon.

  3. Look at weather context: Link abnormal events to weather extremes - drought is a common trigger.

  4. Distinguish from errors: Very isolated observations (single station, single year) need verification before treating them as e.g. second flowering.

  5. Keep for separate analysis: Abnormal phenological events are scientifically interesting (proof of climate change) - don’t just delete them!

For Completeness Assessment

  1. Set appropriate thresholds: Use 80%+ coverage for calculating normals, 50%+ for trend analysis.

  2. Consider gap patterns: Many small gaps are usually better than one long gap that might coincide with important climate changes.

  3. Check temporal bias: Ensure you’re not comparing stations with data only from early periods to stations with only recent data.

  4. Document station selection: Report how many stations met your completeness criteria and how this affected your sample.

For Phase Presence Checking

  1. Check early: Run pep_check_phases() at the start of your analysis to verify required phases are available in your data.

  2. Know required phases: Different analyses need different phases - know which BBCH codes you need before starting.

  3. Check across species: Use pep_check_phases_multi() to verify data availability for all species you plan to analyze.


Summary

This vignette covered data quality tools for phenological analysis:

Function Purpose Key Output
pep_flag_outliers() Identify unusual observations is_outlier flag, deviation in days
pep_plot_outliers() Visualize outlier patterns Four plot types: overview, seasonal, detail, map
pep_completeness() Assess temporal coverage Completeness %, gaps, year range
pep_check_phases() Check phase presence Missing phases, observation counts

Key Take-Home Messages

  1. Data quality assessment is not optional - it’s a critical step that affects the validity of all downstream analyses.

  2. Not all outliers are errors - some represent genuine biological phenomena like second flowering that deserve further study.

  3. Document your decisions - quality control choices should be transparent and reproducible.

  4. Use robust methods - the MAD method is recommended for phenological data because it handles multiple outliers well.

  5. Visual inspection matters - plots often reveal patterns that summary statistics miss.

Next Steps

Explore the other vignettes for complementary analyses:

vignette("getting-started", package = "pep725")
vignette("phenological-analysis", package = "pep725")
vignette("spatial-patterns", package = "pep725")

Session Info

sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Tahoe 26.0.1
#> 
#> Matrix products: default
#> BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: Europe/Zurich
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] pep725_1.0.0        ggplot2_4.0.1       data.table_1.18.2.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6        jsonlite_2.0.0      dplyr_1.2.0        
#>  [4] compiler_4.5.2      Rcpp_1.1.1          tidyselect_1.2.1   
#>  [7] tidyr_1.3.2         jquerylib_0.1.4     scales_1.4.0       
#> [10] yaml_2.3.12         fastmap_1.2.0       R6_2.6.1           
#> [13] labeling_0.4.3      generics_0.1.4      patchwork_1.3.2    
#> [16] classInt_0.4-11     robustbase_0.99-7   sf_1.0-24          
#> [19] knitr_1.51          tibble_3.3.1        units_1.0-0        
#> [22] DBI_1.2.3           bslib_0.10.0        pillar_1.11.1      
#> [25] RColorBrewer_1.1-3  rlang_1.1.7         cachem_1.1.0       
#> [28] xfun_0.56           sass_0.4.10         S7_0.2.1           
#> [31] otel_0.2.0          rnaturalearth_1.2.0 viridisLite_0.4.2  
#> [34] cli_3.6.5           withr_3.0.2         magrittr_2.0.4     
#> [37] class_7.3-23        digest_0.6.39       grid_4.5.2         
#> [40] rstudioapi_0.18.0   lifecycle_1.0.5     DEoptimR_1.1-4     
#> [43] vctrs_0.7.1         KernSmooth_2.23-26  proxy_0.4-29       
#> [46] evaluate_1.0.5      glue_1.8.0          farver_2.1.2       
#> [49] e1071_1.7-17        rmarkdown_2.30      purrr_1.2.1        
#> [52] tools_4.5.2         pkgconfig_2.0.3     htmltools_0.5.9

  1. University of Applied Sciences Northwestern Switzerland (FHNW)↩︎

  2. Swiss Federal Institute for Forest, Snow and Landscape Research (WSL)↩︎