Skip to contents

ZooGoN 4.1.0

Bug fixes and data quality improvements

  • preprocess_surveys(): Added an explicit AphiaID remapping table that replaces 16 unaccepted WoRMS taxon identifiers with their accepted equivalents before any downstream processing. Previously, unaccepted IDs from KoboToolbox submissions were passed through unchanged and resolved only later (or not at all).

  • preprocess_surveys(): NA AphiaIDs now raise an informative error listing the affected event_id values instead of silently dropping those rows. This makes data-quality issues in incoming survey submissions immediately visible.

  • ingest_legacy_84_15() and ingest_legacy_16_20(): Improved resolution of unaccepted WoRMS names. The functions now retain valid_AphiaID and valid_name from the WoRMS match and explicitly overwrite AphiaID, scientificName, and lsid with accepted values for any taxon whose status is "unaccepted". Deduplication is then performed by slice_min(AphiaID) within each (sample, taxon, life-stage) group, replacing the previous strategy of sorting by AphiaID and taking the first row.

ZooGoN 4.0.0

Breaking changes

New functions

  • ingest_legacy_84_15(): downloads, validates, and harmonises the 1984–2015 historical zooplankton dataset, including WoRMS taxonomic lookups and life-stage code standardisation. Outputs McZoo_84-15 to SharePoint.
  • ingest_legacy_16_20(): same workflow as above for the 2016–2020 legacy dataset. Outputs McZoo_16-20 to SharePoint.
  • upload_sharepoint_file(): uploads an arbitrary local file to SharePoint (complements the existing upload_sharepoint_df() for data frames).

Code organisation

  • Survey ingestion logic moved to R/kobo-ingestion.R.
  • Survey preprocessing logic moved to R/kobo-processing.R.
  • Legacy data ingestion logic consolidated in R/legacy-data-ingestion.R.

Documentation

  • Vignettes updated to reflect renamed functions and new legacy-ingestion steps.
  • Added plain-language introductions to both vignettes for a non-technical (biology/ecology) audience.
  • pkgdown reference index reorganised: added “Legacy Data Ingestion”, “Pipeline & Reporting” sections; all renamed functions updated.

ZooGoN 3.1.0

Full Automated Pipeline

This release completes the end-to-end data pipeline from field survey collection to Darwin Core Archive publication.

New Functions

  • raw_to_tidy(): Merges legacy datasets (1984-2020) with preprocessed ongoing surveys into a single analysis-ready dataset. Uploads the merged data in both CSV and Parquet formats to SharePoint.

Pipeline & Automation

  • Complete 6-step GitHub Actions pipeline: build container → ingest surveys → preprocess → merge tidy data → Darwin Core conversion → build archive.
  • raw_to_dc() now reads from the tidy data output (produced by raw_to_tidy()) instead of merging legacy files internally. Results are uploaded to SharePoint as versioned RDS.
  • dc_to_archive() no longer takes arguments — it downloads Darwin Core tables from SharePoint automatically.

Storage Improvements

  • Large file upload support: Files larger than 4 MB are uploaded via Microsoft Graph API resumable upload sessions with chunked transfer (10 MB chunks).
  • RDS format support in upload_sharepoint_df(), download_sharepoint_file(), and related helpers.

Production Hardening

  • Fixed Dockerfile.prod syntax (install2.r does not accept version specs) and added a library validation step.
  • Bumped minimum R version to R (>= 4.1) (required for the native pipe |>).
  • Fixed scalar logical operators (|| instead of |) in KoboToolbox validation.
  • Internal helpers (flatten_row, flatten_field, rename_child) are no longer exported.

Documentation

  • Updated README, data-processing vignette, and darwin-core vignette to reflect the full pipeline workflow.
  • Reorganised pkgdown reference index with a “Data Processing” section.

ZooGoN 3.0.0

  • Added automated GBIF publishing helpers: register_gbif_dataset() for production use and register_gbif_dataset_test() for the GBIF-Test demo flow; both take a public DwC-A URL and handle dataset registration.
  • Simplified Darwin Core export: dc_to_archive() now builds the DwC-A + EML from raw_to_dc() output and uploads the archive to SharePoint.
  • Clarified EML generation (get_metadata()) and cleaned up pkgdown reference sections.

ZooGoN 2.0.0

Automation & Cloud Integration

This release introduces cloud storage connectivity and survey data ingestion capabilities for the ZooGoN package. The focus is on building infrastructure for automated workflows, with new functions for SharePoint integration, KoboToolbox data retrieval, and streamlined Darwin Core conversion.

New Functions

Cloud Storage Functions
  • upload_sharepoint_df(): Upload data frames to Microsoft SharePoint document libraries
    • Automatic file versioning with timestamps
    • Support for CSV, TSV, and Parquet formats
    • Authentication via Microsoft Graph API
  • download_sharepoint_file(): Download files from SharePoint
    • Retrieve latest versioned files by prefix
    • Support for multiple file formats
Survey Data Ingestion
  • ingest_surveys(): Retrieve survey data from KoboToolbox and upload to SharePoint
    • Connects to KoboToolbox API to download survey submissions
    • Flattens nested JSON survey data into tabular format
    • Validates data integrity and checks for duplicates
    • Uploads processed data to cloud storage
  • get_kobo_data(): Low-level function to retrieve data from KoboToolbox API
    • Handles pagination for large datasets
    • Supports JSON and XML formats
Darwin Core Conversion
  • raw_to_dc(): Convert preprocessed legacy data to Darwin Core format
    • Processes parquet files with WoRMS-validated taxonomic data
    • Creates Event, Occurrence, and eMoF (Extended Measurement or Fact) tables
    • Applies BODC NERC Vocabulary standards
    • Currently processes McZoo_84-13.parquet (1984-2013 data)

Infrastructure Changes

Configuration System
  • New YAML-based configuration (inst/config.yml) for managing:
    • SharePoint connection details and credentials
    • KoboToolbox API settings
    • Data bucket organization (aut_bucket, hot_bucket)
    • File naming conventions
Code Organization
  • Modular file structure:
    • R/storage.R: SharePoint upload/download operations
    • R/ingestion-surveys.R: KoboToolbox data retrieval
    • R/preprocessing-surveys.R: Survey data transformation
    • R/processing.R: Darwin Core conversion

New Dependencies

  • httr2: HTTP API interactions
  • arrow: Parquet file handling
  • config: Configuration management
  • logger: Logging functionality
  • purrr: Functional programming utilities

Documentation

  • Updated function documentation with roxygen2
  • Revised data-processing.Rmd vignette to document raw_to_dc() workflow
  • Updated raw_to_dc.Rd manual page

Breaking Changes

  • Removed process_lter_data() function (replaced by raw_to_dc())
  • Requires inst/config.yml configuration file with credentials
  • raw_to_dc() expects preprocessed parquet files instead of raw Excel files

Notes

  • This release establishes the foundation for automated data workflows
  • Survey data ingestion is functional but preprocessing steps are still in development
  • Darwin Core conversion currently handles legacy data (1984-2013); additional time periods will be added in future releases

ZooGoN 1.0.0

Major Release - Comprehensive Data Processing Pipeline

This major release introduces a complete data processing workflow for the LTER-MareChiara zooplankton dataset, transforming ZooGoN from a basic taxonomic standardization tool into a comprehensive biodiversity data processing package.

New Features

  • process_lter_data() Function: Complete integrated workflow from raw Excel files to Darwin Core format
  • WoRMS Taxonomic Validation: Optional integration with World Register of Marine Species for taxonomic accuracy
  • Flexible Output Options: Support for both R list objects and direct CSV export
  • Enhanced Error Handling: Comprehensive validation and graceful error recovery
  • Geographic Metadata Integration: Automatic LTER-MareChiara station coordinates and sampling information
  • Processing Metadata: Detailed workflow information and data quality metrics

Technical Improvements

  • Enhanced Dependencies: Added readr and worrms packages for improved functionality
  • Comprehensive Documentation: Extensive function documentation with examples and parameter descriptions
  • Quality Control: Built-in data validation and processing verification
  • Performance Optimization: Efficient handling of large taxonomic datasets (40 years of data)

Data Processing Capabilities

  • Complete Workflow: Single function handles entire Excel → Darwin Core pipeline
  • Taxonomic Standardization: Enhanced extract_genus_species() integration with WoRMS validation
  • Darwin Core Extensions: Automatic generation of Event, Occurrence, and eMoF tables
  • Temporal Data Handling: Robust Excel date conversion and temporal metadata management
  • Sample Identification: Automated mapping between dates and standardized sample IDs

Documentation Updates

  • Enhanced Vignettes: Updated data processing workflow to demonstrate process_lter_data()
  • README Improvements: Showcases new comprehensive processing capabilities
  • CLAUDE.md Updates: Reflects new package architecture and main functions
  • Function Examples: Comprehensive usage examples for all processing scenarios

Breaking Changes

  • Package Version: Updated to 1.0.0 to reflect major functionality additions
  • Enhanced Dependencies: New required packages for full functionality

Performance

  • Large Dataset Support: Efficiently processes 1,506 samples with 148+ taxa
  • Memory Management: Optimized for 40-year temporal datasets
  • Optional Processing: WoRMS validation can be disabled for faster execution

ZooGoN 0.1.0

Initial Release

This is the initial release of ZooGoN as part of the DTO-BioFlow project (Gulf of Naples - 40 Years of Zooplankton Biodiversity Assessment).

Features

  • Taxonomic Standardization: Core extract_genus_species() function for standardizing Mediterranean zooplankton taxonomic names
  • Darwin Core Compliance: Full support for converting LTER-MareChiara data to Darwin Core Archive format
  • FAIR Data Principles: Ensures Findable, Accessible, Interoperable, Reusable data output
  • EMODnet Biology Integration: Quality-controlled data ready for European marine biodiversity infrastructure

Dataset Coverage

  • Temporal Scope: 1984-2024 (40 years of continuous monitoring)
  • Sample Size: 1,506 zooplankton samples from Gulf of Naples
  • Taxonomic Diversity: 148 copepod species + 61 other zooplankton taxa
  • Geographic Coverage: LTER-MareChiara station (40°81’N, 14°25’E), Tyrrhenian Sea

Documentation

  • Comprehensive package documentation with pkgdown website
  • Data processing workflow vignette
  • Darwin Core integration guide
  • EMODnet Biology compliance documentation

Project Context

This package represents the technical “workbase” for the ZOOGoN-40Y project, funded by a €60,000 DTO-BioFlow FSTP grant under the EU Horizon Mission “Restore our Ocean & Waters by 2030”. The work is conducted by Stazione Zoologica Anton Dohrn (Naples, Italy) as part of the broader Digital Twin of the Ocean initiative.

Standards Compliance

  • Darwin Core Archive: International biodiversity data standard
  • WoRMS Integration: World Register of Marine Species validation
  • BODC NERC Vocabulary: Standardized measurement terminology
  • ISO19115 Metadata: International metadata standards
  • Creative Commons Licensing: CC-BY-NC for open science

Acknowledgments

This work builds on 40 years of dedicated zooplankton monitoring at the LTER-MareChiara research station and contributes to the European Digital Twin of the Ocean for enhanced marine ecosystem understanding and management.

ZooGoN 0.0.1

  • Initial commit