ZooGoN 4.1.0
Bug fixes and data quality improvements
preprocess_surveys(): Added an explicit AphiaID remapping table that replaces 16 unaccepted WoRMS taxon identifiers with their accepted equivalents before any downstream processing. Previously, unaccepted IDs from KoboToolbox submissions were passed through unchanged and resolved only later (or not at all).preprocess_surveys(): NA AphiaIDs now raise an informative error listing the affectedevent_idvalues instead of silently dropping those rows. This makes data-quality issues in incoming survey submissions immediately visible.ingest_legacy_84_15()andingest_legacy_16_20(): Improved resolution of unaccepted WoRMS names. The functions now retainvalid_AphiaIDandvalid_namefrom the WoRMS match and explicitly overwriteAphiaID,scientificName, andlsidwith accepted values for any taxon whose status is"unaccepted". Deduplication is then performed byslice_min(AphiaID)within each (sample, taxon, life-stage) group, replacing the previous strategy of sorting byAphiaIDand taking the first row.
ZooGoN 4.0.0
Breaking changes
-
raw_to_tidy()renamed toformat_to_tidy(). -
raw_to_dc()renamed toformat_to_dc(). -
dc_to_archive()renamed toformat_to_DC_archive().
New functions
-
ingest_legacy_84_15(): downloads, validates, and harmonises the 1984–2015 historical zooplankton dataset, including WoRMS taxonomic lookups and life-stage code standardisation. OutputsMcZoo_84-15to SharePoint. -
ingest_legacy_16_20(): same workflow as above for the 2016–2020 legacy dataset. OutputsMcZoo_16-20to SharePoint. -
upload_sharepoint_file(): uploads an arbitrary local file to SharePoint (complements the existingupload_sharepoint_df()for data frames).
Code organisation
- Survey ingestion logic moved to
R/kobo-ingestion.R. - Survey preprocessing logic moved to
R/kobo-processing.R. - Legacy data ingestion logic consolidated in
R/legacy-data-ingestion.R.
Documentation
- Vignettes updated to reflect renamed functions and new legacy-ingestion steps.
- Added plain-language introductions to both vignettes for a non-technical (biology/ecology) audience.
- pkgdown reference index reorganised: added “Legacy Data Ingestion”, “Pipeline & Reporting” sections; all renamed functions updated.
ZooGoN 3.1.0
Full Automated Pipeline
This release completes the end-to-end data pipeline from field survey collection to Darwin Core Archive publication.
New Functions
-
raw_to_tidy(): Merges legacy datasets (1984-2020) with preprocessed ongoing surveys into a single analysis-ready dataset. Uploads the merged data in both CSV and Parquet formats to SharePoint.
Pipeline & Automation
- Complete 6-step GitHub Actions pipeline: build container → ingest surveys → preprocess → merge tidy data → Darwin Core conversion → build archive.
-
raw_to_dc()now reads from the tidy data output (produced byraw_to_tidy()) instead of merging legacy files internally. Results are uploaded to SharePoint as versioned RDS. -
dc_to_archive()no longer takes arguments — it downloads Darwin Core tables from SharePoint automatically.
Storage Improvements
- Large file upload support: Files larger than 4 MB are uploaded via Microsoft Graph API resumable upload sessions with chunked transfer (10 MB chunks).
-
RDS format support in
upload_sharepoint_df(),download_sharepoint_file(), and related helpers.
Production Hardening
- Fixed Dockerfile.prod syntax (
install2.rdoes not accept version specs) and added a library validation step. - Bumped minimum R version to
R (>= 4.1)(required for the native pipe|>). - Fixed scalar logical operators (
||instead of|) in KoboToolbox validation. - Internal helpers (
flatten_row,flatten_field,rename_child) are no longer exported.
ZooGoN 3.0.0
- Added automated GBIF publishing helpers:
register_gbif_dataset()for production use andregister_gbif_dataset_test()for the GBIF-Test demo flow; both take a public DwC-A URL and handle dataset registration. - Simplified Darwin Core export:
dc_to_archive()now builds the DwC-A + EML fromraw_to_dc()output and uploads the archive to SharePoint. - Clarified EML generation (
get_metadata()) and cleaned up pkgdown reference sections.
ZooGoN 2.0.0
Automation & Cloud Integration
This release introduces cloud storage connectivity and survey data ingestion capabilities for the ZooGoN package. The focus is on building infrastructure for automated workflows, with new functions for SharePoint integration, KoboToolbox data retrieval, and streamlined Darwin Core conversion.
New Functions
Cloud Storage Functions
-
upload_sharepoint_df(): Upload data frames to Microsoft SharePoint document libraries- Automatic file versioning with timestamps
- Support for CSV, TSV, and Parquet formats
- Authentication via Microsoft Graph API
-
download_sharepoint_file(): Download files from SharePoint- Retrieve latest versioned files by prefix
- Support for multiple file formats
Survey Data Ingestion
-
ingest_surveys(): Retrieve survey data from KoboToolbox and upload to SharePoint- Connects to KoboToolbox API to download survey submissions
- Flattens nested JSON survey data into tabular format
- Validates data integrity and checks for duplicates
- Uploads processed data to cloud storage
-
get_kobo_data(): Low-level function to retrieve data from KoboToolbox API- Handles pagination for large datasets
- Supports JSON and XML formats
Darwin Core Conversion
-
raw_to_dc(): Convert preprocessed legacy data to Darwin Core format- Processes parquet files with WoRMS-validated taxonomic data
- Creates Event, Occurrence, and eMoF (Extended Measurement or Fact) tables
- Applies BODC NERC Vocabulary standards
- Currently processes
McZoo_84-13.parquet(1984-2013 data)
Infrastructure Changes
New Dependencies
-
httr2: HTTP API interactions -
arrow: Parquet file handling -
config: Configuration management -
logger: Logging functionality -
purrr: Functional programming utilities
Documentation
- Updated function documentation with roxygen2
- Revised
data-processing.Rmdvignette to documentraw_to_dc()workflow - Updated
raw_to_dc.Rdmanual page
ZooGoN 1.0.0
Major Release - Comprehensive Data Processing Pipeline
This major release introduces a complete data processing workflow for the LTER-MareChiara zooplankton dataset, transforming ZooGoN from a basic taxonomic standardization tool into a comprehensive biodiversity data processing package.
New Features
-
process_lter_data()Function: Complete integrated workflow from raw Excel files to Darwin Core format - WoRMS Taxonomic Validation: Optional integration with World Register of Marine Species for taxonomic accuracy
- Flexible Output Options: Support for both R list objects and direct CSV export
- Enhanced Error Handling: Comprehensive validation and graceful error recovery
- Geographic Metadata Integration: Automatic LTER-MareChiara station coordinates and sampling information
- Processing Metadata: Detailed workflow information and data quality metrics
Technical Improvements
-
Enhanced Dependencies: Added
readrandworrmspackages for improved functionality - Comprehensive Documentation: Extensive function documentation with examples and parameter descriptions
- Quality Control: Built-in data validation and processing verification
- Performance Optimization: Efficient handling of large taxonomic datasets (40 years of data)
Data Processing Capabilities
- Complete Workflow: Single function handles entire Excel → Darwin Core pipeline
-
Taxonomic Standardization: Enhanced
extract_genus_species()integration with WoRMS validation - Darwin Core Extensions: Automatic generation of Event, Occurrence, and eMoF tables
- Temporal Data Handling: Robust Excel date conversion and temporal metadata management
- Sample Identification: Automated mapping between dates and standardized sample IDs
Documentation Updates
-
Enhanced Vignettes: Updated data processing workflow to demonstrate
process_lter_data() - README Improvements: Showcases new comprehensive processing capabilities
- CLAUDE.md Updates: Reflects new package architecture and main functions
- Function Examples: Comprehensive usage examples for all processing scenarios
ZooGoN 0.1.0
Initial Release
This is the initial release of ZooGoN as part of the DTO-BioFlow project (Gulf of Naples - 40 Years of Zooplankton Biodiversity Assessment).
Features
-
Taxonomic Standardization: Core
extract_genus_species()function for standardizing Mediterranean zooplankton taxonomic names - Darwin Core Compliance: Full support for converting LTER-MareChiara data to Darwin Core Archive format
- FAIR Data Principles: Ensures Findable, Accessible, Interoperable, Reusable data output
- EMODnet Biology Integration: Quality-controlled data ready for European marine biodiversity infrastructure
Dataset Coverage
- Temporal Scope: 1984-2024 (40 years of continuous monitoring)
- Sample Size: 1,506 zooplankton samples from Gulf of Naples
- Taxonomic Diversity: 148 copepod species + 61 other zooplankton taxa
- Geographic Coverage: LTER-MareChiara station (40°81’N, 14°25’E), Tyrrhenian Sea
Documentation
- Comprehensive package documentation with
pkgdownwebsite - Data processing workflow vignette
- Darwin Core integration guide
- EMODnet Biology compliance documentation
Project Context
This package represents the technical “workbase” for the ZOOGoN-40Y project, funded by a €60,000 DTO-BioFlow FSTP grant under the EU Horizon Mission “Restore our Ocean & Waters by 2030”. The work is conducted by Stazione Zoologica Anton Dohrn (Naples, Italy) as part of the broader Digital Twin of the Ocean initiative.
Standards Compliance
- Darwin Core Archive: International biodiversity data standard
- WoRMS Integration: World Register of Marine Species validation
-
BODC NERC Vocabulary: Standardized measurement terminology
- ISO19115 Metadata: International metadata standards
- Creative Commons Licensing: CC-BY-NC for open science