Darwin Core Output for LTER-MareChiara Legacy Data
Source:vignettes/darwin-core.Rmd
darwin-core.RmdOverview
Darwin Core is the international data standard required by OBIS, EMODnet Biology, and GBIF to publish biodiversity occurrence records. It structures data into three linked tables — sampling events, species occurrences, and associated measurements — allowing any repository worldwide to ingest and cross-reference the records. Converting the LTER-MareChiara dataset to this format is the step that makes it ready for submission to these repositories.
This vignette shows how the package builds those Darwin Core tables
from the LTER-MareChiara zooplankton dataset. The conversion is handled
by format_to_dc(), which downloads the merged tidy dataset
from SharePoint (produced by format_to_tidy()), adds the
required Darwin Core fields, and returns the three tables as an R list.
The function does not upload anything — that is handled by
format_to_DC_archive(), which calls
format_to_dc() internally and then assembles and uploads
the Darwin Core Archive zip. The function uses data that already
contains WoRMS LSIDs and does not perform additional taxonomic
queries.
Prerequisites
-
inst/config.ymlconfigured with SharePoint credentials and bucket names. - The merged dataset must exist in SharePoint (produced by running
format_to_tidy()earlier in the pipeline). This dataset covers 1984-2024 and contains preprocessed records with WoRMS LSIDs. - Optional: set
verbose = FALSEto silence log messages.
Expected Input Structure
format_to_dc() expects the tidy dataset to contain at
least:
-
eventID: sampling event identifier (for example"mc_1"). -
eventDate: sampling date (Datecolumn). -
scientificname: WoRMS validated taxon name. -
lsid: WoRMS Life Science Identifier. -
Abundance: abundance measurement (individuals per cubic meter). -
lifeStage: life stage codes (f,m,j,fm,fmj).
Any extra columns are carried through and pivoted into the measurement table.
Running the conversion
# Convert tidy data to Darwin Core format
# Tables are uploaded to SharePoint automatically
format_to_dc()The function downloads the merged dataset from SharePoint and builds the three Darwin Core tables (Event, Occurrence, eMoF) in memory. It does not upload anything to SharePoint.
To build a Darwin Core Archive and upload it to SharePoint, call
format_to_DC_archive(). This function calls
format_to_dc() internally, so in the automated pipeline it
is the only function you need to run:
# Build DwC-A zip (runs format_to_dc() internally) and upload to SharePoint
format_to_DC_archive()Darwin Core tables
Event
format_to_dc() builds one row per unique event with
fixed station metadata.
# After downloading the DC tables from SharePoint:
# event_table %>%
# select(eventID, eventDate, decimalLatitude, decimalLongitude, samplingProtocol) %>%
# head()Columns include:
-
eventID,eventDate -
decimalLatitude= 40.81,decimalLongitude= 14.25 -
locality,country,stateProvince,waterBody -
maximumDepthInMeters= 50,minimumDepthInMeters= 0 -
samplingProtocol,sampleSizeValue(= 1),sampleSizeUnit(= “sample”)
Occurrence
Occurrences are derived from the tidy data with an automatically
generated occurrenceID and presence flag.
-
occurrenceStatusis set to"present"whenAbundance > 0, otherwise"absent". -
scientificNameandscientificNameIDcome directly from the input (no new validation is run).
eMoF (Extended Measurement or Fact)
eMoF is the Darwin Core extension used to store quantitative measurements alongside each occurrence record. For the LTER-MareChiara dataset this includes abundance (individuals per m³), sex, and life stage. Each measurement is linked to a controlled-vocabulary identifier so that the values can be interpreted unambiguously by any international biodiversity database.
Mapping logic implemented in format_to_dc():
- Measurements with values
f,m,fm,fmjare labelled asmeasurementType = "sex"withmeasurementTypeID = http://vocab.nerc.ac.uk/collection/P01/current/ENTSEX01/. - Measurements with value
jare labelledmeasurementType = "lifeStage"withmeasurementTypeID = http://vocab.nerc.ac.uk/collection/P01/current/LSTAGE01/. -
AbundancekeepsmeasurementType = "Abundance"withmeasurementTypeID = https://vocab.nerc.ac.uk/collection/S06/current/S0600002/andmeasurementUnitID = http://vocab.nerc.ac.uk/collection/P06/current/UPMM/. -
measurementValueIDis set for the coded values above (S10/S11 URIs); other measurements keepNA.
Processing summary
format_to_dc() prints a processing summary on
completion: the number of sampling events, occurrences, and measurements
produced, the date range covered, and the number of unique taxa
included.
Publishing (optional)
If you need to publish the archive to GBIF:
- Production: supply your real organization/installation keys and a
public DwC-A URL to
register_gbif_dataset(). - Test: use the GBIF-Test demo helper
register_gbif_dataset_test()with the demo credentials and a public DwC-A URL to check the flow safely.