Skip to content

feat: add Darwin Core Archive (DwC-A) export format#1131

Draft
mihow wants to merge 15 commits intomainfrom
feat/dwca-export
Draft

feat: add Darwin Core Archive (DwC-A) export format#1131
mihow wants to merge 15 commits intomainfrom
feat/dwca-export

Conversation

@mihow
Copy link
Collaborator

@mihow mihow commented Feb 11, 2026

Summary

  • Add a new dwca export format that produces a GBIF-compatible Darwin Core Archive (ZIP containing event.txt, occurrence.txt, meta.xml, eml.xml)
  • Uses Event Core architecture: events are the core entity, occurrences are an extension linked via eventID
  • Field mappings, metadata generators, and taxonomy helpers live in a new ami/exports/dwca.py module
  • Registered via the existing export registry — available immediately through the exports API

This is the foundation for publishing AMI biodiversity data to GBIF. The initial implementation produces static archive downloads. Future work will add richer metadata, multimedia extensions, and eventually automated publishing via IPT integration.

Design decisions

  • Direct TSV writing instead of DRF serializers — DwC fields are flat extractions, not nested API representations. Uses queryset.iterator(chunk_size=500) for memory efficiency.
  • Taxonomy from parents_json — Walks the pre-computed parent list to extract kingdom/phylum/class/order/family/genus, avoiding N+1 queries.
  • meta.xml generated from field definitions — The same field list used for TSV columns drives meta.xml generation, keeping them in sync.
  • basisOfRecord = "MachineObservation" for all records.
  • URN format IDs: urn:ami:event:{slug}:{id}, urn:ami:occurrence:{slug}:{id}.

Changes

File Change
ami/exports/dwca.py New — DwC field mappings, TSV writer, meta.xml/eml.xml generators, taxonomy helpers
ami/exports/format_types.py Add DwCAExporter(BaseExporter) class
ami/exports/registry.py Register "dwca" format
ami/exports/tests.py 9 tests: ZIP structure, headers, row counts, meta.xml structure, referential integrity, taxonomy extraction, EML validity

Test plan

Known issues to fix before merge

  1. Occurrences without events produce empty coreid — GBIF will reject orphaned extension rows. Need .filter(event__isnull=False) on the occurrence queryset.
  2. Occurrences without determinations produce empty scientificName — GBIF treats this as required. Need .filter(determination__isnull=False).
  3. individualCount semantics are wrongdetections_count is the number of bounding boxes across frames, not individuals. Each occurrence is one individual in AMI's model. Should emit 1 or omit the field.
  4. vernacularName lambda has ambiguous operator precedence — Works by accident but should be parenthesized.
  5. Temp files never cleaned up — event.txt, occurrence.txt, and the zip temp file leak on the worker.

Near follow-up (before real GBIF submission)

  • Apply project default filters to the occurrence queryset — without this, low-confidence ML determinations get published. Biggest data quality risk.
  • Add license field on events — GBIF requires a dataset license for reuse terms.
  • Add identifiedBy / dateIdentified — provenance for who/what made the determination.
  • Add associatedMedia — detection image URLs. For an image-based platform this is the primary evidence.
  • Runtime validation before packaging — check for missing required fields, orphaned references.
  • Multimedia extension file (multimedia.txt)
  • Species / taxa extension file
  • Upgrade EML to 2.2.0 — current code uses 2.1.1, GBIF recommends 2.2.0.

Eventual follow-up

  • EML geographic/temporal coverage computed from actual data (bounding box, date range)
  • country, stateProvince, locality on events (requires reverse geocoding or Site model fields)
  • coordinateUncertaintyInMeters
  • institutionCode, collectionCode (project-level settings)
  • scientificNameAuthorship from Taxon.author
  • eventType field
  • GBIF Data Validator automated integration test
  • IPT server integration / acting as IPT endpoint for GBIF crawling

Summary by CodeRabbit

  • New Features

    • Added Darwin Core Archive (DwC-A) export format for standardized biodiversity data sharing.
    • Exports now package event and occurrence records with complete metadata files into a single ZIP archive.
  • Documentation

    • Added DwC-A format reference and export framework technical documentation.

mihow and others added 4 commits February 11, 2026 11:45
Add planning document for Darwin Core Archive export format, export
framework technical reference, DwC-A format reference with field
mappings, and downloaded DwC terms quick reference from TDWG.

Co-Authored-By: Claude <noreply@anthropic.com>
Add Event Core + Occurrence Extension DwC-A exporter that produces a
GBIF-compatible ZIP containing event.txt, occurrence.txt, meta.xml,
and eml.xml. Events are the core entity with occurrences linked via
eventID foreign key.

Key design decisions:
- Direct TSV writing with iterator(chunk_size=500) instead of DRF serializers
- Taxonomy hierarchy extracted from parents_json to avoid N+1 queries
- meta.xml generated from the same field definitions used for TSV columns
- basisOfRecord = "MachineObservation" for all records
- URN format IDs: urn:ami:event:{slug}:{id}, urn:ami:occurrence:{slug}:{id}

Co-Authored-By: Claude <noreply@anthropic.com>
Test ZIP structure, event/occurrence headers and row counts, meta.xml
core/extension structure, referential integrity between events and
occurrences, taxonomy hierarchy extraction from parents_json, specific
epithet parsing, and EML metadata validity.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
@netlify
Copy link

netlify bot commented Feb 11, 2026

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit c8aadb7
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/698d322b1918830008eff4fe

@netlify
Copy link

netlify bot commented Feb 11, 2026

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit c8aadb7
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/698d322b1bf45a0008d103f1

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 11, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request introduces a complete Darwin Core Archive (DwC-A) export format for AMI data. It adds a new exporter class, field mapping definitions, helper utilities for data transformation, metadata generation, registry integration, comprehensive test coverage, and technical documentation.

Changes

Cohort / File(s) Summary
Planning & Framework Docs
.agents/planning/dwca-export-plan.md, docs/claude/export-framework.md
High-level planning document for DwC-A feature and technical reference for the Export Framework architecture, registry pattern, and runtime flow.
DwC-A Core Implementation
ami/exports/dwca.py
Implements DwC field mappings for Event and Occurrence, helper functions for date/coordinate/taxonomy formatting, TSV export logic, meta.xml and eml.xml generation, and ZIP archiving.
Exporter Integration
ami/exports/format_types.py, ami/exports/registry.py
Adds DwCAExporter class with get_queryset, get_events_queryset, and export methods; registers "dwca" format in ExportRegistry (note: DwCAExporter appears duplicated in format_types.py).
Test Suite
ami/exports/tests.py
Comprehensive DwCAExportTest covering exporter registration, ZIP contents validation, header correctness, row counts, referential integrity, taxonomy extraction, and EML metadata.
DwC-A Reference Documentation
docs/claude/dwca-format-reference.md
Detailed DwC-A format specification including archive structure, meta.xml/EML schemas, DwC term mappings, and validation rules.

Sequence Diagram

sequenceDiagram
    participant Client
    participant API as DataExport API
    participant Exporter as DwCAExporter
    participant DB as Database
    participant Files as File System
    participant Storage as ZIP Archive

    Client->>API: POST export (format=dwca)
    API->>Exporter: initialize & export()
    Exporter->>DB: get_events_queryset()
    DB-->>Exporter: Event records
    Exporter->>Files: write_tsv(event.txt)
    Files-->>Exporter: event.txt path
    Exporter->>DB: get_queryset()
    DB-->>Exporter: Occurrence records
    Exporter->>Files: write_tsv(occurrence.txt)
    Files-->>Exporter: occurrence.txt path
    Exporter->>Exporter: generate_meta_xml()
    Exporter->>Exporter: generate_eml_xml()
    Exporter->>Storage: create_dwca_zip(event.txt, occurrence.txt, meta.xml, eml.xml)
    Storage-->>Exporter: dwca.zip path
    Exporter-->>API: return ZIP path
    API-->>Client: export complete with download URL
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A Darwin's dream in archives bright,
Events and occurrences packed just right,
Metadata nestled in XML's fold,
DwC-A stories ready to be told! 📦

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 79.31% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'feat: add Darwin Core Archive (DwC-A) export format' clearly and concisely summarizes the main change: introducing a new export format for DwC-A archives.
Description check ✅ Passed PR description follows the template structure with Summary, Design decisions, Changes table, Test plan, Known issues, and Near/Eventual follow-ups. All required sections are present and substantive.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/dwca-export

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow marked this pull request as ready for review February 11, 2026 20:34
Copilot AI review requested due to automatic review settings February 11, 2026 20:34
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🤖 Fix all issues with AI agents
In `@ami/exports/dwca.py`:
- Around line 132-142: The _format_duration function must guard against negative
durations: check event.start and event.end and compute delta = event.end -
event.start, then if delta.total_seconds() <= 0 return an empty string (or
optionally log a warning) instead of proceeding; update the logic around
total_seconds / divmod so negative values are not formatted (use the check on
delta.total_seconds() before computing hours/minutes and then format only for
positive durations).
- Around line 62-63: The taxonRank mapping lambda can raise AttributeError when
o.determination.rank is None; update the DWC + "taxonRank" tuple's lambda (the
function referencing o.determination.rank.lower()) to guard by checking both
o.determination and o.determination.rank before calling .lower(), returning an
empty string (or appropriate default) when rank is falsy.
- Line 87: The mapping that emits DWC + "individualCount" currently uses
getattr(o, "detections_count", 0) which is semantically wrong; change the tuple
entry for DWC + "individualCount" so it always emits the string "1" (i.e.,
replace the lambda that references detections_count with one that returns "1")
or remove the entire tuple if you prefer omission; update the tuple containing
DWC + "individualCount" and the lambda (the anonymous function referencing
getattr(o, "detections_count", 0)) accordingly.
- Around line 75-79: The ternary expression in the tuple entry with key
"vernacularName" is ambiguous; update the lambda (lambda o, slug:
o.determination.common_name_en or "" if o.determination else "") to add
parentheses around the ternary's true branch so it reads like
(o.determination.common_name_en or "") if o.determination else "" to make the
intent explicit and easier to read.
- Around line 309-312: The EML export is inserting the owner's email into the
<surName> element (project.owner.email) which leaks PII and is semantically
wrong; update the block that creates creator -> individualName -> surName to
avoid using project.owner.email: prefer using a non-PII display name (e.g.,
project.owner.get_display_name() or project.owner.name) if available, and if no
safe display name exists omit creating the individualName/surName elements
entirely (i.e., only create creator when a valid non-email name/contact is
present). Ensure references to project.owner, creator, individualName, and
surName are updated accordingly.
- Around line 288-290: Change the naive timestamp to a timezone-aware one:
replace the use of datetime.datetime.now() when assigning now (used alongside
project_slug/slugify) with Django's timezone-aware function (timezone.now()),
and add the appropriate import from django.utils import timezone at the top of
the module so now is timezone-aware when used in exports/dwca.py.

In `@ami/exports/format_types.py`:
- Around line 196-208: The Occurrence queryset in get_queryset currently
includes occurrences lacking related event or determination, producing empty
coreid/scientificName; update get_queryset (the Occurrence.objects.valid() chain
in the get_queryset method) to filter out occurrences with null event and null
determination (e.g., .filter(event__isnull=False, determination__isnull=False))
before the select_related/.with_identifications() calls so only occurrences with
both event and determination are exported.
- Around line 237-264: The temp files created for the DwC-A export
(event_file.name, occ_file.name and the produced zip_path) are never cleaned up;
wrap the export sequence in a try/finally around the code that calls
get_events_queryset, write_tsv, generate_meta_xml, generate_eml_xml and
create_dwca_zip so that in the finally block you remove event_file.name,
occ_file.name and (on error or success) delete any intermediate temp zip if
appropriate; ensure update_export_stats and the return of zip_path still occur
on success but that file removal happens only for the intermediate files (or
conditionally for the zip if you decide not to persist it).

In `@ami/exports/tests.py`:
- Around line 477-502: The test test_taxonomy_hierarchy_extraction can pass
vacuously when taxon.parents_json is empty; update the test to guarantee
parents_json is populated before calling _get_rank_from_parents: after obtaining
taxon (from occurrence.determination) and calling
taxon.save(update_calculated_fields=True) and taxon.refresh_from_db(), add an
assertion that taxon.parents_json is truthy (or else set up explicit test data/a
fixture so parents_json contains a parent chain) so the loop over ranks actually
executes and _get_rank_from_parents is exercised.

In `@docs/claude/dwca-format-reference.md`:
- Around line 94-113: The documentation shows EML 2.2.0 namespace
("https://eml.ecoinformatics.org/eml-2.2.0") but the implementation currently
emits EML 2.1.1 (literal "eml://ecoinformatics.org/eml-2.1.1"); either make the
docs reflect 2.1.1 or update the implementation to 2.2.0 so they match. To fix:
search the codebase for the literal "eml://ecoinformatics.org/eml-2.1.1" (and
any code that constructs the eml:eml root or schemaLocation) and change the
namespace and schemaLocation to the 2.2.0 values used in the doc, or
alternatively update the doc snippet to the 2.1.1 namespace and schemaLocation;
ensure the change is applied to the EML root construction logic (the
function/method that builds the eml:eml element) and the documentation fragment
so both use the same namespace string.

In `@docs/claude/export-framework.md`:
- Line 13: Update the documentation table entry that currently lists
JSONExporter and CSVExporter to also include the new DwCAExporter: edit the row
showing `ami/exports/format_types.py` so it mentions Concrete exporters
`JSONExporter`, `CSVExporter`, and `DwCAExporter` (use the exact class name
DwCAExporter to match the implementation).
- Around line 68-71: Update the example output of
ExportRegistry.get_supported_formats() to include the newly registered "dwca"
format so the snippet accurately reflects current registrations; adjust the
example list returned by ExportRegistry.get_supported_formats() (used alongside
ExportRegistry.register and ExportRegistry.get_exporter in the snippet) to
include "dwca" among the formats (e.g., ["occurrences_api_json",
"occurrences_simple_csv", "dwca"]).
🧹 Nitpick comments (3)
docs/claude/dwca-format-reference.md (1)

9-16: Add a language identifier to the fenced code block.

The linter flags this code block for missing a language specifier (MD040). Since it's a file tree diagram, use ```text.

ami/exports/tests.py (1)

329-341: Each test re-runs the full export pipeline independently.

Every test method calls _run_export(), which creates a DataExport, runs the entire export pipeline, and writes to storage. For 7 tests that call this helper, that's significant overhead. Consider using setUpClass or caching the export result in setUp to run the export once and share the ZIP across structural validation tests.

This is fine for correctness — just a test performance observation.

ami/exports/dwca.py (1)

286-286: Unused events_queryset parameter.

events_queryset is accepted but never used in generate_eml_xml. Either remove it or use it to populate temporal/geographic coverage metadata (the latter is listed as a follow-up). For now, removing the parameter or marking it for future use avoids confusion.

Proposed fix — remove unused parameter
-def generate_eml_xml(project, events_queryset=None) -> str:
+def generate_eml_xml(project) -> str:

This also requires updating the call site in ami/exports/format_types.py (line 258):

-        eml_xml = generate_eml_xml(self.project, events_qs)
+        eml_xml = generate_eml_xml(self.project)

Comment on lines 75 to 79
(
DWC + "vernacularName",
"vernacularName",
lambda o, slug: o.determination.common_name_en or "" if o.determination else "",
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

vernacularName lambda: add parentheses for clarity (known issue #4).

While Python's operator precedence makes this technically correct (or binds tighter than the ternary), the expression is confusing to read and a common source of bugs. Adding explicit parentheses removes ambiguity.

Proposed fix
     (
         DWC + "vernacularName",
         "vernacularName",
-        lambda o, slug: o.determination.common_name_en or "" if o.determination else "",
+        lambda o, slug: (o.determination.common_name_en or "") if o.determination else "",
     ),
🧰 Tools
🪛 Ruff (0.15.0)

[warning] 78-78: Unused lambda argument: slug

(ARG005)

🤖 Prompt for AI Agents
In `@ami/exports/dwca.py` around lines 75 - 79, The ternary expression in the
tuple entry with key "vernacularName" is ambiguous; update the lambda (lambda o,
slug: o.determination.common_name_en or "" if o.determination else "") to add
parentheses around the ternary's true branch so it reads like
(o.determination.common_name_en or "") if o.determination else "" to make the
intent explicit and easier to read.

Comment on lines 132 to 142
def _format_duration(event) -> str:
"""Format event duration as human-readable string."""
if not event.start or not event.end:
return ""
delta = event.end - event.start
total_seconds = int(delta.total_seconds())
hours, remainder = divmod(total_seconds, 3600)
minutes, _ = divmod(remainder, 60)
if hours > 0:
return f"{hours}h {minutes}m"
return f"{minutes}m"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Negative durations are not guarded against.

If event.end < event.start (data inconsistency), total_seconds will be negative, and divmod with a negative dividend produces negative results, yielding strings like "-1h -30m". Consider returning an empty string or logging a warning for negative durations.

Proposed fix
 def _format_duration(event) -> str:
     """Format event duration as human-readable string."""
     if not event.start or not event.end:
         return ""
     delta = event.end - event.start
     total_seconds = int(delta.total_seconds())
+    if total_seconds < 0:
+        return ""
     hours, remainder = divmod(total_seconds, 3600)
     minutes, _ = divmod(remainder, 60)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _format_duration(event) -> str:
"""Format event duration as human-readable string."""
if not event.start or not event.end:
return ""
delta = event.end - event.start
total_seconds = int(delta.total_seconds())
hours, remainder = divmod(total_seconds, 3600)
minutes, _ = divmod(remainder, 60)
if hours > 0:
return f"{hours}h {minutes}m"
return f"{minutes}m"
def _format_duration(event) -> str:
"""Format event duration as human-readable string."""
if not event.start or not event.end:
return ""
delta = event.end - event.start
total_seconds = int(delta.total_seconds())
if total_seconds < 0:
return ""
hours, remainder = divmod(total_seconds, 3600)
minutes, _ = divmod(remainder, 60)
if hours > 0:
return f"{hours}h {minutes}m"
return f"{minutes}m"
🤖 Prompt for AI Agents
In `@ami/exports/dwca.py` around lines 132 - 142, The _format_duration function
must guard against negative durations: check event.start and event.end and
compute delta = event.end - event.start, then if delta.total_seconds() <= 0
return an empty string (or optionally log a warning) instead of proceeding;
update the logic around total_seconds / divmod so negative values are not
formatted (use the check on delta.total_seconds() before computing hours/minutes
and then format only for positive durations).

Comment on lines 288 to 290

project_slug = slugify(project.name)
now = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use timezone-aware datetime.

datetime.datetime.now() produces a naive datetime. In a Django project with USE_TZ=True, this is inconsistent and could produce incorrect timestamps. Use Django's timezone.now().

Proposed fix
+    from django.utils import timezone
+
     project_slug = slugify(project.name)
-    now = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
+    now = timezone.now().strftime("%Y-%m-%dT%H:%M:%S")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
project_slug = slugify(project.name)
now = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
from django.utils import timezone
project_slug = slugify(project.name)
now = timezone.now().strftime("%Y-%m-%dT%H:%M:%S")
🤖 Prompt for AI Agents
In `@ami/exports/dwca.py` around lines 288 - 290, Change the naive timestamp to a
timezone-aware one: replace the use of datetime.datetime.now() when assigning
now (used alongside project_slug/slugify) with Django's timezone-aware function
(timezone.now()), and add the appropriate import from django.utils import
timezone at the top of the module so now is timezone-aware when used in
exports/dwca.py.

Comment on lines 94 to 113
```xml
<?xml version="1.0" encoding="UTF-8"?>
<eml:eml xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0 https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd"
packageId="urn:ami:dataset:{project_id}" system="AMI">
<dataset>
<title>{project.name}</title>
<creator>
<organizationName>{project.owner or institution}</organizationName>
</creator>
<abstract>
<para>{project.description}</para>
</abstract>
<intellectualRights>
<para>License information here</para>
</intellectualRights>
</dataset>
</eml:eml>
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

EML version mismatch between documentation and implementation.

This reference doc shows EML 2.2.0 namespace (https://eml.ecoinformatics.org/eml-2.2.0), but the actual implementation in ami/exports/dwca.py (line 293) uses EML 2.1.1 (eml://ecoinformatics.org/eml-2.1.1). The planning doc acknowledges upgrading to 2.2.0 as a follow-up, but this reference should either match the current implementation or clearly note the discrepancy.

🤖 Prompt for AI Agents
In `@docs/claude/dwca-format-reference.md` around lines 94 - 113, The
documentation shows EML 2.2.0 namespace
("https://eml.ecoinformatics.org/eml-2.2.0") but the implementation currently
emits EML 2.1.1 (literal "eml://ecoinformatics.org/eml-2.1.1"); either make the
docs reflect 2.1.1 or update the implementation to 2.2.0 so they match. To fix:
search the codebase for the literal "eml://ecoinformatics.org/eml-2.1.1" (and
any code that constructs the eml:eml root or schemaLocation) and change the
namespace and schemaLocation to the 2.2.0 values used in the doc, or
alternatively update the doc snippet to the 2.1.1 namespace and schemaLocation;
ensure the change is applied to the EML root construction logic (the
function/method that builds the eml:eml element) and the documentation fragment
so both use the same namespace string.

Comment on lines 68 to 71
ExportRegistry.register("format_name")(ExporterClass)
ExportRegistry.get_exporter("format_name") # → ExporterClass
ExportRegistry.get_supported_formats() # → ["occurrences_api_json", "occurrences_simple_csv"]
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Example output of get_supported_formats() is incomplete.

The example on line 70 should include "dwca" to reflect the new registration.

Proposed fix
-ExportRegistry.get_supported_formats()       # → ["occurrences_api_json", "occurrences_simple_csv"]
+ExportRegistry.get_supported_formats()       # → ["occurrences_api_json", "occurrences_simple_csv", "dwca"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ExportRegistry.register("format_name")(ExporterClass)
ExportRegistry.get_exporter("format_name") # → ExporterClass
ExportRegistry.get_supported_formats() # → ["occurrences_api_json", "occurrences_simple_csv"]
```
ExportRegistry.register("format_name")(ExporterClass)
ExportRegistry.get_exporter("format_name") # → ExporterClass
ExportRegistry.get_supported_formats() # → ["occurrences_api_json", "occurrences_simple_csv", "dwca"]
🤖 Prompt for AI Agents
In `@docs/claude/export-framework.md` around lines 68 - 71, Update the example
output of ExportRegistry.get_supported_formats() to include the newly registered
"dwca" format so the snippet accurately reflects current registrations; adjust
the example list returned by ExportRegistry.get_supported_formats() (used
alongside ExportRegistry.register and ExportRegistry.get_exporter in the
snippet) to include "dwca" among the formats (e.g., ["occurrences_api_json",
"occurrences_simple_csv", "dwca"]).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new export format (dwca) to generate a GBIF-compatible Darwin Core Archive (DwC-A) ZIP using an Event core (event.txt) with an Occurrence extension (occurrence.txt), along with generated meta.xml and eml.xml.

Changes:

  • Introduces ami/exports/dwca.py with DwC field mappings, TSV writing, and meta/eml XML generation.
  • Adds DwCAExporter to the existing export framework and registers it under "dwca".
  • Adds a new DwC-A-focused test suite validating ZIP contents and basic structure.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
ami/exports/dwca.py New DwC-A field definitions and archive/metadata generators.
ami/exports/format_types.py Adds DwCAExporter implementation to orchestrate TSV generation + ZIP packaging.
ami/exports/registry.py Registers the new "dwca" export format.
ami/exports/tests.py Adds end-to-end tests covering ZIP structure, TSV headers/rows, and XML parsing.
docs/claude/export-framework.md New internal technical reference for the export framework.
docs/claude/dwca-format-reference.md New DwC-A format reference documentation.
.agents/planning/dwca-export-plan.md Planning/design notes and known issues list for DwC-A export.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

|------|---------|
| `ami/exports/base.py` | `BaseExporter` ABC - all exporters inherit from this |
| `ami/exports/registry.py` | `ExportRegistry` - maps format strings to exporter classes |
| `ami/exports/format_types.py` | Concrete exporters: `JSONExporter`, `CSVExporter` |
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “Concrete exporters” bullet lists only JSONExporter and CSVExporter, but this PR adds DwCAExporter in ami/exports/format_types.py. Update this reference so the technical doc matches the supported exporters.

Suggested change
| `ami/exports/format_types.py` | Concrete exporters: `JSONExporter`, `CSVExporter` |
| `ami/exports/format_types.py` | Concrete exporters: `JSONExporter`, `CSVExporter`, `DwCAExporter` |

Copilot uses AI. Check for mistakes.
Comment on lines 196 to 208
def get_queryset(self):
"""Return the occurrence queryset (used by BaseExporter for record count)."""
return (
Occurrence.objects.valid() # type: ignore[union-attr]
.filter(project=self.project)
.select_related(
"determination",
"event",
"deployment",
)
.with_detections_count()
.with_identifications()
)
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DwC-A occurrence exports can currently include rows with event_id or determination missing, which produces empty coreid/eventID and scientificName values in occurrence.txt (GBIF validator rejects these). Filter the queryset to exclude occurrences with null event and null determination (and adjust the expected counts in tests accordingly).

Copilot uses AI. Check for mistakes.
progress_callback=lambda n: self.update_job_progress(n),
)
logger.info(f"DwC-A: wrote {occ_count} occurrences")

Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job progress for DwC-A exports is only updated every 500 occurrences; for smaller exports progress may remain at 0% even though the job completes. Consider calling update_job_progress(occ_count) once at the end (guarding against total_records == 0 to avoid division by zero in BaseExporter.update_job_progress()).

Suggested change
# Ensure final progress update, even if the callback was never invoked.
# Guard against total_records == 0 to avoid division by zero in BaseExporter.update_job_progress().
if getattr(self, "total_records", 0):
self.update_job_progress(occ_count)

Copilot uses AI. Check for mistakes.
Comment on lines 240 to 250
# Column 0 is the id (eventID)
id_elem = ET.SubElement(core, "id")
id_elem.set("index", "0")

for i, (term_uri, header, _) in enumerate(event_fields):
if i == 0:
continue # Already declared as <id>
field = ET.SubElement(core, "field")
field.set("index", str(i))
field.set("term", term_uri)

Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In meta.xml generation, column 0 is declared as <id index="0"/> but the corresponding <field index="0" term="...eventID"/> mapping is skipped. Many DwC-A consumers/validators expect the ID column to also be mapped to the Darwin Core term (as in the DwC Text Guide examples). Include a <field> mapping for index 0 instead of skipping it.

Copilot uses AI. Check for mistakes.
mihow and others added 2 commits February 11, 2026 13:33
- Guard taxonRank against None rank (AttributeError on .lower())
- Guard _get_rank_from_parents against None det_rank
- Fix vernacularName ternary precedence with explicit parentheses
- Change individualCount to emit "1" (each occurrence = 1 individual,
  not detections_count which counts bounding boxes)
- Guard _format_duration against negative durations

Co-Authored-By: Claude <noreply@anthropic.com>
- Filter out occurrences with null event or determination from DwC-A
  export queryset (GBIF rejects empty coreid/scientificName)
- Replace project.owner.email with project.owner.name in EML creator
  element to avoid leaking PII in downloadable archives
- Only emit individualName when owner has a name set

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow marked this pull request as draft February 11, 2026 21:51
mihow and others added 8 commits February 11, 2026 15:26
- Wrap DwC-A export in try/finally to clean up intermediate temp files
- Use timezone.now() instead of naive datetime.datetime.now()
- Use full EML schemaLocation URL for GBIF validation compatibility
- Remove unused events_queryset parameter from generate_eml_xml
- Simplify progress_callback lambda to direct method reference

Co-Authored-By: Claude <noreply@anthropic.com>
- Update occurrence row count test to match null-filtered queryset
- Assert parents_json is populated in taxonomy hierarchy test
- Use settings.MEDIA_URL instead of hardcoded "/media/" in test helper
- Add DwCAExporter to export-framework.md file table and registry example
- Fix EML version in dwca-format-reference.md to match implementation (2.1.1)

Co-Authored-By: Claude <noreply@anthropic.com>
…pdate

- Map all columns (including index 0) to DwC term URIs in meta.xml
  so GBIF validators can resolve both <id>/<coreid> and <field> entries
- Change fieldsEnclosedBy from empty string to double-quote character
  for stricter parser compatibility
- Add final progress update after TSV writing so small exports (<500
  records) report completion instead of staying at 0%

Co-Authored-By: Claude <noreply@anthropic.com>
Run the export pipeline once in setUpClass and share the ZIP across all
structural validation tests instead of re-running per test method. This
reduces test time from ~7 export runs to 1.

Co-Authored-By: Claude <noreply@anthropic.com>
…occurrences in DwC-A

Remove get_filter_backends() override that returned [], allowing DwCAExporter
to inherit BaseExporter's OccurrenceCollectionFilter. Update get_events_queryset()
to derive events from self.queryset instead of fetching all project events,
preventing orphaned events when collection_id filtering is active.

Co-Authored-By: Claude <noreply@anthropic.com>
…ssertion

Add test_dwca_export_with_collection_filter that verifies filtered exports
produce correct occurrence/event counts and referential integrity. Update
test_event_headers_and_row_count to expect events derived from occurrences
rather than all project events.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Add API methods, request/response format, filter system, DwC-A specifics,
job integration, and file lifecycle details. Remove separate export-system.md.

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant