feat: add Darwin Core Archive (DwC-A) export format#1131
feat: add Darwin Core Archive (DwC-A) export format#1131
Conversation
Add planning document for Darwin Core Archive export format, export framework technical reference, DwC-A format reference with field mappings, and downloaded DwC terms quick reference from TDWG. Co-Authored-By: Claude <noreply@anthropic.com>
Add Event Core + Occurrence Extension DwC-A exporter that produces a
GBIF-compatible ZIP containing event.txt, occurrence.txt, meta.xml,
and eml.xml. Events are the core entity with occurrences linked via
eventID foreign key.
Key design decisions:
- Direct TSV writing with iterator(chunk_size=500) instead of DRF serializers
- Taxonomy hierarchy extracted from parents_json to avoid N+1 queries
- meta.xml generated from the same field definitions used for TSV columns
- basisOfRecord = "MachineObservation" for all records
- URN format IDs: urn:ami:event:{slug}:{id}, urn:ami:occurrence:{slug}:{id}
Co-Authored-By: Claude <noreply@anthropic.com>
Test ZIP structure, event/occurrence headers and row counts, meta.xml core/extension structure, referential integrity between events and occurrences, taxonomy hierarchy extraction from parents_json, specific epithet parsing, and EML metadata validity. Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
✅ Deploy Preview for antenna-preview canceled.
|
✅ Deploy Preview for antenna-ssec canceled.
|
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis pull request introduces a complete Darwin Core Archive (DwC-A) export format for AMI data. It adds a new exporter class, field mapping definitions, helper utilities for data transformation, metadata generation, registry integration, comprehensive test coverage, and technical documentation. Changes
Sequence DiagramsequenceDiagram
participant Client
participant API as DataExport API
participant Exporter as DwCAExporter
participant DB as Database
participant Files as File System
participant Storage as ZIP Archive
Client->>API: POST export (format=dwca)
API->>Exporter: initialize & export()
Exporter->>DB: get_events_queryset()
DB-->>Exporter: Event records
Exporter->>Files: write_tsv(event.txt)
Files-->>Exporter: event.txt path
Exporter->>DB: get_queryset()
DB-->>Exporter: Occurrence records
Exporter->>Files: write_tsv(occurrence.txt)
Files-->>Exporter: occurrence.txt path
Exporter->>Exporter: generate_meta_xml()
Exporter->>Exporter: generate_eml_xml()
Exporter->>Storage: create_dwca_zip(event.txt, occurrence.txt, meta.xml, eml.xml)
Storage-->>Exporter: dwca.zip path
Exporter-->>API: return ZIP path
API-->>Client: export complete with download URL
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 12
🤖 Fix all issues with AI agents
In `@ami/exports/dwca.py`:
- Around line 132-142: The _format_duration function must guard against negative
durations: check event.start and event.end and compute delta = event.end -
event.start, then if delta.total_seconds() <= 0 return an empty string (or
optionally log a warning) instead of proceeding; update the logic around
total_seconds / divmod so negative values are not formatted (use the check on
delta.total_seconds() before computing hours/minutes and then format only for
positive durations).
- Around line 62-63: The taxonRank mapping lambda can raise AttributeError when
o.determination.rank is None; update the DWC + "taxonRank" tuple's lambda (the
function referencing o.determination.rank.lower()) to guard by checking both
o.determination and o.determination.rank before calling .lower(), returning an
empty string (or appropriate default) when rank is falsy.
- Line 87: The mapping that emits DWC + "individualCount" currently uses
getattr(o, "detections_count", 0) which is semantically wrong; change the tuple
entry for DWC + "individualCount" so it always emits the string "1" (i.e.,
replace the lambda that references detections_count with one that returns "1")
or remove the entire tuple if you prefer omission; update the tuple containing
DWC + "individualCount" and the lambda (the anonymous function referencing
getattr(o, "detections_count", 0)) accordingly.
- Around line 75-79: The ternary expression in the tuple entry with key
"vernacularName" is ambiguous; update the lambda (lambda o, slug:
o.determination.common_name_en or "" if o.determination else "") to add
parentheses around the ternary's true branch so it reads like
(o.determination.common_name_en or "") if o.determination else "" to make the
intent explicit and easier to read.
- Around line 309-312: The EML export is inserting the owner's email into the
<surName> element (project.owner.email) which leaks PII and is semantically
wrong; update the block that creates creator -> individualName -> surName to
avoid using project.owner.email: prefer using a non-PII display name (e.g.,
project.owner.get_display_name() or project.owner.name) if available, and if no
safe display name exists omit creating the individualName/surName elements
entirely (i.e., only create creator when a valid non-email name/contact is
present). Ensure references to project.owner, creator, individualName, and
surName are updated accordingly.
- Around line 288-290: Change the naive timestamp to a timezone-aware one:
replace the use of datetime.datetime.now() when assigning now (used alongside
project_slug/slugify) with Django's timezone-aware function (timezone.now()),
and add the appropriate import from django.utils import timezone at the top of
the module so now is timezone-aware when used in exports/dwca.py.
In `@ami/exports/format_types.py`:
- Around line 196-208: The Occurrence queryset in get_queryset currently
includes occurrences lacking related event or determination, producing empty
coreid/scientificName; update get_queryset (the Occurrence.objects.valid() chain
in the get_queryset method) to filter out occurrences with null event and null
determination (e.g., .filter(event__isnull=False, determination__isnull=False))
before the select_related/.with_identifications() calls so only occurrences with
both event and determination are exported.
- Around line 237-264: The temp files created for the DwC-A export
(event_file.name, occ_file.name and the produced zip_path) are never cleaned up;
wrap the export sequence in a try/finally around the code that calls
get_events_queryset, write_tsv, generate_meta_xml, generate_eml_xml and
create_dwca_zip so that in the finally block you remove event_file.name,
occ_file.name and (on error or success) delete any intermediate temp zip if
appropriate; ensure update_export_stats and the return of zip_path still occur
on success but that file removal happens only for the intermediate files (or
conditionally for the zip if you decide not to persist it).
In `@ami/exports/tests.py`:
- Around line 477-502: The test test_taxonomy_hierarchy_extraction can pass
vacuously when taxon.parents_json is empty; update the test to guarantee
parents_json is populated before calling _get_rank_from_parents: after obtaining
taxon (from occurrence.determination) and calling
taxon.save(update_calculated_fields=True) and taxon.refresh_from_db(), add an
assertion that taxon.parents_json is truthy (or else set up explicit test data/a
fixture so parents_json contains a parent chain) so the loop over ranks actually
executes and _get_rank_from_parents is exercised.
In `@docs/claude/dwca-format-reference.md`:
- Around line 94-113: The documentation shows EML 2.2.0 namespace
("https://eml.ecoinformatics.org/eml-2.2.0") but the implementation currently
emits EML 2.1.1 (literal "eml://ecoinformatics.org/eml-2.1.1"); either make the
docs reflect 2.1.1 or update the implementation to 2.2.0 so they match. To fix:
search the codebase for the literal "eml://ecoinformatics.org/eml-2.1.1" (and
any code that constructs the eml:eml root or schemaLocation) and change the
namespace and schemaLocation to the 2.2.0 values used in the doc, or
alternatively update the doc snippet to the 2.1.1 namespace and schemaLocation;
ensure the change is applied to the EML root construction logic (the
function/method that builds the eml:eml element) and the documentation fragment
so both use the same namespace string.
In `@docs/claude/export-framework.md`:
- Line 13: Update the documentation table entry that currently lists
JSONExporter and CSVExporter to also include the new DwCAExporter: edit the row
showing `ami/exports/format_types.py` so it mentions Concrete exporters
`JSONExporter`, `CSVExporter`, and `DwCAExporter` (use the exact class name
DwCAExporter to match the implementation).
- Around line 68-71: Update the example output of
ExportRegistry.get_supported_formats() to include the newly registered "dwca"
format so the snippet accurately reflects current registrations; adjust the
example list returned by ExportRegistry.get_supported_formats() (used alongside
ExportRegistry.register and ExportRegistry.get_exporter in the snippet) to
include "dwca" among the formats (e.g., ["occurrences_api_json",
"occurrences_simple_csv", "dwca"]).
🧹 Nitpick comments (3)
docs/claude/dwca-format-reference.md (1)
9-16: Add a language identifier to the fenced code block.The linter flags this code block for missing a language specifier (MD040). Since it's a file tree diagram, use
```text.ami/exports/tests.py (1)
329-341: Each test re-runs the full export pipeline independently.Every test method calls
_run_export(), which creates aDataExport, runs the entire export pipeline, and writes to storage. For 7 tests that call this helper, that's significant overhead. Consider usingsetUpClassor caching the export result insetUpto run the export once and share the ZIP across structural validation tests.This is fine for correctness — just a test performance observation.
ami/exports/dwca.py (1)
286-286: Unusedevents_querysetparameter.
events_querysetis accepted but never used ingenerate_eml_xml. Either remove it or use it to populate temporal/geographic coverage metadata (the latter is listed as a follow-up). For now, removing the parameter or marking it for future use avoids confusion.Proposed fix — remove unused parameter
-def generate_eml_xml(project, events_queryset=None) -> str: +def generate_eml_xml(project) -> str:This also requires updating the call site in
ami/exports/format_types.py(line 258):- eml_xml = generate_eml_xml(self.project, events_qs) + eml_xml = generate_eml_xml(self.project)
| ( | ||
| DWC + "vernacularName", | ||
| "vernacularName", | ||
| lambda o, slug: o.determination.common_name_en or "" if o.determination else "", | ||
| ), |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
vernacularName lambda: add parentheses for clarity (known issue #4).
While Python's operator precedence makes this technically correct (or binds tighter than the ternary), the expression is confusing to read and a common source of bugs. Adding explicit parentheses removes ambiguity.
Proposed fix
(
DWC + "vernacularName",
"vernacularName",
- lambda o, slug: o.determination.common_name_en or "" if o.determination else "",
+ lambda o, slug: (o.determination.common_name_en or "") if o.determination else "",
),🧰 Tools
🪛 Ruff (0.15.0)
[warning] 78-78: Unused lambda argument: slug
(ARG005)
🤖 Prompt for AI Agents
In `@ami/exports/dwca.py` around lines 75 - 79, The ternary expression in the
tuple entry with key "vernacularName" is ambiguous; update the lambda (lambda o,
slug: o.determination.common_name_en or "" if o.determination else "") to add
parentheses around the ternary's true branch so it reads like
(o.determination.common_name_en or "") if o.determination else "" to make the
intent explicit and easier to read.
| def _format_duration(event) -> str: | ||
| """Format event duration as human-readable string.""" | ||
| if not event.start or not event.end: | ||
| return "" | ||
| delta = event.end - event.start | ||
| total_seconds = int(delta.total_seconds()) | ||
| hours, remainder = divmod(total_seconds, 3600) | ||
| minutes, _ = divmod(remainder, 60) | ||
| if hours > 0: | ||
| return f"{hours}h {minutes}m" | ||
| return f"{minutes}m" |
There was a problem hiding this comment.
Negative durations are not guarded against.
If event.end < event.start (data inconsistency), total_seconds will be negative, and divmod with a negative dividend produces negative results, yielding strings like "-1h -30m". Consider returning an empty string or logging a warning for negative durations.
Proposed fix
def _format_duration(event) -> str:
"""Format event duration as human-readable string."""
if not event.start or not event.end:
return ""
delta = event.end - event.start
total_seconds = int(delta.total_seconds())
+ if total_seconds < 0:
+ return ""
hours, remainder = divmod(total_seconds, 3600)
minutes, _ = divmod(remainder, 60)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _format_duration(event) -> str: | |
| """Format event duration as human-readable string.""" | |
| if not event.start or not event.end: | |
| return "" | |
| delta = event.end - event.start | |
| total_seconds = int(delta.total_seconds()) | |
| hours, remainder = divmod(total_seconds, 3600) | |
| minutes, _ = divmod(remainder, 60) | |
| if hours > 0: | |
| return f"{hours}h {minutes}m" | |
| return f"{minutes}m" | |
| def _format_duration(event) -> str: | |
| """Format event duration as human-readable string.""" | |
| if not event.start or not event.end: | |
| return "" | |
| delta = event.end - event.start | |
| total_seconds = int(delta.total_seconds()) | |
| if total_seconds < 0: | |
| return "" | |
| hours, remainder = divmod(total_seconds, 3600) | |
| minutes, _ = divmod(remainder, 60) | |
| if hours > 0: | |
| return f"{hours}h {minutes}m" | |
| return f"{minutes}m" |
🤖 Prompt for AI Agents
In `@ami/exports/dwca.py` around lines 132 - 142, The _format_duration function
must guard against negative durations: check event.start and event.end and
compute delta = event.end - event.start, then if delta.total_seconds() <= 0
return an empty string (or optionally log a warning) instead of proceeding;
update the logic around total_seconds / divmod so negative values are not
formatted (use the check on delta.total_seconds() before computing hours/minutes
and then format only for positive durations).
ami/exports/dwca.py
Outdated
|
|
||
| project_slug = slugify(project.name) | ||
| now = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S") |
There was a problem hiding this comment.
Use timezone-aware datetime.
datetime.datetime.now() produces a naive datetime. In a Django project with USE_TZ=True, this is inconsistent and could produce incorrect timestamps. Use Django's timezone.now().
Proposed fix
+ from django.utils import timezone
+
project_slug = slugify(project.name)
- now = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
+ now = timezone.now().strftime("%Y-%m-%dT%H:%M:%S")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| project_slug = slugify(project.name) | |
| now = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S") | |
| from django.utils import timezone | |
| project_slug = slugify(project.name) | |
| now = timezone.now().strftime("%Y-%m-%dT%H:%M:%S") |
🤖 Prompt for AI Agents
In `@ami/exports/dwca.py` around lines 288 - 290, Change the naive timestamp to a
timezone-aware one: replace the use of datetime.datetime.now() when assigning
now (used alongside project_slug/slugify) with Django's timezone-aware function
(timezone.now()), and add the appropriate import from django.utils import
timezone at the top of the module so now is timezone-aware when used in
exports/dwca.py.
| ```xml | ||
| <?xml version="1.0" encoding="UTF-8"?> | ||
| <eml:eml xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0" | ||
| xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" | ||
| xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0 https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd" | ||
| packageId="urn:ami:dataset:{project_id}" system="AMI"> | ||
| <dataset> | ||
| <title>{project.name}</title> | ||
| <creator> | ||
| <organizationName>{project.owner or institution}</organizationName> | ||
| </creator> | ||
| <abstract> | ||
| <para>{project.description}</para> | ||
| </abstract> | ||
| <intellectualRights> | ||
| <para>License information here</para> | ||
| </intellectualRights> | ||
| </dataset> | ||
| </eml:eml> | ||
| ``` |
There was a problem hiding this comment.
EML version mismatch between documentation and implementation.
This reference doc shows EML 2.2.0 namespace (https://eml.ecoinformatics.org/eml-2.2.0), but the actual implementation in ami/exports/dwca.py (line 293) uses EML 2.1.1 (eml://ecoinformatics.org/eml-2.1.1). The planning doc acknowledges upgrading to 2.2.0 as a follow-up, but this reference should either match the current implementation or clearly note the discrepancy.
🤖 Prompt for AI Agents
In `@docs/claude/dwca-format-reference.md` around lines 94 - 113, The
documentation shows EML 2.2.0 namespace
("https://eml.ecoinformatics.org/eml-2.2.0") but the implementation currently
emits EML 2.1.1 (literal "eml://ecoinformatics.org/eml-2.1.1"); either make the
docs reflect 2.1.1 or update the implementation to 2.2.0 so they match. To fix:
search the codebase for the literal "eml://ecoinformatics.org/eml-2.1.1" (and
any code that constructs the eml:eml root or schemaLocation) and change the
namespace and schemaLocation to the 2.2.0 values used in the doc, or
alternatively update the doc snippet to the 2.1.1 namespace and schemaLocation;
ensure the change is applied to the EML root construction logic (the
function/method that builds the eml:eml element) and the documentation fragment
so both use the same namespace string.
| ExportRegistry.register("format_name")(ExporterClass) | ||
| ExportRegistry.get_exporter("format_name") # → ExporterClass | ||
| ExportRegistry.get_supported_formats() # → ["occurrences_api_json", "occurrences_simple_csv"] | ||
| ``` |
There was a problem hiding this comment.
Example output of get_supported_formats() is incomplete.
The example on line 70 should include "dwca" to reflect the new registration.
Proposed fix
-ExportRegistry.get_supported_formats() # → ["occurrences_api_json", "occurrences_simple_csv"]
+ExportRegistry.get_supported_formats() # → ["occurrences_api_json", "occurrences_simple_csv", "dwca"]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ExportRegistry.register("format_name")(ExporterClass) | |
| ExportRegistry.get_exporter("format_name") # → ExporterClass | |
| ExportRegistry.get_supported_formats() # → ["occurrences_api_json", "occurrences_simple_csv"] | |
| ``` | |
| ExportRegistry.register("format_name")(ExporterClass) | |
| ExportRegistry.get_exporter("format_name") # → ExporterClass | |
| ExportRegistry.get_supported_formats() # → ["occurrences_api_json", "occurrences_simple_csv", "dwca"] |
🤖 Prompt for AI Agents
In `@docs/claude/export-framework.md` around lines 68 - 71, Update the example
output of ExportRegistry.get_supported_formats() to include the newly registered
"dwca" format so the snippet accurately reflects current registrations; adjust
the example list returned by ExportRegistry.get_supported_formats() (used
alongside ExportRegistry.register and ExportRegistry.get_exporter in the
snippet) to include "dwca" among the formats (e.g., ["occurrences_api_json",
"occurrences_simple_csv", "dwca"]).
There was a problem hiding this comment.
Pull request overview
Adds a new export format (dwca) to generate a GBIF-compatible Darwin Core Archive (DwC-A) ZIP using an Event core (event.txt) with an Occurrence extension (occurrence.txt), along with generated meta.xml and eml.xml.
Changes:
- Introduces
ami/exports/dwca.pywith DwC field mappings, TSV writing, and meta/eml XML generation. - Adds
DwCAExporterto the existing export framework and registers it under"dwca". - Adds a new DwC-A-focused test suite validating ZIP contents and basic structure.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
ami/exports/dwca.py |
New DwC-A field definitions and archive/metadata generators. |
ami/exports/format_types.py |
Adds DwCAExporter implementation to orchestrate TSV generation + ZIP packaging. |
ami/exports/registry.py |
Registers the new "dwca" export format. |
ami/exports/tests.py |
Adds end-to-end tests covering ZIP structure, TSV headers/rows, and XML parsing. |
docs/claude/export-framework.md |
New internal technical reference for the export framework. |
docs/claude/dwca-format-reference.md |
New DwC-A format reference documentation. |
.agents/planning/dwca-export-plan.md |
Planning/design notes and known issues list for DwC-A export. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
docs/claude/export-framework.md
Outdated
| |------|---------| | ||
| | `ami/exports/base.py` | `BaseExporter` ABC - all exporters inherit from this | | ||
| | `ami/exports/registry.py` | `ExportRegistry` - maps format strings to exporter classes | | ||
| | `ami/exports/format_types.py` | Concrete exporters: `JSONExporter`, `CSVExporter` | |
There was a problem hiding this comment.
The “Concrete exporters” bullet lists only JSONExporter and CSVExporter, but this PR adds DwCAExporter in ami/exports/format_types.py. Update this reference so the technical doc matches the supported exporters.
| | `ami/exports/format_types.py` | Concrete exporters: `JSONExporter`, `CSVExporter` | | |
| | `ami/exports/format_types.py` | Concrete exporters: `JSONExporter`, `CSVExporter`, `DwCAExporter` | |
| def get_queryset(self): | ||
| """Return the occurrence queryset (used by BaseExporter for record count).""" | ||
| return ( | ||
| Occurrence.objects.valid() # type: ignore[union-attr] | ||
| .filter(project=self.project) | ||
| .select_related( | ||
| "determination", | ||
| "event", | ||
| "deployment", | ||
| ) | ||
| .with_detections_count() | ||
| .with_identifications() | ||
| ) |
There was a problem hiding this comment.
DwC-A occurrence exports can currently include rows with event_id or determination missing, which produces empty coreid/eventID and scientificName values in occurrence.txt (GBIF validator rejects these). Filter the queryset to exclude occurrences with null event and null determination (and adjust the expected counts in tests accordingly).
ami/exports/format_types.py
Outdated
| progress_callback=lambda n: self.update_job_progress(n), | ||
| ) | ||
| logger.info(f"DwC-A: wrote {occ_count} occurrences") | ||
|
|
There was a problem hiding this comment.
Job progress for DwC-A exports is only updated every 500 occurrences; for smaller exports progress may remain at 0% even though the job completes. Consider calling update_job_progress(occ_count) once at the end (guarding against total_records == 0 to avoid division by zero in BaseExporter.update_job_progress()).
| # Ensure final progress update, even if the callback was never invoked. | |
| # Guard against total_records == 0 to avoid division by zero in BaseExporter.update_job_progress(). | |
| if getattr(self, "total_records", 0): | |
| self.update_job_progress(occ_count) |
| # Column 0 is the id (eventID) | ||
| id_elem = ET.SubElement(core, "id") | ||
| id_elem.set("index", "0") | ||
|
|
||
| for i, (term_uri, header, _) in enumerate(event_fields): | ||
| if i == 0: | ||
| continue # Already declared as <id> | ||
| field = ET.SubElement(core, "field") | ||
| field.set("index", str(i)) | ||
| field.set("term", term_uri) | ||
|
|
There was a problem hiding this comment.
In meta.xml generation, column 0 is declared as <id index="0"/> but the corresponding <field index="0" term="...eventID"/> mapping is skipped. Many DwC-A consumers/validators expect the ID column to also be mapped to the Darwin Core term (as in the DwC Text Guide examples). Include a <field> mapping for index 0 instead of skipping it.
- Guard taxonRank against None rank (AttributeError on .lower()) - Guard _get_rank_from_parents against None det_rank - Fix vernacularName ternary precedence with explicit parentheses - Change individualCount to emit "1" (each occurrence = 1 individual, not detections_count which counts bounding boxes) - Guard _format_duration against negative durations Co-Authored-By: Claude <noreply@anthropic.com>
- Filter out occurrences with null event or determination from DwC-A export queryset (GBIF rejects empty coreid/scientificName) - Replace project.owner.email with project.owner.name in EML creator element to avoid leaking PII in downloadable archives - Only emit individualName when owner has a name set Co-Authored-By: Claude <noreply@anthropic.com>
- Wrap DwC-A export in try/finally to clean up intermediate temp files - Use timezone.now() instead of naive datetime.datetime.now() - Use full EML schemaLocation URL for GBIF validation compatibility - Remove unused events_queryset parameter from generate_eml_xml - Simplify progress_callback lambda to direct method reference Co-Authored-By: Claude <noreply@anthropic.com>
- Update occurrence row count test to match null-filtered queryset - Assert parents_json is populated in taxonomy hierarchy test - Use settings.MEDIA_URL instead of hardcoded "/media/" in test helper - Add DwCAExporter to export-framework.md file table and registry example - Fix EML version in dwca-format-reference.md to match implementation (2.1.1) Co-Authored-By: Claude <noreply@anthropic.com>
…pdate - Map all columns (including index 0) to DwC term URIs in meta.xml so GBIF validators can resolve both <id>/<coreid> and <field> entries - Change fieldsEnclosedBy from empty string to double-quote character for stricter parser compatibility - Add final progress update after TSV writing so small exports (<500 records) report completion instead of staying at 0% Co-Authored-By: Claude <noreply@anthropic.com>
Run the export pipeline once in setUpClass and share the ZIP across all structural validation tests instead of re-running per test method. This reduces test time from ~7 export runs to 1. Co-Authored-By: Claude <noreply@anthropic.com>
…occurrences in DwC-A Remove get_filter_backends() override that returned [], allowing DwCAExporter to inherit BaseExporter's OccurrenceCollectionFilter. Update get_events_queryset() to derive events from self.queryset instead of fetching all project events, preventing orphaned events when collection_id filtering is active. Co-Authored-By: Claude <noreply@anthropic.com>
…ssertion Add test_dwca_export_with_collection_filter that verifies filtered exports produce correct occurrence/event counts and referential integrity. Update test_event_headers_and_row_count to expect events derived from occurrences rather than all project events. Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Add API methods, request/response format, filter system, DwC-A specifics, job integration, and file lifecycle details. Remove separate export-system.md. Co-Authored-By: Claude <noreply@anthropic.com>
Summary
dwcaexport format that produces a GBIF-compatible Darwin Core Archive (ZIP containingevent.txt,occurrence.txt,meta.xml,eml.xml)eventIDami/exports/dwca.pymoduleThis is the foundation for publishing AMI biodiversity data to GBIF. The initial implementation produces static archive downloads. Future work will add richer metadata, multimedia extensions, and eventually automated publishing via IPT integration.
Design decisions
queryset.iterator(chunk_size=500)for memory efficiency.parents_json— Walks the pre-computed parent list to extract kingdom/phylum/class/order/family/genus, avoiding N+1 queries.meta.xmlgenerated from field definitions — The same field list used for TSV columns drives meta.xml generation, keeping them in sync.basisOfRecord = "MachineObservation"for all records.urn:ami:event:{slug}:{id},urn:ami:occurrence:{slug}:{id}.Changes
ami/exports/dwca.pyami/exports/format_types.pyDwCAExporter(BaseExporter)classami/exports/registry.py"dwca"formatami/exports/tests.pyTest plan
Known issues to fix before merge
coreid— GBIF will reject orphaned extension rows. Need.filter(event__isnull=False)on the occurrence queryset.scientificName— GBIF treats this as required. Need.filter(determination__isnull=False).individualCountsemantics are wrong —detections_countis the number of bounding boxes across frames, not individuals. Each occurrence is one individual in AMI's model. Should emit1or omit the field.vernacularNamelambda has ambiguous operator precedence — Works by accident but should be parenthesized.Near follow-up (before real GBIF submission)
licensefield on events — GBIF requires a dataset license for reuse terms.identifiedBy/dateIdentified— provenance for who/what made the determination.associatedMedia— detection image URLs. For an image-based platform this is the primary evidence.multimedia.txt)Eventual follow-up
country,stateProvince,localityon events (requires reverse geocoding or Site model fields)coordinateUncertaintyInMetersinstitutionCode,collectionCode(project-level settings)scientificNameAuthorshipfromTaxon.authoreventTypefieldSummary by CodeRabbit
New Features
Documentation