Skip to content

DM-54070: Add support for APDB record updates#33

Merged
JeremyMcCormick merged 35 commits intomainfrom
tickets/DM-54070
Apr 14, 2026
Merged

DM-54070: Add support for APDB record updates#33
JeremyMcCormick merged 35 commits intomainfrom
tickets/DM-54070

Conversation

@JeremyMcCormick
Copy link
Copy Markdown
Contributor

@JeremyMcCormick JeremyMcCormick commented Feb 11, 2026

Overview

This is a major update to the repository adding support for propagating APDB update records to the PPDB in BigQuery. The commit history was completely rebuilt and consolidated, and starting with it as a point of reference may be useful. In particular, 9e056c4 and a78ee70 contain the majority of the changes for implementing the new functionality.

As discussed previously, applying record updates using a typical RDMS pattern of SQL UPDATE statements would be infeasible given BigQuery's limitations in this area. In particular, the documentation on quotas indicates that there are limits on the number of update statements which may be executed per day. Given that single replica chunks may contain millions of individual update records, applying them using UPDATE statements would be infeasible, as executing this many queries would far exceed the quota. So a different approach needed to be taken where the updates are batched together and applied using a single MERGE statement.

The process for applying the updates implemented on this branch is as follows:

  1. During replication, the update records are serialized to a JSON file in the local replica chunk directory.
  2. The uploader process of the replication application uploads this JSON file to cloud storage alongside the parquet files.
  3. During the promotion step, the JSON files containing the updates are read in from cloud storage for the relevant chunks and the data is copied into an "expanded updates table", which represents the data generically as a combination of table name, field name, record ID, and new value, along with other necessary information like the replica chunk ID.
  4. The records in the expanded updates table are "deduplicated" so that only the most recent version of a particular update changing the same table name, field name, and record ID is included in the batch. The most recent update is found by the ordering of replica chunk, update time, and update order, in that specific order. (This process does not really represent "deduplication" - it is selecting out only the most recent update of the same type to a particular record.)
  5. The deduplicated records are merged into production during the promotion step, after the temporary table has been created, which includes the current prod table + new records. The updates are applied to this new table, so that it should not matter if the target records are present in the staging or prod tables.

Implementing this process required some major changes and additions to the existing infrastructure, outlined below.

Major Changes

  • A new updates package was added under bigquery, containing the following modules:
    • update_records - Pydantic model for packaging a set of update records for a single replica chunk
    • expanded_update_record - expanded update record representing the changes in a common form within the initial BigQuery table
    • updates_table - encapsulation of the initial target BQ table which contains the update information (also contains functionality for "deduplicating" the update records, described above)
    • updates_merger - tool for merging the records in the updates table into the target APDB tables (DiaObject, etc.)
    • updates_manager - manages the overall process of applying the updates during promotion
  • Existing classes and modules in the bigquery package were modified to support the new functionality:
    • ppdb_bigquery
      • Added support for reading Postgres password from Google Secrets Manager
      • Added support for handling update records in the store method by serializing them to a local JSON file in the chunk directory
      • Ported several methods related to chunk "promotion" from the db module in dax_ppdb_gcp to this class so that they are more easily accessible
  • A few core framework classes, or portions of them, were ported from lsst-dm/dax_ppdbx_gcp into this repository.
    • Functionality for interacting with the SQL database for replica chunk management was copied from the db module into ppdb_bigquery.
    • The replica_chunk_promoter module for promoting replica chunk data from staging to production was copied into the bigquery package, modified, and renamed as chunk_promoter.

Minor Changes

  • Dependencies were reorganized and simplified.
    • The google-cloud-bigquery package was added as a primary dependency. The repository now depends heavily on this library after the updates in this ticket, so making it optional does not make sense for dependency management.
    • The lsst-dax-ppdbx-gcp library was also made a required dependency. It is similarly a core dependency now and does not make sense to have as optional.
    • Checks for formerly optional dependencies were stripped from the codebase, in particular, from the ppdb_bigquery module and all test modules.
  • Tests support usage of Google Cloud services now.
    • A check was added to skip tests where a Google Cloud environment is required but not available.
    • Certain tests will now (lightly) utilize cloud services such as Google Cloud Storage, BigQuery, etc., where appropriate.
      • This was implemented carefully to not use existing names for production objects such as datasets or buckets. The tests will typically create and then teardown these objects themselves.
  • The config module was renamed to ppdb_config, because the primary class it defines is PpdbConfig (follows DM standards for module naming).
  • Added a new sql_resource module for loading SQL from a package resource
    • New SQL files for the updates were placed under a resources directory in the Python source tree
  • Moved test schema from tests directory into resources directory where it is more easily accessible
  • Reorganized some of the test utilities in the tests package to make certain methods available, e.g., for generating test data
  • See full commit history for a few additional, minor changes which were not listed.

Known Issues

These are known issues that will be resolved on separate tickets.

  • There is a lot of duplication and overlap of the database-related (SQL/Postgres) methods and functionality in the ppdb_bigquery module. Some of this was preexisting and other overlapping methods were introduced in this PR. This can be cleaned up and consolidated (DM-54522).
  • Determining what records are updated in the cloud pipeline is currently done by reading the chunk manifests from cloud storage during replication. It would be better if instead there was a field in the PpdbReplicaChunk table which tracked whether the chunk has updates, perhaps an update_count field, so that this was unnecessary. (I may include this in DM-54522 or it could be a separate ticket.) I will make this improvement on this PR.
  • The tests and test coverage in this package could use some improvements, though I did add test cases for the new modules in the updates package, as well as a few core, existing classes like chunk_uploader and ppdb_bigquery (DM-54536).
    • Test coverage should be much better now than it was before but still could be improved.
    • We could use some better utilities/classes for generating test data. Some exist, but there is overlap, and this in general could be improved and consolidated. The tests for the new updates support could use a more extensive set of records, e.g., for testing the "deduplication" process, etc.
    • There aren't any existing tools for creating a BigQuery dataset from a Felis YAML schema, which would be particularly useful for standing up test datasets (DM-49220).
    • The tests package in the Python source tree has some new modules, though they are a bit disorganized and miscellaneous. These should be cleaned up and consolidated. It is also possible that some of the classes should just be included into the test modules, even if there is minor duplication as a result.
    • End-to-end testing on the full replication process is not implemented in this repository, and is difficult to do without some additional functionality which doesn't exist yet. I will likely create a follow-up ticket for deploying and testing the new "updates" functionality to the cloud pipeline and RSP. It is beyond the scope of this repository's functionality to fully test that ingestion pipeline at this time.
    • Some usage of cloud services needs to be cleaned up so that resources are torn down after they are used. All of the resources created are marked so that they will be automatically deleted eventually, but ideally they should be deleted by the test case itself.
  • There are a large number of miscellaneous issues related to improvements to this repository's Python codebase (DM-54522).
    • These issues/bullets will likely be broken out into separate, more manageable child tickets which include a few related issues together.

Additional Notes

This ticket includes some refactoring, porting of classes from other repositories, etc. that I now realize was excessive to include and should have been done on separate tickets. I will keep this in mind for the future and try to make changes on ticket branches more targeted, in particular, for ease of review and testing. The commit history of this PR should be useful for disambiguating some of this work, though beb2eab includes refactoring alongside updates for supporting the new functionality (These would have been difficult to separate out when I rebuilt the commit history.).

Disclosure on LLM Usage

I used Copilot extensively during the development of this ticket, in particular, for the following:

  • Creation of test cases for new modules
  • Design and implementation of the algorithm and class structure for applying updates
  • Code refactoring, cleanup, and bug-fixing

I reviewed all of the AI-generated code multiple times and made many changes to it. The test cases, in particular, could use further attention and consolidation (see above).

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end support for exporting, uploading, deduplicating, and merging APDB update records into PPDB BigQuery tables, along with supporting SQL resources and test coverage updates.

Changes:

  • Introduces BigQuery “updates” subsystem (expand → load → deduplicate → merge) with SQL MERGE resources.
  • Extends chunk export/upload metadata to track update-record presence and GCS location (gcs_uri), and adds promotable-chunk SQL/query utilities.
  • Refactors tests/config to use a resource-based test schema path and adds multiple BigQuery/GCS integration tests.

Reviewed changes

Copilot reviewed 36 out of 41 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
tests/test_updates_table.py BigQuery UpdatesTable integration tests
tests/test_updates_merger.py BigQuery MERGE integration tests
tests/test_updates_manager.py End-to-end updates manager test
tests/test_update_records.py UpdateRecords JSON + GCS uploader tests
tests/test_update_record_expander.py Unit tests for update expansion
tests/test_ppdb_sql.py Switch tests to schema resource URI
tests/test_ppdb_bigquery.py Adds BigQuery test case scaffolding
requirements.txt Removes ppdbx-gcp from base reqs
python/lsst/dax/ppdb/tests/config/init.py Test config package init
python/lsst/dax/ppdb/tests/_updates.py Shared synthetic update-record fixtures
python/lsst/dax/ppdb/tests/_ppdb.py Adds test schema URI + mixin refactor
python/lsst/dax/ppdb/tests/_bigquery.py BigQuery test mixins + uploader stub helpers
python/lsst/dax/ppdb/sql/_ppdb_sql.py Engine creation API change for DB init
python/lsst/dax/ppdb/sql/_ppdb_sql_base.py Refactors engine/connect-args helpers
python/lsst/dax/ppdb/ppdb.py Imports config from new module
python/lsst/dax/ppdb/ppdb_config.py New pydantic-based config loader
python/lsst/dax/ppdb/config/sql/select_promotable_chunks.sql New SQL for promotable chunk selection
python/lsst/dax/ppdb/config/sql/merge_diasource_updates.sql New MERGE SQL for DiaSource updates
python/lsst/dax/ppdb/config/sql/merge_diaobject_updates.sql New MERGE SQL for DiaObject updates
python/lsst/dax/ppdb/config/sql/merge_diaforcedsource_updates.sql New MERGE SQL for DiaForcedSource updates
python/lsst/dax/ppdb/config/schemas/test_apdb_schema.yaml Adds test schema as package data
python/lsst/dax/ppdb/bigquery/updates/updates_table.py Creates/loads/dedups updates table
python/lsst/dax/ppdb/bigquery/updates/updates_merger.py Merger classes for applying updates
python/lsst/dax/ppdb/bigquery/updates/updates_manager.py Orchestrates download/expand/load/merge
python/lsst/dax/ppdb/bigquery/updates/update_records.py Pydantic model for update records JSON
python/lsst/dax/ppdb/bigquery/updates/update_record_expander.py Expands logical updates into field updates
python/lsst/dax/ppdb/bigquery/updates/expanded_update_record.py Model for a single expanded update row
python/lsst/dax/ppdb/bigquery/updates/init.py Exports updates public API
python/lsst/dax/ppdb/bigquery/sql_resource.py Loads SQL from package resources
python/lsst/dax/ppdb/bigquery/replica_chunk_promoter.py Promotion workflow for staged chunks
python/lsst/dax/ppdb/bigquery/query_runner.py Utility for running/logging BQ jobs
python/lsst/dax/ppdb/bigquery/ppdb_replica_chunk_extended.py Adds gcs_uri to chunk metadata
python/lsst/dax/ppdb/bigquery/ppdb_bigquery.py Writes update_records.json; adds promotable-chunks query
python/lsst/dax/ppdb/bigquery/manifest.py Tracks whether updates are included
python/lsst/dax/ppdb/bigquery/chunk_uploader.py Uploads update_records.json; stores gcs_uri
python/lsst/dax/ppdb/bigquery/init.py Exports ChunkUploader
python/lsst/dax/ppdb/_factory.py Updates config import location
python/lsst/dax/ppdb/init.py Re-exports new config module
pyproject.toml Adds package data + gcp extra dep
docker/Dockerfile.replication Adds build deps for Python packages
.gitignore Ignores .scratch directory
Comments suppressed due to low confidence (1)

python/lsst/dax/ppdb/tests/_bigquery.py:33

  • This module imports google.cloud.storage unconditionally. Because _bigquery.py is used by multiple test cases/mixins, this can break test collection in environments where optional GCP dependencies aren’t installed. Wrap these imports in try/except (similar to other tests) and skip/disable GCP-dependent helpers when unavailable.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/lsst/dax/ppdb/bigquery/chunk_uploader.py Outdated
Comment thread python/lsst/dax/ppdb/bigquery/updates/updates_merger.py Outdated
Comment thread python/lsst/dax/ppdb/bigquery/updates/updates_manager.py
Comment thread python/lsst/dax/ppdb/bigquery/updates/updates_table.py
Comment thread python/lsst/dax/ppdb/bigquery/updates/updates_table.py
Comment thread python/lsst/dax/ppdb/bigquery/updates/update_record_expander.py Outdated
Comment thread tests/test_updates_table.py Outdated
Comment thread tests/test_update_record_expander.py Outdated
Comment thread tests/test_updates_manager.py
Comment thread tests/test_update_records.py Outdated
@JeremyMcCormick JeremyMcCormick force-pushed the tickets/DM-54070 branch 3 times, most recently from fc6daf7 to 0519130 Compare March 19, 2026 22:47
Comment thread tests/test_update_record_expander.py Outdated
@JeremyMcCormick JeremyMcCormick force-pushed the tickets/DM-54070 branch 3 times, most recently from 6c7a5d3 to 4e54fc1 Compare March 21, 2026 02:52
Comment thread python/lsst/dax/ppdb/bigquery/chunk_promoter.py Outdated
@JeremyMcCormick JeremyMcCormick force-pushed the tickets/DM-54070 branch 8 times, most recently from 039bc08 to 0be5f74 Compare March 31, 2026 00:59
@JeremyMcCormick JeremyMcCormick marked this pull request as ready for review March 31, 2026 21:01
Copy link
Copy Markdown
Collaborator

@andy-slac andy-slac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked what I could, but it is too much to read in one go. Anyways, I left a bunch of comments and suggestions. My main issues:

  • The code is structured in a way that makes it hard to extend with additional update types. SQL code looks particularly fragile.
  • Error handling needs more attention, I think there are a lot of possible leaking low-level exceptions everywhere.
  • Dumping update records to JSON will make it harder for other people to reuse those files, I think Parquet will be easier to read.
  • You want to extend dax_apdb API instead of doing getattr on hardcoded field names of update records.

Comment thread .github/workflows/build.yaml Outdated
Comment thread python/lsst/dax/ppdb/bigquery/updates/__init__.py Outdated
Comment thread python/lsst/dax/ppdb/bigquery/updates/expanded_update_record.py
Comment thread python/lsst/dax/ppdb/bigquery/updates/expanded_update_record.py Outdated
Comment thread python/lsst/dax/ppdb/bigquery/updates/expanded_update_record.py
Comment thread tests/test_update_records.py Outdated
Comment thread tests/test_update_records.py Outdated
Comment thread tests/test_update_records.py Outdated
Comment thread tests/test_update_records.py Outdated
Comment thread tests/test_updates_table.py Outdated
@lsst lsst deleted a comment from andy-slac Apr 7, 2026
The `test_ppdbBigQuery` module was also renamed to
`tests_ppdb_bigquery` following snakecase conventions.
This was previously referred to as "deduplication," which was
misleading, because the records do not represent duplicates. They are
updates on the same combination of `(table, field, record_id)` where
only the latest one should be kept.
Instead of a boolean flag, an update count is included in the manifest,
which aligns with the record counts for the table files.
This is used for easily determining which replica chunks have
associated updates during processing.
@JeremyMcCormick JeremyMcCormick force-pushed the tickets/DM-54070 branch 2 times, most recently from 00c56b6 to 3402ff2 Compare April 9, 2026 17:57
This adds startup and shutdown messages for the replication and adjusts
some message levels so that the log is not so cluttered. We generally
want to see messages when the replication activates and processes
replica chunks. Printing information about wait intervals or when
no chunks are available to replica is generall not very interesting for
normal operations and these tend to bury other messages. So these types
of messages have either been removed or set to DEBUG level.
@JeremyMcCormick JeremyMcCormick force-pushed the tickets/DM-54070 branch 2 times, most recently from 22ea446 to e69428e Compare April 10, 2026 22:20
@JeremyMcCormick
Copy link
Copy Markdown
Contributor Author

JeremyMcCormick commented Apr 13, 2026

Hi, @andy-slac.

This should be ready for a re-review now.

I rebased a significant amount of work onto previous commits, but once that became too awkward, I added new commits. All commits after 7ccf56f represent new work primarily based on your first review.

I also went ahead and implemented the change mentioned in #33 (comment) on this branch after your review.

I checked what I could, but it is too much to read in one go. Anyways, I left a bunch of comments and suggestions. My main issues:

  • The code is structured in a way that makes it hard to extend with additional update types. SQL code looks particularly fragile.

This was discussed on the DM-54070 Jira ticket, and DM-54636 has been created to track future work. I am not planning to rewrite these classes on this ticket branch, as the amount of testing and validation required for a new implementation is significant.

  • Error handling needs more attention, I think there are a lot of possible leaking low-level exceptions everywhere.

The error handling should be in better shape now. I added a few module-specific exception types and more granular error-handling to the updates_manager and chunk_promoter modules, since these are the primary entry points for executing the updates process. Additional improvements in this area can be made on subsequent tickets.

  • Dumping update records to JSON will make it harder for other people to reuse those files, I think Parquet will be easier to read.

I thought this was a great idea, and I've changed the implementation to use parquet instead of JSON files for the update records. I stripped out the original JSON functionality as it was unused after making this change.

  • You want to extend dax_apdb API instead of doing getattr on hardcoded field names of update records.

This was implemented on a dax_apdb branch, which I will merge before this one.

This also makes a few minor updates to logging in several of the
classes within the `updates` package.
Copy link
Copy Markdown
Collaborator

@andy-slac andy-slac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK, but I think it still misses validation of resulted updates.

Comment thread python/lsst/dax/ppdb/bigquery/updates/update_record_expander.py Outdated
Comment thread python/lsst/dax/ppdb/bigquery/updates/update_record_expander.py Outdated
Comment thread python/lsst/dax/ppdb/bigquery/updates/update_record_expander.py
Comment thread python/lsst/dax/ppdb/bigquery/updates/update_records.py
Comment thread python/lsst/dax/ppdb/bigquery/updates/updates_merger.py
Comment thread python/lsst/dax/ppdb/bigquery/updates/updates_table.py Outdated
Comment thread python/lsst/dax/ppdb/bigquery/chunk_uploader.py Outdated
Comment thread python/lsst/dax/ppdb/bigquery/chunk_uploader.py Outdated
Comment thread python/lsst/dax/ppdb/bigquery/ppdb_bigquery.py
Comment thread python/lsst/dax/ppdb/bigquery/ppdb_bigquery.py Outdated
The `record_payload` method will only return fields that were actually
updated now, so this should be unneeded.
@JeremyMcCormick
Copy link
Copy Markdown
Contributor Author

JeremyMcCormick commented Apr 14, 2026

Thank you for the re-review, @andy-slac.

Looks OK, but I think it still misses validation of resulted updates.

I have created DM-54650 to track this.

@JeremyMcCormick JeremyMcCormick merged commit c41e29e into main Apr 14, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants