DM-54070: Add support for APDB record updates by JeremyMcCormick · Pull Request #33 · lsst/dax_ppdb

JeremyMcCormick · 2026-02-11T23:42:16Z

Overview

This is a major update to the repository adding support for propagating APDB update records to the PPDB in BigQuery. The commit history was completely rebuilt and consolidated, and starting with it as a point of reference may be useful. In particular, 9e056c4 and a78ee70 contain the majority of the changes for implementing the new functionality.

As discussed previously, applying record updates using a typical RDMS pattern of SQL UPDATE statements would be infeasible given BigQuery's limitations in this area. In particular, the documentation on quotas indicates that there are limits on the number of update statements which may be executed per day. Given that single replica chunks may contain millions of individual update records, applying them using UPDATE statements would be infeasible, as executing this many queries would far exceed the quota. So a different approach needed to be taken where the updates are batched together and applied using a single MERGE statement.

The process for applying the updates implemented on this branch is as follows:

During replication, the update records are serialized to a JSON file in the local replica chunk directory.
The uploader process of the replication application uploads this JSON file to cloud storage alongside the parquet files.
During the promotion step, the JSON files containing the updates are read in from cloud storage for the relevant chunks and the data is copied into an "expanded updates table", which represents the data generically as a combination of table name, field name, record ID, and new value, along with other necessary information like the replica chunk ID.
The records in the expanded updates table are "deduplicated" so that only the most recent version of a particular update changing the same table name, field name, and record ID is included in the batch. The most recent update is found by the ordering of replica chunk, update time, and update order, in that specific order. (This process does not really represent "deduplication" - it is selecting out only the most recent update of the same type to a particular record.)
The deduplicated records are merged into production during the promotion step, after the temporary table has been created, which includes the current prod table + new records. The updates are applied to this new table, so that it should not matter if the target records are present in the staging or prod tables.

Implementing this process required some major changes and additions to the existing infrastructure, outlined below.

Major Changes

A new updates package was added under bigquery, containing the following modules:
- update_records - Pydantic model for packaging a set of update records for a single replica chunk
- expanded_update_record - expanded update record representing the changes in a common form within the initial BigQuery table
- updates_table - encapsulation of the initial target BQ table which contains the update information (also contains functionality for "deduplicating" the update records, described above)
- updates_merger - tool for merging the records in the updates table into the target APDB tables (DiaObject, etc.)
- updates_manager - manages the overall process of applying the updates during promotion
Existing classes and modules in the bigquery package were modified to support the new functionality:
- ppdb_bigquery
  - Added support for reading Postgres password from Google Secrets Manager
  - Added support for handling update records in the store method by serializing them to a local JSON file in the chunk directory
  - Ported several methods related to chunk "promotion" from the db module in dax_ppdb_gcp to this class so that they are more easily accessible
A few core framework classes, or portions of them, were ported from lsst-dm/dax_ppdbx_gcp into this repository.
- Functionality for interacting with the SQL database for replica chunk management was copied from the db module into ppdb_bigquery.
- The replica_chunk_promoter module for promoting replica chunk data from staging to production was copied into the bigquery package, modified, and renamed as chunk_promoter.

Minor Changes

Dependencies were reorganized and simplified.
- The google-cloud-bigquery package was added as a primary dependency. The repository now depends heavily on this library after the updates in this ticket, so making it optional does not make sense for dependency management.
- The lsst-dax-ppdbx-gcp library was also made a required dependency. It is similarly a core dependency now and does not make sense to have as optional.
- Checks for formerly optional dependencies were stripped from the codebase, in particular, from the ppdb_bigquery module and all test modules.
Tests support usage of Google Cloud services now.
- A check was added to skip tests where a Google Cloud environment is required but not available.
- Certain tests will now (lightly) utilize cloud services such as Google Cloud Storage, BigQuery, etc., where appropriate.
  - This was implemented carefully to not use existing names for production objects such as datasets or buckets. The tests will typically create and then teardown these objects themselves.
The config module was renamed to ppdb_config, because the primary class it defines is PpdbConfig (follows DM standards for module naming).
Added a new sql_resource module for loading SQL from a package resource
- New SQL files for the updates were placed under a resources directory in the Python source tree
Moved test schema from tests directory into resources directory where it is more easily accessible
Reorganized some of the test utilities in the tests package to make certain methods available, e.g., for generating test data
See full commit history for a few additional, minor changes which were not listed.

Known Issues

These are known issues that will be resolved on separate tickets.

There is a lot of duplication and overlap of the database-related (SQL/Postgres) methods and functionality in the ppdb_bigquery module. Some of this was preexisting and other overlapping methods were introduced in this PR. This can be cleaned up and consolidated (DM-54522).
Determining what records are updated in the cloud pipeline is currently done by reading the chunk manifests from cloud storage during replication. It would be better if instead there was a field in the PpdbReplicaChunk table which tracked whether the chunk has updates, perhaps an update_count field, so that this was unnecessary. (I may include this in DM-54522 or it could be a separate ticket.) I will make this improvement on this PR.
The tests and test coverage in this package could use some improvements, though I did add test cases for the new modules in the updates package, as well as a few core, existing classes like chunk_uploader and ppdb_bigquery (DM-54536).
- Test coverage should be much better now than it was before but still could be improved.
- We could use some better utilities/classes for generating test data. Some exist, but there is overlap, and this in general could be improved and consolidated. The tests for the new updates support could use a more extensive set of records, e.g., for testing the "deduplication" process, etc.
- There aren't any existing tools for creating a BigQuery dataset from a Felis YAML schema, which would be particularly useful for standing up test datasets (DM-49220).
- The tests package in the Python source tree has some new modules, though they are a bit disorganized and miscellaneous. These should be cleaned up and consolidated. It is also possible that some of the classes should just be included into the test modules, even if there is minor duplication as a result.
- End-to-end testing on the full replication process is not implemented in this repository, and is difficult to do without some additional functionality which doesn't exist yet. I will likely create a follow-up ticket for deploying and testing the new "updates" functionality to the cloud pipeline and RSP. It is beyond the scope of this repository's functionality to fully test that ingestion pipeline at this time.
- Some usage of cloud services needs to be cleaned up so that resources are torn down after they are used. All of the resources created are marked so that they will be automatically deleted eventually, but ideally they should be deleted by the test case itself.
There are a large number of miscellaneous issues related to improvements to this repository's Python codebase (DM-54522).
- These issues/bullets will likely be broken out into separate, more manageable child tickets which include a few related issues together.

Additional Notes

This ticket includes some refactoring, porting of classes from other repositories, etc. that I now realize was excessive to include and should have been done on separate tickets. I will keep this in mind for the future and try to make changes on ticket branches more targeted, in particular, for ease of review and testing. The commit history of this PR should be useful for disambiguating some of this work, though beb2eab includes refactoring alongside updates for supporting the new functionality (These would have been difficult to separate out when I rebuilt the commit history.).

Disclosure on LLM Usage

I used Copilot extensively during the development of this ticket, in particular, for the following:

Creation of test cases for new modules
Design and implementation of the algorithm and class structure for applying updates
Code refactoring, cleanup, and bug-fixing

I reviewed all of the AI-generated code multiple times and made many changes to it. The test cases, in particular, could use further attention and consolidation (see above).

Copilot

Pull request overview

Adds end-to-end support for exporting, uploading, deduplicating, and merging APDB update records into PPDB BigQuery tables, along with supporting SQL resources and test coverage updates.

Changes:

Introduces BigQuery “updates” subsystem (expand → load → deduplicate → merge) with SQL MERGE resources.
Extends chunk export/upload metadata to track update-record presence and GCS location (gcs_uri), and adds promotable-chunk SQL/query utilities.
Refactors tests/config to use a resource-based test schema path and adds multiple BigQuery/GCS integration tests.

Reviewed changes

Copilot reviewed 36 out of 41 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
tests/test_updates_table.py	BigQuery UpdatesTable integration tests
tests/test_updates_merger.py	BigQuery MERGE integration tests
tests/test_updates_manager.py	End-to-end updates manager test
tests/test_update_records.py	UpdateRecords JSON + GCS uploader tests
tests/test_update_record_expander.py	Unit tests for update expansion
tests/test_ppdb_sql.py	Switch tests to schema resource URI
tests/test_ppdb_bigquery.py	Adds BigQuery test case scaffolding
requirements.txt	Removes ppdbx-gcp from base reqs
python/lsst/dax/ppdb/tests/config/init.py	Test config package init
python/lsst/dax/ppdb/tests/_updates.py	Shared synthetic update-record fixtures
python/lsst/dax/ppdb/tests/_ppdb.py	Adds test schema URI + mixin refactor
python/lsst/dax/ppdb/tests/_bigquery.py	BigQuery test mixins + uploader stub helpers
python/lsst/dax/ppdb/sql/_ppdb_sql.py	Engine creation API change for DB init
python/lsst/dax/ppdb/sql/_ppdb_sql_base.py	Refactors engine/connect-args helpers
python/lsst/dax/ppdb/ppdb.py	Imports config from new module
python/lsst/dax/ppdb/ppdb_config.py	New pydantic-based config loader
python/lsst/dax/ppdb/config/sql/select_promotable_chunks.sql	New SQL for promotable chunk selection
python/lsst/dax/ppdb/config/sql/merge_diasource_updates.sql	New MERGE SQL for DiaSource updates
python/lsst/dax/ppdb/config/sql/merge_diaobject_updates.sql	New MERGE SQL for DiaObject updates
python/lsst/dax/ppdb/config/sql/merge_diaforcedsource_updates.sql	New MERGE SQL for DiaForcedSource updates
python/lsst/dax/ppdb/config/schemas/test_apdb_schema.yaml	Adds test schema as package data
python/lsst/dax/ppdb/bigquery/updates/updates_table.py	Creates/loads/dedups updates table
python/lsst/dax/ppdb/bigquery/updates/updates_merger.py	Merger classes for applying updates
python/lsst/dax/ppdb/bigquery/updates/updates_manager.py	Orchestrates download/expand/load/merge
python/lsst/dax/ppdb/bigquery/updates/update_records.py	Pydantic model for update records JSON
python/lsst/dax/ppdb/bigquery/updates/update_record_expander.py	Expands logical updates into field updates
python/lsst/dax/ppdb/bigquery/updates/expanded_update_record.py	Model for a single expanded update row
python/lsst/dax/ppdb/bigquery/updates/init.py	Exports updates public API
python/lsst/dax/ppdb/bigquery/sql_resource.py	Loads SQL from package resources
python/lsst/dax/ppdb/bigquery/replica_chunk_promoter.py	Promotion workflow for staged chunks
python/lsst/dax/ppdb/bigquery/query_runner.py	Utility for running/logging BQ jobs
python/lsst/dax/ppdb/bigquery/ppdb_replica_chunk_extended.py	Adds `gcs_uri` to chunk metadata
python/lsst/dax/ppdb/bigquery/ppdb_bigquery.py	Writes update_records.json; adds promotable-chunks query
python/lsst/dax/ppdb/bigquery/manifest.py	Tracks whether updates are included
python/lsst/dax/ppdb/bigquery/chunk_uploader.py	Uploads update_records.json; stores gcs_uri
python/lsst/dax/ppdb/bigquery/init.py	Exports ChunkUploader
python/lsst/dax/ppdb/_factory.py	Updates config import location
python/lsst/dax/ppdb/init.py	Re-exports new config module
pyproject.toml	Adds package data + gcp extra dep
docker/Dockerfile.replication	Adds build deps for Python packages
.gitignore	Ignores `.scratch` directory

Comments suppressed due to low confidence (1)

python/lsst/dax/ppdb/tests/_bigquery.py:33

This module imports google.cloud.storage unconditionally. Because _bigquery.py is used by multiple test cases/mixins, this can break test collection in environments where optional GCP dependencies aren’t installed. Wrap these imports in try/except (similar to other tests) and skip/disable GCP-dependent helpers when unavailable.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

andy-slac

I checked what I could, but it is too much to read in one go. Anyways, I left a bunch of comments and suggestions. My main issues:

The code is structured in a way that makes it hard to extend with additional update types. SQL code looks particularly fragile.
Error handling needs more attention, I think there are a lot of possible leaking low-level exceptions everywhere.
Dumping update records to JSON will make it harder for other people to reuse those files, I think Parquet will be easier to read.
You want to extend dax_apdb API instead of doing getattr on hardcoded field names of update records.

The `test_ppdbBigQuery` module was also renamed to `tests_ppdb_bigquery` following snakecase conventions.

This was previously referred to as "deduplication," which was misleading, because the records do not represent duplicates. They are updates on the same combination of `(table, field, record_id)` where only the latest one should be kept.

This also cleans up the test GCS bucket in the teardown of the test case.

Instead of a boolean flag, an update count is included in the manifest, which aligns with the record counts for the table files.

This is used for easily determining which replica chunks have associated updates during processing.

This adds startup and shutdown messages for the replication and adjusts some message levels so that the log is not so cluttered. We generally want to see messages when the replication activates and processes replica chunks. Printing information about wait intervals or when no chunks are available to replica is generall not very interesting for normal operations and these tend to bury other messages. So these types of messages have either been removed or set to DEBUG level.

JeremyMcCormick · 2026-04-13T17:24:08Z

Hi, @andy-slac.

This should be ready for a re-review now.

I rebased a significant amount of work onto previous commits, but once that became too awkward, I added new commits. All commits after 7ccf56f represent new work primarily based on your first review.

I also went ahead and implemented the change mentioned in #33 (comment) on this branch after your review.

I checked what I could, but it is too much to read in one go. Anyways, I left a bunch of comments and suggestions. My main issues:

The code is structured in a way that makes it hard to extend with additional update types. SQL code looks particularly fragile.

This was discussed on the DM-54070 Jira ticket, and DM-54636 has been created to track future work. I am not planning to rewrite these classes on this ticket branch, as the amount of testing and validation required for a new implementation is significant.

Error handling needs more attention, I think there are a lot of possible leaking low-level exceptions everywhere.

The error handling should be in better shape now. I added a few module-specific exception types and more granular error-handling to the updates_manager and chunk_promoter modules, since these are the primary entry points for executing the updates process. Additional improvements in this area can be made on subsequent tickets.

Dumping update records to JSON will make it harder for other people to reuse those files, I think Parquet will be easier to read.

I thought this was a great idea, and I've changed the implementation to use parquet instead of JSON files for the update records. I stripped out the original JSON functionality as it was unused after making this change.

You want to extend dax_apdb API instead of doing getattr on hardcoded field names of update records.

This was implemented on a dax_apdb branch, which I will merge before this one.

This also makes a few minor updates to logging in several of the classes within the `updates` package.

andy-slac

Looks OK, but I think it still misses validation of resulted updates.

The `record_payload` method will only return fields that were actually updated now, so this should be unneeded.

JeremyMcCormick · 2026-04-14T21:20:01Z

Thank you for the re-review, @andy-slac.

Looks OK, but I think it still misses validation of resulted updates.

I have created DM-54650 to track this.

JeremyMcCormick marked this pull request as draft February 11, 2026 23:42

JeremyMcCormick force-pushed the tickets/DM-54070 branch 4 times, most recently from 35ac308 to a1327a5 Compare February 23, 2026 20:58

JeremyMcCormick force-pushed the tickets/DM-54070 branch from e66b7e7 to 2bd43ff Compare February 25, 2026 21:11

JeremyMcCormick requested a review from Copilot March 15, 2026 23:15

Copilot started reviewing on behalf of JeremyMcCormick March 15, 2026 23:16 View session

Copilot AI reviewed Mar 15, 2026

View reviewed changes

JeremyMcCormick force-pushed the tickets/DM-54070 branch 3 times, most recently from fc6daf7 to 0519130 Compare March 19, 2026 22:47

JeremyMcCormick commented Mar 20, 2026

View reviewed changes

Comment thread tests/test_update_record_expander.py Outdated

JeremyMcCormick force-pushed the tickets/DM-54070 branch 3 times, most recently from 6c7a5d3 to 4e54fc1 Compare March 21, 2026 02:52

JeremyMcCormick commented Mar 22, 2026

View reviewed changes

Comment thread python/lsst/dax/ppdb/bigquery/chunk_promoter.py Outdated

JeremyMcCormick force-pushed the tickets/DM-54070 branch 8 times, most recently from 039bc08 to 0be5f74 Compare March 31, 2026 00:59

JeremyMcCormick marked this pull request as ready for review March 31, 2026 21:01

JeremyMcCormick requested a review from andy-slac March 31, 2026 21:01

JeremyMcCormick force-pushed the tickets/DM-54070 branch from 0be5f74 to 3e1a408 Compare March 31, 2026 21:46

andy-slac requested changes Apr 1, 2026

View reviewed changes

JeremyMcCormick force-pushed the tickets/DM-54070 branch from 3e1a408 to 6ed9f72 Compare April 2, 2026 19:59

lsst deleted a comment from andy-slac Apr 7, 2026

JeremyMcCormick added 3 commits April 7, 2026 16:49

Add tests of APDB update record handling

60ddaa4

The `test_ppdbBigQuery` module was also renamed to `tests_ppdb_bigquery` following snakecase conventions.

Make all fields in updates table required

da74885

JeremyMcCormick force-pushed the tickets/DM-54070 branch from 247217e to ae8a405 Compare April 7, 2026 21:49

JeremyMcCormick added 4 commits April 7, 2026 18:46

Skip updates of type close_diaobject_validity when nDiaSources is None

d498a5b

Add specialized exception class and error handling

b7648ca

Move test of chunk uploader to its own test module

9709ec3

This also cleans up the test GCS bucket in the teardown of the test case.

Serialize the update records to parquet instead of JSON

1b9196a

JeremyMcCormick force-pushed the tickets/DM-54070 branch from adda2e9 to 1b9196a Compare April 7, 2026 23:46

JeremyMcCormick added 4 commits April 7, 2026 18:49

Add specialized exception type and catch errors during promotion

3e42518

Store update count in the chunk manifest

3e29097

Instead of a boolean flag, an update count is included in the manifest, which aligns with the record counts for the table files.

Make ignore of egg-info more generic

2a2a1d9

Add update count field to the PpdbReplicaChunk table

7454dd8

This is used for easily determining which replica chunks have associated updates during processing.

JeremyMcCormick force-pushed the tickets/DM-54070 branch 2 times, most recently from 00c56b6 to 3402ff2 Compare April 9, 2026 17:57

JeremyMcCormick force-pushed the tickets/DM-54070 branch 2 times, most recently from 22ea446 to e69428e Compare April 10, 2026 22:20

JeremyMcCormick requested a review from andy-slac April 13, 2026 17:24

Add job stat logging to the updates manager

7d7fde1

This also makes a few minor updates to logging in several of the classes within the `updates` package.

JeremyMcCormick force-pushed the tickets/DM-54070 branch from e69428e to 7d7fde1 Compare April 13, 2026 17:25

andy-slac approved these changes Apr 14, 2026

View reviewed changes

JeremyMcCormick added 3 commits April 14, 2026 16:02

Remove unused variable

e632600

Remove unneeded check on nDiaSources being None

876b635

The `record_payload` method will only return fields that were actually updated now, so this should be unneeded.

Fix a few minor docstring issues

952ae0e

Use dax_apdb main branch

baba54f

JeremyMcCormick merged commit c41e29e into main Apr 14, 2026
18 checks passed

Conversation

JeremyMcCormick commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andy-slac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JeremyMcCormick commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andy-slac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JeremyMcCormick commented Apr 14, 2026 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JeremyMcCormick commented Feb 11, 2026 •

edited

Loading

JeremyMcCormick commented Apr 13, 2026 •

edited

Loading

JeremyMcCormick commented Apr 14, 2026 •

edited by atlassian bot

Loading