Skip to content

fix: add shutdown methods to executors#925

Open
jbusecke wants to merge 30 commits intomainfrom
executor-cleaning
Open

fix: add shutdown methods to executors#925
jbusecke wants to merge 30 commits intomainfrom
executor-cleaning

Conversation

@jbusecke
Copy link
Copy Markdown
Collaborator

@jbusecke jbusecke commented Mar 12, 2026

@TomNicholas and I have been mulling over a complex native zarr ingestion job for a few days now. We were ingesting many batches of large (~1TB) native zarr stores, and saw a steady increase of memory which indicated that 'something' was holding onto memory in between batches. This PR adds tests to catch this behavior and a fix for the lithops executor that did fix our problem for now.

jbusecke and others added 2 commits March 12, 2026 12:26
Add explicit shutdown() to SerialExecutor and DaskDelayedExecutor that
clears tracked futures. Enhance LithopsEagerFunctionExecutor.shutdown()
to clear cached _call_output on ResponseFutures before closing, preventing
memory accumulation across repeated map() calls. Add parametrized tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.30%. Comparing base (2b68ec1) to head (887edac).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #925      +/-   ##
==========================================
+ Coverage   89.23%   89.30%   +0.07%     
==========================================
  Files          33       33              
  Lines        2025     2039      +14     
==========================================
+ Hits         1807     1821      +14     
  Misses        218      218              
Files with missing lines Coverage Δ
virtualizarr/parallel.py 93.02% <100.00%> (+3.36%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread virtualizarr/parallel.py Outdated
Comment thread virtualizarr/tests/test_parallel.py Outdated
Comment thread virtualizarr/tests/test_parallel.py Outdated


@pytest.mark.parametrize("executor_cls", ALL_EXECUTORS)
class TestExecutorMemory:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if either of these tests will be reliable enough - curious of @chuckwondo 's thoughts.

@jbusecke jbusecke marked this pull request as ready for review March 12, 2026 19:50
Comment thread virtualizarr/parallel.py Outdated
Comment on lines +391 to +394
# Lithops registers self.clean as an atexit handler (executors.py __init__),
# which prevents the FunctionExecutor from ever being garbage collected.
# Unregister it so the executor can be freed after shutdown.
atexit.unregister(self.lithops_client.clean)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is absolutely wild and deserves raising upstream

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably so.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #926

Comment thread virtualizarr/tests/test_parallel.py Outdated
Comment thread virtualizarr/tests/test_parallel.py Outdated


@pytest.mark.parametrize("executor_cls", ALL_EXECUTORS)
class TestExecutorMemory:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure these really belong here. They seem like tests that should occur upstream. If we see memory leaks in the upstream executors, we should probably be opening bugs against the appropriate repositories, no?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, see #926 for discussion of what we should do or not do to clean up

Copy link
Copy Markdown
Collaborator Author

@jbusecke jbusecke Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right in principal, but I would propose to keep this around as at least an optional test due to the significant work that was needed to get to the bottom of this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah @chuckwondo I see these tests as hopefully-temporary, but unfortunately important.

@jbusecke
Copy link
Copy Markdown
Collaborator Author

Ok Tom and I actually worked on an alternative approach where we change the lithops config to set lithops.data_cleaner to false (this is true by default and triggers the atexit registration). Combined with the added .shutdown() method on the Lithops exec this solves the problem in #926 and seems a bunch nicer than the original approach.

I have limited this to when the backend is localhost so that we leave the serverless behavior untouched for now. We could easily extend this if a user finds this error with other backends.

Copy link
Copy Markdown
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good, but before releasing it I want to:

  • confirm with @jbusecke that nothing else puzzling has come up wrt this rabbit hole,
  • raise an upstream issue on lithops in case we're missing something important here.

@TomNicholas TomNicholas mentioned this pull request Mar 16, 2026
The shutdown method was not clearing `lithops_client.futures` or freeing
output memory, causing test failures on Python 3.12 and 3.13.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Temporarily point lithops dep to jbusecke/lithops@fix-join-job-manager-localhostv2
to verify the upstream fix resolves the memory growth test on Linux CI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…to exit

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…re measuring

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread virtualizarr/parallel.py
return iter(dask.compute(*delayed_tasks))

def shutdown(self, wait: bool = True, *, cancel_futures: bool = False) -> None:
self._futures.clear()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just get rid of self._futures? It's not clear why we even hold the list of futures to begin with?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um that is actually a good question @chuckwondo. @TomNicholas do you know why this was necessary in the case of dask?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this the case for any of the executors?

@jbusecke
Copy link
Copy Markdown
Collaborator Author

Ok that was a frustrating day. The memory test seemed to have some sort of dependency on the running environment (it passed concistently on my mac, but never on the CI). I have confirmed here that the actual end-to-end workflow which triggered this PR in the first place is indeed working as expected with this fix. I guess in the end I have come to agree with @chuckwondo that that behavior should not be tested in this packages context, and I have removed the test alltogether. We are still checking that the futures are all cleared up (and I am asserting that the data_cleaner property is indeed set to False by our config modifications.

Added `.shutdown()` method to custom executors to prevent unbounded memory increase in lithops.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lithops FunctionExecutor memory leaks: atexit handler + unbounded futures list

3 participants