-
Notifications
You must be signed in to change notification settings - Fork 51
Description
Hopefully solving several points: #2223
1. Containers not removed
- 11/02/2026: submissions containers staying up forever
2. Wrong log when storage is full
When docker pull fails because of full storage, we have no clear logs.
See:
-
Have the right error logs, and have them on the platform's UI
-
Detect errors in
_get_container_image()
Then it gets stuck in Running state.
3. Progress bar
Related: show_progress and the progress bar adds up to the mess:
-
Make
show_progress()more robust (not treating missing keys as errors) -
Avoid printing the multiple error lines like this (Compute worker - Improve status update and logs #2223):
2026-02-28 02:38:37.854 | ERROR | compute_worker:show_progress:137 - There was an error showing the progress bar
2026-02-28 02:38:37.854 | ERROR | compute_worker:show_progress:138 - 6
2026-02-28 02:38:37.955 | ERROR | compute_worker:show_progress:137 - There was an error showing the progress bar
2026-02-28 02:38:37.955 | ERROR | compute_worker:show_progress:138 - 14. Logs
- Sometimes no submission logs (Compute worker - Improve status update and logs #2223)
- Add logs at the start of submission container with metadata of the competition and submission
- Add a clear log in the computer worker container with the competition title when receiving a submission
- Similarly to other problems reported, sometimes we only have "Time limit exceeded" and no other logs (e.g. Problem with BEA 2019 Shared Task submissions #1994) (Compute worker - Improve status update and logs #2223)
- Docker pull and progress bar should be shown during preparation:
5. No space left
How to manage the disks? Should we limit docker images size?
6. Submissions not marked as Failed
Submissions stuck in "Running" or "Scoring" or status
- Submissions stuck in "Scoring" state instead of "Failed" when the compute worker crashes (Worker status to FAILED instead of SCORING or FINISHED in case of failure #2030, Compute worker - Improve status update and logs #2223)
Related issues:
-
Submissions stuck in "Running" or failing on compute worker ("non-zero return code") #2258 (grouped issue)
-
submission in "Scoring" status for multiple hours on default queue #1184
-
Similarly, it looks like the status get stuck to "Preparing" when failing during this process.
Example failure during "Preparing":
[2025-09-18 11:25:05,234: ERROR/ForkPoolWorker-2] Task compute_worker_run[fd956bf5-3e2d-4168-ab48-f0896dc80993] raised unexpected: OSError(28, 'No space left on device')
Traceback (most recent call last):
[...]
OSError: [Errno 28] No space left on device7. Duplication of submission files
8. To check
The log level is defined in this way in compute_worker.py:
configure_logging(
os.environ.get("LOG_LEVEL", "INFO"), os.environ.get("SERIALIZED", "false")
)Generally we want as much log as possible, so we may want to be in "DEBUG" log level.
8. Directory structure problem
9. Docker pull failing
- Docker pull failing
Pull for image: codalab/codalab-legacy:py39 returned a non-zero exit code! Check if the docker image exists on docker hub.
Related issues:
- submission in "Scoring" status for multiple hours on default queue #1184
- Solution is always running #1278
- Submission stuck on scoring status (Twice) #1263
Solution:
- To have more logs, we need to update
compute_worker.pyso we print more logs in the logger (More logs when docker pull fails in compute_worker.py #1283).
10. Logs at the wrong place
- Docker pull error during scoring are written in ingestion stderr instead of scoring stdrerr (Docker pull in scoring #1204)
Solved by: Show error in scoring std_err #1214
11. No hostname in server status when status is "Preparing"
- The "Preparing" status means that the worker is downloading the necessary data and programs to run the submission. We should already have a hostname in the server status page during this phase, but it is not the case. (fixed in Worker status to FAILED instead of SCORING or FINISHED in case of failure #2030)