Conversation
|
/run-security-scan |
alexcos20
left a comment
There was a problem hiding this comment.
AI automated code review (Gemini 3).
Overall risk: high
Summary:
This pull request introduces significant changes to the compute-to-data (C2D) image building process, allowing image builds to consume job resources and duration. Key enhancements include tracking build start/stop timestamps, incorporating build time into the overall job duration, enforcing build time limits, and dynamically allocating resources (CPU, RAM) for the build process. A new unit test for image building has been added. However, there is a critical security concern regarding the commented-out authentication check for free compute jobs.
Comments:
• [INFO][style] The phrase "most likley are very limited" could be rephrased for better grammar and formality, e.g., "are likely very limited" or "are typically very limited."
• [INFO][performance] The CPU affinity reuse logic (const existing = this.cpuAllocations.get(jobId)) is a good optimization to avoid re-calculating CPU sets if already allocated for a job. This is a nice addition, although somewhat independent of the primary build image changes.
• [WARNING][performance] Changing timeoutMs to job.maxJobDuration * 1000 means the image build process can now consume the entire allocated job duration. While this allows for more complex builds, it implicitly reduces the time available for the actual algorithm computation. If a job has a maxJobDuration of 60 seconds and the build takes 59 seconds, only 1 second remains for the algorithm. This is handled by the check on line 1606, but it's important to consider if this is the desired default behavior for all paid jobs, especially smaller ones. Perhaps a minimum algorithm duration guarantee or a separate maxBuildDuration might be considered in the future.
• [INFO][other] Adding the docker.getImage(job.containerImage).inspect() check after the build stream completes is a robust improvement. It ensures that even if the Docker build stream reports success, the image genuinely exists and can be used, preventing potential downstream issues.
• [ERROR][security] CRITICAL SECURITY CONCERN: The validateTokenOrSignature call for FreeComputeStartHandler has been commented out. This bypasses the authentication and authorization check for all free compute jobs. This means anyone could potentially request free compute jobs without any validation of their identity or nonce, which is a significant security vulnerability. This change needs immediate justification, or the code should be restored. Free jobs, even without payment, still require a validated consumerAddress to prevent abuse and track usage correctly. If this is intentional, it should be clearly documented as a major design decision with associated risks, but ideally, authentication should remain for all job types.
• [INFO][other] This new unit test file is a great addition, providing focused testing for the buildImage functionality and improving test coverage for these critical new features.
fix in 11806ae |
|
/run-security-scan |
alexcos20
left a comment
There was a problem hiding this comment.
AI automated code review (Gemini 3).
Overall risk: medium
Summary:
This pull request introduces significant enhancements to the C2D compute engine, primarily focusing on allowing and properly accounting for Docker image builds within compute jobs. It distinguishes between free and paid environments, ensuring that image building is only permitted in environments explicitly configured for it and that build time is accurately factored into job duration and resource consumption. The changes span documentation, type definitions, core compute engine logic, database persistence, configuration schemas, and include a dedicated unit test for the image build process's robustness.
Comments:
• [INFO][other] This change correctly incorporates the build duration into the overall job duration, which is crucial for accurate billing and job lifecycle management. This is a core part of the feature.
• [INFO][security] This allowImageBuild check is important. It prevents users from circumventing free environment restrictions by submitting Dockerfiles for image building where it's not allowed, which could lead to resource abuse. Good enforcement of policies.
• [INFO][performance] This new logic ensures that if the image build process itself consumes the entire maxJobDuration, the job is terminated early and marked as finished. This prevents jobs from running indefinitely on the build phase if it's too long, respecting the allocated time budget. This is a good failure mode handling.
• [INFO][other] Adjusting the job expiry to account for build duration ensures that the actual algorithm execution time adheres to the maxJobDuration minus the time spent building the image. This is vital for correct resource accounting and charging.
• [INFO][performance] Reusing existing CPU allocations for a job is a nice optimization. It avoids unnecessary re-allocations and ensures CPU affinity if a job's CPU request changes or is re-evaluated during its lifecycle, though for this specific change it's more about preventing re-allocation if the function is called multiple times for the same job. This improves efficiency and predictability.
• [INFO][other] Setting the timeoutMs for buildImage to job.maxJobDuration * 1000 means the image build itself is constrained by the overall job duration. This aligns the build phase with the job's resource budget, which is a key part of this feature.
• [INFO][performance] Dynamically setting Docker image build options (memory, cpuquota) based on the job's requested resources is a significant improvement. This ensures the build process itself operates within the allocated resource limits, preventing resource exhaustion on the node during image builds and reflecting the actual cost implications.
• [WARNING][bug] This added check after the build stream completes, which inspects the Docker image, is a crucial robustness improvement. The buildStream.emit('end') only indicates that the Docker daemon finished processing the stream, not necessarily that an image was successfully created and is available. Explicitly checking for the image prevents silent failures and ensures the job transitions correctly. Without this, a build could 'complete' but yield no usable image, leading to subsequent job failures.
Could we add a test case where the build stream succeeds but docker.getImage().inspect() fails? Oh, wait, the buildImage.test.ts already covers this with getImage: sinon.stub().returns({ inspect: sinon.stub().rejects(new Error('no such image')) }). Excellent!
• [INFO][other] The addition of a new dedicated unit test file for buildImage is excellent. It demonstrates good test coverage for the critical new functionality and robustness checks introduced in this PR. The tests specifically target the new docker.getImage().inspect() logic after a build, which is a key failure point.
• [INFO][other] Adding allowImageBuild to the schema with a default of false is a good practice. It ensures that the feature is opt-in and provides a safe default, preventing unexpected behavior on existing installations.
Closes #1294
Summary
This PR tightens Docker compute job accounting and image-build behavior:
maxJobDuration(same limit applies to build + run).docker.getImage(...).inspect(); if missing, the job is marked BuildImageFailed.allocateCpus()now reuses an existing allocation for a jobId rather than allocating new cores.Implementation notes
buildStartTimestamp/buildStopTimestamptoDBComputeJoband persisted them to sqlite internal blob.buildImage()now records build start/stop timestamps and uses per-job CPU/RAM build options.start > 0,stop >= start, numeric), and still appliesmaxJobDurationandminJobDurationrules.