Skip to content

Support torch.distributed as alias for pytorch distribution type#44968

Open
Copilot wants to merge 4 commits intomainfrom
copilot/support-pytorch-torch-distributed
Open

Support torch.distributed as alias for pytorch distribution type#44968
Copilot wants to merge 4 commits intomainfrom
copilot/support-pytorch-torch-distributed

Conversation

Copy link
Contributor

Copilot AI commented Feb 3, 2026

AML Designer stores PyTorch distribution configs as type: torch.distributed for SDK v1.5 compatibility, but SDK v2 only recognizes type: pytorch, causing deserialization failures.

Changes

  • Constants: Added TORCH_DISTRIBUTED = "torch.distributed" to DistributionType
  • Entity mappings: Both pytorch and torch.distributed now map to PyTorchDistribution in DISTRIBUTION_TYPE_MAP and RestDistributionType.PY_TORCH in SDK_TO_REST
  • Schema validation: PyTorchDistributionSchema accepts both values in allowed_values
  • Pipeline helpers: from_dict_to_rest_distribution() handles both types

Behavior

# Service returns torch.distributed
service_response = {'type': 'torch.distributed', 'process_count_per_instance': 4}

# SDK deserializes to PyTorchDistribution
dist = DistributionConfiguration._from_rest_object(service_response)
assert isinstance(dist, PyTorchDistribution)
assert dist.type == "pytorch"  # Normalized internally

# SDK serializes back as pytorch
rest_obj = dist._to_rest_object()  # Maps to RestDistributionType.PY_TORCH

The normalization is unidirectional: torch.distributedpytorch during deserialization. PyTorchDistribution always uses pytorch internally, maintaining consistency while accepting the legacy alias from service responses.

Testing

  • Command job schema round-trip with both types
  • Pipeline helper function validation
  • Case-insensitive type matching
  • YAML configuration with torch.distributed

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • pypi.org
    • Triggering command: /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/python /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/python /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/lib/python3.9/site-packages/pip/__pip-REDACTED__.py install --ignore-installed --no-user --prefix /tmp/pip-build-env-bwbf7dov/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i REDACTED -- setuptools>=40.8.0 (dns block)
  • scanning-api.github.com
    • Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This section details on the original issue you should resolve

<issue_title>Make AML SDK v2 support the type: pytorch and type: torch.distributed interchangeably</issue_title>
<issue_description>Incident-662922382 Details - IcM

Make AML SDK v2 support the type: pytorch and type: torch.distributed interchangeably.

DETAILS:

Going by the whole thread "Unable to use "DistributedComponent" in AML SDK v2".

  1. AML SDK v2 uses type: pytorch. AML Designer uses type: torch.distributed in order to have backwards compatibility with AML SDK v1.5.
  2. @Ming Gu has added support of using type: pytorch in AML Designer by making type: pytorch interchangeable with type: torch.distributed in AML Designer backend.
  3. As my experiment with uploading component of type: pytorch showed, AML Designer actually stores the value as type: torch.distributed to have backwards compatibility with AML SDK v1.5.
  4. AML SDK v2 get the value of type: torch.distributed and fails as AML SDK v2 does not recognize type: torch.distributed.

We have 2 options.

  1. Make AML Designer store the value as type: pytorch and make change in AML SDK v1.5 to support type: pytorch``. But this cannot happen as AML SDK v1.5 is deprecated.
  2. Make AML SDK v2 support the type: pytorch and type: torch.distributed interchangeably.

We have to go with option 2.

I have an idea to do something like.


class DistributionType:

    MPI = "mpi"

    TENSORFLOW = "tensorflow"

    PYTORCH = ["pytorch", "torch.distributed"]

    RAY = "ray"

and for checking presence we can do something like.

if value in DistributionType.PYTORCH:  

We have to decide if this will be an internal change or will change the documentation as well. We have to consider the following places.

  1. CLI (v2) command component YAML schema - Azure Machine Learning | Microsoft Learn

  2. azureml-examples/sdk/python/jobs/single-step/pytorch/distributed-training/distributed-cifar10.ipynb at main · Azure/azureml-examples

Since this change requires understanding of our public facing contract, adding Amit Chauhan to guide the change from AML SDK v2 side.</issue_description>

<agent_instructions>Make the pull request and we will guide further.</agent_instructions>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

@mohammadsheraj
Copy link

Please go ahead and make the changes.

Copilot AI and others added 3 commits February 3, 2026 03:59
- Added TORCH_DISTRIBUTED constant as legacy alias in DistributionType
- Updated DISTRIBUTION_TYPE_MAP to map both pytorch and torch.distributed to PyTorchDistribution
- Updated PyTorchDistributionSchema to accept both values in allowed_values
- Added test YAML file with torch.distributed type
- Added unit test to verify both values work interchangeably

Co-authored-by: mohammadsheraj <221899694+mohammadsheraj@users.noreply.github.com>
- Modified from_dict_to_rest_distribution to accept both pytorch and torch.distributed
- Updated error message to include torch.distributed in valid types
- Added comprehensive unit tests for distribution helper functions

Co-authored-by: mohammadsheraj <221899694+mohammadsheraj@users.noreply.github.com>
- Added TORCH_DISTRIBUTED mapping to RestDistributionType.PY_TORCH
- Ensures proper REST API conversion for both pytorch and torch.distributed

Co-authored-by: mohammadsheraj <221899694+mohammadsheraj@users.noreply.github.com>
Copilot AI changed the title [WIP] Make AML SDK v2 support pytorch and torch.distributed interchangeably Support torch.distributed as alias for pytorch distribution type Feb 3, 2026
Copilot AI requested a review from mohammadsheraj February 3, 2026 04:07
@mohammadsheraj mohammadsheraj removed their assignment Feb 3, 2026
@mohammadsheraj mohammadsheraj removed their request for review February 3, 2026 04:18
@mohammadsheraj mohammadsheraj marked this pull request as ready for review February 3, 2026 04:20
Copilot AI review requested due to automatic review settings February 3, 2026 04:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for torch.distributed as a legacy alias for the pytorch distribution type, enabling backwards compatibility with AML SDK v1.5 while maintaining the existing SDK v2 behavior.

Changes:

  • Added TORCH_DISTRIBUTED = "torch.distributed" constant to DistributionType
  • Updated entity mappings (SDK_TO_REST and DISTRIBUTION_TYPE_MAP) to handle both pytorch and torch.distributed types, mapping them to the same PyTorch classes
  • Modified schema validation to accept both values in PyTorchDistributionSchema.allowed_values
  • Updated pipeline helper from_dict_to_rest_distribution() to recognize both distribution types
  • Added comprehensive test coverage including YAML configuration, schema round-trips, and case-insensitive handling

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
sdk/ml/azure-ai-ml/azure/ai/ml/constants/_job/job.py Adds TORCH_DISTRIBUTED constant for backwards compatibility
sdk/ml/azure-ai-ml/azure/ai/ml/entities/_job/distribution.py Updates SDK_TO_REST and DISTRIBUTION_TYPE_MAP to map both types to PyTorch classes
sdk/ml/azure-ai-ml/azure/ai/ml/_schema/job/distribution.py Modifies PyTorchDistributionSchema to accept both pytorch and torch.distributed in validation
sdk/ml/azure-ai-ml/azure/ai/ml/entities/_job/pipeline/_pipeline_job_helpers.py Updates from_dict_to_rest_distribution() to handle both types and updates error message
sdk/ml/azure-ai-ml/tests/test_configs/command_job/dist_job_pytorch_torch_distributed.yml Adds YAML test fixture using torch.distributed type
sdk/ml/azure-ai-ml/tests/command_job/unittests/test_command_job_schema.py Adds test for interchangeability of pytorch and torch.distributed types
sdk/ml/azure-ai-ml/tests/pipeline_job/unittests/test_distribution_helpers.py Adds tests for pipeline helper function handling both types

if target_type == "ray":
return Ray(**distribution_dict)
msg = "Distribution type must be pytorch, mpi, tensorflow or ray: {}".format(target_type)
msg = "Distribution type must be pytorch, torch.distributed, mpi, tensorflow or ray: {}".format(target_type)
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message lists "torch.distributed" separately from "pytorch", which might confuse users since torch.distributed is documented as an internal alias for backwards compatibility. Consider updating the message to either:

  1. Only mention "pytorch" as the valid type and omit "torch.distributed" from user-facing error messages (since it's a legacy alias)
  2. Clarify that "torch.distributed" is a legacy alias

For consistency with the PR's intention to support the alias transparently, option 1 would be more user-friendly.

Suggested change
msg = "Distribution type must be pytorch, torch.distributed, mpi, tensorflow or ray: {}".format(target_type)
msg = "Distribution type must be pytorch, mpi, tensorflow or ray: {}".format(target_type)

Copilot uses AI. Check for mistakes.
Comment on lines +101 to +105
# Verify roundtrip for torch.distributed
rest_obj = job_torch_distributed._to_rest_object()
reconstructed = CommandJob._load_from_rest(rest_obj)
assert isinstance(reconstructed.distribution, PyTorchDistribution)
assert reconstructed.distribution.process_count_per_instance == 4
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test validates that torch.distributed deserializes correctly and creates a PyTorchDistribution object. However, it doesn't verify that the resulting REST object uses the correct distribution type (RestDistributionType.PY_TORCH). Consider adding an assertion to check that the REST representation is correct, for example:

rest_obj = job_torch_distributed._to_rest_object()
assert rest_obj.properties.distribution.distribution_type == "PyTorch"

This would ensure that the SDK_TO_REST mapping is working correctly and that the alias normalization is complete.

Copilot uses AI. Check for mistakes.
Comment on lines +25 to +34
def test_from_dict_to_rest_distribution_torch_distributed(self):
"""Test that torch.distributed type is properly handled"""
distribution_dict = {
"distribution_type": "torch.distributed",
"process_count_per_instance": 4
}
result = from_dict_to_rest_distribution(distribution_dict)
assert result is not None
assert hasattr(result, 'process_count_per_instance')
assert result.process_count_per_instance == 4
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test validates that the from_dict_to_rest_distribution helper function accepts torch.distributed, but it doesn't verify that both pytorch and torch.distributed types result in the same REST object type. Consider adding an assertion to verify that both types produce equivalent REST objects:

# Compare results from both types
pytorch_dict = {"distribution_type": "pytorch", "process_count_per_instance": 4}
torch_dist_dict = {"distribution_type": "torch.distributed", "process_count_per_instance": 4}
pytorch_result = from_dict_to_rest_distribution(pytorch_dict)
torch_dist_result = from_dict_to_rest_distribution(torch_dist_dict)
assert type(pytorch_result) == type(torch_dist_result)

This would strengthen the test by verifying that the alias normalization produces equivalent results.

Copilot uses AI. Check for mistakes.
Comment on lines +101 to +105
# Verify roundtrip for torch.distributed
rest_obj = job_torch_distributed._to_rest_object()
reconstructed = CommandJob._load_from_rest(rest_obj)
assert isinstance(reconstructed.distribution, PyTorchDistribution)
assert reconstructed.distribution.process_count_per_instance == 4
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test validates round-trip serialization but doesn't verify what value the type field has after deserialization and re-serialization. According to the PR description, torch.distributed should be normalized to pytorch internally. Consider adding an assertion to verify the normalized type after round-trip:

rest_obj = job_torch_distributed._to_rest_object()
reconstructed = CommandJob._load_from_rest(rest_obj)
# Verify that the type is normalized to "pytorch" after round-trip
assert reconstructed.distribution.type == DistributionType.PYTORCH

This ensures that the internal normalization is working correctly and that torch.distributed is transparently converted to pytorch.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make AML SDK v2 support the type: pytorch and type: torch.distributed interchangeably

2 participants