Skip to content

Added script for PyTorch issues debugging#137

Closed
kodxana wants to merge 3 commits intorunpod:mainfrom
kodxana:main
Closed

Added script for PyTorch issues debugging#137
kodxana wants to merge 3 commits intorunpod:mainfrom
kodxana:main

Conversation

@kodxana
Copy link
Copy Markdown

@kodxana kodxana commented Mar 11, 2024

Added new command: gpu-test
Curently it's including one of my script to test out PyTorch (PyTorch,go file)
Saves debug informations to /workspace/gpu_diagnostics.json

What script does:
Gather all informations about host including:

  • pod id
  • machine id
  • CUDA version
  • PyTorch Version
  • GPU model

Runs test on all attached GPU's to makre sure PyTorch fully utilizes CUDA
Logs error messages into .json file for tech support to review.

Copy link
Copy Markdown
Contributor

@ef0xa ef0xa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly looks very good. Have a lot of small comments on general performance and 'go/house style' that are just meant to be pointers. Feel free to ignore the stuff labeled OPTIONAL or MINOR: just keep them in mind for next time.

Comment thread cmd/diagnostic/PyTorch.go Outdated
}
}
`
data := map[string]interface{}{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MINOR: General go HTTP/REST performance tips:

  • use a struct here rather than map[string]any to avoid extraneous allocations.
  • don't make a map to iterate through the headers: just do req.Header.Set("Content-Type", "application/json")
  • don't make a new http.Client for each request: they're safe to share between invocations. if you don't care which client to use, use http.DefaultClient
  • either or both of the client or request context should have a timeout.

Comment thread cmd/diagnostic/PyTorch.go
"github.com/spf13/cobra"
)

func getPodMachineID(podID, apiKey string) string {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function should return an error.

Comment thread cmd/diagnostic/PyTorch.go Outdated
}
jsonData, _ := json.Marshal(data)

req, _ := http.NewRequest("POST", url, bytes.NewBuffer(jsonData))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check this error and return it.

Comment thread cmd/diagnostic/PyTorch.go
headers := map[string]string{
"Content-Type": "application/json",
}
query := `
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OPTIONAL: this can be a const.

Comment thread cmd/diagnostic/PyTorch.go Outdated

var result map[string]interface{}
json.NewDecoder(resp.Body).Decode(&result)
if pod, ok := result["data"].(map[string]interface{})["pod"].(map[string]interface{}); ok {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, don't use map[string]any: just make a struct type that's the shape you want. in this case, it would be

type Result struct { 
    Data struct  { 
        Pod struct { 
            MachineID string  `json:"machineId"` 
         } `json:"pod"`
    } `json:"data"`
 }

Comment thread cmd/diagnostic/PyTorch.go
driverVersion := driverVersionRegex.FindStringSubmatch(output)
gpuName := gpuNameRegex.FindStringSubmatch(output)

info := map[string]string{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(MINOR, OPTIONAL): see above re: structs vs maps

Comment thread cmd/diagnostic/PyTorch.go

client := &http.Client{}
resp, err := client.Do(req)
if err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use return fmt.Errorf with %w instead of just printing the error.

Comment thread cmd/diagnostic/PyTorch.go
return parseNvidiaSMIOutput(string(output))
}

func getSystemInfo() map[string]interface{} {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(OPTIONAL, MINOR): this function is tiny and only called once. inline it.

Comment thread cmd/diagnostic/PyTorch.go
return results
}

func saveInfoToFile(info map[string]interface{}, filename string) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(OPTIONAL, MINOR): this function is tiny and only called once. inline it.

Comment thread cmd/diagnostic/PyTorch.go
return info
}

func getNvidiaSMIInfo() map[string]string {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(OPTIONAL, MINOR): this function is tiny and only called once. inline it.

@ef0xa ef0xa self-assigned this Mar 12, 2024
@DireLines
Copy link
Copy Markdown
Contributor

What will happen if people try to run this on the client? Is there anything telling them that it is a diagnostic to be run specifically through web terminal on the pod? I think somebody will try to run it locally and be confused by the output if not

@kodxana
Copy link
Copy Markdown
Author

kodxana commented Mar 12, 2024

What will happen if people try to run this on the client? Is there anything telling them that it is a diagnostic to be run specifically through web terminal on the pod? I think somebody will try to run it locally and be confused by the output if not

Most of the output in json file would be empty though script will try to run PyTorch test.

@TimPietruskyRunPod
Copy link
Copy Markdown
Member

Closing — the CLI focuses on resource management rather than in-pod debugging. Users can run GPU diagnostics directly on their pods via SSH (runpodctl ssh info <pod-id>). The doctor command handles CLI-level diagnostics. Thanks for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants