retest: preparing to debug difference gke/ce by vsoch · Pull Request #78 · converged-computing/performance-study

vsoch · 2024-12-13T03:24:18Z

So far I have found differences in MTU and using (or not using) TIER1 network, which would influence bandwidth. I am preparing a size32 study directory anticipating testing this. We were never able to get COMPACT mode on compute engine so I'm thinking that will still be the case. I haven't found other differences yet but am still looking.

So far I have found differences in MTU and using (or not using) TIER1 network, which would influence bandwidth. I am preparing a size32 study directory anticipating testing this. Signed-off-by: vsoch <vsoch@users.noreply.github.com>

vsoch · 2025-02-23T20:06:26Z

This likely won't be merged, but I'll add the results (from when I ran them) for transparency. This thread is from December 15th 2024.

Some notes for retesting compute engine with the notes I made above. First, we still can't get COMPACT, even for 10 nodes they spin indefinitely, it reaches some timeout around 15-16 minutes, and then starts again and I think this would go on forever. These are c2d-standard-112.

I'm going to restart without COMPACT.

It's not looking any faster, but I'll wait to do 3 iterations at both sizes to say that for sure.
Confirming MTU, and that lammps (on a node) is using all the CPU, which suggests that lower CPU utilization reported is due to network still.

Results! This is for size 32.

CPU utilization is a tiny bit better as compared to the lower MTU / not premium. Surprisingly it is actually better, overall, on the compute engine environments (first plot) despite the overall runtime being slower.
Matom steps per second is still better on GKE, and MTU/TIER-1 didn't seem to impact compute engine environments (second plot)
Wall time is the tiniest bit lower, but it's nothing to call home about (third plot). It still is a minute or more slower than GKE, and I did fewer iterations.

My early conclusions:

Bumping MTU and adding TIER-1 add complexity (and cost for the latter) that might have tiny improvements to CPU utilization / runtime, but TIER-1 is probably not worth the cost (but this is hugely app dependent). MTU is easy enough to bump.
I kind of look at these plots and (largely) see no change I find interesting.
Using cluster-toolkit seemed to break the NFS and added complexity.
It still could be COMPACT, which we can get in GKE but not here.

TLDR: I would not blame the difference between GKE and compute engine on MTU or TIER-1, at least for LAMMPS. I don't know what else we could look at that we did "wrong" because we can't get COMPACT or better resources. Anyway, I guess burned a few hundred dollars and a big chunk of today, was worth a try anyway. I had tiny hopes.

retest: preparing to debug difference gke/ce

c58ce91

So far I have found differences in MTU and using (or not using) TIER1 network, which would influence bandwidth. I am preparing a size32 study directory anticipating testing this. Signed-off-by: vsoch <vsoch@users.noreply.github.com>

vsoch added the wontfix This will not be worked on label Feb 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

retest: preparing to debug difference gke/ce#78

retest: preparing to debug difference gke/ce#78
vsoch wants to merge 1 commit intomainfrom
prepare-compute-engine-retest

vsoch commented Dec 13, 2024

Uh oh!

vsoch commented Feb 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vsoch commented Dec 13, 2024

Uh oh!

vsoch commented Feb 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant