User Story
As a representative of the LHCb DIRAC community at Barcelona, we need some kind of system that would let us group single core jobs, so that when it reaches our worker nodes they can be executed parallelly using as many cores as possible.
Feature Description
Currently we are underutilizing our resources, as it is only executing single core jobs in worker nodes with 112 cores. This is harming performance greatly, as we could theoretically reach an 112x improvement.
This issue is due to the fact that these worker nodes do not have external connectivity so we are unable to accept pilot jobs. On top of that, the main simulation program being executed is Gauss, which only uses a singular core for its simulations.
This problem has existed for a while, but we need a solution now more than ever, and with the improvements done to the PushJobAgent, it is the perfect moment to have this.
The idea would be te create some kind of intermediary CE that sits between the PushJobAgent and an AREX CE that receives single core jobs.
Those single core jobs should be grouped or bundled in a singular multiprocessor job that gets sent to the AREX as just 1 job.
Finally the worker node should divide those single core jobs and execute them all at once, utilizing all of the nodes if possible.
Must be easy to configure for administrators and also transparent for the user.
Definition of Done
Alternatives Considered
Modifying the Matcher so it matches multiple jobs at a time instead of one by one could also work.
Related Issues
No response
Additional Context
No response
User Story
As a representative of the LHCb DIRAC community at Barcelona, we need some kind of system that would let us group single core jobs, so that when it reaches our worker nodes they can be executed parallelly using as many cores as possible.
Feature Description
Currently we are underutilizing our resources, as it is only executing single core jobs in worker nodes with 112 cores. This is harming performance greatly, as we could theoretically reach an 112x improvement.
This issue is due to the fact that these worker nodes do not have external connectivity so we are unable to accept pilot jobs. On top of that, the main simulation program being executed is Gauss, which only uses a singular core for its simulations.
This problem has existed for a while, but we need a solution now more than ever, and with the improvements done to the
PushJobAgent, it is the perfect moment to have this.The idea would be te create some kind of intermediary CE that sits between the
PushJobAgentand anAREX CEthat receives single core jobs.Those single core jobs should be grouped or bundled in a singular multiprocessor job that gets sent to the
AREXas just 1 job.Finally the worker node should divide those single core jobs and execute them all at once, utilizing all of the nodes if possible.
Must be easy to configure for administrators and also transparent for the user.
Definition of Done
AREX CE.PushJobAgent.Alternatives Considered
Modifying the
Matcherso it matches multiple jobs at a time instead of one by one could also work.Related Issues
No response
Additional Context
No response