Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CITATION
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
Please cite as:

Dune Collaboration: "DUNE Computing Tutorial" Version 2024.01
Dune Collaboration: "DUNE Computing Tutorial" Version 2025.01
18 changes: 9 additions & 9 deletions _episodes/02-submit-jobs-w-justin.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,27 @@
---
title: Submit grid jobs with JustIn
title: New justIN Job Submission System
teaching: 20
exercises: 0
questions:
- How to submit realistic grid jobs with JustIn
- How to submit realistic grid jobs with justIN
objectives:
- Demonstrate use of [justIn](https://dunejustin.fnal.gov) for job submission with more complicated setups.
- Demonstrate use of [justIN](https://dunejustin.fnal.gov) for job submission with more complicated setups.
keypoints:
- Always, always, always prestage input datasets. No exceptions.
---

# PLEASE USE THE NEW [justIn](https://dunejustin.fnal.gov) SYSTEM INSTEAD OF POMS
# PLEASE USE THE NEW [justIN](https://dunejustin.fnal.gov) SYSTEM INSTEAD OF POMS

__A simple [justIn](https://dunejustin.fnal.gov) Tutorial is currently in docdb at: [JustIn Tutorial](https://docs.dunescience.org/cgi-bin/sso/RetrieveFile?docid=30145)__
__A simple [justIN](https://dunejustin.fnal.gov) Tutorial is currently in docdb at: [justIN Tutorial](https://docs.dunescience.org/cgi-bin/sso/RetrieveFile?docid=30145)__

A more detailed tutorial is available at:
[JustIn Docs](https://dunejustin.fnal.gov/docs/)
[justIN Docs](https://dunejustin.fnal.gov/docs/)

The [justIn](https://dunejustin.fnal.gov) system is described in detail at:
The [justIN](https://dunejustin.fnal.gov) system is described in detail at:

__[JustIn Home](https://dunejustin.fnal.gov/dashboard/)__
__[justIN Home](https://dunejustIN .fnal.gov/dashboard/)__

__[JustIn Docs](https://dunejustin.fnal.gov/docs/)__
__[justIN Docs](https://dunejustin.fnal.gov/docs/)__


> ## Note More documentation coming soon
Expand Down
46 changes: 21 additions & 25 deletions _episodes/07-grid-job-submission.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Jobsub Grid Job Submission and Common Errors - still 2024 version
title: Jobsub Grid Job Submission and Common Errors (SPECIAL PURPOSE)
teaching: 65
exercises: 0
questions:
Expand Down Expand Up @@ -68,8 +68,8 @@ The past few months have seen significant changes in how DUNE (as well as other
First, log in to a `dunegpvm` machine . Then you will need to set up the job submission tools (`jobsub`). If you set up `dunesw` it will be included, but if not, you need to do

~~~
mkdir -p /pnfs/dune/scratch/users/${USER}/DUNE_tutorial_sep2025 # if you have not done this before
mkdir -p /pnfs/dune/scratch/users/${USER}/sep2025tutorial
mkdir -p /pnfs/dune/scratch/users/${USER}/DUNE_tutorial_jan2026 # if you have not done this before
mkdir -p /pnfs/dune/scratch/users/${USER}/jan2026tutorial
~~~
{: ..language-bash}

Expand Down Expand Up @@ -190,16 +190,16 @@ You will have to change the last line with your own submit file instead of the p
First, we should make a tarball. Here is what we can do (assuming you are starting from /exp/dune/app/users/username/):

```bash
cp /exp/dune/app/users/kherner/setupsep2025tutorial-grid.sh /exp/dune/app/users/${USER}/
cp /exp/dune/app/users/kherner/sep2025tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup-grid /exp/dune/app/users/${USER}/sep2025tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup-grid
cp /exp/dune/app/users/kherner/setupjan2026tutorial-grid.sh /exp/dune/app/users/${USER}/
cp /exp/dune/app/users/kherner/jan2026tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup-grid /exp/dune/app/users/${USER}/jan2026tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup-grid
```

Before we continue, let's examine these files a bit. We will source the first one in our job script, and it will set up the environment for us.

~~~
#!/bin/bash

DIRECTORY=sep2025tutorial
DIRECTORY=jan2026tutorial
# we cannot rely on "whoami" in a grid job. We have no idea what the local username will be.
# Use the GRID_USER environment variable instead (set automatically by jobsub).
USERNAME=${GRID_USER}
Expand All @@ -217,40 +217,38 @@ mrbslp


Now let's look at the difference between the setup-grid script and the plain setup script.
Assuming you are currently in the /exp/dune/app/users/username directory:
Assuming you are currently in the `/exp/dune/app/users/$USER` directory:

```bash
diff sep2025tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup sep2025tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup-grid
diff jan2026tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup jan2026tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup-grid
```

~~~
< setenv MRB_TOP "/exp/dune/app/users/<username>/sep2025tutorial"
< setenv MRB_TOP_BUILD "/exp/dune/app/users/<username>/sep2025tutorial"
< setenv MRB_SOURCE "/exp/dune/app/users/<username>/sep2025tutorial/srcs"
< setenv MRB_INSTALL "/exp/dune/app/users/<username>/sep2025tutorial/localProducts_larsoft_v09_72_01_e20_prof"
< setenv MRB_TOP "/exp/dune/app/users/<username>/jan2026tutorial"
< setenv MRB_TOP_BUILD "/exp/dune/app/users/<username>/jan2026tutorial"
< setenv MRB_SOURCE "/exp/dune/app/users/<username>/jan2026tutorial/srcs"
< setenv MRB_INSTALL "/exp/dune/app/users/<username>/jan2026tutorial/localProducts_larsoft_v09_72_01_e20_prof"
---
> setenv MRB_TOP "${INPUT_TAR_DIR_LOCAL}/sep2025tutorial"
> setenv MRB_TOP_BUILD "${INPUT_TAR_DIR_LOCAL}/sep2025tutorial"
> setenv MRB_SOURCE "${INPUT_TAR_DIR_LOCAL}/sep2025tutorial/srcs"
> setenv MRB_INSTALL "${INPUT_TAR_DIR_LOCAL}/sep2025tutorial/localProducts_larsoft_v09_72_01_e20_prof"
> setenv MRB_TOP "${INPUT_TAR_DIR_LOCAL}/jan2026tutorial"
> setenv MRB_TOP_BUILD "${INPUT_TAR_DIR_LOCAL}/jan2026tutorial"
> setenv MRB_SOURCE "${INPUT_TAR_DIR_LOCAL}/jan2026tutorial/srcs"
> setenv MRB_INSTALL "${INPUT_TAR_DIR_LOCAL}/jan2026tutorial/localProducts_larsoft_v09_72_01_e20_prof"
~~~

As you can see, we have switched from the hard-coded directories to directories defined by environment variables; the `INPUT_TAR_DIR_LOCAL` variable will be set for us (see below).
Now, let's actually create our tar file. Again assuming you are in `/exp/dune/app/users/kherner/sep2025tutorial/`:
Now, let's actually create our tar file. Again assuming you are in `/exp/dune/app/users/kherner/jan2026tutorial/`:
```bash
tar --exclude '.git' -czf sep2025tutorial.tar.gz sep2025tutorial/localProducts_larsoft_v09_72_01_e20_prof sep2025tutorial/work setupsep2025tutorial-grid.sh
tar --exclude '.git' -czf jan2026tutorial.tar.gz jan2026tutorial/localProducts_larsoft_${DUNESW_VERSION}_${DUNESW_QUALIFIER} jan2026tutorial/work setupjan2026tutorial-grid.sh
```
Note how we have excluded the contents of ".git" directories in the various packages, since we don't need any of that in our jobs. It turns out that the .git directory can sometimes account for a substantial fraction of a package's size on disk!

Then submit another job (in the following we keep the same submit file as above):

```bash
jobsub_submit -G dune --mail_always -N 1 --memory=2500MB --disk=2GB --expected-lifetime=3h --cpu=1 --tar_file_name=dropbox:///exp/dune/app/users/<username>/sep2025tutorial.tar.gz --singularity-image /cvmfs/singularity.opensciencegrid.org/fermilab/fnal-wn-sl7:latest --append_condor_requirements='(TARGET.HAS_Singularity==true&&TARGET.HAS_CVMFS_dune_opensciencegrid_org==true&&TARGET.HAS_CVMFS_larsoft_opensciencegrid_org==true&&TARGET.CVMFS_dune_opensciencegrid_org_REVISION>=1105&&TARGET.HAS_CVMFS_fifeuser1_opensciencegrid_org==true&&TARGET.HAS_CVMFS_fifeuser2_opensciencegrid_org==true&&TARGET.HAS_CVMFS_fifeuser3_opensciencegrid_org==true&&TARGET.HAS_CVMFS_fifeuser4_opensciencegrid_org==true)' -e GFAL_PLUGIN_DIR=/usr/lib64/gfal2-plugins -e GFAL_CONFIG_DIR=/etc/gfal2.d file:///exp/dune/app/users/kherner/run_sep2025tutorial.sh
```


You'll see this is very similar to the previous case, but there are some new options:

* `--tar_file_name=dropbox://` automatically **copies and untars** the given tarball into a directory on the worker node, accessed via the INPUT_TAR_DIR_LOCAL environment variable in the job. The value of INPUT_TAR_DIR_LOCAL is by default $CONDOR_DIR_INPUT/name_of_tar_file_without_extension, so if you have a tar file named e.g. sep2025tutorial.tar.gz, it would be $CONDOR_DIR_INPUT/sep2025tutorial.
* `--tar_file_name=dropbox://` automatically **copies and untars** the given tarball into a directory on the worker node, accessed via the INPUT_TAR_DIR_LOCAL environment variable in the job. The value of INPUT_TAR_DIR_LOCAL is by default $CONDOR_DIR_INPUT/name_of_tar_file_without_extension, so if you have a tar file named e.g. jan2026tutorial.tar.gz, it would be $CONDOR_DIR_INPUT/jan2026tutorial.
* Notice that the `--append_condor_requirements` line is longer now, because we also check for the fifeuser[1-4]. opensciencegrid.org CVMFS repositories.

The submission output will look something like this:
Expand All @@ -265,7 +263,7 @@ Could not locate uploaded file on RCDS. Will retry in 30 seconds.
Could not locate uploaded file on RCDS. Will retry in 30 seconds.
Found uploaded file on RCDS.
Transferring files to web sandbox...
Copying file:///nashome/k/kherner/.cache/jobsub_lite/js_2023_05_24_224713_9669e535-daf9-496f-8332-c6ec8a4238d9/run_sep2025tutorial.sh [DONE] after 0s
Copying file:///nashome/k/kherner/.cache/jobsub_lite/js_2023_05_24_224713_9669e535-daf9-496f-8332-c6ec8a4238d9/run_jan2026tutorial.sh [DONE] after 0s
Copying file:///nashome/k/kherner/.cache/jobsub_lite/js_2023_05_24_224713_9669e535-daf9-496f-8332-c6ec8a4238d9/simple.cmd [DONE] after 0s
Copying file:///nashome/k/kherner/.cache/jobsub_lite/js_2023_05_24_224713_9669e535-daf9-496f-8332-c6ec8a4238d9/simple.sh [DONE] after 0s
Submitting job(s).
Expand Down Expand Up @@ -566,8 +564,6 @@ Some more background material on these topics (including some examples of why ce

[Wiki page listing differences between jobsub_lite and legacy jobsub](https://fifewiki.fnal.gov/wiki/Differences_between_jobsub_lite_and_legacy_jobsub_client/server)

[DUNE Computing Tutorial:Advanced topics and best practices](DUNE_computing_tutorial_advanced_topics_20210129)

[2021 Intensity Frontier Summer School](https://indico.fnal.gov/event/49414)

[The Glidein-based Workflow Management System]( https://glideinwms.fnal.gov/doc.prd/index.html )
Expand Down
98 changes: 98 additions & 0 deletions _episodes/08-justin-job-submission.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
---
title: justIN Grid Job Submission (UNDER CONSTRUCTION)
teaching: 65
exercises: 0
questions:
- How to submit grid jobs?
objectives:
- Submit a basic batchjob and understand what's happening behind the scenes
- Monitor the job and look at its outputs
- Review best practices for submitting jobs (including what NOT to do)
keypoints:
- When in doubt, ask! Understand that policies and procedures that seem annoying, overly complicated, or unnecessary (especially when compared to running an interactive test) are there to ensure efficient operation and scalability. They are also often the result of someone breaking something in the past, or of simpler approaches not scaling well.
- Send test jobs after creating new workflows or making changes to existing ones. If things don't work, don't blindly resubmit and expect things to magically work the next time.
- Only copy what you need in input tar files. In particular, avoid copying log files, .git directories, temporary files, etc. from interactive areas.
- Take care to follow best practices when setting up input and output file locations.
- Always, always, always prestage input datasets. No exceptions.
---

<!-- > ## Note:
> This section describes basic job submission. Large scale submission of jobs to read DUNE data files are described in the [next section]({{ site.baseurl }}/08-submit-jobs-w-justin/index.html). -->
<!--
#### Session Video

This session will be captured on video a placed here after the workshop for asynchronous study.
<!-- The session was video captured for your asynchronous review. -->
The video from the two day version of this training in May 2022 is provided [here](https://www.youtube.com/embed/QuDxkhq64Og) as a reference. -->

<!--
<center>
<iframe width="560" height="315" src="https://www.youtube.com/embed/QuDxkhq64Og" title="DUNE Computing Tutorial May 2022 Grid Job Submission and Common Errors" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</center>
-->





Once you have practiced basic justIn commands, please look at the instructions for running your own code below:



## First learn the basics of Justin Submit a job

Go to [The justIN Tutorial](https://dunejustin.fnal.gov/docs/tutorials.dune.md)

and work up to ["run some hello world jobs"](https://dunejustin.fnal.gov/docs/tutorials.dune.md#run-some-hello-world-jobs)

> ## Quiz
>
> 1. What is your workflow ID?
>
{: .solution}

Then work through

- [View your workflow on the justIN web dashboard](https://dunejustin.fnal.gov/docs/tutorials.dune.md#view-your-workflow-on-the-justin-web-dashboard)
- [Jobs with inputs and outputs](https://dunejustin.fnal.gov/docs/tutorials.dune.md#jobs-with-inputs-and-outputs)
- [Fetching files from Rucio managed storage](https://dunejustin.fnal.gov/docs/tutorials.dune.md#fetching-files-from-rucio-managed-storage)
- (skip for now) Jobs using GPUs
- [Jobs writing to scratch](https://dunejustin.fnal.gov/docs/tutorials.dune.md#jobs-writing-to-scratch)





## Submit a job using the tarball containing custom code



First off, a very important point: for running analysis jobs, **you may not actually need to pass an input tarball**, especially if you are just using code from the base release and you don't actually modify any of it. In that case, it is much more efficient to use everything from the release and refrain from using a tarball.
All you need to do is set up any required software from CVMFS (e.g. dunetpc and/or protoduneana), and you are ready to go.
If you're just modifying a fcl file, for example, but no code, it's actually more efficient to copy just the fcl(s) you're changing to the scratch directory within the job, and edit them as part of your job script (copies of a fcl file in the current working directory have priority over others by default).

Sometimes, though, we need to run some custom code that isn't in a release.
We need a way to efficiently get code into jobs without overwhelming our data transfer systems.
We have to make a few minor changes to the scripts you made in the previous tutorial section, generate a tarball, and invoke the proper jobsub options to get that into your job.
There are many ways of doing this but by far the best is to use the Rapid Code Distribution Service (RCDS), as shown in our example.


### Temporary short version of an example for custom code.

We're working on a long version of this but please look at these [instructions for running a justIN workflow using your own code]({{ site.baseurl }}/short_submission) for now.

### Cool justIN feature

justIN has a very useful interactive test command.

Here is a test from the short submission example.

~~~
{% include test_workflow.sh %}
~~~

it reads in a tarball from an area `$DUNEDATA` and writes output to a tmp area on your interactive machine. It works very well at emulating a grid job.

## Did your job work?

If not please ask over at #computing-questions in Slack
Loading