From 004884fc235b872945d62676a2a20e93c70ba6ef Mon Sep 17 00:00:00 2001 From: Heidi Schellman <33669005+hschellman@users.noreply.github.com> Date: Fri, 29 Aug 2025 15:47:08 -0700 Subject: [PATCH 1/5] remove a lot of stuff that was in others --- _episodes/02-storage-spaces.md | 436 -------------- _episodes/02-submit-jobs-w-justin.md | 26 + _episodes/03-data-management.md | 454 -------------- _episodes/04-intro-art-larsoft.md | 698 --------------------- _episodes/05-end-of-basics.md | 41 -- _episodes/05.1-improve-code-efficiency.md | 361 ----------- _episodes/05.5-mrb.md | 38 -- _episodes/06-larsoft-modify-module.md | 699 ---------------------- _extras/ComputerSetup.md | 141 ----- _extras/InstallConda.md | 117 ---- _extras/Windows.md | 74 --- _extras/al9_setup.md | 58 -- _extras/putty.md | 40 -- _extras/sl7_setup.md | 79 --- setup.md | 683 +-------------------- 15 files changed, 30 insertions(+), 3915 deletions(-) delete mode 100644 _episodes/02-storage-spaces.md create mode 100644 _episodes/02-submit-jobs-w-justin.md delete mode 100644 _episodes/03-data-management.md delete mode 100644 _episodes/04-intro-art-larsoft.md delete mode 100644 _episodes/05-end-of-basics.md delete mode 100644 _episodes/05.1-improve-code-efficiency.md delete mode 100644 _episodes/05.5-mrb.md delete mode 100644 _episodes/06-larsoft-modify-module.md delete mode 100644 _extras/ComputerSetup.md delete mode 100644 _extras/InstallConda.md delete mode 100644 _extras/Windows.md delete mode 100644 _extras/al9_setup.md delete mode 100644 _extras/putty.md delete mode 100644 _extras/sl7_setup.md diff --git a/_episodes/02-storage-spaces.md b/_episodes/02-storage-spaces.md deleted file mode 100644 index 444f452..0000000 --- a/_episodes/02-storage-spaces.md +++ /dev/null @@ -1,436 +0,0 @@ ---- -title: Storage Spaces (2024) -teaching: 30 -exercises: 15 -questions: -- What are the types and roles of DUNE's data volumes? -- What are the commands and tools to handle data? -objectives: -- Understanding the data volumes and their properties -- Displaying volume information (total size, available size, mount point, device location) -- Differentiating the commands to handle data between grid accessible and interactive volumes -keypoints: -- Home directories are centrally managed by Computing Division and meant to store setup scripts, do NOT store certificates here. -- Network attached storage (NAS) /dune/app is primarily for code development. -- The NAS /dune/data is for store ntuples and small datasets. -- dCache volumes (tape, resilient, scratch, persistent) offer large storage with various retention lifetime. -- The tool suites idfh and XRootD allow for accessing data with appropriate transfer method and in a scalable way. ---- - -## This is an updated version of the 2023 training - - - -### Workshop Storage Spaces Video from December 2024 - - - -
- -
- - -## Introduction -There are four types of storage volumes that you will encounter at Fermilab (or CERN): - -- local hard drives -- network attached storage -- large-scale, distributed storage -- Rucio Storage Elements (RSE's) (a specific type of large-scale, distributed storage) -- CERN Virtual Machine File System (CVMFS) - -Each has its own advantages and limitations, and knowing which one to use when isn't all straightforward or obvious. But with some amount of foresight, you can avoid some of the common pitfalls that have caught out other users. - - -## Vocabulary - -**What is POSIX?** A volume with POSIX access (Portable Operating System Interface [Wikipedia](https://en.wikipedia.org/wiki/POSIX)) allow users to directly read, write and modify using standard commands, e.g. using bash scripts, fopen(). In general, volumes mounted directly into the operating system. - -**What is meant by 'grid accessible'?** Volumes that are grid accessible require specific tool suites to handle data stored there. Grid access to a volume is NOT POSIX access. This will be explained in the following sections. - -**What is immutable?** A file that is immutable means that once it is written to the volume it cannot be modified. It can only be read, moved, or deleted. This property is in general a restriction imposed by the storage volume on which the file is stored. Not a good choice for code or other files you want to change. - -## Interactive storage volumes (mounted on dunegpvmXX.fnal.gov) - -**Home area** is similar to the user's local hard drive but network mounted -* access speed to the volume very high, on top of full POSIX access -* network volumes are NOT safe to store certificates and tickets -* important: users have a single home area at FNAL used for all experiments -* not accessible from grid worker nodes -* not for code developement (home area is less than 2 GB) -* at Fermilab, need a valid Kerberos ticket in order to access files in your Home area -* periodic snapshots are taken so you can recover deleted files. (/nashome/.snapshot) -* permissions are set so your collaborators cannot see files in your home area - -> ## Note: your home area is small and private -> You want to use your home area for things that only you should see. If you want to share files with collaborators you need to put them in the /app/ or /data/ areas described below. -{: .callout} - -**Locally mounted volumes** are physical disks, mounted directly on the computer -* physically inside the computer node you are remotely accessing -* mounted on the machine through the motherboard (not over network) -* used as temporary storage for infrastructure services (e.g. /var, /tmp,) -* can be used to store certificates and tickets. (These are saved there automatically with owner-read enabled and other permissions disabled.) -* usually very small and should not be used to store data files or for code development -* files on these volumes are not backed up - -**Network Attached Storage (NAS)** element behaves similar to a locally mounted volume. -* functions similar to services such as Dropbox or OneDrive -* fast and stable POSIX access to these volumes -* volumes available only on a limited number of computers or servers -* not available on grid computing (FermiGrid, Open Science Grid, WLCG, HPC, etc.) -* /exp/dune/app/.... has periodic snapshots in /exp/dune/app/..../.snap, but /exp/dune/data does NOT -* easy to share files with colleagues using /exp/dune/data and /exp/dune/app - -## Grid-accessible storage volumes - -At Fermilab, an instance of dCache+Enstore is used for large-scale, distributed storage with capacity for more than 100 PB of storage and O(10000) connections. Whenever possible, these storage elements should be accessed over xrootd (see next section) as the mount points on interactive nodes are slow, unstable, and can cause the node to become unusable. Here are the different dCache volumes: - -**Persistent dCache**: the data in the file is actively available for reads at any time and will not be removed until manually deleted by user. -There is now a second persistent dCache volume that is dedicated for DUNE Physics groups and managed by the respective physics conveners of those -physics group. https://wiki.dunescience.org/wiki/DUNE_Computing/Using_the_Physics_Groups_Persistent_Space_at_Fermilab gives more details on how to get -access to these groups. In general, if you need to store more than 5TB in persistent dCache you should be working with the Physics Groups areas. - -**Scratch dCache**: large volume shared across all experiments. When a new file is written to scratch space, old files are removed in order to make room for the newer file. Removal is based on Least Recently Utilized (LRU) policy, and performed by an automated daemon. - -**Tape-backed dCache**: disk based storage areas that have their contents mirrored to permanent storage on Enstore tape. -Files are not available for immediate read on disk, but needs to be 'staged' from tape first ([see video of a tape storage robot](https://www.youtube.com/watch?v=kiNWOhl00Ao)). - -**Resilient dCache**: NOTE: DIRECT USAGE is being phased out and if the Rapid Code Distribution function in POMS/jobsub does not work for you, consult with the FIFE team for a solution (handles custom user code for their grid jobs, often in the form of a tarball. Inappropriate to store any other files here (NO DATA OR NTUPLES)). - -**Rucio Storage Elements**: Rucio Storage Elements (or RSEs) are storage elements provided by collaborating institution for official DUNE datasets. Data stored in DUNE RSE's must be fully cataloged in the [metacat][metacat] catalog and is managed by the DUNE data management team. This is where you find the official data samples. - -**CVMFS**: CERN Virtual Machine File System is a centrally managed storage area that is distributed over the network, and utilized to distribute common software and a limited set of reference files. CVMFS is mounted over the network, and can be utilized on grid nodes, interactive nodes, and personal desktops/laptops. It is read only, and the most common source for centrally maintained versions of experiment software libraries/executables. CVMFS is mounted at `/cvmfs/` and access is POSIX-like, but read only. - -> ## Note - When reading from dcache always use the root: syntax, not direct /pnfs -> The Fermilab dcache areas have NFS mounts. These are for your convenience, they allow you to look at the directory structure and, for example, remove files. However, NFS access is slow, inconsistent, and can hang the machine if I/O heavy processes use it. Always use the `xroot root://` ... when reading/accessing files instead of `/pnfs/` directly. Once you have your dune environment set up the `pnfs2xrootd` command can do the conversion to `root:` format for you (only for files at FNAL for now). -{: .callout} - -## Summary on storage spaces -Full documentation: [Understanding Storage Volumes](https://cdcvs.fnal.gov/redmine/projects/fife/wiki/Understanding_storage_volumes) - -|-------------+------------------+----------+-------------+----------------+------------+--------------+-----------| -| | Quota/Space | Retention Policy | Tape Backed? | Retention Lifetime on disk | Use for | Path | Grid Accessible | -|-------------+------------------+----------+-------------+----------------+------------+--------------+-----------| -| Persistent dCache | Yes(5)/~100 TB/exp | Managed by User/Exp| No| Until manually deleted | immutable files w/ long lifetime | /pnfs/dune/persistent | Yes | -|-------------+------------------+----------+-------------+----------------+------------+--------------+-----------| -| Persistent PhysGrp | Yes(50)/~500 TB/exp | Managed by PhysGrp| No| Until manually deleted | immutable files w/ long lifetime | /pnfs/dune/persistent/physicsgroups | Yes | -|-------------+------------------+----------+-------------+----------------+------------+--------------+-----------| -| Scratch dCache | No/no limit | LRU eviction - least recently used file deleted | No | Varies, ~30 days (*NOT* guaranteed) | immutable files w/ short lifetime | /pnfs/\/scratch | Yes | -|-------------+------------------+----------+-------------+----------------+------------+--------------+-----------| -| Tape backed| dCache No/O(10) PB | LRU eviction (from disk) | Yes | Approx 30 days | Long-term archive | /pnfs/dune/... | Yes | -|-------------+------------------+----------+-------------+----------------+------------+--------------+-----------| -| NAS Data | Yes (~1 TB)/ 32+30 TB total | Managed by Experiment | No | Until manually deleted | Storing final analysis samples | /exp/dune/data | No | -|-------------+------------------+----------+-------------+----------------+------------+--------------+-----------| -| NAS App | Yes (~100 GB)/ ~15 TB total | Managed by Experiment | No | Until manually deleted | Storing and compiling software | /exp/dune/app | No | -|-------------+------------------+----------+-------------+----------------+------------+--------------+-----------| -| Home Area (NFS mount) | Yes (~10 GB) | Centrally Managed by CCD | No | Until manually deleted | Storing global environment scripts (All FNAL Exp) | /nashome/\/\| No | -|-------------+------------------+----------+-------------+----------------+------------+--------------+-----------| -| Rucio | 10 PB | Centrally Managed by DUNE | Yes | Each file has retention policy | Official DUNE Data samples | use rucio/justin to access| Yes | -|-------------+------------------+----------+-------------+----------------+------------+--------------+-----------| - - -![Storage Picture](../fig/Storage.png){: .image-with-shadow } - -## Monitoring and Usage -Remember that these volumes are not infinite, and monitoring your and the experiment's usage of these volumes is important to smooth access to data and simulation samples. To see your persistent usage visit [here](https://fifemon.fnal.gov/monitor/d/000000175/dcache-persistent-usage-by-vo?orgId=1&var-VO=dune) (bottom left): - -And to see the total volume usage at Rucio Storage Elements around the world: - -**Resource** [DUNE Rucio Storage](https://dune.monitoring.edi.scotgrid.ac.uk/app/dashboards#/view/7eb1cea0-ca5e-11ea-b9a5-15b75a959b33?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-1d,to:now))) - -> ## Note - do not blindly copy files from personal machines to DUNE systems. -> You may have files on your personal machine that contain personal information, licensed software or (god forbid) malware or pornography. Do not transfer any files from your personal machine to DUNE machines unless they are directly related to work on DUNE. You must be fully aware of any file's contents. We have seen it all and we do not want to. -{: .callout} - -## Commands and tools -This section will teach you the main tools and commands to display storage information and access data. - -### ifdh - -Another useful data handling command you will soon come across is ifdh. This stands for Intensity Frontier Data Handling. It is a tool suite that facilitates selecting the appropriate data transfer method from many possibilities while protecting shared resources from overload. You may see *ifdhc*, where *c* refers to *client*. - -> ## Note -> ifdh is much more efficient than NFS file access. Please use it and/or xroot when accessing remote files. -{: .challenge} - -Here is an example to copy a file. Refer to the [Mission Setup]({{ site.baseurl }}/setup.html) for the setting up the `DUNELAR_VERSION`. - -> ## Note -> For now do this in the Apptainer -{: .challenge} - -~~~ -/cvmfs/oasis.opensciencegrid.org/mis/apptainer/current/bin/apptainer shell --shell=/bin/bash \ --B /cvmfs,/exp,/nashome,/pnfs/dune,/opt,/run/user,/etc/hostname,/etc/hosts,/etc/krb5.conf --ipc --pid \ -/cvmfs/singularity.opensciencegrid.org/fermilab/fnal-dev-sl7:latest -~~~ -{: .language-bash} - -once in the Apptainer -~~~ -#source ~/dune_presetup_2024.sh -#dune_setup -source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh -kx509 -export ROLE=Analysis -voms-proxy-init -rfc -noregen -voms=dune:/dune/Role=$ROLE -valid 120:00 -setup ifdhc -export IFDH_TOKEN_ENABLE=0 -ifdh cp root://fndcadoor.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/physics/full-reconstructed/2023/mc/out1/MC_Winter2023_RITM1592444_reReco/54/05/35/65/NNBarAtm_hA_BR_dune10kt_1x2x6_54053565_607_20220331T192335Z_gen_g4_detsim_reco_65751406_0_20230125T150414Z_reReco.root /dev/null -~~~ -{: .language-bash} - -Note, if the destination for an ifdh cp command is a directory instead of filename with full path, you have to add the "-D" option to the command line. - -Prior to attempting the first exercise, please take a look at the full list of IFDH commands, to be able to complete the exercise. In particular, mkdir, cp, rmdir, - -**Resource:** [idfh commands](https://cdcvs.fnal.gov/redmine/projects/ifdhc/wiki/Ifdh_commands) - -> ## Exercise 1 -> Using the ifdh command, complete the following tasks: -> * create a directory in your dCache scratch area (/pnfs/dune/scratch/users/${USER}/) called "DUNE_tutorial_2024" -> * copy /exp/dune/app/users/${USER}/my_first_login.txt file to that directory -> * copy the my_first_login.txt file from your dCache scratch directory (i.e. DUNE_tutorial_2024) to /dev/null -> * remove the directory DUNE_tutorial_2024 -> * create the directory DUNE_tutorial_2024_data_file -> Note, if the destination for an ifdh cp command is a directory instead of filename with full path, you have to add the "-D" option to the command line. Also, for a directory to be deleted, it must be empty. -> -> > ## Answer -> > ~~~ -> > ifdh mkdir /pnfs/dune/scratch/users/${USER}/DUNE_tutorial_2024 -> > ifdh cp -D /exp/dune/app/users/${USER}/my_first_login.txt /pnfs/dune/scratch/users/${USER}/DUNE_tutorial_2024 -> > ifdh cp /pnfs/dune/scratch/users/${USER}/DUNE_tutorial_2024/my_first_login.txt /dev/null -> > ifdh rm /pnfs/dune/scratch/users/${USER}/DUNE_tutorial_2024/my_first_login.txt -> > ifdh rmdir /pnfs/dune/scratch/users/${USER}/DUNE_tutorial_2024 -> > ifdh mkdir /pnfs/dune/scratch/users/${USER}/DUNE_tutorial_2024_data_file -> > ~~~ -> > {: .language-bash} -> {: .solution} -{: .challenge} - -### xrootd -The eXtended ROOT daemon is a software framework designed for accessing data from various architectures in a complete scalable way (in size and performance). - -XRootD is most suitable for read-only data access. -[XRootD Man pages](https://xrootd.slac.stanford.edu/docs.html) - -Issue the following command. Please look at the input and output of the command, and recognize that this is a listing of /pnfs/dune/scratch/users/${USER}/DUNE_tutorial_2024. Try and understand how the translation between a NFS path and an xrootd URI could be done by hand if you needed to do so. - -~~~ -xrdfs root://fndca1.fnal.gov:1094/ ls /pnfs/fnal.gov/usr/dune/scratch/users/${USER}/ -~~~ -{: .language-bash} - -Note that you can do -~~~ -lar -c -~~~ -{: .language-bash} - -to stream into a larsoft module configured within the fhicl file. As well, it can be implemented in standalone C++ as - -~~~ -TFile * thefile = TFile::Open() -~~~ -{: .language-c++} - -or PyROOT code as - -~~~ -thefile = ROOT.TFile.Open() -~~~ -{: .language-python} - -### What is the right xroot path for a file. - -If a file is in `/pnfs/dune/tape_backed/dunepro/protodune-sp/reco-recalibrated/2021/detector/physics/PDSPProd4/00/00/51/41/np04_raw_run005141_0003_dl9_reco1_18127219_0_20210318T104440Z_reco2_51835174_0_20211231T143346Z.root` - -the command - -~~~ -pnfs2xrootd /pnfs/dune/tape_backed/dunepro/protodune-sp/reco-recalibrated/2021/detector/physics/PDSPProd4/00/00/51/41/np04_raw_run005141_0003_dl9_reco1_18127219_0_20210318T104440Z_reco2_51835174_0_20211231T143346Z.root -~~~ -{: .language-bash} - - -will return the correct xrootd uri: - -~~~ -root://fndca1.fnal.gov:1094//pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune-sp/reco-recalibrated/2021/detector/physics/PDSPProd4/00/00/51/41/np04_raw_run005141_0003_dl9_reco1_18127219_0_20210318T104440Z_reco2_51835174_0_20211231T143346Z.root -~~~ -{: .output} - -you can then - -~~~ -root -l -~~~ -{: .language-bash} - -to open the root file. - -This even works if the file is in Europe - which you cannot do with a direct /pnfs! (NOTE! not all storage elements accept tokens, so right now this will fail if you have a token in your environment! Times out over ~10 minutes.) - -~~~ -#Need to setup root executable in the environment first... -export DUNELAR_VERSION=v09_90_01d00 -export DUNELAR_QUALIFIER=e26:prof -export UPS_OVERRIDE=“-H Linux64bit+3.10-2.17" -source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh -setup dunesw $DUNELAR_VERSION -q $DUNELAR_QUALIFIER - -root -l root://dune.dcache.nikhef.nl:1094/pnfs/nikhef.nl/data/dune/generic/rucio/usertests/b4/03/prod_beam_p1GeV_cosmics_protodunehd_20240405T005104Z_188961_006300_g4_stage1_g4_stage2_sce_E500_detsim_reco_20240426T232530Z_rerun_reco.root -~~~ -{: .language-bash} - -See the next section on [data management](({{ site.baseurl }}/03-data-management)) for instructions on finding files worldwide. - -> ## Note Files in /tape_backed/ may not be immediately accessible, those in /persistent/ and /scratch/ are. -{: .callout} - -## Let's practice - -> ## Exercise 2 -> Using a combination of `ifdh` and `xrootd` commands discussed previously: -> * Use `ifdh locateFile root` to find the directory for this file `PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_off_43352322_0_20210427T162252Z.root` -> * Use `xrdcp` to copy that file to `/pnfs/dune/scratch/users/${USER}/DUNE_tutorial_2024_data_file` -> * Using `xrdfs` and the `ls` option, count the number of files in the same directory as `PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_off_43352322_0_20210427T162252Z.root` -{: .challenge} - -Note that redirecting the standard output of a command into the command `wc -l` will count the number of lines in the output text. e.g. `ls -alrth ~/ | wc -l` - -~~~ -ifdh locateFile PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_off_43352322_0_20210427T162252Z.root root -xrdcp root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune-sp/full-reconstructed/2021/mc/out1/PDSPProd4a/18/80/01/67/PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_off_43352322_0_20210427T162252Z.root root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/scratch/users/${USER}/DUNE_tutorial_2024_data_file/PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_off_43352322_0_20210427T162252Z.root -xrdfs root://fndca1.fnal.gov:1094/ ls /pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune-sp/full-reconstructed/2021/mc/out1/PDSPProd4a/18/80/01/67/ | wc -l -~~~ -{: .language-bash} - - -> ## Is my file available or stuck on tape? -> /tape_backed/ storage at Fermilab is migrated to tape and may not be on disk? -> You can check this by doing the following **in an AL9 window** -> ~~~ -> gfal-xattr user.status -> ~~~ -> {: .language-bash} -> if it is on disk you get -> ~~~ -> ONLINE -> ~~~ -> {: .output} -> if it is only on tape you get -> ~~~ -> NEARLINE -> ~~~ -> {: .output} -> (This command doesn't work on SL7 so use an AL9 window) -{: .challenge} - - -### The df command - -To find out what types of volumes are available on a node can be achieved with the command `df`. The `-h` is for _human readable format_. It will list a lot of information about each volume (total size, available size, mount point, device location). -~~~ -df -h -~~~ -{: .language-bash} - -> ## Exercise 3 -> From the output of the `df -h` command, identify: -> 1. the home area -> 2. the NAS storage spaces -> 3. the different dCache volumes -{: .challenge} - -## Quiz - -> ## Question 01 -> -> Which volumes are directly accessible (POSIX) from grid worker nodes? ->
    ->
  1. /exp/dune/data
  2. ->
  3. DUNE CVMFS repository
  4. ->
  5. /pnfs/dune/scratch
  6. ->
  7. /pnfs/dune/persistent
  8. ->
  9. None of the Above
  10. ->
-> -> > ## Answer -> > The correct answer is B - DUNE CVMFS repository. -> > {: .output} -> {: .solution} -{: .challenge} - -> ## Question 02 -> -> Which data volume is the best location for the output of an analysis-user grid job? ->
    ->
  1. dCache scratch (/pnfs/dune/scratch/users/${USER}/)
  2. ->
  3. dCache persistent (/pnfs/dune/persistent/users/${USER}/)
  4. ->
  5. Enstore tape (/pnfs/dune/tape_backed/users/${USER}/)
  6. ->
  7. user’s home area (`~${USER}`)
  8. ->
  9. NFS data volume (/dune/data or /dune/app)
  10. ->
-> -> > ## Answer -> > The correct answer is A, dCache scratch (/pnfs/dune/scratch/users/${USER}/). -> > {: .output} -> {: .solution} -{: .challenge} - -> ## Question 03 -> -> You have written a shell script that sets up your environment for both DUNE and another FNAL experiment. Where should you put it? ->
    ->
  1. DUNE CVMFS repository
  2. ->
  3. /pnfs/dune/scratch/
  4. ->
  5. /exp/dune/app/
  6. ->
  7. Your GPVM home area
  8. ->
  9. Your laptop home area
  10. ->
-> -> > ## Answer -> > The correct answer is D - Your GPVM home area. -> > {: .output} -> {: .solution} -{: .challenge} - -> ## Question 04 -> -> What is the preferred way of reading a file interactively? ->
    ->
  1. Read it across the nfs mount on the GPVM
  2. ->
  3. Download the whole file to /tmp with xrdcp
  4. ->
  5. Open it for streaming via xrootd
  6. ->
  7. None of the above
  8. ->
-> -> > ## Answer -> > The correct answer is C - Open it for streaming via xrootd. Use `pnfs2xrootd` to generate the streaming path. -> > {: .output} -> > Comment here -> {: .solution} -{: .challenge} - -## Useful links to bookmark - -* [ifdh commands (redmine)](https://cdcvs.fnal.gov/redmine/projects/ifdhc/wiki/Ifdh_commands) -* [Understanding storage volumes (redmine)](https://cdcvs.fnal.gov/redmine/projects/fife/wiki/Understanding_storage_volumes) -* How DUNE storage works: [pdf](https://dune-data.fnal.gov/tutorial/howitworks.pdf) - ---- - -{%include links.md%} diff --git a/_episodes/02-submit-jobs-w-justin.md b/_episodes/02-submit-jobs-w-justin.md new file mode 100644 index 0000000..f2e09d6 --- /dev/null +++ b/_episodes/02-submit-jobs-w-justin.md @@ -0,0 +1,26 @@ +--- +title: Submit grid jobs with JustIn +teaching: 20 +exercises: 0 +questions: +- How to submit realistic grid jobs with JustIn +objectives: +- Demonstrate use of JustIn for job submission with more complicated setups. +keypoints: +- Always, always, always prestage input datasets. No exceptions. +--- + +# PLEASE USE THE NEW JUSTIN SYSTEM INSTEAD OF POMS + +__The JustIn Tutorial is currently in docdb at: [JustIn Tutorial](https://docs.dunescience.org/cgi-bin/sso/RetrieveFile?docid=30145)__ + +The JustIn system is describe in detail at: + +__[JustIn Home](https://justin.dune.hep.ac.uk/dashboard/)__ + +__[JustIn Docs](https://justin.dune.hep.ac.uk/docs/)__ + + +> ## Note More documentation coming soon +{: .callout} + diff --git a/_episodes/03-data-management.md b/_episodes/03-data-management.md deleted file mode 100644 index 22a2974..0000000 --- a/_episodes/03-data-management.md +++ /dev/null @@ -1,454 +0,0 @@ ---- -title: Data Management (2024 updated for metacat/justin/rucio) -teaching: 30 -exercises: 15 -questions: -- What are the data management tools and software for DUNE? -objectives: -- Learn how to access data from DUNE Data Catalog. -- Learn a bit about the JustIN workflow system for submitting batch jobs. -keypoints: -- SAM and Rucio are data handling systems used by the DUNE collaboration to retrieve data. -- Staging is a necessary step to make sure files are on disk in dCache (as opposed to only on tape). -- Xrootd allows user to stream data files. ---- - -#### Session Video - - - -The session video on December 10, 2025 was captured for your asynchronous review. - -
- -
- - - -## Introduction - -### What we need to do to produce accurate physics results -DUNE has a lot of data which is processed through a complicated chain of steps. We try to abide by [FAIR](https://www.go-fair.org/fair-principles/) (Findable, Accesible, Intepretable and Reproducible) principles in our use of data. - -Our [DUNE Physics Analysis Review Procedures](https://docs.dunescience.org/cgi-bin/sso/RetrieveFile?docid=28237&filename=physics_analysis_review_v7.pdf) state that: - -1. Software must be documented, and committed to a repository accessible to the collaboration. - - The preferred location is any repository managed within the official DUNE GitHub page: https://github.com/DUNE. - - There should be sufficient instructions on how to reproduce the results included with the software. In particular, a good goal is that the working group conveners are able to remake plots, in case cosmetic changes need to be made. Software repositories should adhere to licensing and copyright guidelines detailed in DocDB-27141. - -2. Data and simulation samples must come from well-documented, reproducible production campaigns. For most analyses, input samples should be official, catalogued DUNE productions. - - - -### How we do it - -DUNE offical data samples are produced using released code, cataloged with metadata that describes the processing chain and stored so that they are accessible to collaborators. - -DUNE data is stored around the world and the storage elements are not always organized in a way that they can be easily inspected. For this purpose we use the [metacat][metacat] data catalog to describe the data and collections and the [rucio][rucio] file storage system to determine where replicas of files are. There is also a legacy SAM data access system that can be used for older files. - -### How can I help? - -If you want to access data, this module will help you find and examine it. - -If you want to process data using the full power of DUNE computing, you should talk to the data management group about methods for cataloging any data files you plan to produce. This will allow you to use DUNE's collaborative storage capabilities to preserve and share your work with others and will be required for publication of results. - -## How to find and access official data - -### What is metacat? - -Metacat is a file catalog - it allows you to search for files that have particular attributes and understand their provenance, including details on all of their processing steps. -It also allows for querying jointly the file catalog and the DUNE conditions database. - -You can find extensive documentation on metacat at: - -[General metacat documentation](https://metacat.readthedocs.io/en/latest/) - -[DUNE metacat examples](https://dune.github.io/DataCatalogDocs/index.html) - -### Find a file in metacat - -DUNE runs multiple experiments (far detectors, protodune-sp, protodune-dp hd-protodune, vd-protodune, iceberg, coldboxes... ) and produces various kinds of data (mc/detector) and process them through different phases. - -To find your data you need to specify at the minimum - -- `core.run_type` (the experiment) -- `core.file_type` (mc or detecor) -- `core.data_tier` (the level of processing raw, full-reconstructed, root-tuple) - -and when searching for specific types of data - -- `core.data_stream` (physics, calibration, cosmics) -- `core.runs[any]=` - - Here is an example of a metacat query that gets you raw files from a recent 'hd-protodune' cosmics run. - -Note: there are example setups that do a full setup in the extras folder: - -- [SL7 setup]({{ site.baseurl }}/sl7_setup) -- [AL9 setup]({{ site.baseurl }}/al9_setup) - -First get metacat if you have not already done so - - -> ## SL7 -> ~~~ -> # If you have not already done a general SL7 software setup: -> source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh -> export DUNELAR_VERSION=v10_00_04d00 -> export DUNELAR_QUALIFIER=e26:prof -> setup dunesw $DUNELAR_VERSION -q $DUNELAR_QUALIFIER -> export METACAT_AUTH_SERVER_URL=https://metacat.fnal.gov:8143/auth/dune -> export METACAT_SERVER_URL=https://metacat.fnal.gov:9443/dune_meta_prod/app -> -> # then you can set up metacat and rucio -> setup metacat -> setup rucio -> ~~~ -> {: .language-bash} -{: .callout} - -> ## AL9 -> ~~~ -> source /cvmfs/larsoft.opensciencegrid.org/spack-packages/setup-env.sh -> spack load r-m-dd-config experiment=dune -> ~~~ -> {: .language-bash} -{: .callout} - -> ## For both -> ~~~ -> metacat auth login -m password $USER # use your services password to authenticate -> ~~~ -> {: .language-bash} -{: .callout} - ->### Note: other means of authentication ->Check out the [metacat documentation](https://metacat.readthedocs.io/en/latest/ui.html#user-authentication) for -kx509 and token authentication. -{: .callout} - -then do queries to find particular sets of files. -~~~ -metacat query "files from dune:all where core.file_type=detector \ - and core.run_type=hd-protodune and core.data_tier=raw \ - and core.data_stream=cosmics and core.runs[any]=27296 limit 2" -~~~ -{: .language-bash} - -should give you 2 files: - -~~~ -hd-protodune:np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5 -hd-protodune:np04hd_raw_run027296_0000_dataflow0_datawriter_0_20240619T110330.hdf5 -~~~ -{: .output} - - -the string before the ':' is the namespace and the string after is the filename. - -You can find out more about your file by doing: - -~~~ -metacat file show -m -l hd-protodune:np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5 -~~~ -{: .language-bash} - -which gives you a lot of information: - -~~~ -checksums: - adler32 : 6a191436 -created_timestamp : 2024-06-19 11:08:24.398197+00:00 -creator : dunepro -fid : 83302138 -name : np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5 -namespace : hd-protodune -retired : False -retired_by : None -retired_timestamp : None -size : 4232017188 -updated_by : None -updated_timestamp : 1718795304.398197 -metadata: - core.data_stream : cosmics - core.data_tier : raw - core.end_time : 1718795024.0 - core.event_count : 35 - core.events : [3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63, 67, 71, 75, 79, 83, 87, 91, 95, 99, 103, 107, 111, 115, 119, 123, 127, 131, 135, 139] - core.file_content_status: good - core.file_format : hdf5 - core.file_type : detector - core.first_event_number: 3 - core.last_event_number: 139 - core.run_type : hd-protodune - core.runs : [27296] - core.runs_subruns : [2729600001] - core.start_time : 1718795010.0 - dune.daq_test : False - retention.class : physics - retention.status : active -children: - hd-protodune-det-reco:np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330_reco_stage1_20240621T175057_keepup_hists.root (eywzUgkZRZ6llTsU) - hd-protodune-det-reco:np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330_reco_stage1_reco_stage2_20240621T175057_keepup.root (GHSm3owITS20vn69) -~~~ -{: .output} - -look in the glossary to see what those fields mean. - -### find out how much raw data there is in a run using the summary option - -~~~ -metacat query -s "files from dune:all where core.file_type=detector \ - and core.run_type=hd-protodune and core.data_tier=raw \ - and core.data_stream=cosmics and core.runs[any]=27296" -~~~ -{: .language-bash} - -~~~ -Files: 963 -Total size: 4092539942264 (4.093 TB) -~~~ -{: .output} - -To look at all the files in that run you need to use XRootD - **DO NOT TRY TO COPY 4 TB to your local area!!!*** - - - -### What is(was) SAM? -Sequential Access with Metadata (SAM) is/was a data handling system developed at Fermilab. It is designed to track locations of files and other file metadata. It has been replaced by the combination of MetaCat and Rucio. New files are not getting declared to SAM anymore. Any SAM locations after June of 2024 should be presumed to be wrong. Still being used in some legacy ProtoDUNE analyses. - - - -### What is Rucio? -Rucio is the next-generation Data Replica service and is part of DUNE's new Distributed Data Management (DDM) system that is currently in deployment. -Rucio has two functions: -1. A rule-based system to get files to Rucio Storage Elements around the world and keep them there. -2. To return the "nearest" replica of any data file for use either in interactive or batch file use. It is expected that most DUNE users will not be regularly using direct Rucio commands, but other wrapper scripts that calls them indirectly. - -As of the date of the December 2024 tutorial: -- The Rucio client is available in CVMFS and Spack -- Most DUNE users are now enabled to use it. New users may not automatically be added. - -### Let's find a file - -If you haven't already done this earlier in setup - -- On sl7 type `setup rucio` -- On al9 type `spack load rucio-clients@33.3.0` # see above for r-m-dd-config which will always get the current version - -~~~ -# first get a kx509 proxy, then - -export RUCIO_ACCOUNT=$USER - - -rucio list-file-replicas hd-protodune:np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5 --pfns --protocols=root -~~~ -{: .language-bash} - -returns 3 locations: - -~~~ -root://dune.dcache.nikhef.nl:1094/pnfs/nikhef.nl/data/dune/generic/rucio/hd-protodune/e5/57/np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5 -root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro//hd-protodune/raw/2024/detector/cosmics/None/00/02/72/96/np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5 -root://eosctapublic.cern.ch:1094//eos/ctapublic/archive/neutplatform/protodune/rawdata/np04//hd-protodune/raw/2024/detector/cosmics/None/00/02/72/96/np04hd_raw_run027296_0000_dataflow3_datawriter_0_20240619T110330.hdf5 -~~~ -{: .output} - - -which is the locations of the file on disk and tape. We can use this to copy the file to our local disk or access the file via xroot. - -### Finding files by characteristics using metacat - -To list raw data files for a given run: -~~~ -metacat query "files where core.file_type=detector \ - and core.run_type='protodune-sp' and core.data_tier=raw \ - and core.data_stream=physics and core.runs[any] in (5141)" -~~~ -{: .language-bash} - -- `core.run_type` tells you which of the many DAQ's this came from. -- `core.file_type` tells detector from mc -- `core.data_tier` could be raw, full-reconstructed, root-tuple. Same data different formats. - -~~~ -protodune-sp:np04_raw_run005141_0013_dl7.root -protodune-sp:np04_raw_run005141_0005_dl3.root -protodune-sp:np04_raw_run005141_0003_dl1.root -protodune-sp:np04_raw_run005141_0004_dl7.root -... -protodune-sp:np04_raw_run005141_0009_dl7.root -protodune-sp:np04_raw_run005141_0014_dl11.root -protodune-sp:np04_raw_run005141_0007_dl6.root -protodune-sp:np04_raw_run005141_0011_dl8.root -~~~ -{: .output} - -Note the presence of both a *namespace* and a *filename* - -What about some files from a reconstructed version? -~~~ -metacat query "files from dune:all where core.file_type=detector \ - and core.run_type='protodune-sp' and core.data_tier=full-reconstructed \ - and core.data_stream=physics and core.runs[any] in (5141) and dune.campaign=PDSPProd4 limit 10" -~~~ -{: .language-bash} - -~~~ -pdsp_det_reco:np04_raw_run005141_0013_dl10_reco1_18127013_0_20210318T104043Z.root -pdsp_det_reco:np04_raw_run005141_0015_dl4_reco1_18126145_0_20210318T101646Z.root -pdsp_det_reco:np04_raw_run005141_0008_dl12_reco1_18127279_0_20210318T104635Z.root -pdsp_det_reco:np04_raw_run005141_0002_dl2_reco1_18126921_0_20210318T103516Z.root -pdsp_det_reco:np04_raw_run005141_0002_dl14_reco1_18126686_0_20210318T102955Z.root -pdsp_det_reco:np04_raw_run005141_0015_dl5_reco1_18126081_0_20210318T122619Z.root -pdsp_det_reco:np04_raw_run005141_0017_dl10_reco1_18126384_0_20210318T102231Z.root -pdsp_det_reco:np04_raw_run005141_0006_dl4_reco1_18127317_0_20210318T104702Z.root -pdsp_det_reco:np04_raw_run005141_0007_dl9_reco1_18126730_0_20210318T102939Z.root -pdsp_det_reco:np04_raw_run005141_0011_dl7_reco1_18127369_0_20210318T104844Z.root -~~~ -{: .output} - - -To see the total number (and size) of files that match a certain query expression, then add the `-s` option to `metacat query`. - - - -See the metacat documentation for more information about queries. [DataCatalogDocs][DataCatalogDocs] and check out the glossary of common fields at: [MetaCatGlossary][MetaCatGlossary] - -## Accessing data for use in your analysis -To access data without copying it, `XRootD` is the tool to use. However it will work only if the file is staged to the disk. - -You can stream files worldwide if you have a DUNE VO certificate as described in the preparation part of this tutorial. - -To learn more about using Rucio and Metacat to run over large data samples go here: - -> # Full Justin/Rucio/Metacat Tutorial -> The [Justin/Rucio/Metacat Tutorial](https://docs.dunescience.org/cgi-bin/sso/RetrieveFile?docid=30145) -> and [justin tutorial](https://justin.dune.hep.ac.uk/docs/tutorials.dune.md) -{: .challenge} - - -> ## Exercise 1 -> * Use `metacat query ....` to find a file from a particular experiment/run/processing stage. Look in [DataCatalogDocs](https://dune.github.io/DataCatalogDocs/index.html) for hints on constructing queries. -> * Use `metacat file show -m -l namespace:filename` to get metadata for this file. Note that `--json` gives the output in json format. -{: .challenge} - - -When we are analyzing large numbers of files in a group of batch jobs, we use a metacat dataset to describe the full set of files that we are going to analyze and use the JustIn system to run over that dataset. Each job will then come up and ask metacat and rucio to give it the next file in the list. It will try to find the nearest copy. For instance if you are running at CERN and analyzing this file it will automatically take it from the CERN storage space EOS. - - -> ## Exercise 2 - explore in the gui -> [The Metacat Gui](https://metacat.fnal.gov:9443/dune_meta_prod/app/auth/login) is a nice place to explore the data we have. -> -> You need to log in with your services (not kerberos) password. -> -> do a datasets search of all namespaces for the word official in a dataset name -> -> you can then click on sets to see what they contain -{: .challenge} - -> ## Exercise 3 - explore a dataset -> Use metacat to find information about the dataset justin-tutorial:justin-tutorial-2024 -> How many files are in it, what is the total size. (metacat dataset show command, and metacat dataset files command) -> Use rucio to find one of the files in it. -{: .challenge} - -**Resources**: - -- [DataCatalogDocs][DataCatalogDocs] -- The [Justin/Rucio/Metacat Tutorial](https://docs.dunescience.org/cgi-bin/sso/RetrieveFile?docid=30145) -- [justin tutorial](https://justin.dune.hep.ac.uk/docs/tutorials.dune.md) - - - - - - -## Quiz - -> ## Question 01 -> -> What is file metadata? ->
    ->
  1. Information about how and when a file was made
  2. ->
  3. Information about what type of data the file contains
  4. ->
  5. Conditions such as liquid argon temperature while the file was being written
  6. ->
  7. Both A and B
  8. ->
  9. All of the above
  10. ->
-> -> > ## Answer -> > The correct answer is D - Both A and B. -> > {: .output} -> > Comment here -> {: .solution} -{: .challenge} - -> ## Question 02 -> -> How do we determine a DUNE data file location? ->
    ->
  1. Do `ls -R` on /pnfs/dune and grep
  2. ->
  3. Use `rucio list-file-replicas` (namespace:filename) --pnfs --protocols=root
  4. ->
  5. Ask the data management group
  6. ->
  7. None of the Above
  8. ->
-> -> > ## Answer -> > The correct answer is B - use `rucio list-file-replicas` (namespace:filename). -> > {: .output} -> > Comment here -> {: .solution} -{: .challenge} - - -## Useful links to bookmark -* DataCatalog: [https://dune.github.io/DataCatalogDocs](https://dune.github.io/DataCatalogDocs/index.html) -* metacat: [https://dune.github.io/DataCatalogDocs/] -* rucio: [https://rucio.github.io/documentation/] -* Pre-2024 Official dataset definitions: [dune-data.fnal.gov](https://dune-data.fnal.gov) -* [UPS reference manual](http://www.fnal.gov/docs/products/ups/ReferenceManual/) -* [UPS documentation (redmine)](https://cdcvs.fnal.gov/redmine/projects/ups/wiki) -* UPS qualifiers: [About Qualifiers (redmine)](https://cdcvs.fnal.gov/redmine/projects/cet-is-public/wiki/AboutQualifiers) -* [mrb reference guide (redmine)](https://cdcvs.fnal.gov/redmine/projects/mrb/wiki/MrbRefereceGuide) -* CVMFS on DUNE wiki: [Access files in CVMFS](https://wiki.dunescience.org/wiki/DUNE_Computing/Access_files_in_CVMFS) - -[Ifdh_commands]: https://cdcvs.fnal.gov/redmine/projects/ifdhc/wiki/Ifdh_commands -[xrootd-man-pages]: https://xrootd.slac.stanford.edu/docs.html -[Understanding-storage]: https://cdcvs.fnal.gov/redmine/projects/fife/wiki/Understanding_storage_volumes -[useful-samweb]: https://wiki.dunescience.org/wiki/Useful_ProtoDUNE_samweb_parameters -[dune-data-fnal]: https://dune-data.fnal.gov/ -[dune-data-fnal-how-works]: https://dune-data.fnal.gov/tutorial/howitworks.pdf -[sam-data-control]: https://wiki.dunescience.org/wiki/Using_the_SAM_Data_Catalog_to_find_data -[sam-longer]: https://dune.github.io/computing-basics/sam-by-schellman/index.html -[Spack documentation]: https://fifewiki.fnal.gov/wiki/Spack -[DataCatalogDocs]: https://dune.github.io/DataCatalogDocs/index.html -[MetaCatGlossary]: https://dune.github.io/DataCatalogDocs/glossary.html - -{%include links.md%} - diff --git a/_episodes/04-intro-art-larsoft.md b/_episodes/04-intro-art-larsoft.md deleted file mode 100644 index 3e21257..0000000 --- a/_episodes/04-intro-art-larsoft.md +++ /dev/null @@ -1,698 +0,0 @@ ---- -title: Introduction to art and LArSoft (2024 - Apptainer version) -teaching: 50 -exercises: 0 -questions: -- Why do we need a complicated software framework? Can't I just write standalone code? -objectives: -- Learn what services the *art* framework provides. -- Learn how the LArSoft tookit is organized and how to use it. -keypoints: -- Art provides the tools physicists in a large collaboration need in order to contribute software to a large, shared effort without getting in each others' way. -- Art helps us keep track of our data and job configuration, reducing the chances of producing mystery data that no one knows where it came from. -- LArSoft is a set of simulation and reconstruction tools shared among the liquid-argon TPC collaborations. ---- - -#### Session Video - -The session video on December 10, 2025 was captured for your asynchronous review. - -
- -
- - - - - - - -## Advertisement -- February 2025 LArSoft workshop at CERN - -[https://indico.cern.ch/event/1461779/overview](https://indico.cern.ch/event/1461779/overview) - -This page is protected by a password. Dom Brailsford sent this password in an e-mail to the DUNE Collaboration on November 6, 2024. - -## Introduction to *art* - -*Art* is the framework used for the offline software used to process LArTPC data from the far detector and the ProtoDUNEs. It was chosen not only because of the features it provides, but also because it allows DUNE to use and share algorithms developed for other LArTPC experiments, such as ArgoNeuT, LArIAT, MicroBooNE and ICARUS. The section below describes LArSoft, a shared software toolkit. Art is also used by the NOvA and mu2e experiments. The primary language for *art* and experiment-specific plug-ins is C++. - -The *art* wiki page is here: [https://cdcvs.fnal.gov/redmine/projects/art/wiki][art-wiki]. It contains important information on command-line utilities, how to configure an *art* job, how to define, read in and write out data products, how and when to use *art* modules, services, and tools. - -*Art* features: - -1. Defines the event loop -2. Manages event data storage memory and prevents unintended overwrites -3. Input file interface -- allows ganging together input files -4. Schedules module execution -5. Defines a standard way to store data products in *art*-formatted ROOT files -6. Defines a format for associations between data products (for example, tracks have hits, and associations between tracks and hits can be made via art's association mechanism. -7. Provides a uniform job configuration interface -8. Stores job configuration information in *art*-formatted root files. -9. Output file control -- lets you define output filenames based on parts of the input filename. -10. Message handling -11. Random number control -12. Exception handling - -The configuration storage is particularly useful if you receive a data file from a colleague, or find one in a data repository and you want to know more about how it was produced, with what settings. - -### Getting set up to try the tools - -Log in to a `dunegpvm*.fnal.gov` machine and set up your environment (This script is defined in Exercise 5 of https://dune.github.io/computing-training-basics/setup.html) - -> ## Note -> For now do this in the Apptainer. Due to the need to set up the container separately on the build nodes and the gpvms due to /pnfs mounts being different, and the need to keep your environment clean for use on other experiments, it is best to define aliases in your .profile or .bashrc or other login script you use to define aliases. A set of convenient aliases is -{: .challenge} - -~~~ -alias dunesl7="/cvmfs/oasis.opensciencegrid.org/mis/apptainer/current/bin/apptainer shell --shell=/bin/bash -B /cvmfs,/exp,/nashome,/pnfs/dune,/opt,/run/user,/etc/hostname,/etc/hosts,/etc/krb5.conf --ipc --pid /cvmfs/singularity.opensciencegrid.org/fermilab/fnal-dev-sl7:latest" - -alias dunesl7build="/cvmfs/oasis.opensciencegrid.org/mis/apptainer/current/bin/apptainer shell --shell=/bin/bash -B /cvmfs,/exp,/build,/nashome,/opt,/run/user,/etc/hostname,/etc/hosts,/etc/krb5.conf --ipc --pid /cvmfs/singularity.opensciencegrid.org/fermilab/fnal-dev-sl7:latest" - -alias dunesetups="source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh" -~~~ -{: .language-bash} - -Then you can use the appropriate alias to start the SL7 container on either the build node or the gpvms. Starting a container gives you a very bare environment -- it does not source your .profile for you; you have to do that yourself. The examples below assume you put the aliases above in your .profile or in a script sourced by your .profile. I always set the prompt variable PS1 in my profile so I can tell that I've sourced it. - -~~~ -PS1="<`hostname`> "; export PS1 -~~~ -{: .language-bash} - -Then when you log in, you can type these commands to set up your environment in a container: -~~~ -dunesl7 -source .profile -dunesetups - -export DUNELAR_VERSION=v10_00_04d00 -export DUNELAR_QUALIFIER=e26:prof -setup dunesw $DUNELAR_VERSION -q $DUNELAR_QUALIFIER - -setup_fnal_security -~~~ -{: .language-bash} - -~~~ -# define a sample file -export SAMPLE_FILE=root://fndcadoor.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/physics/full-reconstructed/2023/mc/out1/MC_Winter2023_RITM1592444_reReco/54/05/35/65/NNBarAtm_hA_BR_dune10kt_1x2x6_54053565_607_20220331T192335Z_gen_g4_detsim_reco_65751406_0_20230125T150414Z_reReco.root -~~~ -{: .language-bash} - -The examples below will refer to files in `dCache` at Fermilab which can best be accessed via `xrootd`. - -**For those with no access to Fermilab computing resources but with a CERN account:** -Copies are stored in `/afs/cern.ch/work/t/tjunk/public/jan2023tutorialfiles/`. - -The follow-up of this tutorial provides help on how to find data and MC files in storage. - -You can list available versions of `dunesw` installed in `CVMFS` with this command: - -~~~ -ups list -aK+ dunesw -~~~ -{: .language-bash} - -The output is not sorted, although portions of it may look sorted. Do not depend on it being sorted. The string indicating the version is called the version tag (v09_72_01d00 here). The qualifiers are e26 and prof. Qualifiers can be entered in any order and are separated by colons. "e26" corresponds to a specific version of the GNU compiler -- v9.3.0. We also compile with `clang` -- the compiler qualifier for that is "c7". - -"prof" means "compiled with optimizations turned on." "debug" means "compiled with optimizations turned off". More information on qualifiers is [here][about-qualifiers]. - -In addition to the version and qualifiers, `UPS` products have "flavors". This refers to the operating system type and version. Older versions of DUNE software supported `SL6` and some versions of macOS. Currently only SL7 and the compatible CentOS 7 are supported. The flavor of a product is automatically selected to match your current operating system when you set up a product. If a product does not have a compatible flavor, you will get an error message. "Unflavored" products are ones that do not depend on the operating-system libraries. They are listed with a flavor of "NULL". - -There is a setup command provided by the operating system -- you usually don't want to use it (at least not when developing DUNE software). If you haven't yet sourced the `setup_dune.sh` script in `CVMFS` above but type `setup xyz` anyway, you will get the system setup command, which will ask you for the root password. Just `control-C` out of it, source the `setup_dune.sh` script, and try again. On AL9 and the SL7 container, there is no system setup command so you will get "command not found" if you haven't yet set up UPS. - -UPS's setup command (find out where it lives with this command): - -~~~ -type setup -~~~ -{: .language-bash} - -will not only set up the product you specify (in the instructions above, dunesw), but also all dependent products with corresponding versions so that you get a consistent software environment. You can get a list of everything that's set up with this command - -~~~ - ups active -~~~ -{: .language-bash} - -It is often useful to pipe the output through grep to find a particular product. - -~~~ - ups active | grep geant4 -~~~ -{: .language-bash} - -for example, to see what version of geant4 you have set up. - -### *Art* command-line tools - -All of these command-line tools have online help. Invoke the help feature with the `--help` command-line option. Example: - -~~~ -config_dumper --help -~~~ -{: .language-bash} - -Docmentation on art command-line tools is available on the [art wiki page][art-wiki]. - -#### config_dumper - -Configuration information for a file can be printed with config_dumper. - -~~~ -config_dumper -P -~~~ -{: .language-bash} - -Try it out: -~~~ -config_dumper -P $SAMPLE_FILE -~~~ -{: .language-bash} - -The output is an executable `fcl` file, sent to stdout. We recommend redirecting the output to a file that you can look at in a text editor: - -Try it out: -~~~ -config_dumper -P $SAMPLE_FILE > tmp.fcl -~~~ -{: .language-bash} - -Your shell may be configured with `noclobber`, meaning that if you already have a file called `tmp.fcl`, the shell will refuse to overwrite it. Just `rm tmp.fcl` and try again. - -The `-P` option to `config_dumper` is needed to tell `config_dumper` to print out all processing configuration `fcl` parameters. The default behavior of `config_dumper` prints out only a subset of the configuration parameters, and is most notably missing art services configuration. - - -> ## Quiz -> -> Quiz questions from the output of the above run of `config_dumper`: -> -> 1. What generators were used? What physics processes are simulated in this file? -> 2. What geometry is used? (hint: look for "GDML" or "gdml") -> 3. What electron lifetime was assumed? -> 4. What is the readout window size? -> -{: .solution} - - -#### fhicl-dump - -You can parse a `FCL` file with `fhicl-dump`. - -Try it out: -~~~ -fhicl-dump protoDUNE_refactored_g4_stage2.fcl -~~~ -{: .language-bash} - -See the section below on `FCL` files for more information on what you're looking at. - -#### count_events - -Try it out: -~~~ -count_events $SAMPLE_FILE -~~~ -{: .language-bash} - - -#### product_sizes_dumper - -You can get a peek at what's inside an *art*ROOT file with `product_sizes_dumper`. - -Try it out: -~~~ -product_sizes_dumper -f 0 $SAMPLE_FILE -~~~ -{: .language-bash} - -It is also useful to redirect the output of this command to a file so you can look at it with a text editor and search for items of interest. This command lists the sizes of the `TBranches` in the `Events TTree` in the *art*ROOT file. There is one `TBranch` per data product, and the name of the `TBranch` is the data product name, an "s" is appended (even if the plural of the data product name doesn't make sense with just an "s" on the end), an underscore, then the module label that made the data product, an underscore, the instance name, an underscore, and the process name and a period. - - -Quiz questions, looking at the output from above. - -> ## Quiz -> Questions: -> 1. What is the name of the data product that takes up the most space in the file? -> 2. What the module label for this data product? -> 3. What is the module instance name for this data product? (This question is tricky. You have to count underscores here). -> 4. How many different modules produced simb::MCTruth data products? What are their module labels? -> 5. How many different modules produced recob::Hit data products? What are their module labels? -{: .solution} - -You can open up an *art*ROOT file with `ROOT` and browse the `TTrees` in it with a `TBrowser`. Not all `TBranches` and leaves can be inspected easily this way, but enough can that it can save a lot of time programming if you just want to know something simple about a file such as whether it contains a particular data product and how many there are. - -Try it out -~~~ -root $SAMPLE_FILE -~~~ -{: .language-bash} - -then at the `root` prompt, type: -~~~ -new TBrowser -~~~ -{: .language-bash} - -This will be faster with `VNC`. Navigate to the `Events TTree` in the file that is automatically opened, navigate to the `TBranch` with the Argon 39 MCTruths (it's near the bottom), click on the branch icon `simb::MCTruths_ar39__SinglesGen.obj`, and click on the `NParticles()` leaf (It's near the bottom. Yes, it has a red exclamation point on it, but go ahead and click on it). How many events are there? How many 39Ar decays are there per event on average? - -Header files for many data products are in [lardataobj](https://github.com/larsoft/lardataobj) and some are in [nusimdata](https://github.com/NuSoftHEP/nusimdata). - -*Art* is not constrained to using `ROOT` files -- we use HDF5-formatted files for some purposes. ROOT has nice browsing features for inspecting ROOT-formatted files; Some HDF5 data visualiztion tools exist, but they assume that data are in particular formats. ROOT has the ability to display more general kinds of data (C++ classes), but it needs dictionaries for some of the more complicated ones. - -The *art* main executable program is a very short stub that interprets command-line options, reads in the configuration document (a `FHiCL` file which usually includes other `FHiCL` files), and loads shared libraries, initializes software components, and schedules execution of modules. Most code we are interested in is in the form of *art* plug-ins -- modules, services, and tools. The generic executable for invoking *art* is called `art`, but a LArSoft-customized one is called `lar`. No additional customization has yet been applied so in fact, the `lar` executable has identical functionality to the `art` executable. - -There is online help: - -~~~ - lar --help -~~~ -{: .language-bash} - -All programs in the art suite have a `--help` command-line option. - -Most *art* job invocations take the form - -~~~ -lar -n -c fclfile.fcl artrootfile.root -~~~ -{: .language-bash} - -where the input file specification is just on the command line without a command-line option. Explicit examples follow below. The `-n ` is optional -- it specifies the number of events to process. If omitted, or if `` is bigger than the number of events in the input file, the job processes all of the events in the input file. `-n ` is important for the generator stage. There's also a handy `--nskip ` argument if you'd like the job to start processing partway through the input file. You can steer the output with - -~~~ -lar -c fclfile.fcl artrootfile.root -o outputartrootfile.root -T outputhistofile.root -~~~ -{: .language-bash} - - -The `outputhistofile.root` file contains `ROOT` objects that have been declared with the `TFileService` service in user-supplied art plug-in code (i.e. your code). - -### Job configuration with FHiCL - -The Fermilab Hierarchical Configuration Language, FHiCL is described here [https://cdcvs.fnal.gov/redmine/documents/327][fhicl-described]. - -FHiCL is **not** a Turing-complete language: you cannot write an executable program in it. It is meant to declare values for named parameters to steer job execution and adjust algorithm parameters (such as the electron lifetime in the simulation and reconstruction). Look at `.fcl` files in installed job directories, like `$DUNESW_DIR/fcl` for examples. `Fcl` files are sought in the directory seach path `FHICL_FILE_PATH` when art starts up and when `#include` statements are processed. A fully-expanded `fcl` file with all the #include statements executed is referred to as a fhicl "document". - -Parameters may be defined more than once. The last instance of a parameter definition wins out over previous ones. This makes for a common idiom in changing one or two parameters in a fhicl document. The generic pattern for making a short fcl file that modifies a parameter is: - -~~~ -#include "fcl_file_that_does_almost_what_I_want.fcl" -block.subblock.parameter: new_value -~~~ -{: .source} - -To see what block and subblock a parameter is in, use `fhcl-dump` on the parent fcl file and look for the curly brackets. You can also use - -~~~ -lar -c fclfile.fcl --debug-config tmp.txt --annotate -~~~ -{: .language-bash} - -which is equivalent to `fhicl-dump` with the --annotate option and piping the output to tmp.txt. - -Entire blocks of parameters can be substituted in using `@local` and `@table` idioms. See the examples and documentation for guidance on how to use these. Generally they are defined in the PROLOG sections of fcl files. PROLOGs must precede all non-PROLOG definitions and if their symbols are not subsequently used they do not get put in the final job configuration document (that gets stored with the data and thus may bloat it). This is useful if there are many alternate configurations for some module and only one is chosen at a time. - - -Try it out: -~~~ -fhicl-dump protoDUNE_refactored_g4_stage2.fcl > tmp.txt -~~~ -{: .language-bash} - -Look for the parameter `ModBoxA`. It is one of the Modified Box Model ionization parameters. See what block it is in. Here are the contents of a modified g4 stage 2 fcl file that modifies just that parameter: - -~~~ -#include "protoDUNE_refactored_g4_stage2.fcl" -services.LArG4Parameters.ModBoxA: 7.7E-1 -~~~ -{: .source} - -> ## Exercise -> Do a similar thing -- modify the stage 2 g4 fcl configuration to change the drift field from 486.7 V/cm to 500 V/cm. Hint -- you will find the drift field in an array of fields which also has the fields between wire planes listed. -{: .challenge} - - -### Types of Plug-Ins - -Plug-ins each have their own .so library which gets dynamically loaded by art when referenced by name in the fcl configuration. - -**Producer Modules** -A producer module is a software component that writes data products to the event memory. It is characterized by produces<> and consumes<> statements in the class constructor, and `art::Event::put()` calls in the `produces()` method. A producer must produce the data product collection it says it produces, even if it is empty, or *art* will throw an exception at runtime. `art::Event::put()` transfers ownership of memory (use std::move so as not to copy the data) from the module to the *art* event memory. Data in the *art* event memory will be written to the output file unless output commands in the fcl file tell art not to do that. Documentation on output commands can be found in the LArSoft wiki [here][larsoft-rerun-part-job]. Producer modules have methods that are called on begin job, begin run, begin subrun, and on each event, as well as at the end of processing, so you can initialize counters or histograms, and finish up summaries at the end. Source code must be in files of the form: `modulename_module.cc`, where `modulename` does not have any underscores in it. - -**Analyzer Modules** -Analyzer modules read data products from the event memory and produce histograms or TTrees, or other output. They are typically scheduled after the producer modules have been run. Producer modules have methods that are called on begin job, begin run, begin subrun, and on each event, as well as at the end of processing, so you can initialize counters or histograms, and finish up summaries at the end. Source code must be in files of the form: `modulename_module.cc`, where `modulename` does not have any underscores in it. - -**Source Modules** -Source modules read data from input files and reformat it as need be, in order to put the data in *art* event data store. Most jobs use the art-provided RootInput source module which reads in art-formatted ROOT files. RootInput interacts well with the rest of the framework in that it provides lazy reading of TTree branches. When using the RootInput source, data are not actually fetched from the file into memory when the source executes, but only when GetHandle or GetValidHandle or other product get methods are called. This is useful for *art* jobs that only read a subset of the TBranches in an input file. Code for sources must be in files of the form: `modulename_source.cc`, where `modulename` does not have any underscores in it. -Monte Carlo generator jobs use the input source called EmptyEvent. - -**Services** -These are singleton classes that are globally visible within an *art* job. They can be FHiCL configured like modules, and they can schedule methods to be called on begin job, begin run, begin event, etc. They are meant to help supply configuration parameters like the drift velocity, or more complicated things like geometry functions, to modules that need them. Please do not use services as a back door for storing event data outside of the *art* event store. Source code must be in files of the form: `servicename_service.cc`, where servicename does not have any underscores in it. - -**Tools** -Tools are FHiCL-configurable software components that are not singletons, like services. They are meant to be swappable by FHiCL parameters which tell art which .so libraries to load up, configure, and call from user code. See the [Art Wiki Page][art-wiki-redmine] for more information on tools and other plug-ins. - -You can use cetskelgen to make empty skeletons of *art* plug-ins. See the art wiki for documentation, or use - -~~~ -cetskelgen --help -~~~ -{: .language-bash} - -for instructions on how to invoke it. - -### Ordering of Plug-in Execution - -The constructors for each plug-in are called at job-start time, after the shared object libraries are loaded by the image activater after their names have been discovered from the fcl configuration. Producer, analyzer and service plug-ins have BeginJob, BeginRun, BeginSubRun, EndSubRun, EndRun, EndJob methods where they can do things like book histograms, write out summary information, or clean up memory. - -When processing data, the input source always gets executed first, and it defines the run, subrun and event number of the trigger record being processed. -The producers and filters in trigger_paths then get executed for each event. The analyzers and filters in end_paths then get executed. Analyzers cannot be added to trigger_paths, and producers cannot be added to end_paths. This ordering ensures that data products are all produced by the time they are needed to be analyzed. But it also forces high memory usage for the same reason. - -Services and tools are visible to other plug-ins at any stage of processing. They are loaded dynamically from names in the fcl configurations, so a common error is to use in code a service that hasn't been mentioned in the job configuration. You will get an error asking you to configure the service, even if it is just an empty configuration with the service name and no parameters set. - - - -### Non-Plug-In Code - -You are welcome to write standard C++ code -- classes and C-style functions are no problem. In fact, to enhance the portability of code, the *art* team encourages the separation of algorithm code into non-framework-specific source files, and to call these functions or class methods from the *art* plug-ins. Typically, source files for standalone algorithm code have the extension .cxx while art plug-ins have .cc extensions. Most directories have a CMakeLists.txt file which has instructions for building the plug-ins, each of which is built into a .so library, and all other code gets built and put in a separate .so library. - -### Retrieving Data Products - -In a producer or analyzer module, data products can be retrieved from the art event store with `getHandle()` or `getValidHandle()` calls, or more rarely `getManyByType` or other calls. The arguments to these calls specify the module label and the instance of the data product. A typical `TBranch` name in the Events tree in an *art*ROOT file is - -~~~ -simb::MCParticles_largeant__G4Stage1. -~~~ -{: .source} - -here, `simb::MCParticle` is the name of the class that defines the data product. The "s" after the data product name is added by *art* -- you have no choice in this even if the plural of your noun ought not to just add an "s". The underscore separates the data product name from the module name, "largeant". Another underscore separates the module name and the instance name, which in this example is the empty string -- there are two underscores together there. The last string is the process name and usually is not needed to be specified in data product retrieval. You can find the `TBranch` names by browsing an artroot file with `ROOT` and using a `TBrowser`, or by using `product_sizes_dumper -f 0`. - -### *Art* documentation - -There is a mailing list -- `art-users@fnal.gov` where users can ask questions and get help. - -There is a workbook for art available at [https://art.fnal.gov/art-workbook/][art-workbook] Look for the "versions" link in the menu on the left for the actual document. It is a few years old and is missing some pieces like how to write a producer module, but it does answer some questions. I recommend keeping a copy of it on your computer and using it to search for answers. - -There was an [art/LArSoft course in 2015][art-LArSoft-2015]. While it, too is a few years old, the examples are quite good and it serves as a useful reference. - -## Gallery - -Gallery is a lightweight tool that lets users read art-formatted root files and make plots without having to write and build art modules. It works well with interpreted and compiled ROOT macros, and is thus ideally suited for data exploration and fast turnaround of making plots. It lacks the ability to use art services, however, though some LArSoft services have been split into services and service providers. The service provider code is intended to be able to run outside of the art framework and linked into separate programs. - -Gallery also lacks the ability to write data products to an output file. You are of course free to open and write files of your own devising in your gallery programs. There are example gallery ROOT scripts in duneexamples/duneexamples/GalleryScripts. They are only in the git repository but do not get installed in the UPS product. - -More documentation: [https://art.fnal.gov/gallery/][art-more-documentation] - -## LArSoft - -### Introductory Documentation - -LArSoft's home page: [larsoft.org](https://larsoft.org) - -The LArSoft wiki is here: [larsoft-wiki](https://larsoft.github.io/LArSoftWiki/). - -### Software structure - -The LArSoft toolkit is a set of software components that simulate and reconstruct LArTPC data, and also it provides tools for accessing raw data from the experiments. LArSoft contains an interface to GEANT4 (art does not list GEANT4 as a dependency) and the GENIE generator. It contains geometry tools that are adapted for wire-based LArTPC detectors. - -LArSoft provides a collection of shared simulation, reconstruction, and analysis tools, with art interfaces. Often, a useful algorithm will be developed by an experimental collaboration, and desire to share it with other LArTPC collaborations, which is how much of the software in LArSoft came to be. Interfaces and services have to be standardized for shared use. Things like the detector geometry and the dead channel list, for example, are detector-specific, but shared simulation and reconstruction algorithms need to be able to access information from these services, which are not defined until an experiment's software stack is set up and the lar program is invoked. LArSoft therefore uses plug-ins and class inheritance extensively to deal with these situations. - -A recent graph of the UPS products in a full stack starting with dunesw is available [here](https://wiki.dunescience.org/w/img_auth.php/0/07/Dunesw_v10_00_04d00_e26-prof_graph.pdf) (dunesw). You can see the LArSoft pieces under dunesw, as well as GEANT4, GENIE, ROOT, and a few others. - -### LArSoft Data Products - -A very good introduction to data products such as raw digits, calibrated waveforms, hits and tracks, that are created and used by LArSoft modules and usable by analyzers was given by Tingjun Yang at the [2019 ProtoDUNE analysis workshop](https://indico.fnal.gov/event/19133/contributions/50492/attachments/31462/38611/dataproducts.pdf) (larsoft-data-products). - -There are a number of data product dumper fcl files. A non-exhaustive list of useful examples is given below: - -~~~ - dump_mctruth.fcl - dump_mcparticles.fcl - dump_simenergydeposits.fcl - dump_simchannels.fcl - dump_simphotons.fcl - dump_rawdigits.fcl - dump_wires.fcl - dump_hits.fcl - dump_clusters.fcl - dump_tracks.fcl - dump_pfparticles.fcl - eventdump.fcl - dump_lartpcdetector_channelmap.fcl - dump_lartpcdetector_geometry.fcl -~~~ -{: .language-bash} - -Some of these may require some configuration of input module labels so they can find the data products of interest. - -Some of these may require some configuration of input module labels so they can find the data products of interest. Try one of these yourself: - -~~~ -lar -n 1 -c dump_mctruth.fcl $SAMPLE_FILE -~~~ -{: .language-bash} - -This command will make a file called `DumpMCTruth.log` which you can open in a text editor. Reminder: `MCTruth` are particles made by the generator(s), and MCParticles are those made by GEANT4, except for those owned by the `MCTruth` data products. Due to the showering nature of LArTPCs, there are usually many more MCParticles than MCTruths. - -## Examples and current workflows - -The page with instructions on how to find and look at ProtoDUNE data has links to standard fcl configurations for simulating and reconstructing ProtoDUNE data: [https://wiki.dunescience.org/wiki/Look_at_ProtoDUNE_SP_data][look-at-protodune]. - -Try it yourself! The workflow for ProtoDUNE-SP MC is given in the [Simulation Task Force web page](https://wiki.dunescience.org/wiki/ProtoDUNE-SP_Simulation_Task_Force). - - -### Running on a dunegpvm machine at Fermilab - -~~~ - export USER=`whoami` - mkdir -p /exp/dune/data/users/$USER/tutorialtest - cd /exp/dune/data/users/$USER/tutorialtest - source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh - - export DUNELAR_VERSION=v10_00_04d00 - export DUNELAR_QUALIFIER=e26:prof - setup dunesw $DUNELAR_VERSION -q $DUNELAR_QUALIFIER - - TMPDIR=/tmp lar -n 1 -c mcc12_gen_protoDune_beam_cosmics_p1GeV.fcl -o gen.root - lar -n 1 -c protoDUNE_refactored_g4_stage1.fcl gen.root -o g4_stage1.root - lar -n 1 -c protoDUNE_refactored_g4_stage2_sce_datadriven.fcl g4_stage1.root -o g4_stage2.root - lar -n 1 -c protoDUNE_refactored_detsim_stage1.fcl g4_stage2.root -o detsim_stage1.root - lar -n 1 -c protoDUNE_refactored_detsim_stage2.fcl detsim_stage1.root -o detsim_stage2.root - lar -n 1 -c protoDUNE_refactored_reco_35ms_sce_datadriven_stage1.fcl detsim_stage2.root -o reco_stage1.root - lar -c eventdump.fcl reco_stage1.root >& eventdump_output.txt - config_dumper -P reco_stage1.root >& config_output.txt - product_sizes_dumper -f 0 reco_stage1.root >& productsizes.txt -~~~ -{: .language-bash} - -Note added November 22, 2023: The construct "TMPDIR=/tmp lar ..." defines the environment variable TMPDIR only for the duration of the subsequent command on the line. This is needed for the tutorial example because the mcc12 gen stage copies a 2.9 GB file (see below -- it's the one we had to copy over to CERN) to /var/tmp using ifdh's default temporary location. But the dunegpvm machines as of November 2023 seem to rarely have 2.9 GB of space in /var/tmp and you get a "no space left on device" error. The newer prod4 versions of the fcls point to a newer version of the beam particle generator that can stream this file using XRootD instead of copying it with ifdh. But the streaming flag is turned off by default in the prod4 fcl for the version of dunesw used in this tutorial, and so this is the minimal solution. Note for the next iteration: the Prod4 fcls are here: https://wiki.dunescience.org/wiki/ProtoDUNE-SP_Production_IV - -### Run the event display on your new Monte Carlo event -~~~ - lar -c evd_protoDUNE_data.fcl reco_stage1.root -~~~ -{: .language-bash} -and push the "Reconstructed" radio button at the bottom of the display. - -### Display decoded raw digits - -To look at some raw digits in the event display, you need to decode a DAQ file or find one that's already been decoded. The decoder fcl for ProtoDUNE-HD data taken in 2024 is run_pdhd_wibeth3_tpc_decoder.fcl. An event display of an example decoded file is -~~~ - lar -c evd_protoDUNE_data.fcl /exp/dune/data/users/trj/nov2024tutorial/np04hd_raw_run028707_0075_dataflow5_datawriter_0_20240815T154544_decode.root -~~~ -which is a file taken in August 2024. - -### Running at CERN - -This example puts all files in a subdirectory of your home directory. There is an input file for the ProtoDUNE-SP beamline simulation that is copied over and you need to point the generation job at it. The above sequence of commands will work at CERN if you have a Fermilab grid proxy, but not everyone signed up for the tutorial can get one of these yet, so we copied the necessary file over and adjusted a fcl file to point at it. It also runs faster with the local copy of the input file than the above workflow which copies it. - -The apptainer command is slightly different as the mounts are different. Here we assume you are logged into an lxplus node running Alma9. - ->#### Note -> CERN Apptainer variant -{: .callout} - -~~~ -/cvmfs/oasis.opensciencegrid.org/mis/apptainer/current/bin/apptainer shell --she -ll=/bin/bash \ --B /cvmfs,/afs,/opt,/run/user,/etc/hostname,/etc/krb5.conf --ipc --pid \ -/cvmfs/singularity.opensciencegrid.org/fermilab/fnal-dev-sl7:latest -~~~ -{: .language-bash} - -Make a fcl file: - -~~~ -#include "mcc12_gen_protoDune_beam_cosmics_p1GeV.fcl" -physics.producers.generator.FileName: "/afs/cern.ch/work/t/tjunk/public/may2023tutorialfiles/H4_v34b_1GeV_-27.7_10M_1.root" -~~~ -{: .source} - -~~~ - cd ~ - mkdir 2024Tutorial - cd 2024Tutorial - source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh - - export DUNELAR_VERSION=v10_00_04 - export LARSOFT_VERSION=${DUNELAR_VERSION} - export DUNELAR_QUALIFIER=e26:prof - setup dunesw $DUNELAR_VERSION -q $DUNELAR_QUALIFIER - - #cat > tmpgen.fcl << EOF - ##include "mcc12_gen_protoDune_beam_cosmics_p1GeV.fcl" - #physics.producers.generator.FileName: "/afs/cern.ch/work/t/tjunk/public/may2023tutorialfiles/H4_v34b_1GeV_-27.7_10M_1.root" - #EOF - lar -n 1 -c tmpgen.fcl -o gen.root - lar -n 1 -c protoDUNE_refactored_g4_stage1.fcl gen.root -o g4_stage1.root - lar -n 1 -c protoDUNE_refactored_g4_stage2_sce_datadriven.fcl g4_stage1.root -o g4_stage2.root - lar -n 1 -c protoDUNE_refactored_detsim_stage1.fcl g4_stage2.root -o detsim_stage1.root - lar -n 1 -c protoDUNE_refactored_detsim_stage2.fcl detsim_stage1.root -o detsim_stage2.root - lar -n 1 -c protoDUNE_refactored_reco_35ms_sce_datadriven_stage1.fcl detsim_stage2.root -o reco_stage1.root - lar -c eventdump.fcl reco_stage1.root >& eventdump_output.txt - config_dumper -P reco_stage1.root >& config_output.txt - product_sizes_dumper -f 0 reco_stage1.root >& productsizes.txt - ~~~ - {: .language-bash} - -You can also browse the root files with a TBrowser or run other dumper fcl files on them. The dump example commands above redirect their outputs to text files which you can edit with a text editor or run grep on to look for things. - -You can run the event display with - -~~~ -lar -c evd_protoDUNE.fcl reco_stage1.root -~~~ -{: .language-bash} - -but it will run very slowly over a tunneled X connection. A VNC session will be much faster. Tips: select the "Reconstructed" radio button at the bottom and click on "Unzoom Interest" on the left to see the reconstructed objects in the three views. - - -## DUNE software documentation and how-to's - -The following legacy wiki page provides information on how to check out, build, and contribute to dune-specific larsoft plug-in code. - -[https://cdcvs.fnal.gov/redmine/projects/dunetpc/wiki][dunetpc-wiki] - -The follow-up part of this tutorial gives hands-on exercises for doing these things. - -### Contributing to LArSoft - -The LArSoft git repositories are hosted on GitHub and use a pull-request model. LArSoft's github link is [https://github.com/larsoft][github-link]. DUNE repositories, such as the dunesw stack, protoduneana and garsoft are also on GitHub but at the moment (not for long however), allow users to push code. - -To work with pull requests, see the documentation at this link: [https://larsoft.github.io/LArSoftWiki/Developing_With_LArSoft][developing-with-larsoft] - -There are bi-weekly LArSoft coordination meetings [https://indico.fnal.gov/category/405/][larsoft-meetings] at which stakeholders, managers, and users discuss upcoming releases, plans, and new features to be added to LArSoft. - -## Useful tip: check out an inspection copy of larsoft - -A good old-fashioned `grep -r` or a find command can be effective if you are looking for an example of how to call something but I do not know where such an example might live. The copies of LArSoft source in CVMFS lack the CMakeLists.txt files and if that's what you're looking for to find examples, it's good to have a copy checked out. Here's a script that checks out all the LArSoft source and DUNE LArSoft code but does not compile it. Warning: it deletes a directory called "inspect" in your app area. Make sure `/exp/dune/app/users/` exists first: - - -> ## Note -> Remember the Apptainer! You can use your dunesl7 alias defined at the top of this page. -{: .callout} - -~~~ - #!/bin/bash - USERNAME=`whoami` - source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh - cd /exp/dune/app/users/${USERNAME} - rm -rf inspect - mkdir inspect - cd inspect - mrb newDev - source /exp/dune/app/users/${USERNAME}/inspect/localProducts*/setup - cd srcs - mrb g larsoft_suite - mrb g larsoftobj_suite - mrb g larutils - mrb g larbatch - mrb g dune_suite - mrb g -d dune_raw_data dune-raw-data -~~~ -{: .language-bash} - -Putting it to use: A very common workflow in developing software is to look for an example of how to do something similar to what you want to do. Let's say you want to find some examples of how to use `FindManyP` -- it's an *art* method for retrieving associations between data products, and the art documentation isn't as good as the examples for learning how to use it. You can use a recursive grep through your checked-out version, or you can even look through the installed source in CVMFS. This example looks through the duneprototype product's source files for `FindManyP`: - -~~~ - cd $DUNEPROTOTYPES_DIR/source/duneprototypes - grep -r -i findmanyp * -~~~ -{: .language-bash} - -It is good to use the `-i` option to grep which tells it to ignore the difference between uppercase and lowercase string matches, in case you misremembered the case of what you are looking for. The list of matches is quite long -- you may want to pipe the output of that grep into another grep - -~~~ - grep -r -i findmanyp * | grep recob::Hit -~~~ -{: .language-bash} - -The checked-out versions of the software have the advantage of providing some files that don't get installed in CVMFS, notably CMakeLists.txt files and the UPS product_deps files, which you may want to examine when looking for examples of how to do things. - -## GArSoft - -GArSoft is another art-based software package, designed to simulate the ND-GAr near detector. Many components were copied from LArSoft and modified for the pixel-based TPC with an ECAL. You can find installed versions in CVMFS with the following command: - -~~~ -ups list -aK+ garsoft -~~~ -{: .language-bash} - -and you can check out the source and build it by following the instructions on the [GArSoft wiki](https://cdcvs.fnal.gov/redmine/projects/garsoft/wiki). - - -## Quiz - -> ## Question 01 -> -> Enter Question here ->
    ->
  1. .
  2. ->
  3. .
  4. ->
  5. .
  6. ->
  7. .
  8. ->
  9. None of the Above
  10. ->
-> -> > ## Answer -> > The correct answer is . -> > {: .output} -> > Comment here -> {: .solution} -{: .challenge} - - -{%include links.md%} - -[about-qualifiers]: https://cdcvs.fnal.gov/redmine/projects/cet-is-public/wiki/AboutQualifiers -[art-wiki]: https://cdcvs.fnal.gov/redmine/projects/art/wiki -[larsoft-rerun-part-job]: https://larsoft.github.io/LArSoftWiki/Rerun_part_of_all_a_job_on_an_output_file_of_that_job -[github-link]: https://github.com/larsoft -[protodune-sim-task-force]: https://wiki.dunescience.org/wiki/ProtoDUNE-SP_Simulalation_Task_Force -[larsoft-meetings]: https://indico.fnal.gov/category/405/][larsoft-meetings -[developing-with-larsoft]: https://larsoft.github.io/LArSoftWiki/Developing_With_LArSoft -[fhicl-described]: https://cdcvs.fnal.gov/redmine/documents/327 -[garsoft-wiki]: https://cdcvs.fnal.gov/redmine/projects/garsoft/wiki -[art-wiki-redmine]: https://cdcvs.fnal.gov/redmine/projects/art/wiki#How-to-use-the-modularity-of-art -[art-more-documentation]: https://art.fnal.gov/gallery/][art-more-documentation -[using-larsoft]: https://cdcvs.fnal.gov/redmine/projects/larsoft/wiki/Using_LArSoft -[larsoft-data-products]: https://indico.fnal.gov/event/19133/contributions/50492/attachments/31462/38611/dataproducts.pdf -[dunetpc-wiki]: https://cdcvs.fnal.gov/redmine/projects/dunetpc/wiki -[look-at-protodune]: https://wiki.dunescience.org/wiki/Look_at_ProtoDUNE_SP_data -[art-LArSoft-2015]: https://indico.fnal.gov/event/9928/timetable/?view=standard -[art-workbook]: https://art.fnal.gov/art-workbook/ diff --git a/_episodes/05-end-of-basics.md b/_episodes/05-end-of-basics.md deleted file mode 100644 index 20f3b64..0000000 --- a/_episodes/05-end-of-basics.md +++ /dev/null @@ -1,41 +0,0 @@ ---- -title: End of the basics lesson - Continue on your own to learn how to build code and submit batch jobs -teaching: 5 -exercises: 0 -questions: -- How do I learn more? -objectives: -- Find out about more documentation -- Find out how to ask for help from collaborators. -keypoints: -- There is more documentation! -- People are hear to help ---- - -## You can ask questions here: - -- [DUNE Slack](https://app.slack.com/client/T03RN7KU3/C03RN7KV9) (collaborators only) - - Channels `#computing-training-basics` and `#larsoft-beginners` are a good start. `#user_grid_usage` is where you go to check if there are system problems. - -- [DUNE computing FAQ](https://github.com/orgs/DUNE/projects/19/views/1) - -- [Fermilab Helpdesk](https://fermi.servicenowservices.com/wp) - -- [List of DUNE computing tutorials](https://wiki.dunescience.org/wiki/Computing_tutorials) (collaborators only) - -- [HEP Software Foundation Training](https://hsf-training.org/training-center/) -Learn about `bash`, `github`, `python`, `cmake`, `root` and many more HEP computing packages - -## You can continue on with these additional modules. - -- [The LArSoft tutorial at CERN, February 3-7, 2025](https://indico.cern.ch/event/1461779/) password on the [tutorials page](https://wiki.dunescience.org/wiki/Computing_tutorials) - -- [Make your code more efficient]({{ site.baseurl }}/05.1-improve-code-efficiency.md) -- [The MRB build system]({{ site.baseurl }}/05.5-mrb.md) -- [How to modify and rebuild LArSoft Modules]({{ site.baseurl }}/06-larsoft-modify-module.md) -- [Grid job submission]({{ site.baseurl }}/07-grid-job-submission.md) -- [Grid job submission with justin]({{ site.baseurl }}/08-submit-jobs-w-justin.md) -- [Debugging grid jobs]({{ site.baseurl }}/09-grid-batch-debug.md) - ---- \ No newline at end of file diff --git a/_episodes/05.1-improve-code-efficiency.md b/_episodes/05.1-improve-code-efficiency.md deleted file mode 100644 index 33a141f..0000000 --- a/_episodes/05.1-improve-code-efficiency.md +++ /dev/null @@ -1,361 +0,0 @@ ---- -title: Bonus episode -- Code-makeover on how to code for better efficiency -teaching: 50 -exercises: 0 -questions: -- How to write the most efficient code? -objectives: -- Learn good tips and tools to improve your code. -keypoints: -- CPU, memory, and build time optimizations are possible when good code practices are followed. ---- - -#### Session Video - -The session will be captured on video a placed here after the workshop for asynchronous study. - -#### Live Notes - - - -### Code Make-over - -**How to improve your code for better efficiency** - -DUNE simulation, reconstruction and analysis jobs take a lot of memory and CPU time. This owes to the large size of the Far Detector modules as well as the many channels in the Near Detectors. Reading out a large volume for a long time with high granularity creates a lot of data that needs to be stored and processed. - -### CPU optimization: - -**Run with the “prof” build when launching big jobs.** While both the "debug" and "prof" builds have debugging and profiling information included in the executables and shared libraries, the "prof" build has a high level of compiler optimization turned on while "debug" has optimizations disabled. Debugging with the "prof" build can be done, but it is more difficult because operations can be reordered and some variables get put in CPU registers instead of inspectable memory. The “debug” builds are generally much slower, by a factor of four or more. Often this difference is so stark that the time spent repeatedly waiting for a slow program to chug through the first trigger record in an interactive debugging session is more costly than the inconvenience of not being able to see some of the variables in the debugger. If you are not debugging, then there really is (almost) no reason to use the “debug” builds. If your program produces a different result when run with the debug build and the prof build (and it’s not just the random seed), then there is a bug to be investigated. - -**Compile your interactive ROOT scripts instead of running them in the interpreter** At the ROOT prompt, use .L myprogram.C++ (even though its filename is myprogram.C). Also .x myprogram.C++ will compile and then execute it. This will force a compile. .L myprogram.C+ will compile it only if necessary. - -**Run gprof or other profilers like valgrind's callgrind:** You might be surprised at what is actually taking all the time in your program. There is abundant documentation on the [web][gnu-manuals-gprof], and also the valgrind online documentation. -There is no reason to profile a "debug" build and there is no need to hand-optimize something the compiler will optimize anyway, and which may even hurt the optimality of the compiler-optimized version. - -**The Debugger can be used as a simple profiler:** If your program is horrendously slow (and/or it used to be fast), pausing it at any time is likely to pause it while it is doing its slow thing. Run your program in the debugger, pause it when you think it is doing its slow thing (i.e. after initialization), and look at the call stack. This technique can be handy because you can then inspect the values of variables that might give a clue if there’s a bug making your program slow. (e.g. looping over 1015 wires in the Far Detector, which would indicate a bug, such as an uninitialized loop counter or an unsigned loop counter that is initialized with a negative value. - -**Don't perform calculations or do file i/o that will only later be ignored.** It's just a waste of time. If you need to pre-write some code because in future versions of your program the calculation is not ignored, comment it out, or put a test around it so it doesn't get executed when it is not needed. - - -**Extract constant calculations out of loops.** - - -
- -
Code Example (BAD)
- -
Code Example (GOOD)
- -
-double sum = 0;
-for (size_t i=0; i -{
- sum += result.at(i)/TMath::Sqrt(2.0);
-}
-
-
-double sum = 0;
-double f = TMath::Sqrt(0.5);
-for (size_t i=0; i -{
- sum += result.at(i)*f;
-}
-
- -
- - -The example above also takes advantage of the fact that floating-point multiplies generally have significantly less latency than floating-point divides (this is still true, even with modern CPUs). - -**Use sqrt():** Don’t use `pow()` or `TMath::Power` when a multiplication or `sqrt()` function can be used. - -
- -
Code Example (BAD)
- -
Code Example (GOOD)
-
-double r = TMath::Power( TMath::Power(x,2) + TMath::Power(y,2), 0.5); -
- -
-double r = TMath::Sqrt( x*x + y*y ); -
-
- -The reason is that `TMath::Power` (or the C math library’s `pow()`) function must take the logarithm of one of its arguments, multiply it by the other argument, and exponentiate the result. Modern CPUs have a built-in `SQRT` instruction. Modern versions of `pow()` or `Power` may check the power argument for 2 and 0.5 and instead perform multiplies and `SQRT`, but don’t count on it. - -If the things you are squaring above are complicated expressions, use `TMath::Sq()` to eliminate the need for typing them out twice or creating temporary variables. Or worse, evaluating slow functions twice. The optimizer cannot optimize the second call to that function because it may have side effects like printing something out to the screen or updating some internal variable and you may have intended for it to be called twice. - - -
-
Code Example (BAD)
- -
Code Example (GOOD)
- -
-double r = TMath::Sqrt( slow_function_calculating_x()*
slow_function_calculating_x() +
slow_function_calculating_y()*
slow_function_calculating_y() ); -
- -
-double r = TMath::Sqrt( TMath::Sq(slow_function_calculating_x()) +
TMath::Sq(slow_function_calculating_y())); -
-
- -**Don't call `sqrt()` if you don’t have to.** - -
-
Code Example (BAD)
- -
Code Example (GOOD)
- -
-if (TMath::Sqrt( x*x + y*y ) < rcut )
-{
- do_something();
-} -
- -
-double rcutsq = rcut*rcut;
-if (x*x + y*y < rcutsq)
-{
- do_something();
-} -
-
- - - -**Use binary search features in the STL rather than a step-by-step lookup.** - -~~~ -std::vector my_vector; -(fill my_vector with stuff) - -size_t indexfound = 0; -bool found = false; -for (size_t i=0; i -
Code Example (BAD)
- -
Code Example (GOOD)
- -
-double sum = 0;
-std::vector <double> results;
-(fill lots of results)
-for (size_t i=0; i -{
- float rsq = results.at(i)*result.at(i);
- sum += rsq;
-} -
- -
-double sum = 0;
-std::vector <double> results;
-(fill lots of results)
-for (size_t i=0; i -{
- sum += TMath::Sq(results.at(i));
-} -
- - -**Minimize conversions between int and float or double** - -The up-conversion from int to float takes time, and the down-conversion from float to int loses precision and also takes time. Sometimes you want the precision loss, but sometimes it's a mistake. - -**Check for NaN and Inf.** While your program will still function if an intermediate result is `NaN` or `Inf` (and it may even produce valid output, especially if the `NaN` or `Inf` is irrelevant), processing `NaN`s and `Inf`s is slower than processing valid numbers. Letting a `NaN` or an `Inf` propagate through your calculations is almost never the right thing to do - check functions for domain validity (square roots of negative numbers, logarithms of zero or negative numbers, divide by zero, etc.) when you execute them and decide at that point what to do. If you have a lengthy computation and the end result is `NaN`, it is often ambiguous at what stage the computation failed. - -**Pass objects by reference.** Especially big ones. C and C++ call semantics specify that objects are passed by value by default, meaning that the called method gets a copy of the input. This is okay for scalar quantities like int and float, but not okay for a big vector, for example. The thing to note then is that the called method may modify the contents of the passed object, while an object passed by value can be expected not to be modified by the called method. - -**Use references to receive returned objects created by methods** That way they don't get copied. The example below is from the VD coldbox channel map. Bad, inefficient code courtesy of Tom Junk, and good code suggestion courtesy of Alessandro Thea. The infotohcanmap object is a map of maps of maps: std::unordered_map > > infotochanmap; - -
-
Code Example (BAD)
- -
Code Example (GOOD)
- -
-int dune::VDColdboxChannelMapService::getOfflChanFromWIBConnectorInfo(int wib, int wibconnector, int cechan)
-{
- int r = -1;
- auto fm1 = infotochanmap.find(wib);
- if (fm1 == infotochanmap.end()) return r;
- auto m1 = fm1->second;
- auto fm2 = m1.find(wibconnector);
- if (fm2 == m1.end()) return r;
- auto m2 = fm2->second;
- auto fm3 = m2.find(cechan);
- if (fm3 == m2.end()) return r;
- r = fm3->second;
- return r;
-
-
-int dune::VDColdboxChannelMapService::getOfflChanFromWIBConnectorInfo(int wib, int wibconnector, int cechan)
-{
- int r = -1;
- auto fm1 = infotochanmap.find(wib);
- if (fm1 == infotochanmap.end()) return r;
- auto& m1 = fm1->second;
- auto fm2 = m1.find(wibconnector);
- if (fm2 == m1.end()) return r;
- auto& m2 = fm2->second;
- auto fm3 = m2.find(cechan);
- if (fm3 == m2.end()) return r;
- r = fm3->second;
- return r;
-}
-
-
- -**Minimize cloning TH1’s.** It is really slow. - -**Minimize formatted I/O.** Formatting strings for output is CPU-consuming, even if they are never printed to the screen or output to your logfile. `MF_LOG_INFO` calls for example must prepare the string for printing even if it is configured not to output it. - -**Avoid using caught exceptions as part of normal program operation** While this isn't an efficiency issue or even a code readability issue, it is a problem when debugging programs. Most debuggers have a feature to set a breakpoint on thrown exceptions. This is sometimes necessary to use in order to track down a stubborn bug. Bugs that stop program execution like segmentation faults are sometimes easer to track down than caught exceptions (which often aren't even bugs but sometimes they are). If many caught exceptions take place before the buggy one, then the breakpoint on thrown exceptions has limited value. - -**Use sparse matrix tools where appropriate.** This also saves memory. - -**Minimize database access operations.** Bundle the queries together in blocks if possible. Do not pull more information than is needed out of the database. Cache results so you don’t have to repeat the same data retrieval operation. - -Use `std::vector::reserve()` in order to size your vector right if you know in advance how big it will be. `std::vector()` will, if you `push_back()` to expand it beyond its current size in memory, allocate twice the memory of the existing vector and copy the contents of the old vector to the new memory. This operation will be repeated each time you start with a zero-size vector and push_back a lot of data. -Factorize your program into parts that do i/o and compute. That way, if you don’t need to do one of them, you can switch it off without having to rewrite everything. Example: Say you read data in from a file and make a histogram that you are sometimes interested in looking at but usually not. The data reader should not always make the histogram by default but it should be put in a separate module which can be steered with fcl so the computations needed to calculate the items to fill the histogram can be saved. - -## Memory optimization: - -Use `valgrind`. Its default operation checks for memory leaks and invalid accesses. Search the output for the words “invalid” and “lost”. Valgrind is a `UPS` product you can set up along with everything else. It is set up as part of the dunesw stack. - -~~~ -setup valgrind -valgrind --leak-check=yes --suppressions=$ROOTSYS/etc/valgrind-root.supp myprog arg1 arg2 -~~~ -{: .source} - -More information is available [here][valgrind-quickstart]. ROOT-specific suppressions are described [here][valgrind-root]. You can omit them, but your output file will be cluttered up with messages about things that ROOT does routinely that are not bugs. - -Use `massif`. `massif` is a heap checker, a tool provided with `valgrind`; see documentation [here][valgrind-ms-manual]. - -**Free up memory after use.** Don’t hoard it after your module’s exited. - -**Don’t constantly re-allocate memory if you know you’re going to use it again right away.** - -**Use STL containers instead of fixed-size arrays, to allow for growth in size.** Back in the bad old days (Fortran 77 and earlier), fixed-size arrays had to be declared at compile time that were as big as they possibly could be, both wasting memory on average and creating artificial cutoffs on the sizes of problems that could be handled. This behavior is very easy to replicate in C++. Don’t do it. - -**Be familiar with the structure and access idioms.** These include `std::vector`, `std::map`, `std::unordered_map`, `std::set`, `std::list`. - -**Minimize the use of new and delete to reduce the chances of memory leaks.** If your program doesn’t leak memory now, that’s great, but years from now after maintenance has been transferred, someone might introduce a memory leak. - -**Use move semantics to transfer data ownership without copying it.** - -**Do not store an entire event’s worth of raw digits in memory all at once.** Find some way to process the data in pieces. - -**Consider using more compact representations in memory.** A `float` takes half the space of a double. A `size_t` is 64 bits long (usually). Often that’s needed, but sometimes it’s overkill. - -**Optimize the big uses and don’t spend a lot of time on things that don’t matter.** If you have one instance of a loop counter that’s a `size_t` and it loops over a million vector entries, each of which is an `int`, look at the entries of the vector, not the loop counter (which ought to be on the stack anyway). - -**Rebin histograms.** Some histograms, say binned in channels x ticks or channels x frequency bins for a 2D FFT plot, can get very memory hungry. - -## I/O optimization: - -**Do as much calculation as you can per data element read.** You can spin over a TTree once per plot, or you can spin through the TTree once and make all the plots. ROOT compresses data by default on write and uncompresses it on readin, so this is both an I/O and a CPU issue, to minimize the data that are read. - -**Read only the data you need** ROOT's TTree access methods are set up to give you only the requested TBranches. If you use TTree::MakeClass to write a template analysis ROOT macro script, it will generate code that reads in _all_ TBranches and leaves. It is easy to trim out the extras to speed up your workflow. - -**Saving compressed data reduces I/O time and storage needs.** Even though compressing data takes CPU, a slow disk or network can mean your workflow is in fact faster to trade CPU time instead of the disk read time. - -**Stream data with xrootd** You will wait less for your first event than if you copy the file, put less stress on the data storage elements, and have more reliable i/o with dCache. - -## Build time optimization: - -**Minimize the number of #included files.** If you don’t need an #include, don’t use it. It takes time to find these files in the search path and include them. - -**Break up very large source files into pieces.** `g++’s` analysis and optimization steps take an amount of time that grows faster than linearly with the number of source lines. - -**Use ninja instead of make** Instructions are [here][ninjadocpageredmine] - -## Workflow optimization: - -**Pre-stage your datasets** It takes a lot of time to wait for a tape (sometimes hours!). CPUs are accounted by wall-clock time, whether you're using them or not. So if your jobs are waiting for data, they will run slowly even if you optimized the CPU usage. Pre-stage your data! - -**Run a test job** If you have a bug, you will save time by not submitting large numbers of jobs that might not work. - -**Write out your variables in your own analysis ntuples (TTrees)** You will likely have to run over the same MC and data events repeatedly, and the faster this is the better. You will have to adjust your cuts, tune your algorithms, estimate systematic uncertainties, train your deep-learning functions, debug your program, and tweak the appearance of your plots. Ideally, if the data you need to do these operatios is available interctively, you will be able to perform these tasks faster. Choose a minimal set of variables to put in your ntuples to save on storage space. - -**Write out histograms to ROOTfiles and decorate them in a separate script** You may need to experiment many times with borders, spacing, ticks, fonts, colors, line widths, shading, labels, titles, legends, axis ranges, etc. Best not to have to re-compute the contents when you're doing this, so save the histograms to a file first and read it in to touch it up for presentation. - -## Software readability and maintainability: - -**Keep the test suite up to date** dunesw and larsoft have many examples of unit tests and integration tests. A colleague's commit to your code or even to a different piece of code or even a data file might break your code in unexpected, difficult-to-diagnose ways. The continuous integration (CI) system is there to catch such breakage, and even small changes in run time, memory consumption, and data product output. - -**Keep your methods short** If you have loaded up a lot of functionality in a method, it may become hard to reuse the components to do similar things. A long method is probably doing a lot of different things that can be given meaningful names. - -**Update the comments when code changes** Not many things are more confusing than an out-of-date-comment that refers to how code used to work long ago. - -**Update names when meaning changes** As software evolves, the meaning of the variables may shift. It may be a quick fix to change the contents of a variable without changing its name, but some variables may then contain contents that is the opposite of what the variable name implies. While the code will run, future maintainers will get confused. - -**Use const frequently** The const keyword prevents overwriting variables unintentionally. Constness is how *art* protects the data in its event memory. This mechanism is exposed to the user in that pointers to const memory must be declared as pointers to consts, or you will get obscure error messages from the compiler. Const can also protect you from yourself and your colleagues when you know that the contents of a variable ought not to change. - -**Use simple constructs even if they are more verbose** Sometimes very clever, terse expressions get the job done, but they can be difficult for a human to understand if and when that person must make a change. There is an [obfuscated C contest][obfuscated-C] if you want to see examples of difficult-to-read code (that may in fact be very efficient! But people time is important, too). - -**Always initialize variables when you declare them** Compilers will warn about the use of uninitialized variables, so you will get used to doing this anyway. The initialization step takes a little time and it is not needed if the first use of the memory is to set the variable, which is why compilers do not automatically initialize variables. - -**Minimize the scope of variables** Often a variable will only have a meaningful value iniside of a loop. You can declare variables as you use them. Old langauges like Fortran 77 insisted that you declare all variables at the start of a program block. This is not true in C and C++. Declaring variables inside of blocks delimiated by braces means they will go out of scope when the program exits the block, both freeing the memory and preventing you from referring to the variable after the loop is done and only considering the last value it took. Sometimes this is the desired behaviour, though, and so this is not a blanket rule. - -## Coding for Thread Safety - -Modern CPUs often have many cores available. It is not unusual for a grid worker node to have as many as 64 cores on it, and 128 GB of RAM. Making use of the available hardware to maximize throughput is an important way to optimize our time and resources. DUNE jobs tend to be "embarrassingly parallel", in that they can be divided up into many small jobs that do not need to communicated with one another. Therefore, making use of all the cores on a grid node is usually as easy as breaking a task up into many small jobs and letting the grid schedulers work out what jobs run where. The issue however is effective memory usage. If several small jobs share a lot of memory whose contents do not change (code libraries loaded into RAM, geometry description, calibration constants), then one can group the work together into a single job that uses multiple threads to get the work done faster. If the memory usage of a job is dominated by per-event data, then loading multiple events' worth of data in RAM in order to keep all the cores fed with data may not provide a noticeable improvement in the utilization of CPU time relative to memory time. - -Sometimes multithreading has advantages within a trigger record. Data from different wires or APAs may be processed simultaneously. One thing software managers would like to make sure is controllable is the number of threads a program is allowed to spawn. Some grid sites do not have an automatic protection against a program that creates more threads than CPUs it has requested. Instead, a human operator may notice that the load on a system is far greater than the number of cores, and track down and ban the offending job sumitter (this has already happened on DUNE). If a program contains components, some of which manage their own threawds, then it becomes hard to manage the total thread count in a program. Multithreaded *art* keeps track of the total thread count using TBB, or Thread Building Blocks. - -See this very thorough [presentation][knoepfel-thread-safety] by Kyle Knoepfel at the 2019 LArSoft [workshop][LArSoftWorkshop2019]. Several other talks at the workshop also focus on multi-threaded software. In short, if data are shared between threads and they are mutable, this is a recipe for race conditions and non-reproducible behavior of programs. Giving each thread a separate instance of each object is one way to contain possible race conditions. Alternately, private and public class members which do not change or which have synchronous access methods can also help provide thread safety. - - -[cpp-lower-bound]: https://en.cppreference.com/w/cpp/algorithm/lower_bound -[gnu-manuals-gprof]: https://ftp.gnu.org/old-gnu/Manuals/gprof-2.9.1/html_mono/gprof.html -[valgrind-quickstart]: https://www.valgrind.org/docs/manual/quick-start.html -[valgrind-ms-manual]: https://www.valgrind.org/docs/manual/ms-manual.html -[ninjadocpageredmine]: https://cdcvs.fnal.gov/redmine/projects/dunetpc/wiki/_Tutorial_#Using-the-ninja-build-system-instead-of-make -[valgrind-root]: https://root-forum.cern.ch/t/valgrind-and-root/28506 -[obfuscated-C]: https://www.ioccc.org/ -[knoepfel-thread-safety]: https://indico.fnal.gov/event/20453/contributions/57777/attachments/36182/44065/2019-LArSoftWorkshop-ThreadSafety.pdf -[LArSoftWorkshop2019]: https://indico.fnal.gov/event/20453/timetable/?view=standard - -{%include links.md%} diff --git a/_episodes/05.5-mrb.md b/_episodes/05.5-mrb.md deleted file mode 100644 index ff86bc7..0000000 --- a/_episodes/05.5-mrb.md +++ /dev/null @@ -1,38 +0,0 @@ ---- -title: Multi Repository Build (mrb) system (2024) -teaching: 10 -exercises: 0 -questions: -- How are different software versions handled? -objectives: -- Understand the roles of the tool mrb -keypoints: -- The multi-repository build (mrb) tool allows code modification in multiple repositories, which is relevant for a large project like LArSoft with different cases (end user and developers) demanding consistency between the builds. ---- - -## mrb -**What is mrb and why do we need it?** -Early on, the LArSoft team chose git and cmake as the software version manager and the build language, respectively, to keep up with industry standards and to take advantage of their new features. When we clone a git repository to a local copy and check out the code, we end up building it all. We would like LArSoft and DUNE code to be more modular, or at least the builds should reflect some of the inherent modularity of the code. - -Ideally, we would like to only have to recompile a fraction of the software stack when we make a change. The granularity of the build in LArSoft and other art-based projects is the repository. So LArSoft and DUNE have divided code up into multiple repositories (DUNE ought to divide more than it has, but there are a few repositories already with different purposes). Sometimes one needs to modify code in multiple repositories at the same time for a particular project. This is where mrb comes in. - -**mrb** stands for "multi-repository build". mrb has features for cloning git repositories, setting up build and local products environments, building code, and checking for consistency (i.e. there are not two modules with the same name or two fcl files with the same name). mrb builds UPS products -- when it installs the built code into the localProducts directory, it also makes the necessasry UPS table files and .version directories. mrb also has a tool for making a tarball of a build product for distribution to the grid. The software build example later in this tutorial exercises some of the features of mrb. - -| Command | Action | -|--------------------------|-----------------------------------------------------| -| `mrb --help` | prints list of all commands with brief descriptions | -| `mrb \ --help` | displays help for that command | -| `mrb gitCheckout` | clone a repository into working area | -| `mrbsetenv` | set up build environment | -| `mrb build -jN` | builds local code with N cores | -| `mrb b -jN` | same as above | -| `mrb install -jN` | installs local code with N cores | -| `mrb i -jN` | same as above (this will do a build also) | -| `mrbslp` | set up all products in localProducts... | -| `mrb z` | get rid of everything in build area | - -Link to the [mrb reference guide](https://cdcvs.fnal.gov/redmine/projects/mrb/wiki/MrbRefereceGuide) - -> ## Exercise 1 -> There is no exercise 5. mrb example exercises will be covered in a later session as any useful exercise with mrb takes more than 30 minutes on its own. Everyone gets 100% credit for this exercise! -{: .challenge} diff --git a/_episodes/06-larsoft-modify-module.md b/_episodes/06-larsoft-modify-module.md deleted file mode 100644 index 45b5ba5..0000000 --- a/_episodes/06-larsoft-modify-module.md +++ /dev/null @@ -1,699 +0,0 @@ ---- -title: Expert in the Room - LArSoft How to modify a module - in progress -teaching: 15 -exercises: 0 -questions: -- How do I check out, modify, and build DUNE code? -objectives: -- How to use mrb. -- Set up your environment. -- Download source code from DUNE's git repository. -- Build it. -- Run an example program. -- Modify the job configuration for the program. -- Modify the example module to make a custom histogram. -- Test the modified module. -- Stretch goal -- run the debugger. -key points: -- DUNE's software stack is built out of a tree of UPS products. -- You don't have to build all of the software to make modifications -- you can check out and build one or more products to achieve your goals. -- You can set up pre-built CVMFS versions of products you aren't developing, and UPS will check version consistency, though it is up to you to request the right versions. -- mrb is the tool DUNE uses to check out software from multiple repositories and build it in a single test release. -- mrb uses git and cmake, though aspects of both are exposed to the user. ---- - - - -## First learn a bit about the MRB system - -Link to the [mrb]({{ site.baseurl }}/05.5-mrb) episode - -## getting set up - -You will need *three* login sessions. These have different -environments set up. - -* Session #1 For editing code (and searching for code) -* Session #2 For building (compiling) the software -* Session #3 For running the programs - -## Session 1 - -Start up session #1, editing code, on one of the dunegpvm*.fnal.gov -interactive nodes. These scripts have also been tested on the -lxplus.cern.ch interactive nodes. - -> ## Note Remember the Apptainer! -> see below for special Apptainers for CERN and build machines. -{: .callout} - -Create two scripts in your home directory: - -`newDev2024Tutorial.sh` should have these contents: - -~~~ -#!/bin/bash -export DUNELAR_VERSION=v10_00_04d00 -export PROTODUNEANA_VERSION=$DUNELAR_VERSION -DUNELAR_QUALIFIER=e26:prof -DIRECTORY=2024tutorial -USERNAME=`whoami` -export WORKDIR=/exp/dune/app/users/${USERNAME} -if [ ! -d "$WORKDIR" ]; then - export WORKDIR=`echo ~` -fi - -source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh - -cd ${WORKDIR} -touch ${DIRECTORY} -rm -rf ${DIRECTORY} -mkdir ${DIRECTORY} -cd ${DIRECTORY} -mrb newDev -q ${DUNELAR_QUALIFIER} -source ${WORKDIR}/${DIRECTORY}/localProducts*/setup -mkdir work -cd srcs -mrb g -t ${PROTODUNEANA_VERSION} protoduneana - -cd ${MRB_BUILDDIR} -mrbsetenv -mrb i -j16 -~~~ -{: .language-bash} - -and `setup2024Tutorial.sh` should have these contents: - -~~~ -DIRECTORY=2024tutorial -USERNAME=`whoami` - -source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh -export WORKDIR=/exp/dune/app/users/${USERNAME} -if [ ! -d "$WORKDIR" ]; then - export WORKDIR=`echo ~` -fi - -cd $WORKDIR/$DIRECTORY -source localProducts*/setup -cd work -setup dunesw $DUNELAR_VERSION -q $DUNELAR_QUALIFIER -mrbslp -~~~ -{: .language-bash} - -Execute this command to make the first script executable. - -~~~ - chmod +x newDev2024Tutorial.sh -~~~ -{: .language-bash} - -It is not necessary to chmod the setup script. Problems writing -to your home directory? Check to see if your Kerberos ticket -has been forwarded. - -~~~ - klist -~~~ -{: .language-bash} - -## Session 2 - -Start up session #2 by logging in to one of the build nodes, -`dunebuild02.fnal.gov` or `dunebuild03.fnal.gov`. They have at least 16 cores -apiece and the dunegpvm's have only four, so builds run much faster -on them. If all tutorial users log on to the same one and try -building all at once, the build nodes may become very slow or run -out of memory. The `lxplus` nodes are generally big enough to build -sufficiently quickly. The Fermilab build nodes should not be used -to run programs (people need them to build code!) - -Note -- interactive computers at Fermilab will print out how much RAM, swap, and CPU threads the node has when you log in. In general, builds that launch more processes than a machine has threads will not run any faster, but it will use more memory. So the command "mrb i -j16" above is intended to be run on a build node with at least 16 threads and enough memory to support 16 simultaneous invocations of the C++ compiler, which may take up to 2 GB per invocation. - -> ## Note you need a modified container on the build machines and at CERN as they don't mount /pnfs -> This is done to prevent people from running interactive jobs on the dedicated build machines. -{: .callout} - -### FNAL build machines -~~~ -# remove /pnfs/ for build machines -/cvmfs/oasis.opensciencegrid.org/mis/apptainer/current/bin/apptainer shell --shell=/bin/bash \ --B /cvmfs,/exp,/nashome,/opt,/run/user,/etc/hostname,/etc/hosts,/etc/krb5.conf --ipc --pid \ -/cvmfs/singularity.opensciencegrid.org/fermilab/fnal-dev-sl7:latest -~~~ -{: .language-bash} - -### CERN -~~~ -/cvmfs/oasis.opensciencegrid.org/mis/apptainer/current/bin/apptainer shell --shell=/bin/bash\ --B /cvmfs,/afs,/opt,/run/user,/etc/hostname,/etc/krb5.conf --ipc --pid \ -/cvmfs/singularity.opensciencegrid.org/fermilab/fnal-dev-sl7:latest -~~~ -{: .language-bash} - -### Download source code and build it - -On the build node, execute the `newDev` script: - -~~~ - ./newDev2024Tutorial.sh -~~~ -{: .language-bash} - -Note that this script will *delete* the directory planned to store -the source code and built code, and make a new directory, in order -to start clean. Be careful not to execute this script then if you've -worked on the code some, as this script will wipe it out and start fresh. - -This build script will take a few minutes to check code out and compile it. - -The `mrb g` command does a `git clone` of the specified repository with an optional tag and destination name. More information is available [here][dunetpc-wiki] and [here][mrb-reference-guide]. - -Some comments on the build command - -~~~ - mrb i -j16 -~~~ -{: .language-bash} - -The `-j16` says how many concurrent processes to run. Set the number to no more than the number of cores on the computer you're running it on. A dunegpvm machine has four cores, and the two build nodes each have 16. Running more concurrent processes on a computer with a limited number of cores won't make the build finish any faster, but you may run out of memory. The dunegpvms do not have enough memory to run 16 instances of the C++ compiler at a time, and you may see the word `killed` in your error messages if you ask to run many more concurrent compile processes than the interactive computer can handle. - -You can find the number of cores a machine has with - -~~~ - cat /proc/cpuinfo -~~~ -{: .language-bash} - -The `mrb` system builds code in a directory distinct from the source code. Source code is in `$MRB_SOURCE` and built code is in `$MRB_BUILDDIR`. If the build succeeds (no error messages, and compiler warnings are treated as errors, and these will stop the build, forcing you to fix the problem), then the built artifacts are put in `$MRB_TOP/localProducts*`. mrbslp directs ups to search in `$MRB_TOP/localProducts*` first for software and necessary components like `fcl` files. It is good to separate the build directory from the install directory as a failed build will not prevent you from running the program from the last successful build. But you have to look at the error messages from the build step before running a program. If you edited source code, made a mistake, built it unsuccessfully, then running the program may run successfully with the last version which compiled. You may be wondering why your code changes are having no effect. You can look in `$MRB_TOP/localProducts*` to see if new code has been added (look for the "lib" directory under the architecture-specific directory of your product). - -Because you ran the `newDev2024Tutorial.sh` script instead of sourcing it, the environment it -set up within it is not retained in the login session you ran it from. You will need to set up your environment again. -You will need to do this when you log in anyway, so it is good to have -that setup script. In session #2, type this: - -~~~ - source setup2024Tutorial.sh - cd $MRB_BUILDDIR - mrbsetenv -~~~ -{: .language-bash} - -The shell command "source" instructs the command interpreter (bash) to read commands from the file `setup2024Tutorial.sh` as if they were typed at the terminal. This way, environment variables set up by the script stay set up. -Do the following in session #1, the source editing session: - -~~~ -source setup2024Tutorial.sh - cd $MRB_SOURCE - mrbslp -~~~ -{: .language-bash} - -## Run your program - -[YouTube Lecture Part 2](https://youtu.be/8-M2ZV-zNXs): Start up the session for running programs -- log in to a `dunegpvm` interactive -computer for session #3 - -~~~ - source setup2024Tutorial.sh - mrbslp - setup_fnal_security -~~~ -{: .language-bash} - -We need to locate an input file. Here are some tips for finding input data: - -[https://wiki.dunescience.org/wiki/Look_at_ProtoDUNE_SP_data][dune-wiki-protodune-sp] - -Data and MC files are typically on tape, but can be cached on disk so you don't have to wait possibly a long time for the -file to be staged in. Check to see if a sample file is in dCache or only on tape: - -~~~ -cache_state.py PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_datadriven_18800650_2_20210414T012053Z.root -~~~ -{: .language-bash} - -Get the `xrootd` URL: - -~~~ -samweb get-file-access-url --schema=root PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_datadriven_18800650_2_20210414T012053Z.root -~~~ -{: .language-bash} - -which should print the following URL: - -~~~ -root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune-sp/full-reconstructed/2021/mc/out1/PDSPProd4a/18/80/06/50/PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_datadriven_18800650_2_20210414T012053Z.root -~~~ -{: .language-bash} - -Now run the program with the input file accessed by that URL: - -~~~ -lar -c analyzer_job.fcl root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune-sp/full-reconstructed/2021/mc/out1/PDSPProd4a/18/80/06/50/PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_datadriven_18800650_2_20210414T012053Z.root -~~~ -{: .language-bash} - -CERN Users without access to Fermilab's `dCache`: -- example input files for this tutorial have been copied to `/afs/cern.ch/work/t/tjunk/public/2024tutorialfiles/`. - -After running the program, you should have an output file `tutorial_hist.root`. Note -- please do not -store large rootfiles in `/exp/dune/app`! The disk is rather small, and we'd like to -save it for applications, not data. But this file ought to be quite small. -Open it in root - -~~~ - root tutorial_hist.root -~~~ -{: .language-bash} - -and look at the histograms and trees with a `TBrowser`. It is empty! - -#### Adjust the program's job configuration - -In Session #1, the code editing session, - -~~~ - cd ${MRB_SOURCE}/protoduneana/protoduneana/TutorialExamples/ -~~~ -{: .language-bash} - -See that `analyzer_job.fcl` includes `clustercounter.fcl`. The `module_type` -line in that `fcl` file defines the name of the module to run, and -`ClusterCounter_module.cc` just prints out a message in its analyze() method -just prints out a line to stdout for each event, without making any -histograms or trees. - -Aside on module labels and types: A module label is used to identify -which modules to run in which order in a trigger path in an art job, and also -to label the output data products. The "module type" is the name of the source -file: `moduletype_module.cc` is the filename of the source code for a module -with class name moduletype. The build system preserves this and makes a shared object (`.so`) -library that art loads when it sees a particular module_type in the configuration document. -The reason there are two names here is so you -can run a module multiple times in a job, usually with different inputs. Underscores -are not allowed in module types or module labels because they are used in -contexts that separate fields with underscores. - -Let's do something more interesting than ClusterCounter_module's print -statement. - -Let's first experiment with the configuration to see if we can get -some output. In Session #3 (the running session), - -~~~ - fhicl-dump analyzer_job.fcl > tmp.txt -~~~ -{: .language-bash} - -and open tmp.txt in a text editor. You will see what blocks in there -contain the fcl parameters you need to adjust. -Make a new fcl file in the work directory -called `myana.fcl` with these contents: - -~~~ -#include "analyzer_job.fcl" - -physics.analyzers.clusterana.module_type: "ClusterCounter3" -~~~ -{: .language-bash} - -Try running it: - -~~~ - lar -c myana.fcl root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune-sp/full-reconstructed/2021/mc/out1/PDSPProd4a/18/80/06/50/PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_datadriven_18800650_2_20210414T012053Z.root -~~~ -{: .language-bash} - - -but you will get error messages about "product not found". -Inspection of `ClusterCounter3_module.cc` in Session #1 shows that it is -looking for input clusters. Let's see if we have any in the input file, -but with a different module label for the input data. - -Look at the contents of the input file: - -~~~ - product_sizes_dumper root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune-sp/full-reconstructed/2021/mc/out1/PDSPProd4a/18/80/06/50/PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_datadriven_18800650_2_20210414T012053Z.root | grep -i cluster -~~~ -{: .language-bash} - -There are clusters with module label "pandora" but not -`lineclusterdc` which you can find in the tmp.txt file above. Now edit `myana.fcl` to say - -~~~ -#include "analyzer_job.fcl" - -physics.analyzers.clusterana.module_type: "ClusterCounter3" -physics.analyzers.clusterana.ClusterModuleLabel: "pandora" -~~~ -{: .language-bash} - -and run it again: - -~~~ - lar -c myana.fcl root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune-sp/full-reconstructed/2021/mc/out1/PDSPProd4a/18/80/06/50/PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_datadriven_18800650_2_20210414T012053Z.root -~~~ -{: .language-bash} - - -Lots of information on job configuration via FHiCL is available at this [link][redmine-327] - -#### Editing the example module and building it - -[YouTube Lecture Part 3](https://youtu.be/S29HEzIoGwc): Now in session #1, edit `${MRB_SOURCE}/protoduneana/protoduneana/TutorialExamples/ClusterCounter3_module.cc` - -Add - -~~~ -#include "TH1F.h" -~~~ -{: .source} - -to the section with includes. - -Add a private data member - -~~~ -TH1F *fTutorialHisto; -~~~ -{: .source} - -to the class. Create the histogram in the `beginJob()` method: - -~~~ -fTutorialHisto = tfs->make("TutorialHisto","NClus",100,0,500); -~~~ - -Fill the histo in the `analyze()` method, after the loop over clusters: - -~~~ -fTutorialHisto->Fill(fNClusters); -~~~ -{: .source} - -Go to session #2 and build it. The current working directory should be the build directory: - -~~~ -make install -j16 -~~~ -{: .language-bash} - - -Note -- this is the quicker way to re-build a product. The `-j16` says to use 16 parallel processes, -which matches the number of cores on a build node. The command - -~~~ -mrb i -j16 -~~~ -{: .language-bash} - -first does a cmake step -- it looks through all the `CMakeLists.txt` files and processes them, -making makefiles. If you didn't edit a `CMakeLists.txt` file or add new modules or fcl files -or other code, a simple make can save you some time in running the single-threaded `cmake` step. - -Rerun your program in session #3 (the run session) - -~~~ - lar -c myana.fcl root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune-sp/full-reconstructed/2021/mc/out1/PDSPProd4a/18/80/06/50/PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_datadriven_18800650_2_20210414T012053Z.root -~~~ -{: .language-bash} - -Open the output file in a TBrowser: - -~~~ - root tutorial_hist.root -~~~ -{: .language-bash} - -and browse it to see your new histogram. You can also run on some data. - -~~~ - lar -c myana.fcl -T dataoutputfile.root root://fndca1.fnal.gov/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune-sp/full-reconstructed/2020/detector/physics/PDSPProd4/00/00/53/87/np04_raw_run005387_0041_dl7_reco1_13832298_0_20201109T215042Z.root -~~~ -{: .language-bash} - -The `-T dataoutputfile.root` changes the output filename for the `TTrees` and -histograms to `dataoutputfile.root` so it doesn't clobber the one you made -for the MC output. - -This iteration of course is rather slow -- rebuilding and running on files in `dCache`. Far better, -if you are just changing histogram binning, for example, is to use the output TTree. -`TTree::MakeClass` is a very useful way to make a script that reads in the `TBranches` of a `TTree` on -a file. The workflow in this tutorial is also useful in case you decide to add more content -to the example `TTree`. - -#### Run your program in the debugger - -##### gdb and ddd - -As of January 2025, the Fermilab license for forge_tools ddt and map has expired and will not be renewed. To debug programs, we now have access to command-line gdb and ddd. Instructions for how to use both of these are available on the web. The version of gdb that comes with SL7 is quite old. gdb gets set up with dunesw however so you get a version that can debug programs compiled with modern versions of gcc and clang. The gui debugger ddd is also installed both in the AL9 suite on the dunegpvms, as well as in the defualt SL7 container. ddd uses gdb under the hood, but it provides convenience features for displaying data and setting breakpoints in the source window. There is an issue with assigning a pseudo-terminal in a SL7 container session that is fixed with a preloaded shared library. - -~~~ - source /etc/profile.d/ddd.sh -~~~ -{: .language-bash} -defines an alias for ddd that sets LD_PRELOAD before running the debugger gui. Some of the advice in using the forge_tools debugger below is expected to be useful in running ddd and gdb at the command line, such as the need to find the appropriate version of the source, and stepping to find bugs. - - -##### Old forge_tools ddt instructions - -[YouTube Lecture Part 4](https://youtu.be/xcgVKmpKgfw): In session #3 (the running session) - -~~~ - setup forge_tools - - ddt `which lar` -c myana.fcl root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune-sp/full-reconstructed/2021/mc/out1/PDSPProd4a/18/80/06/50/PDSPProd4a_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_datadriven_18800650_2_20210414T012053Z.root -~~~ -{: .language-bash} - -Click the "Run" button in the window that pops up. The `which lar` is needed because ddt cannot find -executables in your path -- you have to specify their locations explicitly. - -In session #1, look at `ClusterCounter3_module.cc` in a text editor that lets you know what the line numbers are. -Find the line number that fills your new histogram. In the debugger window, select the "Breakpoints" -tab in the bottom window, and usethe right-mouse button (sorry mac users -- you may need to get an external -mouse if you are using VNC. `XQuartz` emulates a three-button mouse I believe). Make sure the "line" -radio button is selected, and type `ClusterCounter3_module.cc` for the filename. Set the breakpoint -line at the line you want, for the histogram filling or some other place you find interesting. Click -Okay, and "Yes" to the dialog box that says ddt doesn't know about the source code yet but will try to -find it when it is loaded. - -Click the right green arrow to start the program. Watch the program in the Input/Output section. -When the breakpoint is hit, you can browse the stack, inspect values (sometimes -- it is better when -compiled with debug), set more breakpoints, etc. - -You will need Session #1 to search for code that ddt cannot find. Shared object libraries contain -information about the location of the source code when it was compiled. So debugging something you -just compiled usually results in a shared object that knows the location of the source, but installed -code in CVMFS points to locations on the Jenkins build nodes. - -#### Looking for source code: - -Your environment has lots of variables pointing at installed code. Look for variables like - -~~~ - PROTODUNEANA_DIR -~~~ - -which points to a directory in `CVMFS`. - -~~~ - ls $PROTODUNEANA_DIR/source - -or $LARDATAOBJ_DIR/include -~~~ - -are good examples of places to look for code, for example. - -#### Checking out and committing code to the git repository - -For protoduneana and dunesw, this [wiki page][dunetpc-wiki-tutorial] is quite good. LArSoft uses GitHub with a pull-request model. See - -[https://cdcvs.fnal.gov/redmine/projects/larsoft/wiki/Developing_With_LArSoft][redmine-dev-larsoft] - -[https://cdcvs.fnal.gov/redmine/projects/larsoft/wiki/Working_with_GitHub][redmine-working-github] - -### Some handy tools for working with search paths - -Tom has written some scripts and made aliases for convenience -- finding files in search paths like FHCIL_FILE_PATH, or FW_SEARCH_PATH, and searching within those files for content. Have a look on the dunegpvms at /exp/dune/data/users/trj/texttools. There is a list of aliases in aliases.txt that can be run in your login script (such as .profile). Put the perl scripts and tkdiff and newtkdiff somewhere in your PATH. A common place to put your favorite convenience scripts is ${HOME}/bin, but make sure to add that to your PATH. The scripts tkdiff and newtkdiff are open-source graphical diff tools that run using TCL/TK. - -## Common errors and recovery - -#### Version mismatch between source code and installed products - -When you perform an mrbsetenv or a mrbslp, sometimes you get a version mismatch. The most common reason for this is that you have set up an older version of the dependent products. `Dunesw` depends on `protoduneana`, which depends on `dunecore`, which depends on `larsoft`, which depends on *art*, ROOT, GEANT4, and many other products. This [picture][dunesw-dependency-tree] shows the software dependency tree for dunesw v09_72_01_d00. If the source code is newer than the installed products, the versions may mismatch. You can check out an older version of the source code (see the example above) with - -~~~ - mrb g -t repository -~~~ -{: .language-bash} - -Alternatively, if you have already checked out some code, you can switch to a different tag using your local clone of the git repository. - -~~~ - cd $MRB_SOURCE/ - git checkout -~~~ -{: .language-bash} - -Try `mrbsetenv` again after checking out a consistent version. - -#### Telling what version is the right one - -The versions of dependent products for a product you're building from source are listed in the file `$MRB_SOURCE//ups/product_deps``. - -Sometimes you may want to know what the version number is of a product way down on the dependency tree so you can check out its source and edit it. Set up the product in a separate login session: - -~~~ - source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh - setup $DUNELAR_VERSION -q $DUNELAR_QUALIFIER - ups active -~~~ -{: .language-bash} - -It usually is a good idea to pipe the output through grep to find a particular product version. You can get dependency information with - -~~~ - ups depend $DUNELAR_VERSION -q $DUNELAR_QUALIFIER -~~~ -{: .language-bash} - -Note: not all dependencies of dependent products are listed by this command. If a product is already listed, it sometimes is not listed a second time, even if two products in the tree depend on it. Some products are listed multiple times. - -There is a script in duneutil called `dependency_tree.sh` which makes graphical displays of dependency trees. - -#### Inconsistent build directory - -The directory $MRB_BUILD contains copies of built code before it gets installed to localProducts. If you change versions of the source or delete things, sometimes the build directory will have clutter in it that has to be removed. - -~~~ - mrb z -~~~ -{: .language-bash} - -will delete the contents of `$MRB_BUILDDIR` and you will have to type `mrbsetenv` again. - -~~~ - mrb zd -~~~ -{: .language-bash} - -will also delete the contents of localProducts. This can be useful if you are removing code and want to make sure the installed version also has it gone. - -### Inconsistent environment - -When you use UPS's setup command, a lot of variables get defined. For each product, a variable called `_DIR` is defined, which points to the location of the version and flavor of the product. UPS has a command "unsetup" which often succeeds in undoing what setup does, but it is not perfect. It is possible to get a polluted environment in which inconsistent versions of packages are set up and it is too hard to repair it one product at a time. Logging out and logging back in again, and setting up the session is often the best way to start fresh. - -### The setup command is the wrong one - -If you have not sourced the DUNE software setup script - -~~~ - source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh -~~~ -{: .language-bash} - -you will find that the setup command that is used instead is one provided by the operating system and it requires root privilege to execute and setup will ask you for the root password. Rather than typing that if you get in this situation, ctrl-c, source the setup_dune.sh script and try again. - -#### Compiler and linker warnings and errors - -Common messages from the `g++` compiler are undeclared variables, uninitialized variables, mismatched parentheses or brackets, missing semicolons, checking unsigned variables to see if they are positive (yes, that's a warning!) and other things. mrb is set up to tell `g++` and clang to treat warnings as errors, so they will stop the build and you will have to fix them. Often undeclared variables or methods that aren't members of a class messages result from having forgotten to include the appropriate include file. - -The linker has fewer ways to fail than the compiler. Usually the error message is "Undefined symbol". The compiler does not emit this message, so you always know this is in the link step. If you have an undefined symbol, one of three things may have gone wrong. 1) You may have mistyped it (usually this gets caught by the compiler because names are defined in header files). More likely, 2) You introduced a new dependency without updating the `CMakeLists.txt` file. Look in the `CMakeLists.txt` file that steers the building of the source code that has the problem. Look at other `CMakeLists.txt` files in other directories for examples of how to refer to libraries. ` MODULE_LIBRARIES` are linked with modules in the `ART_MAKE` blocks, and `LIB_LIBRARIES` are linked when building non-module libraries (free-floating source code, for algorithms). 3) You are writing new code and just haven't gotten around to finishing writing something you called. - -#### Out of disk quota - -Do not store data files on the app disk! Sometimes the app disk fills up nonetheless, and there is a quota of 100 GB per user on it. If you need more than that for several builds, you have some options. 1) Use `/exp/dune/data/users/`. You have a 400 GB quota on this volume. They are slower than the app disk and can get even slower if many users are accessing them simultaneously or transferring large amounts of data to or ofrm them. 3) Clean up some space on app. You may want to tar up an old release and store the tarball on the data volume or in `dCache` for later use. - -#### Runtime errors - -Segmentation faults: These do not throw errors that *art* can catch. They terminate the program immediately. Use the debugger to find out where they happened and why. - -Exceptions that are caught. The `ddt` debugger has in its menu a set of standard breakpoints. You can instruct the debugger to stop any time an exception is thrown. A common exception is a vector accessed past its size using `at()`, but often these are hard to track down because they could be anywhere. Start your program with the debugger, but it is often a good idea to turn off the break-on-exception feature until after the geometry has been read in. Some of the XML parsing code throws a lot of exceptions that are later caught as part of its normal mode of operation, and if you hit a breakpoint on each of these and push the "go" button with your mouse each time, you could be there all day. Wait until the initialization is over, press "pause" and then turn on the breakpoints by exception. - -If you miss, start the debugging session over again. Starting the session over is also a useful technique when you want to know what happened *before* a known error condition occurs. You may find yourself asking "how did it get in *that* condition? Set a breakpoint that's earlier in the execution and restart the session. Keep backing up -- it's kind of like running the program in reverse, but it's very slow. Sometimes it's the only way. - -Print statements are also quite useful for rare error conditions. If a piece of code fails infrequently, based on the input data, sometimes a breakpoint is not very useful because most of the time it's fine and you need to catch the program in the act of misbehaving. Putting in a low-tech print statement, sometimes with a uniquely-identifying string so you can grep the output, can let you put some logic in there to print only when things have gone bad, or even if you print on each iteration, you can just look at the last bit of printout before a crash. - -#### No authentication/permission - -You will almost always need to have a valid Kerberos ticket in your session. Accessing your home directory on the Fermilab machines requires it. Find your tickets with the command - -~~~ - klist -~~~ -{: .language-bash} - -By default, they last for 25 hours or so (a bit more than a day). You can refresh them for another 25 hours (up to -one week's worth of refreshes are allowed) with - -~~~ - kinit -R -~~~ -{: .language-bash} - -If you have a valid ticket on one machine and want to refresh tickets on another, you can - -~~~ -k5push -~~~ -{: .language-bash} - -The safest way to get a new ticket to a machine is to kinit on your local computer (like your laptop) and log in again, -making sure to forward all tickets. In a pinch, you can run kinit on a dunegpvm and enter your Kerberos password, but this is discouraged as bad actors can (and have!) installed keyloggers on shared systems, and have stolen passwords. DO NOT KEEP PRIVATE, PERSONAL INFORMATION ON FERMILAB COMPUTERS! Things like bank account numbers, passwords, and social security numbers are definitely not to be stored on public, shared computers. Running `kinit -R` on a shared machine is fine. - -You will need a grid proxy to submit jobs and access data in `dCache` via `xrootd` or `ifdh`. - -~~~ - setup_fnal_security -~~~ -{: .language-bash} - -will use your valid Kerberos ticket to generate the necessary certificates and proxies. - -#### Link to art/LArSoft tutorial May 2021 - - -[https://wiki.dunescience.org/wiki/Presentation_of_LArSoft_May_2021][dune-larsoft-may21] - - -[dunetpc-wiki]: https://cdcvs.fnal.gov/redmine/projects/dunetpc/wiki/_Tutorial_ -[mrb-reference-guide]: https://cdcvs.fnal.gov/redmine/projects/mrb/wiki/MrbRefereceGuide -[dune-wiki-protodune-sp]: https://wiki.dunescience.org/wiki/Look_at_ProtoDUNE_SP_data -[redmine-327]: https://cdcvs.fnal.gov/redmine/documents/327 -[dunetpc-wiki-tutorial]: https://cdcvs.fnal.gov/redmine/projects/dunetpc/wiki/_Tutorial_ -[redmine-dev-larsoft]: https://cdcvs.fnal.gov/redmine/projects/larsoft/wiki/Developing_With_LArSoft -[redmine-working-github]: https://cdcvs.fnal.gov/redmine/projects/larsoft/wiki/Working_with_GitHub -[dune-larsoft-may21]: https://wiki.dunescience.org/wiki/Presentation_of_LArSoft_May_2021 -[dunesw-dependency-tree]: https://wiki.dunescience.org/w/img_auth.php/6/6f/Dunesw_v09_72_01_e20_prof_graph.pdf - -{%include links.md%} - - diff --git a/_extras/ComputerSetup.md b/_extras/ComputerSetup.md deleted file mode 100644 index 7875178..0000000 --- a/_extras/ComputerSetup.md +++ /dev/null @@ -1,141 +0,0 @@ ---- -title: Basic Computer Setup -teaching: 30 -exercises: 30 -questions: -- How do I get my laptop or desktop set up to do scientific computing -objectives: -- Learn how to use command line tools -- Install software you need to do scientific programing -keypoints: -- This will be useful for a lot of projects -- It is also something almost all people who get paid to program are expected to know well ---- - -## 0. Back up your machine - -We are going to be messing with your operating system at some level so it is extremely wise to do a complete backup of your machine to an external drive right now. - -Also turn off automatic updates. Operating system updates can mess with your setup. Generally, back up before doing updates so you can revert if necessary. - -## 1. Open a unix terminal window - -First figure out how to open a terminal on your system. The Carpentries Shell Training has a [section that explains this][New Shell] - -This should be easy on Linux and MacOS but a bit more complicated in Windows. - - -On Linux use xterm, on MacOS go to Utilities and start a Terminal. - -On Windows it's a bit more complicated as the underlying operating system is not a unix variant. - -> ## We suggest using the [Windows Subsystem for Linux](https://learn.microsoft.com/en-us/windows/wsl/about) (WSL). That page has download instructions. - - - -## 2. Learn how to use the Unix Shell - - - -There is a nice tutorial from the Carpentries at: [Unix Shell Basics][Unix Shell Basics]. - -It tells you how to start a terminal session in Windows, Mac OSX and Unix systems. - -Please do that [unix shell tutorial][Unix Shell Basics] to learn about the basic command line. - - -## 3. Install an x-windows emulator - -#### MacOS - -MacOS has a `Terminal` app in `Utilities` - -but you need to install [XQuartz][XQuartz] - -test it out by typing - -~~~ -xterm & -~~~ -{: .language-bash} - -You should get a terminal window. You can close it. - - -#### Unix - -Should already have a terminal - -test by doing - -~~~ -xterm & -~~~ -{: .language-bash} - -#### Windows - -See the information about [Windows]({{ site.baseurl }}/Windows.html) terminal connections. - -- To do Linux locally, many people like to run an instance of [Windows SubSystem for Linux](https://learn.microsoft.com/en-us/windows/wsl/about). But it is non-trivial to set up. - -- Alternatively, if you have access to a remote linux system through your institution you can use the Windows terminal/X-windows connections described in [Windows]({{ site.baseurl }}/Windows.html) to connect to that system and work there. - -> # Note -> You should now be ready to go for the ({{ site.baseurl }}/setup.html) -{: .callout} - -## Extra - Get a compiler/code editor - -Although you will mainly be using python to code to begin with, most HEP code is actually C++ and it is good to have access to a C++ compiler. Bonus is that you normally get a good editor as well. - -#### OSX -Compiler/editor: On OSX, you should install [Xcode][Xcode] from the [App store](https://www.apple.com/app-store/). It will take a lot of disk space. When you try to use it it will ask you to install command line tools. Do so. - -Compiler/editor: Even though Xcode is what you use to compile and has an editor, many people prefer to use the [Visual Studio Code](https://code.visualstudio.com) application from Microsoft for editing/testing code. - -You can also use vim or emacs if you are old school. - -#### Unix -- Compiler: your compiler will be gcc - -- Editor: Heck - just use vim. Or emacs, or [VSCode][Visual Studio Code]. - -#### Windows -Likely you should load up the full [Visual Studio][Visual Studio] as it has a nice C++ compiler - - -### Useful Links - -[HSF Training Center][HSF Training Center] - -[Unix Shell Basics][Unix Shell Basics] - -[Git][Git] - -[Visual Studio Code][Visual Studio Code] - -[Visual Studio][Visual Studio] - -[GNU gcc][GNU gcc] - -[Xcode][Xcode] - -[XQuartz][XQuartz] - -[Windows Subsystem for Linux][Windows Subsystem for Linux] - -{%include links.md%} - -[New Shell]: https://swcarpentry.github.io/shell-novice/#open-a-new-shell -[HSF Training Center]: https://hsf-training.org/training-center/ -[Windows Subsystem for Linux]: https://learn.microsoft.com/en-us/windows/wsl/about -[Unix Shell Basics]: https://swcarpentry.github.io/shell-novice/ -[Git]: https://swcarpentry.github.io/git-novice -[Visual Studio Code]: https://code.visualstudio.com -[Visual Studio]:https://visualstudio.microsoft.com/vs/ -[GNU gcc]: https://gcc.gnu.org -[App Store]: https://www.apple.com/app-store/ -[Xcode]: https://developer.apple.com/xcode/ -[XQuartz]: https://www.xquartz.org diff --git a/_extras/InstallConda.md b/_extras/InstallConda.md deleted file mode 100644 index 5ca6f8a..0000000 --- a/_extras/InstallConda.md +++ /dev/null @@ -1,117 +0,0 @@ ---- -title: Use Conda to install root -teaching: 30 -exercises: 0 -questions: -- How do I get root and jupyter-lab for simple analysis -objectives: -- get miniconda set up on your machine -- install root and jupyter-lab -keypoints: -- useful mainly for simple tuple analysis ---- - -## Installing conda and root - -This is derived from the excellent [https://iscinumpy.gitlab.io/post/root-conda/](https://iscinumpy.gitlab.io/post/root-conda/) by Henry Schreiner - -Currently this has been tested on OSX and Linux distributions SL7 and AL9 - -## Download miniconda - -1. Do you have wget on your system? If not, get it - -2. Download the miniconda installer - -~~~ -# Download the Linux installer -wget -nv http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh -~~~ -{: .language-bash} - -~~~ -# Or download the macOS installer -wget -nv https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O miniconda.sh -~~~ -{: .language-bash} - -## Install conda - -~~~ -# Install Conda (same for macOS and Linux) -bash miniconda.sh -b -p $HOME/miniconda -~~~ -{: .language-bash} - -4. add this to your .bashrc or .profile to start it up when you log in - -~~~ -source $HOME/miniconda/etc/profile.d/conda.sh # Add to bashrc - similar files available for fish and csh -~~~ -{: .language-bash} - -## Making an environment with root in it - -~~~ -conda create -n my_root_env root -c conda-forge -~~~ -{: .language-bash} - -This will take a while. In the end you have an environment which contains root and a lot of other useful things - -You can repeat step 1 to make different conda enviroments. - -## Try out your new environment - -To activate a particular environment you do: - -~~~ -conda activate my_root_env -~~~ -{: .language-bash} - -First time you use your environment you can do - -~~~ -conda config --add --env channels conda-forge # only -~~~ -{: .language-bash} - -the config command tells conda to use conda-forge as a default. You should now have a conda environment with root in it. - -## Testing - -~~~ -(my_root_env) wngr405-mac3:utilities schellma$ root - ------------------------------------------------------------------ - | Welcome to ROOT 6.28/00 https://root.cern | - | (c) 1995-2022, The ROOT Team; conception: R. Brun, F. Rademakers | - | Built for macosx64 on Mar 21 2023, 08:18:00 | - | From tag , 3 February 2023 | - | With | - | Try '.help'/'.?', '.demo', '.license', '.credits', '.quit'/'.q' | - ------------------------------------------------------------------ - -root [1] .q -~~~ -{: .output} - -To get out of root type - -~~~ -root> .q -~~~ -{: .language-bash} - - - -Try this - -~~~ -root -l -q $ROOTSYS/tutorials/dataframe/df013_InspectAnalysis.C -~~~ -{: .language-bash} - -You should see a plot that updates. - - diff --git a/_extras/Windows.md b/_extras/Windows.md deleted file mode 100644 index ff6ea48..0000000 --- a/_extras/Windows.md +++ /dev/null @@ -1,74 +0,0 @@ ---- -title: Windows Setup -permalink: Windows.html -keypoints: -- Setting up Kerberos -- Getting a terminal and XWindows emulator ---- - -## Instructions for running remote terminal sessions on unix machines from Windows - -> ## Note You need to have administrator privileges to do some of this -> On Windows 11 type "Make Me " in the search area and it should pop up a "Make Me Adminstrator" window that allows you to be administrator for 30 minutes. You will need to do this again after 30 minutes. If something fails, it may be because the time ran out. -{: .callout} - -## Kerberos Ticket Manager - -Download the MIT Kerberos Ticket Manager MSI installer from [here](http://web.mit.edu/kerberos/dist/#kfw-4.1) ("MIT Kerberos for Windows 4.1" as of the time of writing). The installation is mostly automatic - if prompted for the type of installation, choosing "Typical" will be fine. - -Once this is installed, you can launch it as an app to manage your Kerberos tickets. To prepare to login to the FNAL cluster, press "Get Ticket", and you can authenticate with `[username]@` and the Kerberos password that you should already have setup. This replaces the `kinit` command that's used to obtain a Kerberos ticket in unix systems, and the graphical interface shows you your active tickets instead of using `klist` as in unix systems. - -> ## Note: May need a krb5.conf file -> Some sites require a krb5.conf file to customize access. -Ask about site-specific configuration requirements. For Fermilab find it here [krb5conf](https://authentication.fnal.gov/krb5conf/) -{: .callout} - -## Terminal emulators - -### MobaXterm - -[MobaXterm](http://mobaxterm.mobatek.net/) is a replacement for Putty/Xming. I found it easier to set up and install. - -Just install from the website and get it to talk to MIT Kerberos (not the default) - -One thing to remember in all of this is that your username may be different on your personal machine than on the remote system. Remember to set it correctly. - -This was added in 2024 when we had trouble installing Xming - - - -### PuTTY/Xming - -#### PuTTY - - -This is an alternate application to use in Windows as a terminal and to SSH into other systems. Download the appropriate Windows installer (MSI file) for your system from [here](https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html) (probably the 64-bit x86). - -The default options for the installer should be fine. - -Once you start trying to use putty to access remote systems, you need to do some configuration. - -#### Xming with PuTTY - -This is used for X11 graphics forwarding. Download from [here](https://sourceforge.net/projects/xming/) (see also [official notes](http://www.straightrunning.com/XmingNotes/)). Run the installer (keeping all default options is fine). - -Once installed, run XLaunch to check the settings. Follow the XLaunch configurations specified [here](http://www.geo.mtu.edu/geoschem/docs/putty_install.html). - -Once this is done, in the future when you need to use X11 graphics forwarding, simply launch Xming and let it run in the background. - -#### Configuring PuTTY for remote use - -Once all the previous components are installed, open up PuTTY and configure as follows: - -1. Under Connection/SSH/X11, check "Enable X11 Forwarding", and set the X display location to "localhost:0.0" -2. Under Connection/SSH/Auth/GSSAPI, check "Allow GSSAPI credential delegation". Make sure the MIT Kerberos GSSAPI64.DLL (or GSSAPI32.DLL, if you're using the 32-bit version) is in the list of "Preference order for GSSAPI libraries". If it's not, use the "User-supplied GSSAPI library path" option to navigate to where you've installed the MIT Kerberos ticket manager and select this library. -3. Under Session, Fill in the host name with [username]@. -4. Save this configuration by typing a name in the box labeled "Saved Sessions" and pressing "Save". You can load this configuration in the future to reuse these settings. - -## Done! - -This should allow you to SSH to the remote unix cluster and follow the rest of the tutorial. - -*Created on 20220508 using notes provided by Roger Huang.* - -[Go back to Setup]({{ site.baseurl }}/setup.html) diff --git a/_extras/al9_setup.md b/_extras/al9_setup.md deleted file mode 100644 index 0eee419..0000000 --- a/_extras/al9_setup.md +++ /dev/null @@ -1,58 +0,0 @@ ---- -title: Example AL9 setup for a new session -permalink: al9_setup -keypoints: -- getting basic applications on Alma9 -- getting authentication set up ---- - -You can store the code below as - `myal9.sh` and run it every time you log in. - -> ## Note - the full LArSoft suite doesn't work yet with spack -> Use the [Apptainer/sl7 method]({{ site.baseurl }}al9_setup.html) until we get that working if you want to use the full DUNE software suite. -{: .callout} - -~~~ - -# use spack to get applications -source /cvmfs/larsoft.opensciencegrid.org/spack-packages/setup-env.sh - -# load metacat, rucio and sam and tell it you are on dune -spack load r-m-dd-config experiment=dune -spack load kx509 -export IFDH_CP_MAXRETRIES=0\0\0\0\0 # no retries -export RUCIO_ACCOUNT=$USER - -# access some disks -export DUNEDATA=/exp/dune/data/users/$USER -export DUNEAPP=/exp/dune/app/users/$USER -export PERSISTENT=/pnfs/dune/persistent/users/$USER -export SCRATCH=/pnfs/dune/scratch/users/$USER - -# do some authentication - -voms-proxy-destroy -kx509 -export EXPERIMENT=dune -export ROLE=Analysis -voms-proxy-init -rfc -noregen -voms dune:/dune/Role=$ROLE -valid 24:00 -export X509_USER_PROXY=/tmp/x509up_u`id -u` - -htgettoken -i dune --vaultserver htvaultprod.fnal.gov - -export BEARER_TOKEN_FILE=/run/user/`id -u`/bt_u`id -u` - -~~~ -{: .language-bash} - ------------------------- - -## setup specific versions of code here - -~~~ -spack load root@6.28.12 # recent with xrootd -spack load gcc@12.2.0 -spack load fife-utils@3.7.4 -~~~ -{: .language-bash} diff --git a/_extras/putty.md b/_extras/putty.md deleted file mode 100644 index 529d298..0000000 --- a/_extras/putty.md +++ /dev/null @@ -1,40 +0,0 @@ ---- -title: Putty Setup -permalink: putty.html ---- - - -### PuTTY - -This is the application we use in Windows to SSH into other systems. Download the appropriate Windows installer (MSI file) for your system from [here](https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html) (probably the 64-bit x86). - -The default options for the installer should be fine. - -## Kerberos Ticket Manager - -Download the MIT Kerberos Ticket Manager MSI installer from [here](http://web.mit.edu/kerberos/dist/#kfw-4.1) ("MIT Kerberos for Windows 4.1" as of the time of writing). The installation is mostly automatic - if prompted for the type of installation, choosing "Typical" will be fine. - -Once this is installed, you can launch it as an app to manage your Kerberos tickets. To prepare to login to the FNAL cluster, press "Get Ticket", and you can authenticate with `[username]@FNAL.GOV` and the Kerberos password that you should already have setup. This replaces the `kinit` command that's used to obtain a Kerberos ticket in unix systems, and the graphical interface shows you your active tickets instead of using `klist` as in unix systems. - -## Xming - -This is used for X11 graphics forwarding. Download from [here](https://sourceforge.net/projects/xming/) (see also [official notes](http://www.straightrunning.com/XmingNotes/)). Run the installer (keeping all default options is fine). - -Once installed, run XLaunch to check the settings. Follow the XLaunch configurations specified [here](http://www.geo.mtu.edu/geoschem/docs/putty_install.html). - -Once this is done, in the future when you need to use X11 graphics forwarding, simply launch Xming and let it run in the background. - -## Configuring PuTTY - -Once all the previous components are installed, open up PuTTY and configure as follows: - -1. Under Connection/SSH/X11, check "Enable X11 Forwarding", and set the X display location to "localhost:0.0" -2. Under Connection/SSH/Auth/GSSAPI, check "Allow GSSAPI credential delegation". Make sure the MIT Kerberos GSSAPI64.DLL (or GSSAPI32.DLL, if you're using the 32-bit version) is in the list of "Preference order for GSSAPI libraries". If it's not, use the "User-supplied GSSAPI library path" option to navigate to where you've installed the MIT Kerberos ticket manager and select this library. -3. Under Session, Fill in the host name with [username]@dunegpvm[0-15].fnal.gov (or whatever the plan is for this tutorial?). -4. Save this configuration by typing a name in the box labeled "Saved Sessions" and pressing "Save". You can load this configuration in the future to reuse these settings. - -This should allow you to SSH to the FNAL clusters and follow the rest of the tutorial. - -*Created on 20220508 using notes provided by Roger Huang.* - -[Go to Setup]({{ site.baseurl }}/setup.html) diff --git a/_extras/sl7_setup.md b/_extras/sl7_setup.md deleted file mode 100644 index c94b18f..0000000 --- a/_extras/sl7_setup.md +++ /dev/null @@ -1,79 +0,0 @@ ---- -title: Example SL7 setup for a new session -permalink: sl7_setup -keypoints: -- getting basic applications on SL7 -- getting authentication set ip ---- - -## launch the Apptainer - -( I put this command in a file called apptainer.sh so I don't have to retype all the time.) - -### FNAL - -~~~ -/cvmfs/oasis.opensciencegrid.org/mis/apptainer/current/bin/apptainer shell --shell=/bin/bash \ --B /cvmfs,/exp,/nashome,/pnfs/dune,/opt,/run/user,/etc/hostname,/etc/hosts,/etc/krb5.conf --ipc --pid \ -/cvmfs/singularity.opensciencegrid.org/fermilab/fnal-dev-sl7:latest -~~~ -{: .language-bash} - -### CERN - -~~~ -/cvmfs/oasis.opensciencegrid.org/mis/apptainer/current/bin/apptainer shell --shell=/bin/bash\ --B /cvmfs,/afs,/opt,/run/user,/etc/hostname,/etc/krb5.conf --ipc --pid \ -/cvmfs/singularity.opensciencegrid.org/fermilab/fnal-dev-sl7:latest -~~~ -{: .language-bash} - - -## then do the following - -you can store this as - -`mysl7.sh` and run it every time you log in. - -~~~ -# use ups to find programs - this only works on SL7 - -source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh -setup metacat -setup rucio - - -# do some data access setup -export IFDH_CP_MAXRETRIES=0\0\0\0\0 # no retries -export DATA_DISPATCHER_URL=https://metacat.fnal.gov:9443/dune/dd/data -export DATA_DISPATCHER_AUTH_URL=https://metacat.fnal.gov:8143/auth/dune -export METACAT_SERVER_URL=https://metacat.fnal.gov:9443/dune_meta_prod/app -export METACAT_AUTH_SERVER_URL=https://metacat.fnal.gov:8143/auth/dune -export RUCIO_ACCOUNT=$USER - -# access some disks -export DUNEDATA=/exp/dune/data/users/$USER -export DUNEAPP=/exp/dune/app/users/$USER -export PERSISTENT=/pnfs/dune/persistent/users/$USER -export SCRATCH=/pnfs/dune/scratch/users/$USER - -# do some authentication - -voms-proxy-destroy -kx509 -export EXPERIMENT=dune -export ROLE=Analysis -voms-proxy-init -rfc -noregen -voms dune:/dune/Role=$ROLE -valid 24:00 -export X509_USER_PROXY=/tmp/x509up_u`id -u` - -htgettoken -i dune --vaultserver htvaultprod.fnal.gov -export BEARER_TOKEN_FILE=/run/user/`id -u`/bt_u`id -u` - -# This you need to update yourself to get new versions of DUNE software - -export DUNELAR_VERSION=v10_00_04d00 -export DUNELAR_QUALIFIER=e26:prof - -setup -B dunesw ${DUNELAR_VERSION} -q ${DUNELAR_QUALIFIER} -~~~ -{: .language-bash} diff --git a/setup.md b/setup.md index 8954619..af4dc2e 100644 --- a/setup.md +++ b/setup.md @@ -31,687 +31,12 @@ You must be on the DUNE Collaboration member list and have a valid FNAL or CERN You should join the [DUNE Slack instance](https://dunescience.slack.com) and look in #computing-training-basics (see Mission Setup below) for help with this tutorial. To join, email dune-slack@fnal.gov -Windows users are invited to review the [Windows Setup page]({{ site.baseurl }}/Windows.html). +Windows users are invited to review the [Windows Setup page](https://dune.github.io/computing-basics/Windows.html). -## Step 1: DUNE membership +## Please follow the setup instructions at: +[Tutorial setup](https://dune.github.io/computing-basics/setup) -To follow most of this training, you must be on the DUNE Collaboration member list. If you are not, talk to your supervisor or representative to get on it. - ->### Note: Other experiments may find the setup and first few modules useful. -> The first few modules on access and disk spaces should work for other Fermilab experiments if you substitute `dune --> other`. -{: .callout} - -## Step 2: Getting accounts - -### With FNAL -If you have a valid FNAL computing account with DUNE, go to step 3. - -If you have a valid FNAL computing account but not on DUNE yet (say you have access to another experiment's resources), you can ask for a DUNE-specific account using the Service Now [Update my Affiliation/Experiment/Collaboration membership Request](https://fermi.servicenowservices.com/nav_to.do?uri=%2Fcom.glideapp.servicecatalog_cat_item_view.do%3Fv%3D1%26sysparm_id%3D9a35be8d1b42a550746aa82fe54bcb6f%26sysparm_link_parent%3Da5a8218af15014008638c2db58a72314%26sysparm_catalog%3De0d08b13c3330100c8b837659bba8fb4%26sysparm_catalog_view%3Dcatalog_default%26sysparm_view%3Dcatalog_default) form. - -If you do not have any FNAL accounts yet, you need to contact your supervisor and/or Institutional Board representative to obtain a Fermilab User Account. More info: [https://get-connected.fnal.gov/users/access/](https://get-connected.fnal.gov/users/access/). This can take several weeks the first time. - -### With CERN -If you have a valid CERN account and access to CERN machines, you will be able to do many of the exercises as some data is available at CERN. The LArSoft tutorial has been designed to work from CERN. We strongly advise pursuing the FNAL computing account though. - -If you have trouble getting access, please reach out to the training team several days ahead of time. Some issues take some time to resolve. Please do not put this off. We cannot help you the day of the tutorial as we are busy doing the tutorial. - - -## Step 3: Mission setup (rest of this page) - -> ## If you run into problems, check out the [Common Error Messages]({{ site.baseurl }}/ErrorMessages) page and the [FAQ page](https://github.com/orgs/DUNE/projects/19/) -> if that doesn't help, use Slack to ask us about the problem - there is always a new one cropping up. -{: .challenge} - -We ask that you have completed the setup work to verify your access to the DUNE servers. It is not complicated, and should take 10 - 20 min. - -If you are not familiar with Unix shell commands, here is a tutorial you can do on your own to be ready: [The Unix Shell](https://swcarpentry.github.io/shell-novice/) - -If you have any questions, contact us at `dune-computing-training@fnal.gov` or on DUNE Slack `#computing_training_basics`. - - -You should join the DUNE Slack instance and look in [#computing-training-basics](https://dunescience.slack.com/archives/C02TJDHUQPR) for help with this tutorial - -go to [https://atwork.dunescience.org/tools/](https://atwork.dunescience.org/tools/) scroll down to Slack and request an invite. Please do not do this if you are already in DUNE Slack. - -Also check out our [Computing FAQ](https://github.com/orgs/DUNE/projects/19/views/1) for help with connection and account issues. - - - -## 0. Basic setup on your computer. - -[Computer Setup]({{ site.baseurl }}/ComputerSetup.html) goes through how to find a terminal and set up xwindows on MacOS and Windows. You can skip this if already familiar with doing that. - -> ## Note -> The instructions directly below are for FNAL accounts. If you do not have a valid FNAL account but a CERN one, go at the bottom of this page to the [Setup on CERN machines](#setup_CERN). -{: .challenge} - -## 1. Kerberos business - - -If you already are a Kerberos aficionado, go to the next section. If not, we give you a little tour of it below. - -**What is it?** Kerberos is a computer-network authentication protocol that works on the basis of tickets. - -**Why does FNAL use Kerberos?** Fermilab uses Kerberos to implement strong authentication, so that no passwords go over the internet (if a hacker steals a ticket, it is only valid for a day). - -**How does it work?** Kerberos uses tickets to authenticate users. Tickets are made by the kinit command, which asks for your kerberos password (info on kerberos password [here][kerberos-password]). The kinit command reads the password, encrypts it and sends it to the Key Distribution Centre (KDC) at FNAL. The Kerberos configuration file, which lists the KDCs, is stored in a file named krb5.conf. On Linux and Mac, it is located here: - -~~~ -/etc/krb5.conf -~~~ -{: .source} - -If you do not have it, create it. A FNAL template is available [here][kerberos-template] for each OS (Linux, Mac, Windows). More explanations on this config file are available [here][kerberos-config] if you're curious. - -To log in to a machine, you need to have a valid kerberos ticket. You don't need to do this every time you login, only when your ticket is expired. Kerberos tickets last for 26 hours. To create your ticket: - -~~~ -kinit -f username@FNAL.GOV -~~~ -{: .language-bash} - -The -f option means your ticket is forwardable. A forwardable ticket is one that originated on computer A, but can be forwarded to computer B and will remain valid on computer B. Careful: you need to write FNAL.GOV in uppercase. - -To know if you have a valid ticket, type: - -~~~ -klist -~~~ -{: .source} - -Typical output of klist on macOS looks like this: - -~~~ -Mac-124243:~ trj$ klist -Ticket cache: FILE:/tmp/krb5cc_10143_xSCwboGiuY -Default principal: trj@FNAL.GOV - -Valid starting Expires Service principal -05/18/23 12:43:23 05/19/23 11:41:42 krbtgt/FNAL.GOV@FNAL.GOV - renew until 05/25/23 12:41:42 -05/18/23 15:13:22 05/19/23 11:41:42 nfs/homesrv01.fnal.gov@FNAL.GOV -~~~ -{: .output} - -Tickets are stored in /tmp and have file permissions so that only you can read and write them. -If your ticket has not expired yet but will soon, you can refresh it for another 26 hours by typing: - -~~~ -kinit -R -~~~ -{: .language-bash} - -Refreshing a ticket can be done for up to one week after it was initially issued. - -Running into issues? Users logging in from outside the Fermilab network may be behind Network Address Translation (NAT) routers. If so, you may need an "addressless" ticket. For this, add the option -A: - -~~~ -kinit -A -f username@FNAL.GOV -~~~ -{: .language-bash} - -Some users have reported problems with the Kerberos utilities provided by Macports and Anaconda. Macintosh users should use the system-provided Kerberos utilities -- such as /usr/bin/kinit. Use the command - -~~~ -which kinit -~~~ -{: .language-bash} - -to report the path of the kinit that's first in your $PATH search list. If you have another, non-working one, a suggestion is to make an alias like this one: - -~~~ -alias mykinit="/usr/bin/kinit -A -f myusername@FNAL.GOV" -~~~ -{: .language-bash} - -Then you can simply use: - -~~~ -mykinit -~~~ -{: .language-bash} - -If you need to remove a ticket (for example, you are logged in at CERN with one Kerberos account but want to log in to a Fermilab machine with your Fermilab account), you can use the command - -~~~ -kdestroy -~~~ -{: .language-bash} - -After executing this command, you will have to use kinit again to get a new ticket. If you have tickets from a non-working kinit, be sure to use the corresponding kdestroy to remove them. klist should return an empty list to make sure you have a clean setup before running the system-provided kinit. - -Some users have reported that an installation of Anaconda interferes with the use of the system kinit. If you must use the kinit supplied with Anaconda, see these [instructions][anaconda-faq-kinit]. - -Check out the DUNE FAQ for a long list of possible error messages and suggested solutions. - -[DUNE FAQ][DUNE FAQ] - -## 2. ssh-in -**What is it?** SSH stands for Secure SHell. It uses an encrypted protocol used for connecting to remote machines and it works with Kerberos tickets when configured to do so. The configuration is done in your local file in your home directory: - -~~~ -cat ~/.ssh/config -~~~ -{: .language-bash} - -If it does not exist, create it. A minimum working example to connect to FNAL machines is as follows: - -~~~ -Host *.fnal.gov -ForwardAgent yes -ForwardX11 yes -ForwardX11Trusted yes -GSSAPIAuthentication yes -GSSAPIDelegateCredentials yes -~~~ -{: .output} - -Now you can try to log into a machine at Fermilab. There are now 15 different machines you can login to: from dunegpvm01 to dunegpvm15 (gpvm stands for 'general purpose virtual machine' because these servers run on virtual machines and not dedicated hardware, others nodes which are indented for building code run on dedicated hardware). The dunegpvm machines run Scientific Linux Fermi 7 (SLF7). To know the load on the machines, use this monitoring link: dunegpvm status. - -**How to connect?** The ssh command does the job. The -Y option turns on the xwindow protocol so that you can have graphical display and keyboard/mouse handling (quite useful). But if you have the line "ForwardX11Trusted yes" in your ssh config file, this will do the -Y option. For connecting to e.g. dunegpvm07, the command is: - -~~~ -ssh username@dunegpvmXX.fnal.gov -~~~ -{: .language-bash} - -where XX is a number from 01 to 15. -If you experience long delays in loading programs or graphical output, you can try connecting with VNC. More info: [Using VNC Connections on the dunegpvms][dunegpvm-vnc]. - -## 3. Get a clean shell -To run DUNE software, it is necessary to have a 'clean login'. What is meant by clean here? If you work on other experiment(s), you may have some environment variables defined (for NOvA, MINERvA, MicroBooNE). Theses may conflict with the DUNE environment ones. - -Two ways to clean your shell once on a DUNE machine: - -**Cleanup option 1:** Manual check and cleanup of your custom environment variables -To list all of your environment variables: - -~~~ -env -~~~ -{: .language-bash} - -The list may be long. If you want to know what is already setup (potentially conflicting later with DUNE), you can grep the environment variables for a specific experiment (here the -i option is 'case insensitive'). Here is an example to list all NOvA-specific variables: - -~~~ -env | grep -i nova -~~~ -{: .language-bash} - -Another useful command that will detect UPS products that have been set up is - -~~~ -ups active -~~~ -{: .language-bash} - -A "clean" response to the above command is: - -~~~ -bash: ups: command not found... -~~~ -{: .output} - -and if your environment has UPS products set up, the above command will list the ones you have. - -Once you identify environment variables that might conflict with your DUNE work, you can tweak your login scripts, like .bashrc, .profile, .shrc, .login etc., to temporarily comment out those (the "export" commands that are setting custom environment variables, or UPS's setup command). Note: files with names that begin with `.` are "hidden" in that they do not show up with a simple `ls` command. To see them, type `ls -a` which lists **a**ll files. - -**Cleanup option 2:** Back up your login scripts and go minimal at login (recommended) - -A simpler solution would be to rename your login scripts (for instance .bashrc as .bashrc_save and/or .profile as .profile_bkp) so that your setup at login will be minimal and you will get the cleanest shell. For this to take into effect, you will need to exit and reconnect through ssh. - -> ## Note: Avoid putting experiment specific setups in `.bashrc` or `.profile` -> You are going to be doing some setup for dune, which you will also need to do when you submit batch jobs. It is much easier to make a script `setup_dune.sh` which you execute every time you log in. Then you can duplicate the contents of that script in the script you use to run batch jobs on remote machines. It also makes it much easier for people to help you debug your setup. -{: .callout} - -## 4.1 Setting up DUNE software - Alma9 version - - - -We are moving to the Alma9 version of unix. Not all DUNE code has been ported yet but if you are doing basic root analysis work, try it out. - -Alma9 is the operating system you get when you log onto fnal unix or lxplus at CERN. - -Here is how you set up basic DUNE software on Alma 9. We are using the super-computer packaging system [Spack][Spack documentation] to give versioned access to code. - -1. login into a unix machine at FNAL or CERN - -2. Log into a gpvm or lxplus - -~~~ -# find a spack environment and set it up -source /cvmfs/larsoft.opensciencegrid.org/spack-packages/setup-env.sh -# get some basic things - -# use the command spack find to find packages you might want -# If you just type spack load ... you may be presented with a choice and will need to choose. -# -spack load root@6.28.12 -spack load cmake@3.27.7 -spack load gcc@12.2.0 -spack load fife-utils@3.7.4 -# load metacat, rucio and sam and tell it you are on dune -spack load r-m-dd-config experiment=dune -spack load kx509 -export SAM_EXPERIMENT=dune -~~~ -{: .language-bash} - -> ## Optional -> > ## See if ROOT works -> > Try testing ROOT to make certain things are working -> > -> > ~~~ -> > root -l -q $ROOTSYS/share/doc/root/tutorials/dataframe/df013_InspectAnalysis.C -> > # (the spack version of root seems to bury the tutorials.) -> > ~~~ -> > {: .language-bash} -> > You should see a plot that updates and then terminates. -> {: .solution} -{: .callout} - - -### Caveats - -We don't have a full ability to rebuild DUNE Software packages such as LArSoft using Spack yet. We will be adding more functionality soon. Unless you are doing simple ROOT based analysis you will need to use the [SL7 Container](#SL7_setup) method for now. - - -## 4.2 Setting up DUNE software - Scientific Linux 7 version - -See [SL7_to_Alma9][SL7_to_Alma9] for more information - -To set up your environment in SL7, the commands are: - -Log into a DUNE machine running Alma9 - -Launch an SL7 container - -~~~ -/cvmfs/oasis.opensciencegrid.org/mis/apptainer/current/bin/apptainer shell --shell=/bin/bash \ --B /cvmfs,/exp,/nashome,/pnfs/dune,/opt,/run/user,/etc/hostname,/etc/hosts,/etc/krb5.conf --ipc --pid \ -/cvmfs/singularity.opensciencegrid.org/fermilab/fnal-dev-sl7:latest -~~~ -{: .language-bash} - -You will then be in a container which looks like: - -~~~ -Apptainer> -~~~ -{: .output} - -You can then set up DUNE's code - -~~~ -export UPS_OVERRIDE="-H Linux64bit+3.10-2.17" # makes certain you get the right UPS -source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh -~~~ -{: .language-bash} - -You should see in your terminal the following output: -~~~ -Setting up larsoft UPS area... /cvmfs/larsoft.opensciencegrid.org/products/ -Setting up DUNE UPS area... /cvmfs/dune.opensciencegrid.org/products/dune/ -~~~ -{: .output} - -> ## Optional -> > ### See if ROOT works -> > Try testing ROOT to make certain things are working -> > -> > ~~~ -> > setup root v6_28_12 -q e26:p3915:prof # sets up root for you -> > root -l -q $ROOTSYS/tutorials/dataframe/df013_InspectAnalysis.C -> > ~~~ -> > {: .language-bash} -> > You should see a plot that updates and then terminates. You may need to `export DISPLAY=0:0`. -> {: .solution} -{: .callout} - -### Caveats for later - -> ## Note: You cannot submit jobs from the Container -> You cannot submit jobs from the Container - you need to open a separate window. In that window do the minimal [Alma9](#AL9_setup) setup below and submit your jobs from that window. -> ->You may need to print your submit command to the screen or a file to do so if your submission is done from a script that uses ups. -{: .callout} - -> ## 4.3 Optional -> > ## See how you can make an alias so you don't have to type everything -> > You can store this in your (minimal) .bashrc or .profile if you want this alias to be available in all sessions. The alias will be defined but not executed. Only if you type the command `dune_setup7` yourself.> Not familiar with aliases? Read below. -> > -> > To create unix custom commands for yourself, we use 'aliases': -> > ~~~ -> > alias my_custom_commmand='the_long_command_you_want_to_alias_in_a_shorter_custom_name' -> > ~~~ -> > {: .source} -> > For DUNE setup, you can type for instance: -> > ~~~ -> > alias dune_setup7='source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh' -> > ~~~ -> > {: .language-bash} -> > or -> > ~~~ -> > alias dune_setup9='source /cvmfs/larsoft.opensciencegrid.org/spack-packages/setup-env.sh' -> > ~~~ -> > {: .language-bash} -> > -> > So next time you type: -> > ~~~ -> > dune_setup9 -> > ~~~ -> > {: .source} -> > Your terminal will execute the long command. This will work for your current session (if you disconnect, the alias won't exist anymore). -> {: .solution} -{: .callout} - - - - - - - - - -## 5. Exercise! (it's easy) -This exercise will help organizers see if you reached this step or need help. - -1) Start in your home area `cd ~` on the DUNE machine (normally CERN or FNAL) and create the file ```dune_presetup_2024.sh```. - - -Launch the *Apptainer* as described above in the [SL7 version](#SL7_setup) - -Write in it the following: -~~~ -export DUNELAR_VERSION=v10_00_04d00 -export DUNELAR_QUALIFIER=e26:prof - -export UPS_OVERRIDE="-H Linux64bit+3.10-2.17" -alias dune_setup7='source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh' -~~~ -{: .source} -When you start the training, you will have to source this file: -~~~ -source ~/dune_presetup_2024.sh -~~~ -{: .language-bash} -Then, to setup DUNE, use the created alias: -~~~ -dune_setup7 -setup dunesw $DUNELAR_VERSION -q $DUNELAR_QUALIFIER -~~~ -{: .language-bash} - -2) Create working directories in the `dune/app` and `pnfs/dune` areas (these will be explained during the training): -~~~ -mkdir -p /exp/dune/app/users/${USER} -mkdir -p /pnfs/dune/scratch/users/${USER} -mkdir -p /pnfs/dune/persistent/users/${USER} -~~~ -{: .language-bash} - -3) Print the date and add the output to a file named `my_first_login.txt`: -~~~ -date >& /exp/dune/app/users/${USER}/my_first_login.txt -~~~ -{: .language-bash} -4) With the above, we will check if you reach this point. However we want to tailor this tutorial to your preferences as much as possible. We will let you decide which animals you would like to see in future material, between: "puppy", "cat", "squirrel", "sloth", "unicorn pegasus llama" (or "prefer not to say" of course). Write your desired option on the second line of the file you just created above. - -> ## Note -> If you experience difficulties, please ask for help in the Slack channel [#computing-training-basics](https://dunescience.slack.com/archives/C02TJDHUQPR). Please mention in your message this is about the Setup step 5. Thanks! -{: .challenge} - -## 6. Getting setup for streaming and grid access -In addition to your kerberos access, you need to be in the DUNE VO (Virtual Organization) to access to global DUNE resources. This is necessary in particular to stream data and submit jobs to the grid. If you are on the DUNE collaboration list and have a Fermilab ID you should have been added automatically to the DUNE VO. - -To check if you are on the VO, two commands. The kx509 gets a certificate from your kerberos ticket. On a DUNE machine, type: -~~~ -kx509 -~~~ -{: .language-bash} - -~~~ -Authorizing ...... authorized -Fetching certificate ..... fetched -Storing certificate in /tmp/x509up_u55793 - -Your certificate is valid until: Wed Jan 27 18:03:55 2021 -~~~ -{: .output} - -To access the grid resources, you will need either need a proxy or a token. More information on proxy is available [here][proxy-info]. - - - -## How to authorize with the KX509/Proxy method - -On Alma9 you may need to do this first - -~~~ -spack load kx509 -~~~ -{: .language-bash} - -Requesting a proxy needs to be done once every 24 hours per login machine you’re using to identify yourself: - -~~~ -kx509 -export ROLE=Analysis -export X509_USER_PROXY=/tmp/x509up_dune_Analysis_`id -u` -voms-proxy-init -rfc -noregen -voms=dune:/dune/Role=$ROLE -valid 120:00 -~~~ -{: .language-bash} - -~~~ -Your identity: /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/ OU=People/CN=Claire David/CN=UID:cdavid -Contacting voms1.fnal.gov:15042 [/DC=org/DC=incommon/C=US/ST=Illinois/L=Batavia/O=Fermi Research Alliance/OU=Fermilab/CN=voms1.fnal.gov] "dune" Done -Creating proxy .................................... Done - -Your proxy is valid until Mon Jan 25 18:09:25 2021 -~~~ -{: .output} - -You should be able to access files now - -> {: .solution} -{: .callout} - -Report this by appending the output of `voms-proxy-info` to your first login file: -~~~ -voms-proxy-info >> /exp/dune/app/users/${USER}/my_first_login.txt -~~~ -{: .language-bash} - -With this done, you should be able to submit jobs and access remote DUNE storage systems via xroot. - - - -### Tokens method - -We are moving from proxies to tokens - these are a bit different. - -#### 1. Get your token - -~~~ - htgettoken -i dune --vaultserver htvaultprod.fnal.gov -~~~ -{: .language-bash} - -the first time you do this (and once a month thereafter), it will ask you to open a web browser and - -~~~ -Attempting OIDC authentication with https://htvaultprod.fnal.gov:8200 - -Complete the authentication at: - https://cilogon.org/device/?user_code=ABC-D1E-FGH -No web open command defined, please copy/paste the above to any web browser -Waiting for response in web browser -~~~ -{: .output} - -You will need to follow the instructions and copy and paste that link into your browser (can be any browser). There is a time limit on it so its best to do it right away. Choose Fermilab as the identity provider in the menu, even if your home institution is listed. After you hit log on with your service credentials, you'll get a message saying you approved the access request, and then after a short delay (may be several seconds) in the terminal you will see. - - -~~~ -Saving credkey to /nashome/u/username/.config/htgettoken/credkey-dune-default -Saving refresh token ... done -Attempting to get token from https://htvaultprod.fnal.gov:8200 ... succeeded -Storing bearer token in /tmp/bt_token_dune_Analysis_somenumber.othernumber -Storing condor credentials for dune -~~~ -{: .output} - -you only have to do the web thing once/month. - -#### 2. Tell the system where your token is - - -~~~ -export BEARER_TOKEN_FILE=/run/user/`id -u`/bt_u`id -u` -~~~ -{: .language-bash} - -the `id -u` just returns your numerical user ID - -With this done, you should be able to submit jobs and access remote DUNE storage systems via xroot. - - - -> ## Issues -> If you have issues here, please ask [#computing-training-basics](https://dunescience.slack.com/archives/C02TJDHUQPR) in Slack to get support. Please mention in your message it is the Step 6 of the setup. Thanks! -{: .challenge} - -> ## Success -> If you obtain the message starting with `Your proxy is valid until`... Congratulations! You are ready to go! -{: .keypoints} - - -## Set up on CERN machines - - - -> ## Warning: Some data access operations here still require a fermilab account and the Fermilab VO. We are working on a solution. -{: .callout} - -See [https://github.com/DUNE/data-mgmt-ops/wiki/Using-Rucio-to-find-Protodune-files-at-CERN](https://github.com/DUNE/data-mgmt-ops/wiki/Using-Rucio-to-find-Protodune-files-at-CERN) for instructions on getting full access to DUNE data via metacat/rucio from lxplus. - -### 1. Setup in Alma9 - -The directions above at: [AL9_setup](#AL9_setup) above should work directly at CERN, do those and proceed to step 3. - -### 2. Source the DUNE environment SL7 setup script -CERN access is mainly for ProtoDUNE collaborators. If you have a valid CERN ID and access to lxplus via ssh, you can setup your environment for this tutorial as follow: - -log into `lxplus.cern.ch` - -fire up the Apptainer as explained in [SL7 Setup](#SL7_setup) but with a slightly different version as mounts are different. - -~~~ -/cvmfs/oasis.opensciencegrid.org/mis/apptainer/current/bin/apptainer shell --shell=/bin/bash\ --B /cvmfs,/afs,/opt,/run/user,/etc/hostname --ipc --pid \ -/cvmfs/singularity.opensciencegrid.org/fermilab/fnal-dev-sl7:latest -~~~ -{: .language-bash} - -You may have to add some mounts - here I added `/afs/` but removed `/nashome/`, `/exp/`, `/etc/krb5.conf` and `/pnfs/`. - -You should then be able to proceed with much of the tutorial thanks to the wonder that is [`/cvmfs/`]({{ site.baseurl }}/03.3-cvmfs.html). - -Set up the DUNE software - -~~~ -export UPS_OVERRIDE="-H Linux64bit+3.10-2.17" # makes certain you get the right UPS -source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh -setup kx509 -~~~ -{: .language-bash} - -~~~ -Setting up larsoft UPS area... /cvmfs/larsoft.opensciencegrid.org/products/ -Setting up DUNE UPS area... /cvmfs/dune.opensciencegrid.org/products/dune/ -~~~ -{: .output} - -### 3. Getting authentication for data access - -If you have a Fermilab account already, do this to get access the data catalog worldwide - -~~~ -kdestroy -kinit -f @FNAL.GOV -kx509 -export ROLE=Analysis -voms-proxy-init -rfc -noregen -voms=dune:/dune/Role=$ROLE -valid 120:00 -~~~ -{: .language-bash} - -~~~ -Checking if /tmp/x509up_u79129 can be reused ... yes -Your identity: /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Heidi n/CN=UID: -Contacting voms1.fnal.gov:15042 [/DC=org/DC=incommon/C=US/ST=Illinois/O=Fermi Research Alliance/CN=voms1.fnal.gov] "dune" Done -Creating proxy .......................................................................................... Done - -Your proxy is valid until Sat Aug 24 17:11:41 2024 -~~~ -{: .output} - - - -### 4. Access tutorial datasets -Normally, the datasets are accessible through the grid resource. But with your CERN account, you may not be part of the DUNE VO yet (more on this during the tutorial). We found a workaround: some datasets have been copied locally for you. You can check them here: -~~~ -ls /afs/cern.ch/work/t/tjunk/public/may2023tutorialfiles/ -~~~ -{: .language-bash} -~~~ -np04_raw_run005387_0019_dl5_reco_12900894_0_20181102T023521.root -PDSPProd4_protoDUNE_sp_reco_stage1_p1GeV_35ms_sce_datadriven_41094796_0_20210121T214555Z.root -~~~ -{: .output} - -### 5. Notify us -You should be good to go, and you might revisit [Indico event page][indico-event-page]. -If however you are experiencing issues, please contact us as soon as possible. Be sure to mention "Setup on CERN machines" if that is the case, and we will do our best to assist you. - -> ## Success -> If you can list the files above, you should be able to do most of the tutorial on LArSoft. -{: .keypoints} - -> ## Warning -> Connecting to CERN machines will not give you the best experience to understand storage spaces and data management. If you obtain a FNAL account in the future, you can however do the training through the recorded videos that will be made available after the event. -{: .checklist} - -> ## Issues -> If you have issues here, please go to the [#computing-training-basics](https://dunescience.slack.com/archives/C02TJDHUQPR)Slack channel to get support. Please note that you are on a CERN machine in your message. Thanks! -{: .discussion} +then come back and do this tutorial ### Useful Links From c4a183118ed2a7789ef5c5f1da283479d0ad2e80 Mon Sep 17 00:00:00 2001 From: Heidi Schellman <33669005+hschellman@users.noreply.github.com> Date: Fri, 29 Aug 2025 16:10:50 -0700 Subject: [PATCH 2/5] trim out duplicates --- _config.yml | 2 +- _episodes/01-introduction.md | 14 ++- _episodes/02-submit-jobs-w-justin.md | 12 +-- _episodes/03.2-UPS.md | 101 ------------------- _episodes/03.3-cvmfs.md | 51 ---------- _extras/Common-Error-Messages.md | 2 +- _extras/TutorialsMasterList.md | 4 +- _extras/introduction_metacat_rucio_justin.md | 18 ++-- index.md | 32 +++--- 9 files changed, 38 insertions(+), 198 deletions(-) delete mode 100644 _episodes/03.2-UPS.md delete mode 100644 _episodes/03.3-cvmfs.md diff --git a/_config.yml b/_config.yml index e8d125f..03065a6 100644 --- a/_config.yml +++ b/_config.yml @@ -11,7 +11,7 @@ carpentry: "dune" # Overall title for pages. -title: "Computing Basics for DUNE - Late 2024 edition" +title: "Batch Computing Basics for DUNE - 2025 transition edition" # Life cycle stage of the lesson # See this page for more details: https://cdh.carpentries.org/the-lesson-life-cycle.html diff --git a/_episodes/01-introduction.md b/_episodes/01-introduction.md index 1989d63..35410c3 100644 --- a/_episodes/01-introduction.md +++ b/_episodes/01-introduction.md @@ -28,9 +28,7 @@ The May 2023 DUNE computing training spanned two days: [Indico site](https://ind --> -This is a short 3 hour version of the basics. We will be adding/offering additional tutorials. An important one that is coming soon is: - -[The LArSoft tutorial at CERN, February 3-7, 2025](https://indico.cern.ch/event/1461779/) password on the [tutorials page](https://wiki.dunescience.org/wiki/Computing_tutorials) +This is a short 3 hour version of the basics. We will be adding/offering additional tutorials. [Also check out the longer list of DUNE computing tutorials](https://wiki.dunescience.org/wiki/Computing_tutorials) (collaborators only) @@ -48,7 +46,7 @@ A similar session from May 2022 was captured for your asynchronous review. ## Basic setup reminder -You should have gone through the [setup sequence]({{ site.baseurl }}/setup) +You should have gone through the [setup sequence](https://dune.github.io/computing-basics/setup) As a reminder you need to choose between running on sl7 in a container or al9. You do NOT want to mix them. @@ -62,11 +60,11 @@ source mysetup7.sh Here are some example scripts that do most of the setups explained in this tutorial. You need to store these in your home area, source them every time you log in, and possibly update them as code versions evolve. -- [SL7 setup]({{ site.baseurl }}/sl7_setup) +- [SL7 setup](https://dune.github.io/computing-basics/sl7_setup) -- [AL9 setup]({{ site.baseurl }}/al9_setup) +- [AL9 setup](https://dune.github.io/computing-basics/al9_setup) -> ## If you run into problems, check out the [Common Error Messages]({{ site.baseurl }}/ErrorMessages) page and the [FAQ page](https://github.com/orgs/DUNE/projects/19/) +> ## If you run into problems, check out the [Common Error Messages](https://dune.github.io/computing-basics/ErrorMessages) page and the [FAQ page](https://github.com/orgs/DUNE/projects/19/) > if that doesn't help, use Slack to ask us about the problem - there is always a new one cropping up. {: .challenge} @@ -95,7 +93,7 @@ Here are some example scripts that do most of the setups explained in this tutor There will be live documents linked from [Indico][indico-event-link] for each [Zoom][zoom-link] session. You can write questions there, anonymously or not, and experts will reply. The chat on Zoom can quickly saturate so this is a more convenient solution and proved very successful at the previous training. We will collect all questions and release a Q&A after the event. --> -You must be on the DUNE Collaboration member list and have a valid FNAL or CERN account. See the old [Indico Requirement page][indico-event-requirements] for more information. Windows users are invited to review the [Windows Setup page]({{ site.baseurl }}/Windows.html). +You must be on the DUNE Collaboration member list and have a valid FNAL or CERN account. See the old [Indico Requirement page][indico-event-requirements] for more information. Windows users are invited to review the [Windows Setup page](https://dune.github.io/computing-basics/Windows.html). You should join the DUNE Slack instance and look in [#computing-training-basics](https://dunescience.slack.com/archives/C02TJDHUQPR) for help with this tutorial diff --git a/_episodes/02-submit-jobs-w-justin.md b/_episodes/02-submit-jobs-w-justin.md index f2e09d6..afe401c 100644 --- a/_episodes/02-submit-jobs-w-justin.md +++ b/_episodes/02-submit-jobs-w-justin.md @@ -5,20 +5,20 @@ exercises: 0 questions: - How to submit realistic grid jobs with JustIn objectives: -- Demonstrate use of JustIn for job submission with more complicated setups. +- Demonstrate use of [justIn](https://dunejustin.fnal.gov) for job submission with more complicated setups. keypoints: - Always, always, always prestage input datasets. No exceptions. --- -# PLEASE USE THE NEW JUSTIN SYSTEM INSTEAD OF POMS +# PLEASE USE THE NEW [justIn](https://dunejustin.fnal.gov) SYSTEM INSTEAD OF POMS -__The JustIn Tutorial is currently in docdb at: [JustIn Tutorial](https://docs.dunescience.org/cgi-bin/sso/RetrieveFile?docid=30145)__ +__The [justIn](https://dunejustin.fnal.gov) Tutorial is currently in docdb at: [JustIn Tutorial](https://docs.dunescience.org/cgi-bin/sso/RetrieveFile?docid=30145)__ -The JustIn system is describe in detail at: +The [justIn](https://dunejustin.fnal.gov) system is describe in detail at: -__[JustIn Home](https://justin.dune.hep.ac.uk/dashboard/)__ +__[JustIn Home](https://dunejustin.fnal.gov/dashboard/)__ -__[JustIn Docs](https://justin.dune.hep.ac.uk/docs/)__ +__[JustIn Docs](https://dunejustin.fnal.gov/docs/)__ > ## Note More documentation coming soon diff --git a/_episodes/03.2-UPS.md b/_episodes/03.2-UPS.md deleted file mode 100644 index 9b031cc..0000000 --- a/_episodes/03.2-UPS.md +++ /dev/null @@ -1,101 +0,0 @@ ---- -title: The old UPS code management system -teaching: 15 -exercises: 5 -questions: -- How are different software versions handled? -objectives: -- Understand the roles of the tools UPS (and Spack) -keypoints: -- The Unix Product Setup (UPS) is a tool to ensure consistency between different software versions and reproducibility. -- CVMFS distributes software and related files without installing them on the target computer (using a VM, Virtual Machine). ---- -## What is UPS and why do we need it? - -> ## Note -> UPS is going away and only works on SL7 but we do not yet have a fully functional replacement. -> You need to be in the Apptainer to use it. -> UPS is being replaced by a new [spack][Spack Documentation] system for Alma9. We will be adding a Spack tutorial soon but for now, you need to use SL7/UPS to use the full DUNE code stack. -> -> Go back and look at the [SL7/Apptainer]({{ site.baseurl }}setup.html#SL7_setup) instructions to get an SL7 container for this section. -{: .challenge} - -An important requirement for making valid physics results is computational reproducibility. You need to be able to repeat the same calculations on the data and MC and get the same answers every time. You may be asked to produce a slightly different version of a plot for example, and the data that goes into it has to be the same every time you run the program. - -This requirement is in tension with a rapidly-developing software environment, where many collaborators are constantly improving software and adding new features. We therefore require strict version control; the workflows must be stable and not constantly changing due to updates. - -DUNE must provide installed binaries and associated files for every version of the software that anyone could be using. Users must then specify which version they want to run before they run it. All software dependencies must be set up with consistent versions in order for the whole stack to run and run reproducibly. - -The Unix Product Setup (UPS) is a tool to handle the software product setup operation. - -UPS is set up when you setup DUNE: - -Launch the Apptainer and then: - -~~~ - source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh - export DUNELAR_VERSION=v10_00_04d00 - export DUNELAR_QUALIFIER=e26:prof - setup dunesw $DUNELAR_VERSION -q $DUNELAR_QUALIFIER -~~~ -{: .language-bash} - - -`dunesw`: product name
-`$DUNELAR_VERSION` version tag
-`$DUNELAR_QUALIFIER` are "qualifiers". Qualifiers are separated with colons and may be specified in any order. The `e26` qualifier refers to a specific version of the gcc compiler suite, and `prof` means select the installed product that has been compiled with optimizations turned on. An alternative to `prof` is the `debug` qualifier. All builds of LArSoft and dunesw are compiled with debug symbols turned on, but the "debug" builds are made with optimizations turned off. Both kinds of software can be debugged, but it is easier to debug the debug builds (code executes in the proper order and variables aren't optimized away so they can be inspected). - -Another specifier of a product install is the "flavor". This refers to the operating system the program was compiled for. These days we only support SL7, but in the past we used to also support SL6 and various versions of macOS. The flavor is automatically selected when you set up a product using setup (unless you override it which is usually a bad idea). Some product are "unflavored" because they do not contain anything that depends on the operating system. Examples are products that only contain data files or text files. - -Setting up a UPS product defines many environment variables. Most products have an environment variable of the form `_DIR`, where `` is the name of the UPS product in all capital letters. This is the top-level directory and can be used when searching for installed source code or fcl files for example. `_FQ_DIR` is the one that specifies a particular qualifier and flavor. There is also _VERSION, and many products have _INC and _LIB. - -> ## Exercise 3 -> * show all the versions of dunesw that are currently available by using the `ups list -aK+ dunesw` command -> * pick one version and substitute that for DUNELAR_VERSION and DUNELAR_QUALIFIER above and set up dunesw -{: .callout} - -Many products modify the following search path variables, prepending their pieces when set up. These search paths are needed by _art_ jobs. - -`PATH`: colon-separated list of directories the shell uses when searching for programs to execute when you type their names at the command line. The command `which` tells you which version of a program is found first in the PATH search list. Example: -~~~ -which lar -~~~ -{: .language-bash} - -will tell you where the lar command you would execute is if you were to type `lar` at the command prompt. -The other paths are needed by _art_ for finding plug-in libraries, fcl files, and other components, like gdml files. -`CET_PLUGIN_PATH` -`LD_LIBRARY_PATH` -`FHICL_FILE_PATH` -`FW_SEARCH_PATH` - -Also the PYTHONPATH describes where Python modules will be loaded from. - -Try - -~~~ -which root -~~~ -{: .language-bash} -to see the version of root that dunesw sets up. Try it out! - - -### UPS basic commands - -| Command | Action | -|------------------------------------------------|------------------------------------------------------------------| -| `ups list -aK+ dunesw` | List the versions and flavors of dunesw that exist on this node | -| `ups active` | Displays what has been setup | -| `ups depend dunesw v10_00_04d00 -q e20:prof` | Displays the dependencies for this version of dunesw | - - -> ## Exercise 4 -> * show all the dependencies of dunesw by using "ups depend dunesw $DUNELAR_VERSION -q $DUNELAR_QUALIFIER" -{: .challenge} ->## UPS Documentation Links -> -> * [UPS reference manual](http://www.fnal.gov/docs/products/ups/ReferenceManual/) -> * [UPS documentation](https://cdcvs.fnal.gov/redmine/projects/ups/wiki) -> * [UPS qualifiers](https://cdcvs.fnal.gov/redmine/projects/cet-is-public/wiki/AboutQualifiers) -{: .callout} - diff --git a/_episodes/03.3-cvmfs.md b/_episodes/03.3-cvmfs.md deleted file mode 100644 index 1cf90e5..0000000 --- a/_episodes/03.3-cvmfs.md +++ /dev/null @@ -1,51 +0,0 @@ ---- -title: CVMFS distributed file system -teaching: 10 -exercises: 0 -questions: -- What is cvmfs -objectives: -- Understand the roles of the CVMFS. -keypoints: -- CVMFS distributes software and related files without installing them on the target computer (using a VM, Virtual Machine). ---- - - -## CVMFS -**What is CVMFS and why do we need it?** -DUNE has a need to distribute precompiled code to many different computers that collaborators may use. Installed products are needed for four things: -1. Running programs interactively -2. Running programs on grid nodes -3. Linking programs to installed libraries -4. Inspection of source code and data files - -Results must be reproducible, so identical code and associated files must be distributed everywhere. DUNE does not own any batch resources -- we use CPU time on computers that participating institutions donate to the Open Science Grid. We are not allowed to install our software on these computers and must return them to their original state when our programs finish running so they are ready for the next job from another collaboration. - -CVMFS is a perfect tool for distributing software and related files. It stands for CernVM File System (VM is Virtual Machine). Local caches are provided on each target computer, and files are accessed via the `/cvmfs` mount point. DUNE software is in the directory `/cvmfs/dune.opensciencegrid.org`, and LArSoft code is in `/cvmfs/larsoft.opensciencegrid.org`. These directories are auto-mounted and need to be visible when one executes `ls /cvmfs` for the first time. Some software is also in /cvmfs/fermilab.opensciencegrid.org. - -CVMFS also provides a de-duplication feature. If a given file is the same in all 100 releases of dunesw, it is only cached and transmitted once, not independently for every release. So it considerably decreases the size of code that has to be transferred. - -When a file is accessed in `/cvmfs`, a daemon on the target computer wakes up and determines if the file is in the local cache, and delivers it if it is. If not, the daemon contacts the CVMFS repository server responsible for the directory, and fetches the file into local cache. In this sense, it works a lot like AFS. But it is a read-only filesystem on the target computers, and files must be published on special CVMFS publishing servers. Files may also be cached in a layer between the CVMFS host and the target node in a squid server, which helps facilities with many batch workers reduce the network load in fetching many copies of the same file, possibly over an international connection. Directories under /cvmfs may initially not show up if you type `ls /cvmfs`. Instead, accessing them the first time will automatically mount the appropriate volume, at least under Linux. CMVFS clients also exist for macOS, and there, the volumes may need to be listed explicitly when starting CVMFS on a mac. - -CVMFS also has a feature known as "Stashcache" or "xCache". Files that are in /cvmfs/dune.osgstorage.org are not actually transmitted -in their entirety, only pointers to them are, and then they are fetched from one of several regional cache servers or in the case of DUNE from Fermilab dCache directly. DUNE uses this to distribute photon library files, for instance. - -CVMFS is by its nature read-all so code is readable by anyone in the world with a CVMFS client. CVMFS clients are available for download to desktops or laptops. Sensitive code can not be stored in CVMFS. - -More information on CVMFS is available [here](https://wiki.dunescience.org/wiki/DUNE_Computing/Access_files_in_CVMFS) - -> ## Exercise 6 -> * `cd /cvmfs` and do an `ls` at top level -> * What do you see--do you see the four subdirectories (dune.opensciencegrid.org, larsoft.opensciencegrid.org, fermilab.opensciencegrid.org, and dune.osgstorage.org) -> * cd dune.osgstorage.org/pnfs/fnal.gov/usr/dune/persistent/stash/PhotonPropagation/LibraryData -{: .challenge} - -## Useful links to bookmark - -* CVMFS on DUNE wiki: [Access files in CVMFS](https://wiki.dunescience.org/wiki/DUNE_Computing/Access_files_in_CVMFS) - -[Ifdh_commands]: https://cdcvs.fnal.gov/redmine/projects/ifdhc/wiki/Ifdh_commands -[xrootd-man-pages]: https://xrootd.slac.stanford.edu/docs.html -[Understanding-storage]: https://cdcvs.fnal.gov/redmine/projects/fife/wiki/Understanding_storage_volumes -[DataCatalogDocs]: https://dune.github.io/DataCatalogDocs/index.html -[MetaCatGlossary]: https://dune.github.io/DataCatalogDocs/glossary.html diff --git a/_extras/Common-Error-Messages.md b/_extras/Common-Error-Messages.md index ef725e3..692e189 100644 --- a/_extras/Common-Error-Messages.md +++ b/_extras/Common-Error-Messages.md @@ -13,7 +13,7 @@ keypoints: - #### `bash: setup: command not found` setup is a UPS command. You need to be running in the Apptainer and setup the DUNE ups system - check out the instructions in [SL7 setup] - ({{ site.baseurl }}/sl7_setup) + (https://dune.github.io/computing-basics/sl7_setup) - #### `SyntaxError: future feature annotations is not defined` diff --git a/_extras/TutorialsMasterList.md b/_extras/TutorialsMasterList.md index 6a5e196..3d53bf9 100644 --- a/_extras/TutorialsMasterList.md +++ b/_extras/TutorialsMasterList.md @@ -4,8 +4,8 @@ ## [Computing Basics](https://dune.github.io/computing-basics/) A starting point for new DUNE people. How to log in, access disk and run basic jobs at FNAL and CERN. -## [The Justin Workflow System](https://justin.dune.hep.ac.uk/docs/) -How to use the Justin Workflow system to submit jobs and access data. [Metacat/Rucio/Justin Tutorial](https://docs.dunescience.org/cgi-bin/sso/RetrieveFile?docid=30145) has step by step instructions. +## [The Justin Workflow System](https://dunejustin.fnal.gov/docs/) +How to use the [justIn](https://dunejustin.fnal.gov) Workflow system to submit jobs and access data. [Metacat/Rucio/Justin Tutorial](https://docs.dunescience.org/cgi-bin/sso/RetrieveFile?docid=30145) has step by step instructions. ## [LArTPC Reconstruction Training](https://indico.ph.ed.ac.uk/event/268/) Annual training on LArTPC reconstruction and simulation methods. diff --git a/_extras/introduction_metacat_rucio_justin.md b/_extras/introduction_metacat_rucio_justin.md index f7f9476..2e68e84 100644 --- a/_extras/introduction_metacat_rucio_justin.md +++ b/_extras/introduction_metacat_rucio_justin.md @@ -5,20 +5,20 @@ exercises: 0 questions: - How to submit realistic grid jobs with JustIn objectives: -- Demonstrate use of JustIn for job submission with more complicated setups. +- Demonstrate use of [justIn](https://dunejustin.fnal.gov) for job submission with more complicated setups. keypoints: - Always, always, always prestage input datasets. No exceptions. --- -# PLEASE USE THE NEW JUSTIN SYSTEM INSTEAD OF POMS +# PLEASE USE THE NEW [justIn](https://dunejustin.fnal.gov) SYSTEM INSTEAD OF POMS -__The JustIn Tutorial is currently in docdb at: [JustIn Tutorial](https://docs.dunescience.org/cgi-bin/sso/RetrieveFile?docid=30145)__ +__The [justIn](https://dunejustin.fnal.gov) Tutorial is currently in docdb at: [JustIn Tutorial](https://docs.dunescience.org/cgi-bin/sso/RetrieveFile?docid=30145)__ -The JustIn system is described in detail at: +The [justIn](https://dunejustin.fnal.gov) system is described in detail at: -__[JustIn Home](https://justin.dune.hep.ac.uk/dashboard/)__ +__[JustIn Home](https://dunejustin.fnal.gov/dashboard/)__ -__[JustIn Docs](https://justin.dune.hep.ac.uk/docs/)__ +__[JustIn Docs](https://dunejustin.fnal.gov/docs/)__ > ## Note More documentation coming soon @@ -209,7 +209,7 @@ A given file can also be searched and visualized from the WEB interface - is the new workflow system replacing POMS - It can be used to process several input files by submitting batch jobs on the grid -- justIN is a workflow system that processes data by satisfying the requirements of data location/data catalog, rapid code distribution service +- [justIn](https://dunejustin.fnal.gov) is a workflow system that processes data by satisfying the requirements of data location/data catalog, rapid code distribution service and job submission to the grid. justIN ties together: @@ -371,7 +371,7 @@ The scary preload is to allow `xroot` to read `hdf5` files. $ USERF=$USER $ FNALURL='https://fndcadoor.fnal.gov:2880/dune/scratch/users' $ justinsimple-workflow --mql"files from fardet-hd:fardet-hd__fd_mc_2023a_reco2__full-reconstructed__v09_81_00d02__standard_reco2_dune10kt_nu_1x2x6__prodgenie_nu_dune10kt_1x2x6__out1__validation skip 5 limit 5 ordered " --jobscriptsubmit_ana.jobscript--rss-mb 4000 --output-pattern '*_ana_*.root:$FNALURL/$USERF" -'You can look at your job status by using justIN dashboard https://justin.dune.hep.ac.uk/dashboard/?method=list-workflows +'You can look at your job status by using [justIn](https://dunejustin.fnal.gov) dashboard https://dunejustin.fnal.gov/dashboard/?method=list-workflows @@ -541,7 +541,7 @@ justintime Links MetacatWEB interface: https://metacat.fnal.gov:9443/dune_meta_prod/app/auth/login -justIN: https://justin.dune.hep.ac.uk/docs/ +justIN: https://dunejustin.fnal.gov/docs/ Slack channels: #workflow diff --git a/index.md b/index.md index 2f94ddf..5918721 100644 --- a/index.md +++ b/index.md @@ -10,23 +10,21 @@ latitude: "45" longitude: "-1" humandate: "2024" humantime: "asynchronous" -startdate: "2024-05-20" -enddate: "2024-12-01" -instructor: ["Heidi Schellman","Dave Demuth","Michael Kirby","Steve Timm","Tom Junk","Ken Herner"] +startdate: "2025-09-08" +enddate: "2025-09-11" +instructor: ["Heidi Schellman","Dave Demuth","Michael Kirby","Steve Timm","Tom Junk","Ken Herner","Aaron Higuera"] helper: ["mentor1", "mentor2"] email: ["schellmh@oregonstate.edu","dmdemuth@gmail.com","mkirby@bnl.gov","timm@fnal.gov","junk@fnal.gov","herner@fnal.gov"] -collaborative_notes: "2024-05-24-dune" +collaborative_notes: "2025-09-11-dune" eventbrite: --- -This tutorial will teach you the basics of DUNE Computing. +This tutorial will teach you the basics of DUNE batch computing. Instructors will engage students with hands-on lessons focused in three areas: -0. Basics of logging on, getting accounts, disk spaces -1. Data storage and management, -2. Introduction to LArSoft -3. How to find futher training materials for DUNE and HEP software +1. The [justIn](https://dunejustin.fnal.gov) batch system +2. The jobsub batch system Mentors will answer your questions and provide technical support. @@ -40,32 +38,28 @@ Mentors will answer your questions and provide technical support. > We recommend that participants to go through [The Unix Shell](https://swcarpentry.github.io/shell-novice/), if new to the unix command line (also known as terminal or shell). > 2. A computer set up so that you can log into a remote unix system at FNAL or CERN. > This will include getting DUNE computing accounts at FNAL or CERN. -> See [Setup]({{ page.root }}/setup.html) to do this pre-class setup. +> See [Setup](http://dune.github.io/computing-basics/setup.html) to do this pre-class setup. {: .prereq} By the end of this workshop, participants will know how to: -* Utilize data volumes at FNAL. -* Understand good data management practices. -* Provide a basic overview of art and LArSoft to a new researcher. +* submit simple jobs using the [justIn](https://dunejustin.fnal.gov) batch system +* know some simple methods for debugging batch jobs There are additional materials provided that explain how to: * [Develop configuration files to control batch jobs]({{ site.baseurl }}/07-grid-job-submission) -* [Use the Justin system to process data]({{ site.baseurl }}/08-submit-jobs-w-justin) -* [Modify LArSoft modules]({{ site.baseurl }}/06-larsoft-modify-module) +* [Use the [justIn](https://dunejustin.fnal.gov) system to process data]({{ site.baseurl }}/02-submit-jobs-w-justin) + You will need to be a DUNE Collaborator (listed member), and have a valid FNAL or CERN computing account to join the tutorial. Contact your DUNE group leader for assistance. > ## Getting Started > -> First step: follow the directions in the "[Setup]( -> {{ page.root }}/setup.html)". Once you follow the instructions; we give you an easy exercise +> First step: follow the directions in the "[Setup](https://dune.github.io/computing-basics/setup.html)". Once you follow the instructions; we give you an easy exercise > to make sure you are good to go. {: .callout} -Then we will proceed through the episodes - the live tutorial currently goes to episode 5. - Ask questions on [Slack](https://dunescience.slack.com/archives/C02TJDHUQPR) anytime or - during the live lessons - on the [livedoc](https://docs.google.com/document/d/1QNK-hKPqLIVaecRyg9q4QZOHNwAZgq32oHVuboG_AvQ/edit?usp=sharing). From 3ac4bd0ba9eeaf6470207479c0f50e67794abb8b Mon Sep 17 00:00:00 2001 From: Heidi Schellman <33669005+hschellman@users.noreply.github.com> Date: Fri, 29 Aug 2025 16:34:38 -0700 Subject: [PATCH 3/5] more cleanup --- _episodes/07-grid-job-submission.md | 25 +- obsolete/05-improve-code-efficiency.md | 360 ------------------- obsolete/sam-by-schellman.md | 475 ------------------------- 3 files changed, 24 insertions(+), 836 deletions(-) delete mode 100644 obsolete/05-improve-code-efficiency.md delete mode 100644 obsolete/sam-by-schellman.md diff --git a/_episodes/07-grid-job-submission.md b/_episodes/07-grid-job-submission.md index 0c7e1b9..44c52f7 100644 --- a/_episodes/07-grid-job-submission.md +++ b/_episodes/07-grid-job-submission.md @@ -368,6 +368,29 @@ Since the workflow was causing a systemwide disruption we immediately held all o DUNE has also created a a global glideinWMS pool similar to the CMS Global Pool that is intended to serve as a single point through which multiple job submission systems (e.g. HTCondor schedulers at sites outside of Fermilab) can have access to the same resources. Jobs using the global pool still run in the exactly the same way as those that don't. We plan to move more and more work over to the global pool in 2023 and priority access to the FermiGrid quota will eventually be given to jobs submitted to the global pool. To switch to the global pool with jobsub, it's simply a matter of adding `--global-pool dune` as an option to your submission command. The only practical difference is that your jobs will come back with IDs of the form NNNNNNN.N@dunegpschedd0X.fnal.gov instead of NNNNNNN.N@jobsub0X.fnal.gov. Again, everything else is identical, so feel free to test it out. + +## Making subsets of metacat datasets + +Running across very large number of files puts you at risk of system issues. It is often much nicer to run over several smaller subsets. +Many official metacat definitions are large data collections defined only by their properties and not really suitable for a single job. + +You can do the following. Submit your jobs using the skip and limit commands. Here 'namespace:official_dataset' describes the official dataset. + +See [the basics tutorial](https://dune.github.io/computing-basics/03-data-management/index.html#official-datasets-) for information on official datasets. + +~~~ +query="files from namespace:official_dataset skip 0 limit 1000" +query="files from namespace:official_dataset skip 1000 limit 1000" +query="files from namespace:official_dataset skip 2000 limit 1000" +.... +~~~ +{: ..language-bash} + + + + + ## Verify Your Learning: diff --git a/obsolete/05-improve-code-efficiency.md b/obsolete/05-improve-code-efficiency.md deleted file mode 100644 index 7fa2a80..0000000 --- a/obsolete/05-improve-code-efficiency.md +++ /dev/null @@ -1,360 +0,0 @@ ---- -title: Code-makeover on how to code for better efficiency -teaching: 50 -exercises: 0 -questions: -- How to write the most efficient code? -objectives: -- Learn good tips and tools to improve your code. -keypoints: -- CPU, memory, and build time optimizations are possible when good code practices are followed. ---- - -#### Session Video - -The session will be captured on video a placed here after the workshop for asynchronous study. - -#### Live Notes - -Participants are encouraged to monitor and utilize the [Livedoc for May. 2023](https://docs.google.com/document/d/19XMQqQ0YV2AtR5OdJJkXoDkuRLWv30BnHY9C5N92uYs/edit?usp=sharing) to ask questions and learn. For reference, the [Livedoc from Jan. 2023](https://docs.google.com/document/d/1sgRQPQn1OCMEUHAk28bTPhZoySdT5NUSDnW07aL-iQU/edit?usp=sharing) is provided. - -### Code Make-over - -**How to improve your code for better efficiency** - -DUNE simulation, reconstruction and analysis jobs take a lot of memory and CPU time. This owes to the large size of the Far Detector modules as well as the many channels in the Near Detectors. Reading out a large volume for a long time with high granularity creates a lot of data that needs to be stored and processed. - -### CPU optimization: - -**Run with the “prof” build when launching big jobs.** While both the "debug" and "prof" builds have debugging and profiling information included in the executables and shared libraries, the "prof" build has a high level of compiler optimization turned on while "debug" has optimizations disabled. Debugging with the "prof" build can be done, but it is more difficult because operations can be reordered and some variables get put in CPU registers instead of inspectable memory. The “debug” builds are generally much slower, by a factor of four or more. Often this difference is so stark that the time spent repeatedly waiting for a slow program to chug through the first trigger record in an interactive debugging session is more costly than the inconvenience of not being able to see some of the variables in the debugger. If you are not debugging, then there really is (almost) no reason to use the “debug” builds. If your program produces a different result when run with the debug build and the prof build (and it’s not just the random seed), then there is a bug to be investigated. - -**Compile your interactive ROOT scripts instead of running them in the interpreter** At the ROOT prompt, use .L myprogram.C++ (even though its filename is myprogram.C). Also .x myprogram.C++ will compile and then execute it. This will force a compile. .L myprogram.C+ will compile it only if necessary. - -**Run gprof or other profilers like valgrind's callgrind:** You might be surprised at what is actually taking all the time in your program. There is abundant documentation on the [web][gnu-manuals-gprof], and also the valgrind online documentation. -There is no reason to profile a "debug" build and there is no need to hand-optimize something the compiler will optimize anyway, and which may even hurt the optimality of the compiler-optimized version. - -**The Debugger can be used as a simple profiler:** If your program is horrendously slow (and/or it used to be fast), pausing it at any time is likely to pause it while it is doing its slow thing. Run your program in the debugger, pause it when you think it is doing its slow thing (i.e. after initialization), and look at the call stack. This technique can be handy because you can then inspect the values of variables that might give a clue if there’s a bug making your program slow. (e.g. looping over 1015 wires in the Far Detector, which would indicate a bug, such as an uninitialized loop counter or an unsigned loop counter that is initialized with a negative value. - -**Don't perform calculations or do file i/o that will only later be ignored.** It's just a waste of time. If you need to pre-write some code because in future versions of your program the calculation is not ignored, comment it out, or put a test around it so it doesn't get executed when it is not needed. - - -**Extract constant calculations out of loops.** - - -
- -
Code Example (BAD)
- -
Code Example (GOOD)
- -
-double sum = 0;
-for (size_t i=0; i -{
- sum += result.at(i)/TMath::Sqrt(2.0);
-}
-
-
-double sum = 0;
-double f = TMath::Sqrt(0.5);
-for (size_t i=0; i -{
- sum += result.at(i)*f;
-}
-
- -
- - -The example above also takes advantage of the fact that floating-point multiplies generally have significantly less latency than floating-point divides (this is still true, even with modern CPUs). - -**Use sqrt():** Don’t use `pow()` or `TMath::Power` when a multiplication or `sqrt()` function can be used. - -
- -
Code Example (BAD)
- -
Code Example (GOOD)
-
-double r = TMath::Power( TMath::Power(x,2) + TMath::Power(y,2), 0.5); -
- -
-double r = TMath::Sqrt( x*x + y*y ); -
-
- -The reason is that `TMath::Power` (or the C math library’s `pow()`) function must take the logarithm of one of its arguments, multiply it by the other argument, and exponentiate the result. Modern CPUs have a built-in `SQRT` instruction. Modern versions of `pow()` or `Power` may check the power argument for 2 and 0.5 and instead perform multiplies and `SQRT`, but don’t count on it. - -If the things you are squaring above are complicated expressions, use `TMath::Sq()` to eliminate the need for typing them out twice or creating temporary variables. Or worse, evaluating slow functions twice. The optimizer cannot optimize the second call to that function because it may have side effects like printing something out to the screen or updating some internal variable and you may have intended for it to be called twice. - - -
-
Code Example (BAD)
- -
Code Example (GOOD)
- -
-double r = TMath::Sqrt( slow_function_calculating_x()*
slow_function_calculating_x() +
slow_function_calculating_y()*
slow_function_calculating_y() ); -
- -
-double r = TMath::Sqrt( TMath::Sq(slow_function_calculating_x()) +
TMath::Sq(slow_function_calculating_y())); -
-
- -**Don't call `sqrt()` if you don’t have to.** - -
-
Code Example (BAD)
- -
Code Example (GOOD)
- -
-if (TMath::Sqrt( x*x + y*y ) < rcut )
-{
- do_something();
-} -
- -
-double rcutsq = rcut*rcut;
-if (x*x + y*y < rcutsq)
-{
- do_something();
-} -
-
- - - -**Use binary search features in the STL rather than a step-by-step lookup.** - -~~~ -std::vector my_vector; -(fill my_vector with stuff) - -size_t indexfound = 0; -bool found = false; -for (size_t i=0; i -
Code Example (BAD)
- -
Code Example (GOOD)
- -
-double sum = 0;
-std::vector <double> results;
-(fill lots of results)
-for (size_t i=0; i -{
- float rsq = results.at(i)*result.at(i);
- sum += rsq;
-} -
- -
-double sum = 0;
-std::vector <double> results;
-(fill lots of results)
-for (size_t i=0; i -{
- sum += TMath::Sq(results.at(i));
-} -
- - -**Minimize conversions between int and float or double** - -The up-conversion from int to float takes time, and the down-conversion from float to int loses precision and also takes time. Sometimes you want the precision loss, but sometimes it's a mistake. - -**Check for NaN and Inf.** While your program will still function if an intermediate result is `NaN` or `Inf` (and it may even produce valid output, especially if the `NaN` or `Inf` is irrelevant), processing `NaN`s and `Inf`s is slower than processing valid numbers. Letting a `NaN` or an `Inf` propagate through your calculations is almost never the right thing to do - check functions for domain validity (square roots of negative numbers, logarithms of zero or negative numbers, divide by zero, etc.) when you execute them and decide at that point what to do. If you have a lengthy computation and the end result is `NaN`, it is often ambiguous at what stage the computation failed. - -**Pass objects by reference.** Especially big ones. C and C++ call semantics specify that objects are passed by value by default, meaning that the called method gets a copy of the input. This is okay for scalar quantities like int and float, but not okay for a big vector, for example. The thing to note then is that the called method may modify the contents of the passed object, while an object passed by value can be expected not to be modified by the called method. - -**Use references to receive returned objects created by methods** That way they don't get copied. The example below is from the VD coldbox channel map. Bad, inefficient code courtesy of Tom Junk, and good code suggestion courtesy of Alessandro Thea. The infotohcanmap object is a map of maps of maps: std::unordered_map > > infotochanmap; - -
-
Code Example (BAD)
- -
Code Example (GOOD)
- -
-int dune::VDColdboxChannelMapService::getOfflChanFromWIBConnectorInfo(int wib, int wibconnector, int cechan)
-{
- int r = -1;
- auto fm1 = infotochanmap.find(wib);
- if (fm1 == infotochanmap.end()) return r;
- auto m1 = fm1->second;
- auto fm2 = m1.find(wibconnector);
- if (fm2 == m1.end()) return r;
- auto m2 = fm2->second;
- auto fm3 = m2.find(cechan);
- if (fm3 == m2.end()) return r;
- r = fm3->second;
- return r;
-
-
-int dune::VDColdboxChannelMapService::getOfflChanFromWIBConnectorInfo(int wib, int wibconnector, int cechan)
-{
- int r = -1;
- auto fm1 = infotochanmap.find(wib);
- if (fm1 == infotochanmap.end()) return r;
- auto& m1 = fm1->second;
- auto fm2 = m1.find(wibconnector);
- if (fm2 == m1.end()) return r;
- auto& m2 = fm2->second;
- auto fm3 = m2.find(cechan);
- if (fm3 == m2.end()) return r;
- r = fm3->second;
- return r;
-}
-
-
- -**Minimize cloning TH1’s.** It is really slow. - -**Minimize formatted I/O.** Formatting strings for output is CPU-consuming, even if they are never printed to the screen or output to your logfile. `MF_LOG_INFO` calls for example must prepare the string for printing even if it is configured not to output it. - -**Avoid using caught exceptions as part of normal program operation** While this isn't an efficiency issue or even a code readability issue, it is a problem when debugging programs. Most debuggers have a feature to set a breakpoint on thrown exceptions. This is sometimes necessary to use in order to track down a stubborn bug. Bugs that stop program execution like segmentation faults are sometimes easer to track down than caught exceptions (which often aren't even bugs but sometimes they are). If many caught exceptions take place before the buggy one, then the breakpoint on thrown exceptions has limited value. - -**Use sparse matrix tools where appropriate.** This also saves memory. - -**Minimize database access operations.** Bundle the queries together in blocks if possible. Do not pull more information than is needed out of the database. Cache results so you don’t have to repeat the same data retrieval operation. - -Use `std::vector::reserve()` in order to size your vector right if you know in advance how big it will be. `std::vector()` will, if you `push_back()` to expand it beyond its current size in memory, allocate twice the memory of the existing vector and copy the contents of the old vector to the new memory. This operation will be repeated each time you start with a zero-size vector and push_back a lot of data. -Factorize your program into parts that do i/o and compute. That way, if you don’t need to do one of them, you can switch it off without having to rewrite everything. Example: Say you read data in from a file and make a histogram that you are sometimes interested in looking at but usually not. The data reader should not always make the histogram by default but it should be put in a separate module which can be steered with fcl so the computations needed to calculate the items to fill the histogram can be saved. - -## Memory optimization: - -Use `valgrind`. Its default operation checks for memory leaks and invalid accesses. Search the output for the words “invalid” and “lost”. Valgrind is a `UPS` product you can set up along with everything else. It is set up as part of the dunesw stack. - -~~~ -setup valgrind -valgrind --leak-check=yes --suppressions=$ROOTSYS/etc/valgrind-root.supp myprog arg1 arg2 -~~~ -{: .source} - -More information is available [here][valgrind-quickstart]. ROOT-specific suppressions are described [here][valgrind-root]. You can omit them, but your output file will be cluttered up with messages about things that ROOT does routinely that are not bugs. - -Use `massif`. `massif` is a heap checker, a tool provided with `valgrind`; see documentation [here][valgrind-ms-manual]. - -**Free up memory after use.** Don’t hoard it after your module’s exited. - -**Don’t constantly re-allocate memory if you know you’re going to use it again right away.** - -**Use STL containers instead of fixed-size arrays, to allow for growth in size.** Back in the bad old days (Fortran 77 and earlier), fixed-size arrays had to be declared at compile time that were as big as they possibly could be, both wasting memory on average and creating artificial cutoffs on the sizes of problems that could be handled. This behavior is very easy to replicate in C++. Don’t do it. - -**Be familiar with the structure and access idioms.** These include `std::vector`, `std::map`, `std::unordered_map`, `std::set`, `std::list`. - -**Minimize the use of new and delete to reduce the chances of memory leaks.** If your program doesn’t leak memory now, that’s great, but years from now after maintenance has been transferred, someone might introduce a memory leak. - -**Use move semantics to transfer data ownership without copying it.** - -**Do not store an entire event’s worth of raw digits in memory all at once.** Find some way to process the data in pieces. - -**Consider using more compact representations in memory.** A `float` takes half the space of a double. A `size_t` is 64 bits long (usually). Often that’s needed, but sometimes it’s overkill. - -**Optimize the big uses and don’t spend a lot of time on things that don’t matter.** If you have one instance of a loop counter that’s a `size_t` and it loops over a million vector entries, each of which is an `int`, look at the entries of the vector, not the loop counter (which ought to be on the stack anyway). - -**Rebin histograms.** Some histograms, say binned in channels x ticks or channels x frequency bins for a 2D FFT plot, can get very memory hungry. - -## I/O optimization: - -**Do as much calculation as you can per data element read.** You can spin over a TTree once per plot, or you can spin through the TTree once and make all the plots. ROOT compresses data by default on write and uncompresses it on readin, so this is both an I/O and a CPU issue, to minimize the data that are read. - -**Read only the data you need** ROOT's TTree access methods are set up to give you only the requested TBranches. If you use TTree::MakeClass to write a template analysis ROOT macro script, it will generate code that reads in _all_ TBranches and leaves. It is easy to trim out the extras to speed up your workflow. - -**Saving compressed data reduces I/O time and storage needs.** Even though compressing data takes CPU, a slow disk or network can mean your workflow is in fact faster to trade CPU time instead of the disk read time. - -**Stream data with xrootd** You will wait less for your first event than if you copy the file, put less stress on the data storage elements, and have more reliable i/o with dCache. - -## Build time optimization: - -**Minimize the number of #included files.** If you don’t need an #include, don’t use it. It takes time to find these files in the search path and include them. - -**Break up very large source files into pieces.** `g++’s` analysis and optimization steps take an amount of time that grows faster than linearly with the number of source lines. - -**Use ninja instead of make** Instructions are [here][ninjadocpageredmine] - -## Workflow optimization: - -**Pre-stage your datasets** It takes a lot of time to wait for a tape (sometimes hours!). CPUs are accounted by wall-clock time, whether you're using them or not. So if your jobs are waiting for data, they will run slowly even if you optimized the CPU usage. Pre-stage your data! - -**Run a test job** If you have a bug, you will save time by not submitting large numbers of jobs that might not work. - -**Write out your variables in your own analysis ntuples (TTrees)** You will likely have to run over the same MC and data events repeatedly, and the faster this is the better. You will have to adjust your cuts, tune your algorithms, estimate systematic uncertainties, train your deep-learning functions, debug your program, and tweak the appearance of your plots. Ideally, if the data you need to do these operatios is available interctively, you will be able to perform these tasks faster. Choose a minimal set of variables to put in your ntuples to save on storage space. - -**Write out histograms to ROOTfiles and decorate them in a separate script** You may need to experiment many times with borders, spacing, ticks, fonts, colors, line widths, shading, labels, titles, legends, axis ranges, etc. Best not to have to re-compute the contents when you're doing this, so save the histograms to a file first and read it in to touch it up for presentation. - -## Software readability and maintainability: - -**Keep the test suite up to date** dunesw and larsoft have many examples of unit tests and integration tests. A colleague's commit to your code or even to a different piece of code or even a data file might break your code in unexpected, difficult-to-diagnose ways. The continuous integration (CI) system is there to catch such breakage, and even small changes in run time, memory consumption, and data product output. - -**Keep your methods short** If you have loaded up a lot of functionality in a method, it may become hard to reuse the components to do similar things. A long method is probably doing a lot of different things that can be given meaningful names. - -**Update the comments when code changes** Not many things are more confusing than an out-of-date-comment that refers to how code used to work long ago. - -**Update names when meaning changes** As software evolves, the meaning of the variables may shift. It may be a quick fix to change the contents of a variable without changing its name, but some variables may then contain contents that is the opposite of what the variable name implies. While the code will run, future maintainers will get confused. - -**Use const frequently** The const keyword prevents overwriting variables unintentionally. Constness is how *art* protects the data in its event memory. This mechanism is exposed to the user in that pointers to const memory must be declared as pointers to consts, or you will get obscure error messages from the compiler. Const can also protect you from yourself and your colleagues when you know that the contents of a variable ought not to change. - -**Use simple constructs even if they are more verbose** Sometimes very clever, terse expressions get the job done, but they can be difficult for a human to understand if and when that person must make a change. There is an [obfuscated C contest][obfuscated-C] if you want to see examples of difficult-to-read code (that may in fact be very efficient! But people time is important, too). - -**Always initialize variables when you declare them** Compilers will warn about the use of uninitialized variables, so you will get used to doing this anyway. The initialization step takes a little time and it is not needed if the first use of the memory is to set the variable, which is why compilers do not automatically initialize variables. - -**Minimize the scope of variables** Often a variable will only have a meaningful value iniside of a loop. You can declare variables as you use them. Old langauges like Fortran 77 insisted that you declare all variables at the start of a program block. This is not true in C and C++. Declaring variables inside of blocks delimiated by braces means they will go out of scope when the program exits the block, both freeing the memory and preventing you from referring to the variable after the loop is done and only considering the last value it took. Sometimes this is the desired behaviour, though, and so this is not a blanket rule. - -## Coding for Thread Safety - -Modern CPUs often have many cores available. It is not unusual for a grid worker node to have as many as 64 cores on it, and 128 GB of RAM. Making use of the available hardware to maximize throughput is an important way to optimize our time and resources. DUNE jobs tend to be "embarrassingly parallel", in that they can be divided up into many small jobs that do not need to communicated with one another. Therefore, making use of all the cores on a grid node is usually as easy as breaking a task up into many small jobs and letting the grid schedulers work out what jobs run where. The issue however is effective memory usage. If several small jobs share a lot of memory whose contents do not change (code libraries loaded into RAM, geometry description, calibration constants), then one can group the work together into a single job that uses multiple threads to get the work done faster. If the memory usage of a job is dominated by per-event data, then loading multiple events' worth of data in RAM in order to keep all the cores fed with data may not provide a noticeable improvement in the utilization of CPU time relative to memory time. - -Sometimes multithreading has advantages within a trigger record. Data from different wires or APAs may be processed simultaneously. One thing software managers would like to make sure is controllable is the number of threads a program is allowed to spawn. Some grid sites do not have an automatic protection against a program that creates more threads than CPUs it has requested. Instead, a human operator may notice that the load on a system is far greater than the number of cores, and track down and ban the offending job sumitter (this has already happened on DUNE). If a program contains components, some of which manage their own threawds, then it becomes hard to manage the total thread count in a program. Multithreaded *art* keeps track of the total thread count using TBB, or Thread Building Blocks. - -See this very thorough [presentation][knoepfel-thread-safety] by Kyle Knoepfel at the 2019 LArSoft [workshop][LArSoftWorkshop2019]. Several other talks at the workshop also focus on multi-threaded software. In short, if data are shared between threads and they are mutable, this is a recipe for race conditions and non-reproducible behavior of programs. Giving each thread a separate instance of each object is one way to contain possible race conditions. Alternately, private and public class members which do not change or which have synchronous access methods can also help provide thread safety. - - -[cpp-lower-bound]: https://en.cppreference.com/w/cpp/algorithm/lower_bound -[gnu-manuals-gprof]: https://ftp.gnu.org/old-gnu/Manuals/gprof-2.9.1/html_mono/gprof.html -[valgrind-quickstart]: https://www.valgrind.org/docs/manual/quick-start.html -[valgrind-ms-manual]: https://www.valgrind.org/docs/manual/ms-manual.html -[ninjadocpageredmine]: https://cdcvs.fnal.gov/redmine/projects/dunetpc/wiki/_Tutorial_#Using-the-ninja-build-system-instead-of-make -[valgrind-root]: https://root-forum.cern.ch/t/valgrind-and-root/28506 -[obfuscated-C]: https://www.ioccc.org/ -[knoepfel-thread-safety]: https://indico.fnal.gov/event/20453/contributions/57777/attachments/36182/44065/2019-LArSoftWorkshop-ThreadSafety.pdf -[LArSoftWorkshop2019]: https://indico.fnal.gov/event/20453/timetable/?view=standard - -{%include links.md%} diff --git a/obsolete/sam-by-schellman.md b/obsolete/sam-by-schellman.md deleted file mode 100644 index 156981f..0000000 --- a/obsolete/sam-by-schellman.md +++ /dev/null @@ -1,475 +0,0 @@ ---- -title: SAM by Schellman -teaching: 5 -exercises: 0 -questions: -- What event information can be queried for a given data file? -objectives: -- Learn about the utility of SAM -- Practice selected SAM commands -keypoints: -- SAM is a data catalog originally designed for the D0 and CDF experiments at FNAL and is now used widely by HEP experiments. ---- - -## Notes on the SAM data catalog system - -These notes are provided as an ancilliary resource on the topic of DUNE data mananagement by Dr. Heidi Schellman 1-28-2020, and updated Dec. 2021 - -## Introduction - -SAM is a data catalog originally designed for the D0 and CDF high energy physics experiments at Fermilab. It is now used by most of the Intensity Frontier experiments at Fermilab. - -The most important objects cataloged in SAM are individual **files** and collections of files called **datasets**. - -Data files themselves are not stored in SAM, their metadata is, and that metadata allows you to search for and find the actual physical files. SAM also provides mechanisms for initiating and tracking file delivery through **projects**. - -### General considerations - -SAM was designed to ensure that large scale data-processing was done completely and accurately which leads to some features not always present in a generic catalog but very desirable if one wishes high standards of reproducibility and documentation in data analysis. - -For example, at the time of the original design, the main storage medium was 8mm tapes using consumer-grade drives. Drive and tape failure rates were > 1%. Several SAM design concepts, notably luminosity blocks and parentage tracking, were introduced to allow accurate tracking of files and their associated normalization in a high error-rate environment. - - -- Description of the contents of data collections to allow later retrieval -- Tracking of object and collection parentage and description of processing transformations to document the full provenance of any data **object** and ensure accurate normalization -- Grouping of objects and collection into larger “**datasets**” based on their characteristics -- Storing physical location of objects -- Tracking of the processing of collections to allow reprocessing on failure and avoid double processing. -- Methods (“projects”) for delivering and tracking collections in multi-processing jobs -- Preservation of data about processing/storage for debugging/reporting - - -The first 3 goals relate to content and characteristics while the last 3 relate to data storage and processing tools. - -### Specifics - -1. The current SAM implementation uses the file as the basic unit of information. **Metadata** is associated with the file name. Filenames must be unique in the system. This prevents duplication of data in a sample, as a second copy cannot be cataloged. This makes renaming a file very unwise. A very common practice is to include some of the metadata in the filename, both to make it easier to identify and to ensure uniqueness. - -2. **Metadata** for a file can include file locations but does not have to. A file can have no location at all, or many. When you move or remove a file with an associated SAM location, you need to update the location information. - -3. SAM does not move files.. It provides location information for a process to use in streaming or copying a file using its own methods. Temporary locations (such as on a grid node) need not be reported to SAM. Permanently storing or removing files requires both moving/removing the file itself and updating its metadata to reflect that location and is generally left up to special packages such as the Fermilab FTS (File Transfer Service) and SAM projects. - -4. Files which are stored on disk or tape are expected to have appropriate file sizes and checksums. One can have duplicate instances of a file in different locations, but they must all be identical. If one reprocesses an input file, the output may well be subtly different (for example dates stored in the file itself can change the checksum). SAM should force you to choose which version, old or new, is acceptable. It will not let you catalog both with the same filename. As a result, if you get a named file out of SAM, you can be reasonably certain you got the right copy. - -5. files with duplicate content but different names can be problematic. The reprocessed file mentioned in part 4, if renamed, could cause significant problems if it were allowed into the data sample along with the originals as a job processing all files might get both copies. This is one of the major reasons for the checksums and unique filenames. There is a temptation to put, for example, timestamps, in filenames to generate unique names but that removes a protection against duplication. - -6. Files can have **parents** and **children** and, effectively, birth certificates that can tell you how they were made. An example would be a set of raw data files RAWnnn processed with code X to produce a single reconstructed file RECO. One can tell SAM that RAWnnn are the parents of RECO processed with version x of code X. If one later finds another RAWnnn file that was missed in processing, SAM can tell you it has not been processed yet with X (i.e., it has no children associated with version x of X) and you can then choose to process that file. This use case often occurs when a production job fails without reporting back or succeeds but the copy back or catalog action fails. -Note: The D0 experiment required that all official processing going into SAM be done with tagged releases and fixed parameter sets to increase reproducibility and the tags for that information were included in the metadata. Calibration databases were harder to timestamp so some variability was still possible if calibrations were updated. - -7. SAM supports several types of data description fields: - -Values are standard across all implementations like run_type, file_size … - -Parameters are defined by the experiment for example MC.Genieversion - -Values are common to almost all HEP experiments and are optimized for efficient queries. SAM also allows definition of “parameters” (by administrators) as they are needed. This allows the schema to be modified easily as needs arrive. - -8. Metadata can also contain “spill” or luminosity block information that allows a file to point to specific data taking periods with smaller granularity than a run or subrun. When files are merged, this spill information is also merged. - -9. SAM currently does not contain a means of easily determining which file a given event is in. If a daq system is writing multiple streams, an event from a given subrun could be in any stream. Adding an event database would be a useful feature. - -All of these features are intended to assure that your data are well described and can be found. As SAM stores full location information, this means any SAM-visible location. In addition, if parentage information is provided, you can determine and reproduce the full provenance of any file. - -## Datasets and projects - -### Datasets - -In addition to the files themselves, SAM allows you to define datasets. - -A *SAM* dataset is not a fixed list of files but a query against the SAM database. An example would be “data_tier reconstructed and run_number 2001 and version v10” which would be all files from run 2001 that are reconstructed data produced by version v10. This dataset is dynamic. If one finds a missing file from run 2001 and reconstructs it with v10, the dataset will grow. There are also dataset snapshots that are derived from datasets and capture the exact files in the dataset when the snapshot was made. -Note: most other data catalogs assume a “dataset” is a fixed list of files. This is a “snapshot” in SAM. - -**samweb** – `samweb` is the command line and python API that allows queries of the SAM metadata, creation of datasets and tools to track and deliver information to batch jobs. - -samweb can be acquired from ups via -~~~ -setup samweb_client -~~~ -{: .language-bash} - -Or installed locally via - -```bash -git clone http://cdcvs.fnal.gov/projects/sam-web-client -``` - -You then need to do something like: - -~~~ -export PATH=$HOME/sam-web-client/bin:${PATH} -export PYTHONPATH=$HOME/sam-web-client/python:${PYTHONPATH} -export SAM_EXPERIMENT=dune -~~~ -{: .language-bash} - -## Projects - -SAM also supports access tracking mechanisms called projects and consumers. These are generally implemented for you by grid processing scripts. Your job is to choose a dataset and then ask the processing system to launch a project for that dataset. - -A project is effectively a processing campaign across a dataset which is owned by the SAM system. At launch a snapshot is generated and then the files in the snapshot are delivered to a set of consumers. The project maintains an internal record of the status of the files and consumers. Each grid process can instantiate a consumer which is attached to the project. Those consumers then request “files” from the project and, when done processing, tell the project of their status. - -The original SAM implementation actually delivered the files to local hard drives. Modern SAM delivers the location information and expects the consumer to find the optimal delivery method. This is a pull model, where the consuming process requests the next file rather than having the file assigned to it. This makes the system more robust on distributed systems. - -See running projects [here](http://samweb.fnal.gov:8480/station_monitor/dune/stations/dune/projects). - -## Accessing the database in read mode - -Checking the database does not require special privileges but storing files and running projects modifies the database and requires authentication to the right experimental group. `kx509` authentication and membership in the experiment VO are needed. - -Administrative actions like adding new values are restricted to a small set of superusers for each experiment. - -### Suggestions for configuring SAM (for admins) - -First of all, it really is nice to have filenames and dataset names that tell you what’s in the box, although not required. The D0 and MINERvA conventions have been to use “_” underscores between useful key strings. As a result, D0 and MINERvA tried not to use “\_” in metadata entries to allow cleaner parsing. “-“ is used if needed in the metadata. - -D0 also appended processing information to filenames as they moved through the system to assure that files run through different sequences had unique identifiers. - -Example: A Monte Carlo simulation file generated with version v3 and then reconstructed with v5 might look like - -~~~ -SIM_MC_020000_0000_simv3.root would be a parent of RECO_MC_020000_0000_simv3_recov5.root -~~~ -{: .output} - -Data files are all children of the raw data while simulation files sometimes have more complicated ancestry, with both unique generated events and overlay events from data as parents. - -## Setting up SAM metadata (For admins) - -This needs to be done once, and very carefully, early in the experiment. It can grow but thinking hard at the beginning saves a lot of pain later. - -You need to define data_tiers. These represent the different types of data that you produced through your processing chain. Examples would be `raw`, `pedsup`, `calibrated`, `reconstructed`, `thumbnail`, `mc-generated`, `mc-geant`, `mc-overlaid`. - -`run_type` can be used to support multiple DAQ instances. - -`data_stream` is often used for trigger subsamples that you may wish to split data into (for example pedestal vs data runs). - -Generally, you want to store data from a given data_tier with other data from that tier to facilitate fast sequential access. - -## Applications - -It is useful, but not required to also define applications which are triads of “appfamily”, “appname” and “version”. Those are used to figure out what changed X to Y. There are also places to store the machine the application ran on and the start and end time for the job. - -The query: -~~~ -samweb list-files "data_tier raw and not isparentof: (data_tier reconstructed and appname reco and version 7)" -~~~ -{: .language-bash} - -Should, in principle, list raw data files not yet processed by version 7 of reco to produce files of tier reconstructed. You would use this to find lost files in your reconstruction after a power outage. - -It is good practice to also store the name of the head application configuration file for processing but this does not have a standard “value.” - -### Example metadata from DUNE - -Here are some examples of querying sam to get file information - -~~~ -$ samweb get-metadata np04_raw_run005141_0015_dl10_reco_12736632_0_20181028T182951.root –json -~~~ -{: .language-bash} -~~~ -{ - "file_name": "np04_raw_run005141_0015_dl10_reco_12736632_0_20181028T182951.root", - "file_id": 7352771, - "create_date": "2018-10-29T14:59:42+00:00", - "user": "dunepro", - "update_date": "2018-11-28T17:07:30+00:00", - "update_user": "schellma", - "file_size": 14264091111, - "checksum": [ - "enstore:1390300706", - "adler32:e8bf4e23" - ], - "content_status": "good", - "file_type": "detector", - "file_format": "artroot", - "data_tier": "full-reconstructed", - "application": { - "family": "art", - "name": "reco", - "version": "v07_08_00_03" - }, - "event_count": 108, - "first_event": 21391, - "last_event": 22802, - "start_time": "2018-10-28T17:34:58+00:00", - "end_time": "2018-10-29T14:55:42+00:00", - "data_stream": "physics", - "beam.momentum": 7.0, - "data_quality.online_good_run_list": 1, - "detector.hv_value": 180, - "DUNE_data.acCouple": 0, - "DUNE_data.calibpulsemode": 0, - "DUNE_data.DAQConfigName": "np04_WibsReal_Ssps_BeamTrig_00021", - "DUNE_data.detector_config": "cob2_rce01:cob2_rce02:cob2.. 4 more lines of text", - "DUNE_data.febaselineHigh": 2, - "DUNE_data.fegain": 2, - "DUNE_data.feleak10x": 0, - "DUNE_data.feleakHigh": 1, - "DUNE_data.feshapingtime": 2, - "DUNE_data.inconsistent_hw_config": 0, - "DUNE_data.is_fake_data": 0, - "runs": [ - [ - 5141, - 1, - "protodune-sp" - ] - ], - "parents": [ - { - "file_name": "np04_raw_run005141_0015_dl10.root", - "file_id": 6607417 - } - ] -} -~~~ -{: .output} - -~~~ -$ samweb get-file-access-url np04_raw_run005141_0015_dl10_reco_12736632_0_20181028T182951.root -~~~ -{: .language-bash} - -~~~ -gsiftp://eospublicftp.cern.ch/eos/experiment/neutplatform/protodune/rawdata/np04/output/detector/full-reconstructed/07/35/27/71/np04_raw_run005141_0015_dl10_reco_12736632_0_20181028T182951.root -gsiftp://fndca1.fnal.gov:2811/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune/np04/beam/output/detector/full-reconstructed/07/35/27/71/np04_raw_run005141_0015_dl10_reco_12736632_0_20181028T182951.root -~~~ -{: .output} - -```bash -$ samweb file-lineage children np04_raw_run005141_0015_dl10.root -``` -~~~ -np04_raw_run005141_0015_dl10_reco_12736632_0_20181028T182951.root -~~~ -{: .output} - -```bash -$ samweb file-lineage parents -``` -~~~ -np04_raw_run005141_0015_dl10_reco_12736632_0_20181028T182951.root -np04_raw_run005141_0015_dl10.root -~~~ -{: .output} - -## Merging and splitting (for experts) - -Parentage works pretty well if one is merging files but splitting them can become problematic as it makes the parentage structure pretty complex. -SAM will let you merge files with different attributes if you don’t check carefully. Generally, it is a good idea not to merge files from different data tiers and certainly not from different data_types. Merging across major processing versions should also be avoided. - -### Example: Execute samweb Commands - -There is documentation at [here](https://cdcvs.fnal.gov/redmine/projects/sam/wiki/User_Guide_for_SAM) and [here](https://cdcvs.fnal.gov/redmine/projects/sam-main/wiki/Updated_dimensions_syntax). - -This exercise will start you accessing data files that have been defined to the DUNE Data Catalog. Execute the following commands after logging in to the DUNE interactive node, creating the directories above - Once per session - -```bash -setup sam_web_client #(or set up your standalone version) -``` -~~~ -export SAM_EXPERIMENT=dune -~~~ -{: .output} - -Then if curious about a file: - -```bash -samweb locate-file np04_raw_run005141_0001_dl7.root -``` -this will give you output that looks like -~~~ -rucio:protodune-sp -enstore:/pnfs/dune/tape_backed/dunepro/protodune/np04/beam/detector/None/raw/06/60/59/05(596@vr0072m8) -castor:/neutplatform/protodune/rawdata/np04/detector/None/raw/06/60/59/05 -cern-eos:/eos/experiment/neutplatform/protodune/rawdata/np04/detector/None/raw/06/60/59/05 -~~~ -{: .output} - -which are the locations of the file on disk and tape. We can use this to copy the file from tape to our local disk. -Better yet, you can use xrootd to access the file without copying it if it is staged to disk. -Find the xrootd uri via -```bash -samweb get-file-access-url np04_raw_run005141_0001_dl7.root --schema=root -``` -~~~ -root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune/np04/beam/detector/None/raw/06/60/59/05 /np04_raw_run005141_0001_dl7.root -root://castorpublic.cern.ch//castor/cern.ch/neutplatform/protodune/rawdata/np04/detector/None/raw/06/60/59/05/np04_raw_run005141_0001_dl7.root -root://eospublic.cern.ch//eos/experiment/neutplatform/protodune/rawdata/np04/detector/None/raw/06/60/59/05/np04_raw_run005141_0001_dl7.root -~~~ -{: .output} - -You can localize your file with the `--location` argument (enstore, castor, cern-eos) - -```bash -samweb get-file-access-url np04_raw_run005141_0001_dl7.root --schema=root --location=enstore -``` -~~~ -root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune/np04/beam/detector/None/raw/06/60/59/05 /np04_raw_run005141_0001_dl7.root -~~~ -{: .output} - -```bash -samweb get-file-access-url np04_raw_run005141_0001_dl7.root --schema=root --location=cern-eos -``` -~~~ -root://eospublic.cern.ch//eos/experiment/neutplatform/protodune/rawdata/np04/detector/None/raw/06/60/59/05/np04_raw_run005141_0001_dl7.root -~~~ -{: .output} - -To get SAM metadata for a file for which you know the name: - -```bash -samweb get-metadata np04_raw_run005141_0001_dl7.root -``` - -add the `--json` option to get output in json format - -To list raw data files for a given run: - -```bash -samweb list-files "run_number 5141 and run_type protodune-sp and data_tier raw" -``` -What about a reconstructed version? - -```bash -samweb list-files "run_number 5141 and run_type protodune-sp and data_tier full-reconstructed and version (v07_08_00_03,v07_08_00_04)" -``` - -Gives a list of files from the first production like - -~~~ -np04_raw_run005141_0001_dl7_reco_12736115_0_20181028T165152.root -~~~ -{: .output} - -We also group reconstruction versions into Campaigns like PDSPProf4 - -```bash -samweb list-files "run_number 5141 and run_type protodune-sp and data_tier full-reconstructed and DUNE.campaign PDSPProd4" -``` - -Gives more recent files like: - -~~~ -np04_raw_run005141_0009_dl1_reco1_18126423_0_20210318T102429Z.root -~~~ -{: .output} - -samweb allows you to select on a lot of parameters - -Useful ProtoDUNE samweb parameters can be found [here](https://dune-data.fnal.gov) and [here](https://wiki.dunescience.org/wiki/ProtoDUNE-SP_datasets); -these list some official dataset definitions. - -You can make your own samweb dataset definitions: First, make certain a definition does not already exist that satisfies your needs by checking the official pages above. - -Then check to see what you will get: -```bash -samweb list-files "data_tier full-reconstructed and DUNE.campaign PDSPProd4 and data_stream cosmics and run_type protodune-sp and detector.hv_value 180" –summary -``` -``` -samweb create-definition $USER-PDSPProd4_good_cosmics "data_tier full-reconstructed and DUNE.campaign PDSPProd4 and data_stream cosmics and run_type protodune-sp and detector.hv_value 180" -``` - -Note that the `username` appears in the definition name - to prevent users from getting confused with official samples, your user name is required in the definition name. -prestaging - -At CERN files are either on eos or castor. At FNAL they can be on tape_backed `dcache` which may mean they are on tape and may need to be prestaged to disk before access. -setup fife_utils # a new version we requested - -### check to see if a file is on tape or disk - -```bash -sam_validate_dataset --locality --file np04_raw_run005141_0015_dl10.root --location=/pnfs/ --stage_status -``` -~~~ - Staging status for: file np04_raw_run005141_0015_dl10.root - Total Files: 1 - Tapes spanned: 1 - Percent files on disk: 0% -Percent bytes online DCache: 0% -locality counts: -ONLINE: 0 -NEARLINE: 1 -NEARLINE_size: 8276312581 -~~~ -{: .output} - -Oops - this one is not on disk - -~~~ -returns ONLINE_AND_NEARLINE: 1 if available on disk -sam_validate_dataset --locality --name=schellma-1GeVMC-test --stage_status --location=/pnfs/ - Staging status for: defname:schellma-1GeVMC-test - Total Files: 140 - Tapes spanned: 10 - Percent files on disk: 100% -Percent bytes online DCache: 100% -locality counts: -ONLINE: 0 -ONLINE_AND_NEARLINE: 140 -ONLINE_AND_NEARLINE_size: 270720752891 -~~~ -{: .output} - -No `ONLINE_NEARLINE` means you need to prestage that file. Unfortunately, prestaging requires a definition. Let's find some for run 5141. Your physics group should already have some defined. - -The official Protodune dataset definitions are [here](https://wiki.dunescience.org/wiki/ProtoDUNE-SP_datasets). - -```bash -samweb describe-definition PDSPProd4a_MC_1GeV_reco1_sce_datadriven_v1_00 -``` - -Is simulation for 10% of the total sample - -Gives this description: - -```bash -samweb describe-definition PDSPProd4a_MC_1GeV_reco1_sce_datadriven_v1_00 -``` -~~~ -Definition Name: PDSPProd4a_MC_1GeV_reco1_sce_datadriven_v1_00 - Definition Id: 635109 - Creation Date: 2021-08-02T16:57:20+00:00 - Username: dunepro - Group: dune - Dimensions: run_type 'protodune-sp' and file_type mc and data_tier 'full-reconstructed' and dune.campaign PDSPProd4a and dune_mc.beam_energy 1 and -dune_mc.space_charge yes and dune_mc.generators beam_cosmics and version v09_17_01 and run_number in 18800650,..... -~~~ -{: .output} - -```bash -samweb list-files "defname:PDSPProd4a_MC_1GeV_reco1_sce_datadriven_v1_00 -``` -~~~ -> " --summary -File count: 5025 -Total size: 9683195368818 -Event count: 50250 -~~~ -{: .output} - -```bash -samweb prestage-dataset --def=PDSPProd4a_MC_1GeV_reco1_sce_datadriven_v1_00 --parallel=10 -``` - -would prestage all of the reconstructed data for run 5141 and you can check on the status by going [here](http://samweb.fnal.gov:8480/station_monitor/dune/stations/dune/projects) and scrolling down to see your prestage link. - -At CERN - -You can find local copies of files at CERN for interactive use. - -```bash -samweb list-file-locations --defname=runset-5141-raw-180kV-7GeV-v0 --schema=root --filter_path=castor -``` -gives you: - -~~~ -root://castorpublic.cern.ch//castor/cern.ch/neutplatform/protodune/rawdata/np04/detector/None/raw/06/60/74/16/np04_raw_run005141_0015_dl3.root castor:/neutplatform/protodune/rawdata/np04/detector/None/raw/06/60/74/16 np04_raw_run005141_0015_dl3.root 8289321123 -root://castorpublic.cern.ch//castor/cern.ch/neutplatform/protodune/rawdata/np04/detector/None/raw/06/60/74/17/np04_raw_run005141_0015_dl10.root castor:/neutplatform/protodune/rawdata/np04/detector/None/raw/06/60/74/17 np04_raw_run005141_0015_dl10.root 8276312581 -~~~ -{: .output} - - -{%include links.md%} From 25241f72e6b0bbdf7290056e30593f471ae5606b Mon Sep 17 00:00:00 2001 From: Heidi Schellman <33669005+hschellman@users.noreply.github.com> Date: Fri, 29 Aug 2025 16:38:11 -0700 Subject: [PATCH 4/5] more cleanup --- _episodes/07-grid-job-submission.md | 38 ++++++++++++++--------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/_episodes/07-grid-job-submission.md b/_episodes/07-grid-job-submission.md index 44c52f7..f3f4d32 100644 --- a/_episodes/07-grid-job-submission.md +++ b/_episodes/07-grid-job-submission.md @@ -66,8 +66,8 @@ The past few months have seen significant changes in how DUNE (as well as other First, log in to a `dunegpvm` machine . Then you will need to set up the job submission tools (`jobsub`). If you set up `dunesw` it will be included, but if not, you need to do ```bash -mkdir -p /pnfs/dune/scratch/users/${USER}/DUNE_tutorial_may2023 # if you have not done this before -mkdir -p /pnfs/dune/scratch/users/${USER}/may2023tutorial +mkdir -p /pnfs/dune/scratch/users/${USER}/DUNE_tutorial_sep2025 # if you have not done this before +mkdir -p /pnfs/dune/scratch/users/${USER}/sep2025tutorial ``` Having done that, let us submit a prepared script: @@ -183,8 +183,8 @@ You will have to change the last line with your own submit file instead of the p First, we should make a tarball. Here is what we can do (assuming you are starting from /exp/dune/app/users/username/): ```bash -cp /exp/dune/app/users/kherner/setupmay2023tutorial-grid.sh /exp/dune/app/users/${USER}/ -cp /exp/dune/app/users/kherner/may2023tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup-grid /exp/dune/app/users/${USER}/may2023tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup-grid +cp /exp/dune/app/users/kherner/setupsep2025tutorial-grid.sh /exp/dune/app/users/${USER}/ +cp /exp/dune/app/users/kherner/sep2025tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup-grid /exp/dune/app/users/${USER}/sep2025tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup-grid ``` Before we continue, let's examine these files a bit. We will source the first one in our job script, and it will set up the environment for us. @@ -192,7 +192,7 @@ Before we continue, let's examine these files a bit. We will source the first on ~~~ #!/bin/bash -DIRECTORY=may2023tutorial +DIRECTORY=sep2025tutorial # we cannot rely on "whoami" in a grid job. We have no idea what the local username will be. # Use the GRID_USER environment variable instead (set automatically by jobsub). USERNAME=${GRID_USER} @@ -213,37 +213,37 @@ Now let's look at the difference between the setup-grid script and the plain set Assuming you are currently in the /exp/dune/app/users/username directory: ```bash -diff may2023tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup may2023tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup-grid +diff sep2025tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup sep2025tutorial/localProducts_larsoft_v09_72_01_e20_prof/setup-grid ``` ~~~ -< setenv MRB_TOP "/exp/dune/app/users//may2023tutorial" -< setenv MRB_TOP_BUILD "/exp/dune/app/users//may2023tutorial" -< setenv MRB_SOURCE "/exp/dune/app/users//may2023tutorial/srcs" -< setenv MRB_INSTALL "/exp/dune/app/users//may2023tutorial/localProducts_larsoft_v09_72_01_e20_prof" +< setenv MRB_TOP "/exp/dune/app/users//sep2025tutorial" +< setenv MRB_TOP_BUILD "/exp/dune/app/users//sep2025tutorial" +< setenv MRB_SOURCE "/exp/dune/app/users//sep2025tutorial/srcs" +< setenv MRB_INSTALL "/exp/dune/app/users//sep2025tutorial/localProducts_larsoft_v09_72_01_e20_prof" --- -> setenv MRB_TOP "${INPUT_TAR_DIR_LOCAL}/may2023tutorial" -> setenv MRB_TOP_BUILD "${INPUT_TAR_DIR_LOCAL}/may2023tutorial" -> setenv MRB_SOURCE "${INPUT_TAR_DIR_LOCAL}/may2023tutorial/srcs" -> setenv MRB_INSTALL "${INPUT_TAR_DIR_LOCAL}/may2023tutorial/localProducts_larsoft_v09_72_01_e20_prof" +> setenv MRB_TOP "${INPUT_TAR_DIR_LOCAL}/sep2025tutorial" +> setenv MRB_TOP_BUILD "${INPUT_TAR_DIR_LOCAL}/sep2025tutorial" +> setenv MRB_SOURCE "${INPUT_TAR_DIR_LOCAL}/sep2025tutorial/srcs" +> setenv MRB_INSTALL "${INPUT_TAR_DIR_LOCAL}/sep2025tutorial/localProducts_larsoft_v09_72_01_e20_prof" ~~~ As you can see, we have switched from the hard-coded directories to directories defined by environment variables; the `INPUT_TAR_DIR_LOCAL` variable will be set for us (see below). -Now, let's actually create our tar file. Again assuming you are in `/exp/dune/app/users/kherner/may2023tutorial/`: +Now, let's actually create our tar file. Again assuming you are in `/exp/dune/app/users/kherner/sep2025tutorial/`: ```bash -tar --exclude '.git' -czf may2023tutorial.tar.gz may2023tutorial/localProducts_larsoft_v09_72_01_e20_prof may2023tutorial/work setupmay2023tutorial-grid.sh +tar --exclude '.git' -czf sep2025tutorial.tar.gz sep2025tutorial/localProducts_larsoft_v09_72_01_e20_prof sep2025tutorial/work setupsep2025tutorial-grid.sh ``` Note how we have excluded the contents of ".git" directories in the various packages, since we don't need any of that in our jobs. It turns out that the .git directory can sometimes account for a substantial fraction of a package's size on disk! Then submit another job (in the following we keep the same submit file as above): ```bash -jobsub_submit -G dune --mail_always -N 1 --memory=2500MB --disk=2GB --expected-lifetime=3h --cpu=1 --tar_file_name=dropbox:///exp/dune/app/users//may2023tutorial.tar.gz --singularity-image /cvmfs/singularity.opensciencegrid.org/fermilab/fnal-wn-sl7:latest --append_condor_requirements='(TARGET.HAS_Singularity==true&&TARGET.HAS_CVMFS_dune_opensciencegrid_org==true&&TARGET.HAS_CVMFS_larsoft_opensciencegrid_org==true&&TARGET.CVMFS_dune_opensciencegrid_org_REVISION>=1105&&TARGET.HAS_CVMFS_fifeuser1_opensciencegrid_org==true&&TARGET.HAS_CVMFS_fifeuser2_opensciencegrid_org==true&&TARGET.HAS_CVMFS_fifeuser3_opensciencegrid_org==true&&TARGET.HAS_CVMFS_fifeuser4_opensciencegrid_org==true)' -e GFAL_PLUGIN_DIR=/usr/lib64/gfal2-plugins -e GFAL_CONFIG_DIR=/etc/gfal2.d file:///exp/dune/app/users/kherner/run_may2023tutorial.sh +jobsub_submit -G dune --mail_always -N 1 --memory=2500MB --disk=2GB --expected-lifetime=3h --cpu=1 --tar_file_name=dropbox:///exp/dune/app/users//sep2025tutorial.tar.gz --singularity-image /cvmfs/singularity.opensciencegrid.org/fermilab/fnal-wn-sl7:latest --append_condor_requirements='(TARGET.HAS_Singularity==true&&TARGET.HAS_CVMFS_dune_opensciencegrid_org==true&&TARGET.HAS_CVMFS_larsoft_opensciencegrid_org==true&&TARGET.CVMFS_dune_opensciencegrid_org_REVISION>=1105&&TARGET.HAS_CVMFS_fifeuser1_opensciencegrid_org==true&&TARGET.HAS_CVMFS_fifeuser2_opensciencegrid_org==true&&TARGET.HAS_CVMFS_fifeuser3_opensciencegrid_org==true&&TARGET.HAS_CVMFS_fifeuser4_opensciencegrid_org==true)' -e GFAL_PLUGIN_DIR=/usr/lib64/gfal2-plugins -e GFAL_CONFIG_DIR=/etc/gfal2.d file:///exp/dune/app/users/kherner/run_sep2025tutorial.sh ``` You'll see this is very similar to the previous case, but there are some new options: -* `--tar_file_name=dropbox://` automatically **copies and untars** the given tarball into a directory on the worker node, accessed via the INPUT_TAR_DIR_LOCAL environment variable in the job. The value of INPUT_TAR_DIR_LOCAL is by default $CONDOR_DIR_INPUT/name_of_tar_file_without_extension, so if you have a tar file named e.g. may2023tutorial.tar.gz, it would be $CONDOR_DIR_INPUT/may2023tutorial. +* `--tar_file_name=dropbox://` automatically **copies and untars** the given tarball into a directory on the worker node, accessed via the INPUT_TAR_DIR_LOCAL environment variable in the job. The value of INPUT_TAR_DIR_LOCAL is by default $CONDOR_DIR_INPUT/name_of_tar_file_without_extension, so if you have a tar file named e.g. sep2025tutorial.tar.gz, it would be $CONDOR_DIR_INPUT/sep2025tutorial. * Notice that the `--append_condor_requirements` line is longer now, because we also check for the fifeuser[1-4]. opensciencegrid.org CVMFS repositories. The submission output will look something like this: @@ -258,7 +258,7 @@ Could not locate uploaded file on RCDS. Will retry in 30 seconds. Could not locate uploaded file on RCDS. Will retry in 30 seconds. Found uploaded file on RCDS. Transferring files to web sandbox... -Copying file:///nashome/k/kherner/.cache/jobsub_lite/js_2023_05_24_224713_9669e535-daf9-496f-8332-c6ec8a4238d9/run_may2023tutorial.sh [DONE] after 0s +Copying file:///nashome/k/kherner/.cache/jobsub_lite/js_2023_05_24_224713_9669e535-daf9-496f-8332-c6ec8a4238d9/run_sep2025tutorial.sh [DONE] after 0s Copying file:///nashome/k/kherner/.cache/jobsub_lite/js_2023_05_24_224713_9669e535-daf9-496f-8332-c6ec8a4238d9/simple.cmd [DONE] after 0s Copying file:///nashome/k/kherner/.cache/jobsub_lite/js_2023_05_24_224713_9669e535-daf9-496f-8332-c6ec8a4238d9/simple.sh [DONE] after 0s Submitting job(s). From 084565e9fd90ba3ddf9aebfa7a3ccda10dbd7bc6 Mon Sep 17 00:00:00 2001 From: Heidi Schellman <33669005+hschellman@users.noreply.github.com> Date: Wed, 3 Sep 2025 17:23:09 -0700 Subject: [PATCH 5/5] modernize and remove old episodes --- _episodes/02-submit-jobs-w-justin.md | 7 +++++-- _episodes/07-grid-job-submission.md | 15 +++++++++++---- index.md | 5 +++-- 3 files changed, 19 insertions(+), 8 deletions(-) diff --git a/_episodes/02-submit-jobs-w-justin.md b/_episodes/02-submit-jobs-w-justin.md index afe401c..a31fe63 100644 --- a/_episodes/02-submit-jobs-w-justin.md +++ b/_episodes/02-submit-jobs-w-justin.md @@ -12,9 +12,12 @@ keypoints: # PLEASE USE THE NEW [justIn](https://dunejustin.fnal.gov) SYSTEM INSTEAD OF POMS -__The [justIn](https://dunejustin.fnal.gov) Tutorial is currently in docdb at: [JustIn Tutorial](https://docs.dunescience.org/cgi-bin/sso/RetrieveFile?docid=30145)__ +__A simple [justIn](https://dunejustin.fnal.gov) Tutorial is currently in docdb at: [JustIn Tutorial](https://docs.dunescience.org/cgi-bin/sso/RetrieveFile?docid=30145)__ -The [justIn](https://dunejustin.fnal.gov) system is describe in detail at: +A more detailed tutorial is available at: +[JustIn Docs](https://dunejustin.fnal.gov/docs/) + +The [justIn](https://dunejustin.fnal.gov) system is described in detail at: __[JustIn Home](https://dunejustin.fnal.gov/dashboard/)__ diff --git a/_episodes/07-grid-job-submission.md b/_episodes/07-grid-job-submission.md index f3f4d32..51e4f96 100644 --- a/_episodes/07-grid-job-submission.md +++ b/_episodes/07-grid-job-submission.md @@ -1,5 +1,5 @@ --- -title: Grid Job Submission and Common Errors +title: Jobsub Grid Job Submission and Common Errors - still 2024 version teaching: 65 exercises: 0 questions: @@ -44,7 +44,9 @@ This lesson (07-grid-job-submission.md) was imported from the [Jan. 2023 lesson] Quiz blocks are added at the bottom of this page, and invite your review, modify, review, and additional comments. -The official timetable for this training event is on the [Indico site](https://indico.fnal.gov/event/59762/timetable/#20230524). + ## Notes on changes in the 2023/2024 versions @@ -65,15 +67,18 @@ The past few months have seen significant changes in how DUNE (as well as other First, log in to a `dunegpvm` machine . Then you will need to set up the job submission tools (`jobsub`). If you set up `dunesw` it will be included, but if not, you need to do -```bash +~~~ mkdir -p /pnfs/dune/scratch/users/${USER}/DUNE_tutorial_sep2025 # if you have not done this before mkdir -p /pnfs/dune/scratch/users/${USER}/sep2025tutorial -``` +~~~ +{: ..language-bash} + Having done that, let us submit a prepared script: ~~~ jobsub_submit -G dune --mail_always -N 1 --memory=1000MB --disk=1GB --cpu=1 --expected-lifetime=1h --singularity-image /cvmfs/singularity.opensciencegrid.org/fermilab/fnal-wn-sl7:latest --append_condor_requirements='(TARGET.HAS_Singularity==true&&TARGET.HAS_CVMFS_dune_opensciencegrid_org==true&&TARGET.HAS_CVMFS_larsoft_opensciencegrid_org==true&&TARGET.CVMFS_dune_opensciencegrid_org_REVISION>=1105)' -e GFAL_PLUGIN_DIR=/usr/lib64/gfal2-plugins -e GFAL_CONFIG_DIR=/etc/gfal2.d file:///exp/dune/app/users/kherner/submission_test_singularity.sh ~~~ +{: ..language-bash} If all goes well you should see something like this: @@ -112,6 +117,7 @@ Complete the authentication at: No web open command defined, please copy/paste the above to any web browser Waiting for response in web browser ~~~ +{: ..output} The user code will be different of course. In this particular case, you do want to follow the instructions and copy and paste the link into your browser (can be any browser). There is a time limit on it so its best to do it right away. Always choose Fermilab as the identity provider in the menu, even if your home institution is listed. After you hit log on, you'll get a message saying you approved the access request, and then after a short delay (may be several seconds) in the terminal you will see @@ -125,6 +131,7 @@ Submitting job(s) . 1 job(s) submitted to cluster 57110235. ~~~ +{: ..output} Now, let's look at some of these options in more detail. diff --git a/index.md b/index.md index 5918721..e00e8bf 100644 --- a/index.md +++ b/index.md @@ -48,8 +48,9 @@ By the end of this workshop, participants will know how to: There are additional materials provided that explain how to: -* [Develop configuration files to control batch jobs]({{ site.baseurl }}/07-grid-job-submission) -* [Use the [justIn](https://dunejustin.fnal.gov) system to process data]({{ site.baseurl }}/02-submit-jobs-w-justin) +* Use the [justIn](https://dunejustin.fnal.gov) system to process data +* [Develop configuration files to control jobsub batch jobs]({{ site.baseurl }}/07-grid-job-submission) + You will need to be a DUNE Collaborator (listed member), and have a valid FNAL or CERN computing account to join the tutorial. Contact your DUNE group leader for assistance.