.. _running on cerns htcondor:

Tip: Running on CERN's HTCondor
===============================

`CERN HTCondor`_ provides infrastructure to facilitate complex computations
using distributing parallel multiprocessing. This page describes procedure of
running NA64sw as a batch job on HTCondor.

.. _`CERN HTCondor`: https://batchdocs.web.cern.ch/

Generally, steering parallel jobs on a batch system with distributed nodes
is a complicated task itself because of the variety of possible scenarios one
may imagine for multi-staged processing tasks. Such a task usually performed
by some work-management systems (WMS) like Pegasus_, Panda, etc.

.. _Pegasus: https://pegasus.isi.edu/
.. _Panda: https://panda-wms.readthedocs.io/en/latest/

However, so far NA64 needs do not imply frequent usage of multi-staged
scenarios. On this page we will provide just a simple recipe for running few
pipeline processes that would be enough to perform NA64sw pipeline analysis on
few chunks or few runs (a typical common need).

Set-up
------

It is possible to run jobs without any WMS by just using CERN IT account and
public build of NA64sw. Let's consider an alignment task: it typically requires
quite some tracks being collected to provide representative picture on certain
detectors being not well illuminated, thus requiring at least few chunks to
be processed. Running pipeline application locally would be tedious.

The following workflow is assumed roughly:

1. We start with certain ``placements.txt`` and pipeline settings
   with ``run.yaml`` file and list of files we would like to process as an
   input. We would like to tag jobs being ran to keep all the things related
   to this task in a certain place.
2. We create a job submission with *submission script* that tells HTCondor what
   and how to run.
3. Jobs run and generate output asynchroneously. At some point they will finish
   and we can harvest output from the *output dir*.

Alignment task is just an example, it shall help you to grasp the idea.

Workspace dir
~~~~~~~~~~~~~

To start, let's consider a directory that we further will refer as
*workspace dir* (not be messed up with AFS Workspace dir!). A *workspace dir*
for job submission will serve as a place to store submission files, job logs
and other service information. It must be created on some of the CERN's
filesystem share, like ``/afs`` because all the started remote jobs shall
access this share. One can safely use home or *AFS wrokspace* dir for that
purpose:

.. code-block:: shell

    $ cd ~
    $ mkdir alignment_workspace
    $ cd alignment_workspace

In this brand new directory a job-submission script will reside. We will also
put here job's logs and documents that we're going to change across the
submitted tasks.

.. warning::

    Currently CERN infrastructure *does not* support submission from ``/eos``
    shares, so we're restricted by ``/afs`` only.

In the alignment task we're going change the ``placements.txt`` file that
contains geometrical information of the setup. This is done by overriding
the calibration information with custom ``placements.txt`` file and custom
calibrations config (``-c calibrations.yaml`` option of the app). Copy the
files to *workspace dir*:

.. code-block:: shell

    $ cp $NA64SW_PREFIX/share/na64sw/calibrations.yaml .
    $ cp $NA64SW_PREFIX/var/src/na64sw/presets/calibrations/override/2022/placements.txt.bck placements.txt

.. note::

    We're using ``$NA64SW_PREFIX`` environment variable here, assuming that
    you ``source``'d ``.../this-env.sh`` script from one of the public builds.

We will customize the ``calibrations.yaml`` a bit later.

Output dir
~~~~~~~~~~

Choose some directory for ouput data. Typically we have to store
the ``processed.root`` file. For the alignment task we'll need also to save
the ``alignment.dat`` file produced by ``CollectAlignmentData`` handler. It
better to keep this data on ``/eos`` as this share is large enough, optimized
to keep medium and large sized files and pretty easy to access (via CERNBox,
for instance). We will further refer to this dir as *output dir*.

Submission script
-----------------

We will use Bourne shell here, but this is not mandatory, of course. Consider
following script as a useful snippet that has to be customized in order to cope
with your *workspace* and *output* dirs. It is assumed that the submission
script is ran from *workspace dir* -- create a file called ``submit.sh`` in
your *workspace dir* and copy/paste this content:

.. code-block:: bash

    #!/bin/bash

    # Batch submission script for NA64sw for alignment procedure.
    #
    # Will run `na64sw-pipe' app on HTCondor on every file provided in `in.txt'
    # using given runtime config. Expects one command line argument, a tag that
    # shall uniquely identify this batch procedure.

    TAG=$1
    PLACEMENTS=$2

    # Variables to customize
    # - path to NA64sw build (this-env.sh script must be available in this dir)
    NA64SWPRFX=/afs/cern.ch/work/r/rdusaev/public/na64/sw/LCG_101/x86_64-centos7-gcc11-opt
    # - runtime config to use
    RCFG=$(readlink -f run.yaml)
    # - list of input files; must contain one full path per line
    FILESLIST=$(readlink -f ./in.txt)
    # - where to put results (processed.root + alignment.dat)
    OUTDIR=/eos/user/r/rdusaev/autosync/na64/alignment.out/$TAG
    # - where tu put the logs and submission file
    SUBMITTED_DIR=./submitted/$TAG

    mkdir -p $OUTDIR
    mkdir -p $SUBMITTED_DIR

    cp $PLACEMENTS $OUTDIR
    cp calibrations.yaml $OUTDIR
    cp $RCFG $OUTDIR/run.yaml

    NJOBS=$(cat in.txt | wc -l)

    cat <<-EOF > $SUBMITTED_DIR/job.sub
        universe = vanilla
        should_transfer_files = NO
        transfer_output_files = 
        max_transfer_output_mb = 2048
        +JobFlavour = microcentury
        output = $SUBMITTED_DIR/\$(Process).out.txt
        error = $SUBMITTED_DIR/\$(Process).err.txt
        executable = `readlink -f run-alignment-single.sh`
        userLog = $SUBMITTED_DIR/htcondor-log.txt
        environment = "HTCONDOR_JOBINDEX=\$(Process)"
        arguments = $NA64SWPRFX $FILESLIST $OUTDIR
        queue $NJOBS
        EOF

    condor_submit -batch-name na64al-$TAG $SUBMITTED_DIR/job.sub

The script expects two argumnts to be provided from command line invocation:
a *tag* that will uniquelly identify task to run and a path
to ``placements.txt`` file in use. The *tag* is assumed to be a meaningful
human-readable string without spaces that is then used to create directories
specific for the task. For instance, if I align MM03 in 2022-mu run, I will do
it iteratively, and it would be convenient to have output ordered as
``022mu-MM03-01``, ``022mu-MM03-02``, ``022mu-MM03-03``, etc. When I switch
to, say ``ST01``, I would like to have directories tagged as ``022mu-ST01-01``,
``022mu-ST01-02``, etc.

As an input it uses the *in.txt* file containing absolute paths to
experiment's data -- one path per line. Such a file can be produced with, e.g.:

.. code-block:: shell

    $ ls /eos/experiment/na64/data/cdr/cdr01*-006030.dat > in.txt

Choose run number you would like to operate or concatenate few runs within a
file with ``>>``.

The script will create sub-directories named as *tag* in your *output dir* and
*workspace dir*. The point of splitting up the whole thing onto two directories
is that you generally not interested in whatever is produced in
*workspace dir* -- this is nasty service information that one would like to
delete asap, while *output dir* is for our glorious analysis which we would
like to keep forever and bequeath to posterity.

Job script and job environment
------------------------------

Last part of the script (heredoc_) generates a file of a weird syntax, specific
for HTCondor system which is called ClassAd_. This is a job submission
instructions that defines things like:

* What executable or script to be ran in parallel (``executable = ...``)
* What are the arguments for this executable or script (``arguments = ...``)

There are more important parameters in ClassAd related to the job
configurations, some of them may be much more important that appears at a first
glance (like ``+JobFlavour``).

.. _heredoc: https://tldp.org/LDP/abs/html/here-docs.html
.. _ClassAd: https://htcondor.readthedocs.io/en/latest/classads/classad-mechanism.html

We set the ``executable = `readlink -f run-alignment-single.sh```. This is
another ("*job*") script that actually will run on the HTCondor node. It's
purpose is to set up the environment, take one line from the ``in.txt``,
forward execution to pipeline executable and move output files to the
*output dir*.

The point is that when HTCondor runs a job the running process will operate in
a quite restrictive environment. Node provides limited amount of CPU, RAM and
disk space, limiting even the network transfer and bandwith. On the other hand,
``na64sw-pipe`` is a generic-purpose application configured primarily from the
command line. To do so we would rather want it to run within some shell
environment.

Examplar content of ``run-alignment-single.sh``:

.. code-block:: bash

    #!/bin/bash

    # This protects from running the script locally
    if [ -z ${HTCONDOR_JOBINDEX+x} ] ; then
        echo "Empty HTCONDOR_JOBINDEX variable."
        exit 2
    fi

    # 1st arg must be NA64sw-prefix, second is the files list, third is the
    # destination dir. These usually provided by submission file.
    NA64SWPRFX=$1
    FILESLIST=$2
    OUTDIR=$3

    # Setup the env
    source $NA64SWPRFX/this-env.sh
    echo "Using environment from $NA64SW_PREFIX"

    cp $OUTDIR/placements.txt .
    cp $OUTDIR/calibrations.yaml .
    cp $OUTDIR/run.yaml .

    # Get the filename to operate with
    NLINE=$((HTCONDOR_JOBINDEX+1))
    SUBJFILE=$(sed "${NLINE}q;d" $FILESLIST)
    echo "Job #${NLINE}, file ${SUBJFILE}"

    # Run pipeline app
    $NA64SWPRFX/bin/na64sw-pipe -r run.yaml -c calibrations.yaml \
        -m genfit-handlers -EGenFit2EvDisplay.enable=no \
        -EROOTSetupGeometry.magnetsCfg=$NA64SWPRFX/var/src/na64sw/extensions/handlers-genfit/presets/magnets-2022.yaml \
        -N 15000 \
        --event-buffer-size-kb 4096 \
        $SUBJFILE

    # Delete the (copies) of input files to not clutter the submission dir
    # (otherwise, HTCondor will copy 'em back for every job).
    rm -f placements.txt calibrations.yaml run.yaml

    # If there is `alignment.dat' file available, copy it
    if [ -f alignment.dat ] ; then
        gzip alignment.dat
        cp alignment.dat.gz $OUTDIR/alignment-$((HTCONDOR_JOBINDEX+1)).dat.gz
    fi

    # Copy `processed.root' output to dest dir
    cp processed.root $OUTDIR/processed-$((HTCONDOR_JOBINDEX+1)).root
    rm -f processed.root alignment.dat

    echo Done.

So the script does what we've mentioned previously and also performs some
householding for the demonstration: the ``alignment.dat`` output files will be
compressed with ``gzip``. Note that it also copies ``placements.txt``,
``calibrations.yaml`` and ``run.yaml`` to some cwd (``.``) -- this is a
job-execution dir, local to the node, where the process will actually keep all
the stuff.

One can also customize number of events being read and other invocation
parameters within this script.

Conclusion
----------

So, detailed workflow is:

* Having the following input:
  - A pipeline run-config (``run.yaml`` kept locally in *workspace dir*)
  - A geometry file (``placements.txt``)
  - A list of chunks to process (``in.txt``)
  - A *tag* for the set of jobs being submitted
* One call ``submit.sh`` from *workspace dir* providing *tag* and path to
  the ``placements.txt`` file
* The ``submit.sh`` creates tagged dirs (service in *workspace dir* and for
  expected output in the *output dir*), generates ClassAd file and submits it
  to HTCondor for parallel processing:

  .. code-block:: shell

      $ ./submit.sh 2022mu-MM03-01 some/where/my-placements.txt

* Jobs than will start asynchroneously and do whatever is written
  in ``run-alignment-single.sh`` (copy files, run the pipeline, copy output,
  cleanup, etc). You can inspect what is running with ``condor_q`` command on
  lxplus node.
* Once all the jobs are done you will end up with bunch of
  ``processed-<number>.root`` and ``alignment.dat.gz`` files in the tagged
  ouput dir.

Typically, it takes 10-20 min to perform all the processing providing you with
great amount of data, incomparable with local running. To merge
``processed.root`` files into a single one with summed histograms, consider
``hadd`` utility from ROOT. Reading of ``gzip``'ped list of files is supported
by Python scripts of NA64sw.