Tier3s at Compute Canada

This is only available to ATLAS-Canada members with Compute Canada accounts.

Documentation

Support

  • Remember to provide maximum information (cut and paste) and how to reproduce issues from login. Also check if the problem exists for you on lxplus.
  • ATLAS users should first request help from the DAST shifters (mail to hn-atlas-dist-analysis-help@cern.ch) who can diagnose issues.
  • If needed, and confirmed to be site related, DAST shifters may then ask you to contact the Tier3 admins ( Compute Canada support); They may also provide you with relevant information to pass on.

Access

Currently, only Cedar is set up for ATLAS-Canada Tier-3 usage. You need a Compute Canada (CC) username and password to login. Your CC account is sponsored, usually, by an academic staff member (your professor). You are under the group of your sponsor. To get an ATLAS environment, do these steps:

  • Interactive login (or Host) machine : To access, ssh to the cluster to use.
    • ssh cedar.computecanada.ca (for Cedar)
    • ALERT! If accessing the cluster for the first time, please see the next section below for configurations after login.
    • This is where you will do any editing and batch job submissions. You will need the next step to run ATLAS jobs or any tools.
  • Container Then type the following command to get a container that is ATLAS ready.
    • setupATLAS -c slc6 or setupATLAS -c centos7 to get a slc6 or a centos7 container. ALERT! (Aug 29, 2018, at this time we strongly recommend an slc6 container).
    • For more information about containers and setupATLAS -c, please read the Documentation,

It is recommended that you do as much of your editing work on the host machine before going into the container (or open another terminal session for editing etc on the host machine). Once in the container you wil loose much of the environment settings; eg emacs will not be available since these containers are light-weight.

Resource Intensive Interactive Work

ALERT! The first question you should seriously ask yourself is why you need to do this interactive when you can submit to the batch system. However we recognize that some users may have a legitimate need and hence these supplemental instructions.

The interactive login machines, and containers started from them, should be considered similar to lxplus; that is, they are meant for light work such as compiling and running a quick test job and for submitting to the batch system. If you need to do resource-intensive interactive work, you will need to get an allocation before starting the container. Note the intermediate salloc step in the instructions below:

  • Interactive login (or Host) machine (See above)
  • After login, type the salloc command for your allocation. This will get you a machine that is for your use.
    • To see what allocations you have, type showCCaccount
    • If you have only one allocation, you can simply type salloc without an --account option.
    • If you have more than one allocation, it will list them and exit with an error, asking you to specify the --account option.
    • Tip, idea id -nG will show you your groups which are also some of the accounts you can use for salloc
  • Container (see above)

First Time Configurations

ALERT! You need to do this only once for each cluster.

  • Install your grid certificates in your $HOME/.globus directory
  • (optional but more secure and easier to login without passwords) Install your password less ssh login keys in $HOME/.ssh
  • $HOME/.bashrc should contain this:
export RUCIO_ACCOUNT=<your lxplus username>
source /project/atlas/Tier3/AtlasUserSiteSetup.sh

Compute Canada Work Directories

Your $HOME directory has a project and scratch sub-directories which are really symbolic links to network mounted external directories. In your container, these are also available in the same locations at $HOME.

On cedar, the /project space is backed up. The /scratch space is not backed up but there are no policies (yet) in place regarding how long you can keep your data on /scratch. So if you have lots of data files to work with and they can be replicated if lost, it is better to put them on /scratch.

Access to Tier2 Storage

StopThis is available only at Cedar (SFU) at the moment.

Introduction

ComputeCanada ATLAS Tier3 Sites, co-located with a Tier2, comprise of

  • SFU (Cedar)
  • Waterloo (Graham)
if you are using a Tier3 at these sites, you can access only that site's Tier2 storage directly from your Tier3.

These can be:

  • DATADISK:
    • CA-SFU-T2_DATADISK or CA-WATERLOO-T2_DATADISK
    • You have only read access.
    • Files here are managed by ATLAS and have a relatively long lifetime compared to SCRATCHDISK.
  • SCRATCHDISK:
    • CA-SFU-T2_SCRATCHDISK or CA-WATERLOO-T2_SCRATCHDISK
    • You should only use read access and never upload files or replicate data to here as this is volatile space and it can interfere with Tier2 operations.
    • Files here are managed by ATLAS and are volatile (short lifetime of about 2 weeks or until cleanup is needed.)
  • LOCALGROUPDISK:
    • CA-SFU-T2_LOCALGROUPDISK or CA-WATERLOO-T2_LOCALGROUPDISK
    • You can request rucio data replication to these locations for datasets not at the site. You will need to be approved if you move large files and note that you have a quota rucio list-account-usage $RUCIO_ACCOUNT.
    • You can also upload your own datasets to LOCALGROUPDISK and it will be available to anyone in ATLAS (for details, do --help to rucio upload, rucio add-dataset, rucio attach).
    • Files on LOCALGROUPDISK are not cleaned up automatically and need your approval (or remove the rule if you replicated it, see rucio delete-rule --help)
    • You can see what replication rules you have from the web-ui or from rucio list-rules --account $RUCIO_ACCOUNT.
    • To replicate to an ATLAS-Canada Tier2 site LOCALGROUPDISK, you need to have the voms role /atlas/ca already approved.

For more details on rucio and replications and disks, please see the Twiki.

Instructions

Here are some instructions as to how to access files which are on the Tier2 storage from a ComputeCanada Tier3 machine. All of these should be done after setupATLAS -c ...:

  • Find the dataset replica a the ComputeCanada site (if it is not there, you will need to replicate it.)
lsetup rucio
(or lsetup "rucio -w" if other tools are also setup)
voms-proxy-init -voms atlas

eg
rucio list-dataset-replicas group.phys-higgs:group.phys-higgs.hhskim.mc12_8TeV.161574.PowPyth8_AU2CT10_ggH110_tautauhh.e1217_s1469_s1470_r3542_r3549_p1344.v29

  • List the files at the site (note the --rse option and value based on the previous results and the --pfns option used). Technically you can just skip to the next step if you are sure that the files are found at the site and the rucio command below will work.
eg
rucio list-file-replicas group.phys-higgs:group.phys-higgs.hhskim.mc12_8TeV.161574.PowPyth8_AU2CT10_ggH110_tautauhh.e1217_s1469_s1470_r3542_r3549_p1344.v29 --rse CA-SFU-T2_LOCALGROUPDISK --pfns

  • Convert to a local path if the previous command worked (ie insert getLocalDataPath to the previous command:
eg
getLocalDataPath rucio list-file-replicas group.phys-higgs:group.phys-higgs.hhskim.mc12_8TeV.161574.PowPyth8_AU2CT10_ggH110_tautauhh.e1217_s1469_s1470_r3542_r3549_p1344.v29 --rse CA-SFU-T2_LOCALGROUPDISK --pfns

  • The output from above will be local file paths which you can use to access / open directly from your container.

Batch Submissions

Please see this page for how to submit batch jobs from inside a container.

Note that there are tools which can help generate your batch script.

ALERT! Batch job submissions should be done without the local interactive environment passed on to the batch job. For SLURM, this means you submit with this option: sbatch --export=NONE .... See this page for details.

Reference: Compute Canada Slurm Documentation

Known Issues

Description Reported Information Status
After salloc, acm fails to compile. eg setupATLAS; diagnostics; toolTest -m user acm CC admins need to investigate Unresolved
xcache fails to connect from a WN obtained by salloc CC admins need to investigate Unresolved

Troubleshooting

  • If you have issues, first check that you are running on a container. Compute Canada machines are NOT ATLAS-ready and that is why you need to run inside a container.
Edit | Attach | Watch | Print version | History: r33 < r32 < r31 < r30 < r29 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r31 - 2019-02-11 - AsokaDeSilva
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback