EPP Grid - Tutorial: Grid for Local Users


Start of topic | Skip to actions

Tutorial: Grid for Local Users

Before starting you should know that Grid jobs have turn around times of 5 minutes or more. If a resource is full or network problems occur this can be even longer. So the Grid is designed for long running or very large numbers of jobs.

Grid Tools Setup

In order to use the Grid toolkit (called Globus) you will need to modify your environment on the EPP machines. If you use Cshell variants (tcsh or csh) add the following lines to the end of your .tcshrc or .cshrc files...
if ( -r /usr/local/grid/globus/general_cshrc ) then
    source /usr/local/grid/globus/general_cshrc
endif

If you use Bourne shell variants (sh, bash, ksh, zsh), add the following lines to your .bashrc, .shrc, .kshrc, or .zshrc files...

if [ -r /usr/local/grid/globus/general_cshrc ]; then
    source /usr/local/grid/globus/general_cshrc
fi

Verify that the tools are working by logging back in and running the following command:

globusrun -version
The command should report a version number such as 3.6 .

Getting Accounts

Obtaining access to Grid resources involves 3 steps.
  1. Getting a Grid certificate.
  2. Getting an account on one or more resources.
  3. Mapping your certificate to that account.
NOTE: More advanced Grids use "Virtual Organisations".

Certificates

You must first obtain a certificate to identify you on the Grid. A certificate can be obtained from the APAC Certificate Authority (which is actually at VPAC).

Click here to apply for a certificate.

Make sure you have typed the following into the form:

Name Firstname Lastname (same in both places)
Organisation The University of Melbourne (same in both places)
Email double check this
Role Grid User

You'll be sent an email soon, asking you to download your certificate and key over the web. You'll need to place both certificate and key in the ~/.globus directory of your home account. The files must be called ~/.globus/usercert.pem and ~/.globus/userkey.pem respectively.

NOTE: If you happen to have a PKCS12 certificate file (eg. something.p12) you can convert it using the pkcs2pem command.

mkdir ~/.globus
cd ~/.globus
pkcs2pem pathto/mycert.p12 user

Accounts

Initially your account on the EPP machines will be enough. However, you can also apply for accounts at the Melbourne Uni Advanced Research Computing centre and at VPAC.
See Computer Help - Access to Facilities.

Mapping Certs to Accounts

When asking a resource owner to map your certificate to an account you must specify the subject of your certificate. You can obtain this by running the following command:
grid-cert-info -subject

On the Melbourne EPP resources email your certificate subject and phone number to Lyle Winton winton@physics.unimelb.edu.au .

On VPAC, once you have an account, you can manually map your certificate subject to it.
Click here and enter your VPAC username and password.

Data Management

For the moment, you just need to realise that on a Grid your jobs can run anywhere. And your files and data are not everywhere! So most jobs will need to stage in input files (config, data) and then stage out output files (logs, hbooks, root files, data).

Creating a Job

Running a job on the Grid is difficult unless you have the right tools. We've built such a tool called gqsched. This tool allows you to run an ordinary script across the Grid, with only a few modifications for staging files. The tool requires little knowledge to get started, but is full featured for advanced users. In this section we will construct a 3 simple scripts: the first is just a dummy script, the second is a real job with staging, and the third will run multiple jobs from one script.

The following script, written in any of your favorite scripting languages, will print the hostname, print the environment, sleep for 60 seconds, print the date then exit. Nothing to it!

#!/usr/bin/tcsh
echo Starting...
hostname
env
sleep 60
date
echo Stopped.

The second script demonstrates the need to stage files to and from (in and out) the remote Grid resource. Staging is specified with directives of the form #:STAGEIN and #:STAGEOUT. In the following example the local files recon.conf particle.conf event.conf will be staged into any remote resource and myoutput.hbook will be staged back.

#!/usr/bin/tcsh
#:STAGEIN recon.conf ; particle.conf ; event.conf ; data.mdst
basf << EOF
path create main
path add_module main my_ana
initialize
histogram define myoutput.hbook
process_event data.mdst
terminate
EOF
#:STAGEOUT myoutput.hbook

The third script demonstrates running the same script over a number of events. This is done using a "parameter sweep" where the environment variable $MYINPUT (ie. the parameter) evaluates to each value. A separate job is created for each value and these jobs will be run in parallel. In the bellow case, one job for each file mydata*.mdst.

#!/usr/bin/tcsh
#:PARAM MYINPUT FILE mydata*.mdst
#:STAGEIN recon.conf ; particle.conf ; event.conf
#:STAGEIN $MYINPUT
basf << EOF
path create main
path add_module main my_ana
initialize
histogram define myoutput-$JOBID.hbook
process_event $MYINPUT
terminate
EOF
#:STAGEOUT myoutput-$JOBID.hbook
NOTE: You'll notice that $JOBID is used to prevent filename clashes for the output. $JOBID will be set for every job to a different number.

Submitting a Job

Once you've got you're script submission is simple.
grid-proxy-init
gqsched myscript.csh

Running grid-proxy-init signs you on to the Grid with your certificate. This "sign on" lasts for 12 hours but can be renewed or extended. If you know you're job will run for longer than 12 hours you can set it for longer. For example 48 hours...

grid-proxy-init -valid 48:00

The gqsched tool will report the status of your jobs periodically (the job IDs) and should not be stopped until all jobs are complete.

Getting Output and Results

Once complete the standard output and error from the job will be returned to your local directory as myscript.csh.o1 myscript.csh.e1 myscript.csh.o2 myscript.csh.e2 ... etc. Any staged output should also be returned to the local directory.

Troubleshooting

  • To get a more detailed status of what jobs are doing use the -resource-status option.

  • If a job cannot be submitted or is terminated more than 3 times it will be marked as failed. Once gqsched is finished you can force specific jobs to rerun using the -rerun option.

  • You can use the -remote-debug option to turn on remote process debugging. This should tell you everything that going on for each job at each resource. The debug output is returned in both the standard output and error file for each job. If no debug output is returned then the resource may have a connection problem.

  • You can use the -postproc option to validate the output of jobs as they return. This can then be used identify failure and to resubmit the job. gqsched will try to avoid failing resources.

  • If gqsched dies or is killed for some reason, you should run gqsched-killall-jobs to clean up any detached jobs and remote working directories. This command usually takes a while.

key Log In Revision:  r5 - 12 Dec 2005 - LyleWinton
Authorised by:  Geoff Taylor (G.Taylor @ physics.unimelb.edu.au)
Maintained using:  This site is powered by the TWiki collaboration platform
Copyright © 2000-2009 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.