LCG technical Workshop

KEK Japan, November 17 - 18, 2005


Document style conventions.

Throughout this workshop, commands that need to be executed by you will be bounded by a box of this color.
These will apply to everyone irrespective of what Grid component is being installed.

Information will be written in this font.

Commands that can be executed at the shell are in this font.

References.

LCG Installation Manual
LCG Site Testing Manual
LCG Network Ports Table
Disk Pool Manager (DPM) Administration Guide
DPM Testing - GridPP wiki
LCG File Catalog (LFC) Administrators Guide
LFC Testing - GridPP wiki

Table of Contents.


Introduction.

The purpose of this workshop is to install all of the components of the LCG Grid middleware and build a standalone, working Grid.

Workshop participants will work in small groups to install and configure multiple components of the middleware per group. We will then test the installation and examine some of its features. Following this, participants from interested sites will work to configure a compute element interface to their local cluster in order to create a federation of computing resources. Data management will be via the Storage Resource Broker (SRB). Belle MC will be demonstrated.

At the completion of the workshop, attendees will have the skills to configure any of the middleware components.





README.

In the /root directory of each Grid node is the directory "workshop". This directory contains the material necessary for this workshop.


packages An archive of Scientific Linux and LCG middleware packages.
certificates Host certificates and private keys for each of the nodes. Use the certificate / key applicable to your node.
yaim-configs The YAIM configuration files. There are two sets of files in this directory for building Grid1 and Grid2. Use the files appropriate to the
system you are a part of.
doc Some useful documentation. Specifically, the LCG2-UserGuide (recommended reading) and for system administrators, the LCG2-PortTable.





Section 1 - Preparing the node.


Each of the workstations used in this workshop has a base installation of Scientific Linux 3.0.4. In each case, a minimal system was installed. Before we install the middleware, the nodes need to be updated to the latest version of Scientific Linux - namely, 3.0.5.

Aside: Although we will set up the nodes so that they update from the CERN repository, in this case, we will create a package archive locally in order to save time downloading packages.

The standard method is to use the package manager APT (Advanced Packaging Tool) - although in principle, yum could also be used. We will use APT as this is what the LCG middleware installer uses.

1.1 Is APT installed?

$ rpm -qa | grep apt

If it is installed you will see at least:

apt-0.5.15cnc6-8.SL.cern

Although the version may be different.

If it is not installed

$ rpm -ivh http://linuxsoft.cern.ch/cern/slc30X/i386/SL/RPMS/apt-0.5.15cnc6-8.SL.cern.i386.rpm

1.2 Edit the APT sources file.

Although CERN now uses the standard Scientific Linux distribution, historically, there have been differences between CERN Scientific Linux and the distribution of Scientific Linux coming from FermiLab. To be sure we have a CERN compatible version, we will update the system from CERN's Scientific Linux repository

$ cd /etc/apt/sources.list.d

Create a new file cern.list and add the following lines to it.

$ vi cern.list


rpm http://linuxsoft.cern.ch  cern/slc305/i386/apt  os updates extras
rpm-src http://linuxsoft.cern.ch  cern/slc305/i386/apt  os updates extras
rpm http://grid-deployment.web.cern.ch/grid-deployment/gis apt/LCG-2_6_0/sl3/en/i386 lcg_sl3 lcg_sl3.updates

Edit the file sl.list and comment out any lines in it.

$ vi sl.list

I do this because I don't want to update my nodes from a Scientific Linux repository - only from the CERN repositories.

1.3 Update the system.

To save time downloading from CERN, copy all of the packages from the archive directory to the APT package archive.

$ cp /root/workshop/packages/*.rpm /var/cache/apt/archives/
$ apt-get update
$ apt-get dist-upgrade

1.4 Install Java.

Due to licensing restrictions, Java cannot be distributed via the CERN APT repositories. So, it will need to be downloaded from the Java website and installed separately. Java needs to be installed before the middleware is installed in order to satisfy the middleware dependencies. Further, it should be installed from an RPM package (not a tarball distribution) so that APT dependency resolution can occur. The middleware will not install if Java is not installed from an RPM package.

For this workshop, the Java RPM is in the APT archive.

Is Java installed?

$ ls /usr/java

Remember: It has to be installed from the RPM package. If this is the case, then on RedHat like systems it will be in /usr/java.

If it is not installed, then:

$ rpm -ivh /var/cache/apt/archives/j2sdk-1_4_2_08-linux-i586.rpm

1.5 Configure network time synchronisation.

Network time has been configured for you. However, if you were to configure it, you would do the following.
Install ntpd and ntpdate


Network time server is:

ntp01.local.kek.jp: 172.30.32.100

Edit /etc/ntp.conf and add the following lines:

restrict 172.30.32.100  mask 255.255.255.255 nomodify notrap noquery
server 172.30.32.100

Edit /etc/ntp/step-tickers and add the following line:

172.30.32.100

Restart the service.

service ntpd restart

NTP is known to lose time periodically. To combat this, create a script in /etc/cron.daily called "sync-time.sh"
and add the following lines to it.

#!/bin/bash

service ntpd stop
ntpdate ntp.unimelb.edu.au
ntpdate ntp.unimelb.edu.au
ntpdate ntp.unimelb.edu.au
service ntpd start

Run the script - you should see something like:

Shutting down ntpd:                                                  [  OK  ]
11 May 02:22:58 ntpdate[25023]: step time server 128.250.5.101 offset 3.704193 sec
11 May 02:22:58 ntpdate[25038]: adjust time server 128.250.5.101 offset 0.000013 sec
11 May 02:22:59 ntpdate[25039]: adjust time server 128.250.5.101 offset -0.000001 sec
ntpd: Synchronizing with time server:                      [  OK  ]
Starting ntpd:

To set the system time to UTC - add the following lines to /etc/sysconfig/clock

ZONE="UTC"
UTC=true
ARC=false

cp /usr/share/zoneinfo/UTC /etc/localtime

hwclock --systohc --utc

Check that the hardware clock is set to UTC.

hwclock

You should see something like:

Thu 11 May 2006 02:23:38 UTC  -0.918646 seconds

Section 2 - Download the LCG installer and prepare for installation.

2.1 Download the installer.

The current version of the YAIM installer 2.6.0-9.
Is it installed?

$ rpm -qa | grep yaim



If it is not installed, then:

$ apt-get update
$ apt-get install lcg-yaim

YAIM is installed in /opt/lcg/yaim. In this branch of the filesystem there are 3 directories:

examples This is where the site configuration files are located. In here you will find 3 files: site-info.def, users.conf and wn-list.conf. These files
should be customised for your site.
functions This is where the functions to configure the components are located.
scripts This is where the main installation and configuration scripts are located. This directory contains 4 files:

install_node is used to install an LCG component
configure_node is used to configure that component
run_function is used to run a specific YAIM configuration function
node-info.def defines which functions are necessary to configure the different node types

2.2 Configure the site information files.

The YAIM (Yet Another Installation Method) installer is a script based method which is centered on a site configuration file. For any site, large or small, the same configuration file is used to configure all of the different node types. For this workshop, the necessary site configuration files have already been customised. These can be found in:

2.3 /root/workshop/yaim-configs/


Since we have so many computers available, and so that everyone gets a chance to install a reasonably complicated component, we will build two testbeds. To do this, there are two sets of configuration files. Choose the set to use based on which Grid you have been assigned to.

keksite-info-Grid1.def The main site configuration file.
kekusers.conf The list of generic grid user accounts to create on each node.
kekwn-list-Grid1.conf The list of worker nodes that Torque will be configured for.
keksite-info-Grid2.def The main site configuration file.
kekusers.conf The list of generic user accounts to create on each node.
kekwn-list-Grid2.conf The list of worker nodes that Torque will be configured for.


$ cp /root/workshop/yaim-configs/kek*Grid1* /opt/lcg/yaim/examples

OR

$ cp
/root/workshop/yaim-configs/kek*Grid2* /opt/lcg/yaim/examples

$ cd /opt/lcg/yaim/examples/

$ vi kek*

Read the files.

If you are not sure what any of the variables mean or how they are used, ask and we will discuss it as a group.


Section 3 - Install and configure the middleware component.

3.1 Installing the components.

Since we are only installing a few components, and since some require more effort than others, each group will install multiples of the available middleware components.


Node Type Installation Target Configuration Target Meta-package Description
User Interface lcg-UI UI User Interface component - access to the Grid
Resource Broker / BDII lcg-RB lcg-BDII RB BDII Resource Broker component
Storage Element (Disk Pool Manager) LCG-SE_dpm_mysql SE_dpm_mysql Storage Element component - Disk Pool Manager.
Storage Element (Classic)
lcg-SE_classic SE_classic
Storage Element component - classic (Disk)
Compute Element  
(with Torque Server)
lcg-CE_torque CE_torque Compute Element component including Torque (PBS) LRMS server
Worker node (with Torque client) lcg-WN_torque WN_torque Worker Node component including Torque (PBS) LRMS clients
Mon-box lcg-MON MON RGMA based monitoring system collector server

3.2 Generic Installation procedure.


QUESTION:
Should all of the node types have host certificates installed on them?
If not, which ones shouldn't have them?
What about CRL's - all nodes or only some?

Why?



Execute the installer script.

$ cd /opt/lcg/yaim/scripts
$ ./install_node ../examples/keksite-info-Grid?.def <Installation Target>

Install the host certificate and private key.

$ cp /root/workshop/certificates/host-certificates/<NODE-NAME>/hostcert.pem /etc/grid-security/
$ cp /root/workshop/certificates/host-certificates/<NODE-NAME>/hostkey.pem /etc/grid-security/
$ chmod 444 /etc/grid-security/hostcert.pem
$ chmod 400 /etc/grid-security/hostkey.pem

Install the Certificate Authority certificate bundle (the CA we are using is not trusted by LCG).

$ cp /root/workshop/CA-bundle/ca-bundle.tgz /etc/grid-security/certificates
$ tar -zxvf /etc/grid-security/certificates/ca-bundle.tgz

And write a script to automatically update the CRL (to find out where to get the CRL, look in the CA certificate):

openssl x509 -in /etc/grid-security/certificates/$CACERT -noout -text

In the section "X509v3 extensions:" there will be a sub-section "X509v3 CRL Distribution Points:" which will list the URI for collecting the CRL for that particular Certificate Authority CA.

cat > /root/bin/update-crl.sh

#!/bin/bash

cd /etc/grid-security/certificates
wget [URI of certificate revocation list] > /dev/null

[
Optional:
if it's in DER format, it needs to be converted to PEM format with the following command:
openssl crl -inform DER -in cacrl.crl -outform PEM -out 21bf4d92.r0
]

rm cacrl.crl

Then add a crontab for it:
EOF

chmod u+x /root/bin/update-crl.sh

crontab -e
* 4 1-31 * * /root/bin/update-crl.sh

Install the VOMS configuration files

$ cp /root/workshop/VO/vomsdir/apdg-dg13.cc.kek.jp /etc/grid-security/vomsdir/
$ cp /root/workshop/VO/vomses/apdg-dg13.cc.kek.jp /opt/edg/etc/vomses/

Execute the configuration script

$ ./configure_node ../examples/keksite-info-Grid?.def <Configuration Target>




Generic checks (for each node type)

1. Were the generic Grid user accounts created?

$ ls -la /home

2. Are the DN's in the grid-mapfile mapped to existing UNIX user account?

$ vi /etc/grid-security/grid-mapfile

Look for mapped account names which don't match the existing UNIX user accounts

3. Is the Certificate Authority certificate valid?

$ openssl verify -CAfile /etc/grid-security/certificates/$CACERT /etc/grid-security/certificates/$CACERT

4. Is the Certificate Revocation list valid? Is it current?

$ openssl crl -noout -CAfile /etc/grid-security/certificates/$CACERT-CRL -in /etc/grid-security/certificates/$CACERT-CRL
$ openssl crl -noout -CAfile /etc/grid-security/certificates/$CACERT-CRL -in /etc/grid-security/certificates/$CACERT-CRL -nextupdate

5. Does the host certificate match the host key?

$ openssl x509 -in /etc/grid-security/hostcert.pem -noout -modulus
$ openssl rsa -in /etc/grid-security/hostkey.pem -noout -modulus

Do they match?

6. Is the host certificate valid?

$ openssl verify -CAfile /etc/grid-security/certificates/21bf4d92.0 $HOSTCERT

7. Verify the consistency of the host key.

$ openssl rsa -in $HOSTKEY -noout -check


Note: A large proportion of errors usually arise from problems with host certificates, host keys, CA certificates and/or certificate revocation lists. Unfortunately, the actual error you see as a user is not normally recognisable as an error arising from an expired host certificate or an invalid certificate revocation list.

You have been warned!


3.3 User Interface.

Use the netstat command to see what services are listening on which ports. There should only be ssh.

lcgui $ netstat -tnlp

Important files / locations.

In /opt/edg/etc there should be a directory (of the same name) for each of the VO's your site supports (eg. apdg). In that directory there will be a configuration file (edg_wl_ui.conf) that details which Network Server this UI will submit jobs to, where the job logging will go and which myproxy server to use.

lcgui $ cd /opt/edg/etc/
lcgui $ ls -la
lcgui $ ls -la apdg

If the user does not specify information like the rank of the job, the requirements (ie a production system), logging level etc. then the global configuration is used.

lcgui $ ls -la edg_wl_ui_cmd_var.conf

Configure the firewall on the node.

lcgui $ iptables -F
lcgui $ iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
lcgui $ iptables -A INPUT -s `hostname` -d `hostname` -j ACCEPT
lcgui $ iptables -A INPUT -s localhost.localdomain -d localhost.localdomain -j ACCEPT
lcgui $ iptables -A INPUT -p tcp --dport 22 -j ACCEPT
lcgui $ iptables -A INPUT -s ntp01.local.kek.jp -p udp --dport 123 -j ACCEPT
lcgui $ iptables -A INPUT -j DROP
lcgui $ service iptables save


3.4 Storage Element (Disk Pool Manager MySQL).


Use the netstat command to see what services are listening on which ports.

lcgse $ netstat -tnlp

You should see:

5001: rfiod
5010: dpnsdaemon: Disk Pool Name Server daemon
5015: dpm: Disk Pool Manager
2811: ftpd: GridFTP server
8443: srmv1: Storage Resource Manager Version 1
8444: srmv2: Storage Resource Manager Version 2
3306: mysqld: MySQL Database Server
22: sshd
2135: slapd: Glous GRIS

Important files / locations.

The Configuration files are located in /etc/sysconfig.

lcgse $ ls -l /etc/sysconfig/dp*

Ensure /etc/grid-security/gridmapdir is writable by the user dpmmgr and that there exists a dpmmgr directory containing dpmcert.pem and dpmkey.pem.

lcgse $ ls -l /etc/grid-security/
lcgse $ ls -l /etc/grid-security/dpmmgr

DPM and Name Server (NS) Configuration files.

lcgse $ ls -l /opt/lcg/etc/NSCONFIG
lcgse $ ls -l /opt/lcg/etc/DPMCONFIG

Configure the firewall on the node.

lcgse $ iptables -F
lcgse $ iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
lcgse $ iptables -A INPUT -s `hostname` -d `hostname` -j ACCEPT
lcgse $ iptables -A INPUT -s localhost.localdomain -d localhost.localdomain -j ACCEPT
lcgse $ iptables -A INPUT -p tcp --dport 22 -j ACCEPT
lcgse $ iptables -A INPUT -s ntp01.local.kek.jp -p udp --dport 123 -j ACCEPT
lcgse $ iptables -A INPUT -p tcp --dport 5001 -j ACCEPT
lcgse $ iptables -A INPUT -p tcp --dport 5010 -j ACCEPT
lcgse $ iptables -A INPUT -p tcp --dport 5015 -j ACCEPT
lcgse $ iptables -A INPUT -p tcp --dport 8443 -j ACCEPT
lcgse $ iptables -A INPUT -p tcp --dport 8444 -j ACCEPT
lcgse $ iptables -A INPUT -p tcp --dport 3306 -j ACCEPT
lcgse $ iptables -A INPUT -p tcp --dport 2811 -j ACCEPT
lcgse $ iptables -A INPUT -p tcp --dport 20000:25000 -j ACCEPT
lcgse $ iptables -A INPUT -s <Your-CE> -p tcp --dport 2135 -j ACCEPT
lcgse $ iptables -A INPUT -j DROP
lcgse $ service iptables save

3.5 Storage Element (Classic - disk).


Set rfiod to start at boot time

lcgse $ chkconfig --add rfiod

Use the netstat command to see what services are listening on which ports.

lcgse $ netstat -tnlp

You should see:

2135: slapd: GRIS (Grid Resource Information System)
5001: rfiod
2811: ftpd: ftp daemon

Configure the firewall on the node.

lcgse $ iptables -F
lcgse $ iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
lcgse $ iptables -A INPUT -s `hostname` -d `hostname` -j ACCEPT
lcgse $ iptables -A INPUT -s localhost.localdomain -d localhost.localdomain -j ACCEPT
lcgse $ iptables -A INPUT -p tcp --dport 22 -j ACCEPT
lcgse $ iptables -A INPUT -s ntp01.local.kek.jp -p udp --dport 123 -j ACCEPT
lcgse $ iptables -A INPUT -p tcp --dport 5001 -j ACCEPT
lcgse $ iptables -A INPUT -p tcp --dport 2811 -j ACCEPT
lcgse $ iptables -A INPUT -s <Your-CE> -p tcp --dport 2135 -j ACCEPT
lcgse $ iptables -A INPUT -p tcp --dport 20000:25000 -j ACCEPT
lcgse $ iptables -A INPUT -j DROP
lcgse $ service iptables save


3.6 Compute Element (with Torque Server).


Use the netstat command to see what services are listening on which ports.

lcgce $ netstat -tnlp

You should see:

2119: edg-gatekeeper: globus gatekeeper service
9002: edg-wl-logd: workload manager logging daemon
40559,40560: maui: maui scheduler
15004: maui: maui scheduler
15001: pbs _server: Local Resource Management System
2135: slapd: GRIS (Grid Resource Information System)
2170: bdii-fwd: Site BDII (equivalent to a GIIS: Grid Information Index System)
2171, 2172, 2173: slapd
2811: ftpd: ftp daemon

Important files / locations.

Globus

/etc/globus.conf:  gatekeeper configuration
/etc/sysconfig/globus: main globus configuration file - specifies port range for GRAM service
/opt/globus/etc/grid-services:  jobmanager definitions
/opt/globus/lib/perl/Globus/GRAM/JobManager: PERL jobmanagers
/var/log/globus-gatekeeper.log: gatekeeper log file

BDII

/opt/bdii/etc/bdii-update.conf:  this is the configuration file for the site bdii service.
/opt/lcg/libexec/* :  scripts to collect dynamic information for the information system
/opt/lcg/var/gip/* : location of the Generic Information Provider (GIP) info.


PBS and Maui

/var/spool/pbs: PBS configuration directory
/var/spool/pbs/server_name: the name of the PBS server this compute element will submit jobs to.

/var/spool/maui: Maui configuration directory
/var/spool/maui/maui.cfg: the maui configuration file - cluster scheduling policy

/var/spool/pbs/server_logs/*: pbs log files

Hint: 
1. If your site already has a PBS cluster and you want to "Grid enable" it, just configure a normal LCG CE and then change the name listed the server_name file. Make sure the PBS server you are submitting to will accept jobs from CE.
2. Notes describing how to configure a single CE so that it can submit to multiple clusters can be found here.

Configure the firewall on the node.

lcgce $ iptables -F
lcgce $ iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
lcgce $ iptables -A INPUT -s `hostname` -d `hostname` -j ACCEPT
lcgce $ iptables -A INPUT -s localhost.localdomain -d localhost.localdomain -j ACCEPT
lcgce $ iptables -A INPUT -p tcp --dport 22 -j ACCEPT
lcgce $ 
iptables -A INPUT -s ntp01.local.kek.jp -p udp --dport 123 -j ACCEPT
lcgce $ iptables -A INPUT -p tcp --dport 9002 -j ACCEPT
lcgce $ iptables -A INPUT -p tcp --dport 2119 -j ACCEPT
lcgce $ iptables -A INPUT -p tcp --dport 15001 -j ACCEPT
lcgce $ iptables -A INPUT -p tcp --dport 15004 -j ACCEPT
lcgce $ iptables -A INPUT -p  tcp --dport 2170 -j ACCEPT
lcgce $ iptables -A INPUT -p tcp --dport 2811 -j ACCEPT
lcgce $ iptables -A INPUT -p tcp --dport 20000:25000 -j ACCEPT
lcgce $ iptables -A INPUT -j DROP
lcgce $ service iptables save

3.7 Mon-box.


Use the netstat command to see what services are listening on which ports.

lcgmon $ netstat -tnlp

You should see:

8005, 8009, 8080, 8088, 8443: java
3306: mysql
2135: slapd: GRIS (Grid Resource Information System)
2136: slapd: GridICE-MDS - fabric monitoring mds
12409, 12411: edg-fmon-server: GridICE-MDS fabric monitoring mds
22: sshd

Important files / locations.

lcgmon $ ls -l /var/fmonServer

/var/lib/tomcat5/webapps/R-GMA:   R-GMA (Relational-Grid Monitoring Architecture) Tomcat webapp

Configure the firewall on the node.

lcgmon $ iptables -F
lcgmon $ iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
lcgmon $ iptables -A INPUT -s `hostname` -d `hostname` -j ACCEPT
lcgmon $ iptables -A INPUT -s localhost.localdomain -d localhost.localdomain -j ACCEPT
lcgmon $ iptables -A INPUT -p tcp --dport 22 -j ACCEPT
lcgmon
$ iptables -A INPUT -s ntp01.local.kek.jp -p udp --dport 123 -j ACCEPT
lcgmon $ iptables -A INPUT -s <YOUR-CE> -p tcp --dport 2135 -j ACCEPT
lcgmon $ iptables -A INPUT -s <YOUR-CE> -p tcp --dport 2136 -j ACCEPT
lcgmon $ iptables -A INPUT -p tcp --dport 8080 -j ACCEPT
lcgmon $ iptables -A INPUT -j DROP
lcgmon $ service iptables save


Note: For the real LCG Grid, the R-GMA registry and schema server are hosted at Rutherford Appleton Labs in the UK. For the MON box to be set up correctly, it needs to contact that server. However, that server has a default policy of deny for unknown hosts. Accordingly, for this workship, a gLite registry and schema server has been installed and configured.


3.8 Resource Broker / BDII.

Use the netstat command to see what services are listening on which ports.

lcgrb $ netstat -tnlp

You should see:

9000, 9001: edg-wl-bkserved: used for logging and bookeeping
9002: edg-wl-logd: workload manager logging daemon
port can vary: condor_master
two services, ports can vary: condor_schedd
2135: slapd: GRIS (Grid Resource Information System)
2170: bdii-fwd: Site BDII (equivalent to a GIIS: Grid Information Index System)
2171, 2172, 2173: slapd
2811: ftpd: ftp daemon
7772: edg-wl-ns_daemon: network server
3306: mysql: database* 4 1-31 * * /root/update-crl.sh
22: sshd

Important files / locations.

Sandboxes, logfiles, general info can be found in /var/edgwl

lcgrb $ cd /var/edgwl
lcgrb $ ls -la

Should see: SandboxDir, jobcontrol, logmonitor, networkserver, proxyrenewal, workload_manager

Look for the log files in these directories.

Configure the firewall on the node.

lcgrb $ iptables -F
lcgrb $ iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
lcgrb $ iptables -A INPUT -s `hostname` -d `hostname` -j ACCEPT
lcgrb $ iptables -A INPUT -s localhost.localdomain -d localhost.localdomain -j ACCEPT
lcgrb $ iptables -A INPUT -p tcp --dport 22 -j ACCEPT
lcgrb
$ iptables -A INPUT -s ntp01.local.kek.jp -p udp --dport 123 -j ACCEPT
lcgrb $ iptables -A INPUT -p tcp --dport 9000:9002 -j ACCEPT
lcgrb $ iptables -A INPUT -p  tcp --dport 2170 -j ACCEPT
lcgrb $ iptables -A INPUT -s <YOUR-CE> -p tcp --dport 2135 -j ACCEPT
lcgrb $ iptables -A INPUT -p tcp --dport 2811 -j ACCEPT
lcgrb $ iptables -A INPUT -p tcp --dport 7772 -j ACCEPT
lcgrb $ iptables -A INPUT -p tcp --dport 20000:25000 -j ACCEPT
lcgrb $ iptables -A INPUT -j DROP
lcgrb $ service iptables save

Hint:
If you don't want anyone outside your institution using your Resource Broker, block port 7772 (Network Server) at the firewall.


3.9 Worker Node (with Torque Client).

Use the netstat command to see what services are listening on which ports. There should only be ssh and pbs_mom.

lcgwn $ netstat -tnlp | grep ssh
lcgwn $ netstat -tnlp | grep pbs

Make sure these are the only services running

lcgwn $ netstat -tnlp

Use pbsnodes to see if the node was configured correctly.

lcgwn $ pbsnodes -a

lcgwn.kek.jp
     state = free
     np = 1
     properties = lcgpro
     ntype = cluster
     status = arch=linux,uname=Linux lcgwn.kek.jp 2.4.30-xenU #8 Thu Sep 22 16:53:30 EST 2005 i686,sessions=? 0,nsessions=? 0,nusers=0,idletime=1116,totmem=1310712kb,availmem=1064452kb,physmem=262144kb,ncpus=1,loadave=0.12,rectime=1131255736

Important files / locations.

/var/spool/pbs/mom_logs/*

Configure the firewall on the node.

lcgwn $ iptables -F
lcgwn $ iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
lcgwn $ iptables -A INPUT -s `hostname` -d `hostname` -j ACCEPT
lcgwn $ iptables -A INPUT -s localhost.localdomain -d localhost.localdomain -j ACCEPT
lcgwn $ iptables -A INPUT -p tcp --dport 22 -j ACCEPT
lcgwn
$ iptables -A INPUT -s ntp01.local.kek.jp -p udp --dport 123 -j ACCEPT
lcgwn $ iptables -A INPUT -j DROP
lcgwn $ service iptables save




Section 4 - Testing.

4.1 Proxy Creation.


Log on to the user interface using your guest account.

Install your user certificate and private key

$ mkdir .globus
$ cd .globus
$ su -

Enter the root password

$ cp /root/workshop/certificates/user-certificates/<USER_NAME>/usercert.pem ~<USER-NAME>/.globus/
$ cp /root/workshop/certificates/user-certificates/<USER_NAME>/userkey.pem ~<USER-NAME>/.globus/
$ cd ~<USER-NAME>/.globus/
$ chown <USER-NAME>:<USERNAME> *
$ chmod 444 usercert.pem
$ chmod 400 userkey.pem

Exit from root

For a normal globus proxy certificate:

$ grid-proxy-init

For a VOMS proxy certificate:

$ voms-proxy-init -voms apdg

Troubleshooting:

1. If you see an error like:

Error: VERR_NOSOCKET Failed. Failed to contact servers for apdg.

Then:

VOMS server and/or User Interface not time synchronised, or
CRL on UI and/or VOMS out of date



4.2 Run A Simple Job


Check the connectivity.

$ globus-job-run lcgce.kek.jp /bin/pwd

If this works, try:

$ globus-job-submit lcgce.kek.jp:2119/jobmanager-lcgpbs -queue apdg /bin/hostname
GLOBUS_ID

Check the status of the job with.

$ globus-job-status $GLOBUS_ID

or

$ watch --interval=10 "globus-job-status $GLOBUS_ID"

When it reports done you can collect the output and error with the following commands.

$ globus-job-get-output $GLOBUS_ID
$ globus-job-get-output -err $GLOBUS_ID


Troubleshooting:

1. On the UI you see an error like:

init.c:499: globus_gss_assist_init_sec_context_async: Error during context initialization
init_sec_context.c:171: gss_init_sec_context: SSLv3 handshake problems
globus_i_gsi_gss_utils.c:888: globus_i_gsi_gss_handshake: Unable to verify remote side's credentials
globus_i_gsi_gss_utils.c:854: globus_i_gsi_gss_handshake: SSLv3 handshake problems: Couldn't do ssl handshake
OpenSSL Error: s3_pkt.c:1046: in library: SSL routines, function SSL3_READ_BYTES: sslv3 alert certificate expired (error code 7)

In the globus-gatekeeper log you see:

accept_sec_context.c:170: gss_accept_sec_context: SSLv3 handshake problems
globus_i_gsi_gss_utils.c:881: globus_i_gsi_gss_handshake: Unable to verify remote side's credentials
globus_i_gsi_gss_utils.c:854: globus_i_gsi_gss_handshake: SSLv3 handshake problems: Couldn't do ssl handshake
OpenSSL Error: s3_srvr.c:1816: in library: SSL routines, function SSL3_GET_CLIENT_CERTIFICATE: no certificate returned
globus_gsi_callback.c:351: globus_i_gsi_callback_handshake_callback: Could not verify credential
globus_gsi_callback.c:477: globus_i_gsi_callback_cred_verify: Could not verify credential
globus_gsi_callback.c:769: globus_i_gsi_callback_check_revoked: Invalid CRL: The available CRL has expiredFailure: GSS failed Major:000a0000 Minor:00000007 Token:00000000

Indicating that the CRL on the CE has expired.

2. If the job seems to hang, check the connectivity between the CE and WN. Is the firewall blocking the connection?

3. If you get the error: "GRAM Job submission failed because the job manager failed to open stderr (error code 74)"

Then modify the firewall on the UI to include the line:

iptables -I INPUT 5 -s lcgce.kek.jp -j ACCEPT

This is happening because Globus is trying to return the output back to the UI but the firewall is blocking the connection.



4.3 Is the UI correctly configured to access the RB?

Create a file in your home directory called "tesjob.jdl" and add the following lines to it.

Executable = "testJob.sh";
StdOutput = "testJob.out";
StdError = "testJob.err";
InputSandbox = {"./testJob.sh"};
OutputSandbox = {"testJob.out","testJob.err"};
#Requirements = other.GlueCEUniqueID == "lcg-ce.kek.jp:2119/jobmanager-lcgpbs-apdg";

Create a second file called "testJob.sh" and add the following lines to it.

#!/bin/bash
date
hostname
echo"****************************************"
echo "env | sort"
echo"****************************************"
env | sort
echo"****************************************"
echo "mount"
echo"****************************************
mount
echo"****************************************"
echo "rpm -q -a | sort"
echo"****************************************
/bin/rpm -q -a | sort

sleep 20
date



To see which sites can run your job.

$ edg-job-list-match testJob.jdl


Selected Virtual Organisation name (from --vo option): apdg
Connecting to host lcgrb.kek.jp, port 7772

***************************************************************************
                         COMPUTING ELEMENT IDs LIST
 The following CE(s) matching your job requirements have been found:

                   *CEId*
 lcgce.cc.kek.jp:2119/jobmanager-lcgpbs-apdg
 
***************************************************************************

Create a local configuration file

$ cp /opt/edg/etc/edg_wl_ui_cmd_var.conf ./my-defaults.conf

Edit the file:

RetryCount = 1;
ErrorStorage =  "$HOME/jobError/";
OutputStorage = "$HOME/jobOutput/";


Submit the job to the resource broker

$ edg-job-submit -c my-defaults.conf testJob.jdl

Selected Virtual Organisation name (from proxy certificate extension): apdg
Connecting to host lcgrb.cc.kek.jp, port 7772
Logging to host lcgrb.cc.kek.jp, port 9002


*********************************************************************************************
                               JOB SUBMIT OUTCOME
 The job has been successfully submitted to the Network Server.
 Use edg-job-status command to check job current status. Your job identifier (edg_jobId) is:

 - https://lcgrb.cc.kek.jp:9000/cWqvS9GsZXlh8Kd9434YeA


*********************************************************************************************

Check the status of the job

$ edg-job-status https://lcgrb.cc.kek.jp:9000/cWqvS9GsZXlh8Kd9434YeA

or

$ watch --interval=10 "edg-job-status https://lcgrb.cc.kek.jp:9000/cWqvS9GsZXlh8Kd9434YeA"

*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://lcgrb.cc.kek.jp:9000/cWqvS9GsZXlh8Kd9434YeA
Current Status:     Scheduled
Status Reason:      Job successfully submitted to Globus
Destination:          lcgce.cc.kek.jp:2119/jobmanager-lcgpbs-apdg
reached on:           Wed Nov  9 06:51:14 2005
*************************************************************

*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://lcgrb.cc.kek.jp:9000/cWqvS9GsZXlh8Kd9434YeA
Current Status:     Running
Status Reason:      Job successfully submitted to Globus
Destination:          lcgce.cc.kek.jp:2119/jobmanager-lcgpbs-apdg
reached on:           Wed Nov  9 06:54:43 2005
*************************************************************

Get the logging information of the job

$ edg-job-get-logging info https://lcgrb.cc.kek.jp:9000/cWqvS9GsZXlh8Kd9434YeA

You can increase the verbosity of the logging info by specifying -v 1 or -v 2

Get the output from the job

$ edg-job-get-output -c my-defaults.conf https://lcgrb.cc.kek.jp:9000/cWqvS9GsZXlh8Kd9434YeA


Troubleshooting:

1. If you get an error like:

edg-job-list-match --vo apdg testJob.jdl

Selected Virtual Organisation name (from --vo option): apdg
**** Error: API_NATIVE_ERROR ****
Error while calling the "NSClient::multi" native api
AuthenticationException: Failed to establish security context...

**** Error: UI_NO_NS_CONTACT ****
Unable to contact any Network Server


Then the problem is likely to be that you are not getting mapped to a local account correctly. Edit /etc/grid-security/grid-mapfile to contain a line like:

"/C=AU/O=APAC-GRID/OU=The University of Melbourne/CN=Marco La Rosa" .apdg

(the distinguished name from your certificate: grid-cert-info -subject)


2. In the logging info, if you get an error like:

Event: Done
- exit_code             =    1
- host                     =    lcgrb.cc.kek.jp
- level                     =    SYSTEM
- priority                =    asynchronous
- reason                 =    Cannot read JobWrapper output, both from Condor and from Maradona.
- seqcode               =    UI=000003:NS=0000000003:WM=000008:BH=0000000000:JSS=000006:LM=000013:LRMS=000000:APP=000000
- source                  =    LogMonitor
- src_instance        =    unique
- status_code         =    FAILED
- timestamp            =    Thu Nov 10 01:06:55 2005
- user                      =    /C=AU/O=APAC-GRID/OU=The University of Melbourne/CN=Marco La Rosa

Check the connectivity between the CE and WN.

Is the firewall blocking the WN from connecting to the CE (necessary for PBS)?

Remove the firewall on the CE.

lcgce $ iptables -F

Remove the firewall on the WN.

lcgwn $ iptables -F

Retry the job submission.

If this is the problem, then an easy solution is to allow all acces to / from the CE and WN

On the CE.

lcgce $ iptables -I INPUT 5 -s lcgwn.cc.kek.jp -d lcgce.kek.jp -j ACCEPT

On the WN.

lcgwn $ iptables -I INPUT 5 -s lcgce.cc.kek.jp -d lcgwn.cc.kek.jp -j ACCEPT




Is it an ssh problem?

On the CE.

lcgce $ su - apdg001
lcgce $ ssh lcgwn

Can you ssh to lcgwn as the apdg001 without a password? If not, ssh has not been configured correctly.

On the WN.

lcgwn $ su - apdg001
lcgwn $ ssh lcgce

Can you ssh to lcgce as the apdg001 without a password? If not, ssh has not been configured correctly.

On both the CE and WN, check that the WN is listed in the file /opt/lcg/yaim/examples/kekwn-list.conf and re-run the configuration script.

lcgce $ ./configure_node ../examples/keksite-info.def CE_torque
lcgwn $ ./configure_node ../examples/keksite-info.def WN_torque


Are the nodes time synchronised?

On the CE.

lcgce $ date

On the WN.

lcgce $ date

Are the host certificate and CRL valid?


4.4 Disk Pool Manager Storage Element / Data Management


Log in to the User Interface and issue the following commands.

$ export DPNS_HOST=<YOUR-STORAGE-ELEMENT>
$ export DPM_HOST=<YOUR-STORAGE-ELEMENT>

Check that the DPM Name Server is working.

$ dpns-ls -l /grid
$ dpns-ls -l /grid/apdg
$ dpns-mkdir /grid/apdg/<YOUR-USERNAME>
$ dpns-ls -l /grid/apdg
$ dpns-rm -rf  /grid/apdg/<YOUR-USERNAME>

$ dpns-ls -l /dpm/cc.kek.jp/home/apdg
$ dpns-mkdir /dpm/cc.kek.jp/home/apdg/<YOUR-USERNAME>
$ dpns-ls -l /dpm/cc.kek.jp/home/apdg/
$ dpns-rm -rf  /dpm/cc.kek.jp/home/apdg/<YOUR-USERNAME>

Are the user ID and group ID as expected? Why or Why not?

Check that GridFTP works

$ dpns-mkdir /dpm/cc.kek.jp/home/apdg/<YOUR-USERNAME>
$ globus-url-copy file:/etc/group gsiftp://<YOUR-STORAGE-ELEMENT>/dpm/cc.kek.jp/home/apdg/<YOUR-USERNAME>/group


Section 5 - General Information.

5.1 Globus Job Managers.

Globus job managers open up to 4 network ports per job. These ports correspond to STDIN, STDOUT, STDERR and a control port. This is  why the default configuration of Globus specifies a port range of 20,000 - 25,000.

Submit a job and then execute the following command on the CE.

$ watch --interval=10 "netstat -tnlp"

If you use a globus-job-submit, then STDIN may not be opened, in which case, only 3 ports will be open per job.