|
Throughout this
workshop, commands that need to be executed by you will be bounded by a
box of this color.
These will apply to everyone irrespective of what Grid component is being installed. Information will be written in this font. Commands that can be executed at the shell are in this font. |
| packages | An archive of Scientific Linux and LCG middleware packages. |
| certificates | Host certificates and private keys for each of the nodes. Use the certificate / key applicable to your node. |
| yaim-configs | The YAIM configuration files. There are two sets of files in
this directory for building Grid1 and Grid2. Use the files appropriate
to the system you are a part of. |
| doc | Some useful documentation. Specifically, the LCG2-UserGuide (recommended reading) and for system administrators, the LCG2-PortTable. |
| $ rpm -qa | grep apt If it is installed you will see at least: apt-0.5.15cnc6-8.SL.cern Although the version may be different. If it is not installed $ rpm -ivh http://linuxsoft.cern.ch/cern/slc30X/i386/SL/RPMS/apt-0.5.15cnc6-8.SL.cern.i386.rpm |
| $ cd /etc/apt/sources.list.d Create a new file cern.list and add the following lines to it. $ vi cern.list rpm http://linuxsoft.cern.ch cern/slc305/i386/apt os updates extras rpm-src http://linuxsoft.cern.ch cern/slc305/i386/apt os updates extras rpm http://grid-deployment.web.cern.ch/grid-deployment/gis apt/LCG-2_6_0/sl3/en/i386 lcg_sl3 lcg_sl3.updates Edit the file sl.list and comment out any lines in it. $ vi sl.list I do this because I don't want to update my nodes from a Scientific Linux repository - only from the CERN repositories. |
| $ cp /root/workshop/packages/*.rpm /var/cache/apt/archives/ $ apt-get update $ apt-get dist-upgrade |
| Is Java installed? $ ls /usr/java Remember: It has to be installed from the RPM package. If this is the case, then on RedHat like systems it will be in /usr/java. If it is not installed, then: $ rpm -ivh /var/cache/apt/archives/j2sdk-1_4_2_08-linux-i586.rpm |
| Install ntpd and ntpdate Network time server is: ntp01.local.kek.jp: 172.30.32.100 Edit /etc/ntp.conf and add the following lines: restrict 172.30.32.100 mask 255.255.255.255 nomodify notrap noquery server 172.30.32.100 Edit /etc/ntp/step-tickers and add the following line: 172.30.32.100 Restart the service. service ntpd restart NTP is known to lose time periodically. To combat this, create a script in /etc/cron.daily called "sync-time.sh" and add the following lines to it. #!/bin/bash service ntpd stop ntpdate ntp.unimelb.edu.au ntpdate ntp.unimelb.edu.au ntpdate ntp.unimelb.edu.au service ntpd start Run the script - you should see something like: Shutting down ntpd: [ OK ] 11 May 02:22:58 ntpdate[25023]: step time server 128.250.5.101 offset 3.704193 sec 11 May 02:22:58 ntpdate[25038]: adjust time server 128.250.5.101 offset 0.000013 sec 11 May 02:22:59 ntpdate[25039]: adjust time server 128.250.5.101 offset -0.000001 sec ntpd: Synchronizing with time server: [ OK ] Starting ntpd: To set the system time to UTC - add the following lines to /etc/sysconfig/clock ZONE="UTC" UTC=true ARC=false cp /usr/share/zoneinfo/UTC /etc/localtime hwclock --systohc --utc Check that the hardware clock is set to UTC. hwclock You should see something like: Thu 11 May 2006 02:23:38 UTC -0.918646 seconds |
| Is it installed? $ rpm -qa | grep yaim If it is not installed, then: $ apt-get update $ apt-get install lcg-yaim |
| examples | This is where the site configuration files are located. In
here you will find 3 files: site-info.def, users.conf and wn-list.conf.
These files should be customised for your site. |
| functions | This is where the functions to configure the components are located. |
| scripts | This is where the main installation and configuration scripts are located. This directory contains 4 files: install_node is used to install an LCG component configure_node is used to configure that component run_function is used to run a specific YAIM configuration function node-info.def defines which functions are necessary to configure the different node types |
| keksite-info-Grid1.def | The main site configuration file. |
| kekusers.conf | The list of generic grid user accounts to create on each node. |
| kekwn-list-Grid1.conf | The list of worker nodes that Torque will be configured for. |
| keksite-info-Grid2.def | The main site configuration file. |
| kekusers.conf | The list of generic user accounts to create on each node. |
| kekwn-list-Grid2.conf | The list of worker nodes that Torque will be configured for. |
| $ cp /root/workshop/yaim-configs/kek*Grid1* /opt/lcg/yaim/examples OR $ cp /root/workshop/yaim-configs/kek*Grid2* /opt/lcg/yaim/examples $ cd /opt/lcg/yaim/examples/ $ vi kek* Read the files. If you are not sure what any of the variables mean or how they are used, ask and we will discuss it as a group. |
| Node Type | Installation Target | Configuration Target | Meta-package Description |
| User Interface | lcg-UI | UI | User Interface component - access to the Grid |
| Resource Broker / BDII | lcg-RB lcg-BDII | RB BDII | Resource Broker component |
| Storage Element (Disk Pool Manager) | LCG-SE_dpm_mysql | SE_dpm_mysql | Storage Element component - Disk Pool Manager. |
| Storage Element (Classic) |
lcg-SE_classic | SE_classic |
Storage Element component - classic (Disk) |
| Compute Element (with Torque Server) |
lcg-CE_torque | CE_torque | Compute Element component including Torque (PBS) LRMS server |
| Worker node (with Torque client) | lcg-WN_torque | WN_torque | Worker Node component including Torque (PBS) LRMS clients |
| Mon-box | lcg-MON | MON | RGMA based monitoring system collector server |
3.6 Compute Element (with Torque Server).Use the netstat command to see what services are listening on which ports. lcgce $ netstat -tnlp You should see: 2119: edg-gatekeeper: globus gatekeeper service 9002: edg-wl-logd: workload manager logging daemon 40559,40560: maui: maui scheduler 15004: maui: maui scheduler 15001: pbs _server: Local Resource Management System 2135: slapd: GRIS (Grid Resource Information System) 2170: bdii-fwd: Site BDII (equivalent to a GIIS: Grid Information Index System) 2171, 2172, 2173: slapd 2811: ftpd: ftp daemon Important files / locations. Globus /etc/globus.conf: gatekeeper configuration /etc/sysconfig/globus: main globus configuration file - specifies port range for GRAM service /opt/globus/etc/grid-services: jobmanager definitions /opt/globus/lib/perl/Globus/GRAM/JobManager: PERL jobmanagers /var/log/globus-gatekeeper.log: gatekeeper log file BDII /opt/bdii/etc/bdii-update.conf: this is the configuration file for the site bdii service. /opt/lcg/libexec/* : scripts to collect dynamic information for the information system /opt/lcg/var/gip/* : location of the Generic Information Provider (GIP) info. PBS and Maui /var/spool/pbs: PBS configuration directory /var/spool/pbs/server_name: the name of the PBS server this compute element will submit jobs to. /var/spool/maui: Maui configuration directory /var/spool/maui/maui.cfg: the maui configuration file - cluster scheduling policy /var/spool/pbs/server_logs/*: pbs log files Hint: 1. If your site already has a PBS cluster and you want to "Grid enable" it, just configure a normal LCG CE and then change the name listed the server_name file. Make sure the PBS server you are submitting to will accept jobs from CE. 2. Notes describing how to configure a single CE so that it can submit to multiple clusters can be found here. Configure the firewall on the node. lcgce $ iptables -F lcgce $ iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT lcgce $ iptables -A INPUT -s `hostname` -d `hostname` -j ACCEPT lcgce $ iptables -A INPUT -s localhost.localdomain -d localhost.localdomain -j ACCEPT lcgce $ iptables -A INPUT -p tcp --dport 22 -j ACCEPT lcgce $ iptables -A INPUT -s ntp01.local.kek.jp -p udp --dport 123 -j ACCEPT lcgce $ iptables -A INPUT -p tcp --dport 9002 -j ACCEPT lcgce $ iptables -A INPUT -p tcp --dport 2119 -j ACCEPT lcgce $ iptables -A INPUT -p tcp --dport 15001 -j ACCEPT lcgce $ iptables -A INPUT -p tcp --dport 15004 -j ACCEPT lcgce $ iptables -A INPUT -p tcp --dport 2170 -j ACCEPT lcgce $ iptables -A INPUT -p tcp --dport 2811 -j ACCEPT lcgce $ iptables -A INPUT -p tcp --dport 20000:25000 -j ACCEPT lcgce $ iptables -A INPUT -j DROP lcgce $ service iptables save |
| Install your user certificate and private key $ mkdir .globus $ cd .globus $ su - Enter the root password $ cp /root/workshop/certificates/user-certificates/<USER_NAME>/usercert.pem ~<USER-NAME>/.globus/ $ cp /root/workshop/certificates/user-certificates/<USER_NAME>/userkey.pem ~<USER-NAME>/.globus/ $ cd ~<USER-NAME>/.globus/ $ chown <USER-NAME>:<USERNAME> * $ chmod 444 usercert.pem $ chmod 400 userkey.pem Exit from root For a normal globus proxy certificate: $ grid-proxy-init For a VOMS proxy certificate: $ voms-proxy-init -voms apdg |
|
Check the connectivity. $ globus-job-run lcgce.kek.jp /bin/pwd If this works, try: $ globus-job-submit lcgce.kek.jp:2119/jobmanager-lcgpbs -queue apdg /bin/hostname GLOBUS_ID Check the status of the job with. $ globus-job-status $GLOBUS_ID or $ watch --interval=10 "globus-job-status $GLOBUS_ID" When it reports done you can collect the output and error with the following commands. $ globus-job-get-output $GLOBUS_ID $ globus-job-get-output -err $GLOBUS_ID |
#!/bin/bash
date
hostname
echo"****************************************"
echo "env | sort"
echo"****************************************"
env | sort
echo"****************************************"
echo "mount"
echo"****************************************
mount
echo"****************************************"
echo "rpm -q -a | sort"
echo"****************************************
/bin/rpm -q -a | sort
sleep 20
date
To see which sites can run your job. $ edg-job-list-match testJob.jdl Selected Virtual Organisation name (from --vo option): apdg Connecting to host lcgrb.kek.jp, port 7772 *************************************************************************** COMPUTING ELEMENT IDs LIST The following CE(s) matching your job requirements have been found: *CEId* lcgce.cc.kek.jp:2119/jobmanager-lcgpbs-apdg *************************************************************************** Create a local configuration file $ cp /opt/edg/etc/edg_wl_ui_cmd_var.conf ./my-defaults.conf Edit the file: RetryCount = 1; ErrorStorage = "$HOME/jobError/"; OutputStorage = "$HOME/jobOutput/"; Submit the job to the resource broker $ edg-job-submit -c my-defaults.conf testJob.jdl Selected Virtual Organisation name (from proxy certificate extension): apdg Connecting to host lcgrb.cc.kek.jp, port 7772 Logging to host lcgrb.cc.kek.jp, port 9002 ********************************************************************************************* JOB SUBMIT OUTCOME The job has been successfully submitted to the Network Server. Use edg-job-status command to check job current status. Your job identifier (edg_jobId) is: - https://lcgrb.cc.kek.jp:9000/cWqvS9GsZXlh8Kd9434YeA ********************************************************************************************* Check the status of the job $ edg-job-status https://lcgrb.cc.kek.jp:9000/cWqvS9GsZXlh8Kd9434YeA or $ watch --interval=10 "edg-job-status https://lcgrb.cc.kek.jp:9000/cWqvS9GsZXlh8Kd9434YeA" ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://lcgrb.cc.kek.jp:9000/cWqvS9GsZXlh8Kd9434YeA Current Status: Scheduled Status Reason: Job successfully submitted to Globus Destination: lcgce.cc.kek.jp:2119/jobmanager-lcgpbs-apdg reached on: Wed Nov 9 06:51:14 2005 ************************************************************* ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://lcgrb.cc.kek.jp:9000/cWqvS9GsZXlh8Kd9434YeA Current Status: Running Status Reason: Job successfully submitted to Globus Destination: lcgce.cc.kek.jp:2119/jobmanager-lcgpbs-apdg reached on: Wed Nov 9 06:54:43 2005 ************************************************************* Get the logging information of the job $ edg-job-get-logging info https://lcgrb.cc.kek.jp:9000/cWqvS9GsZXlh8Kd9434YeA You can increase the verbosity of the logging info by specifying -v 1 or -v 2 Get the output from the job $ edg-job-get-output -c my-defaults.conf https://lcgrb.cc.kek.jp:9000/cWqvS9GsZXlh8Kd9434YeA |
| Is the firewall blocking the WN from connecting to the CE (necessary for PBS)? Remove the firewall on the CE. lcgce $ iptables -F Remove the firewall on the WN. lcgwn $ iptables -F Retry the job submission. If this is the problem, then an easy solution is to allow all acces to / from the CE and WN On the CE. lcgce $ iptables -I INPUT 5 -s lcgwn.cc.kek.jp -d lcgce.kek.jp -j ACCEPT On the WN. lcgwn $ iptables -I INPUT 5 -s lcgce.cc.kek.jp -d lcgwn.cc.kek.jp -j ACCEPT |
| Is it an ssh problem? On the CE. lcgce $ su - apdg001 lcgce $ ssh lcgwn Can you ssh to lcgwn as the apdg001 without a password? If not, ssh has not been configured correctly. On the WN. lcgwn $ su - apdg001 lcgwn $ ssh lcgce Can you ssh to lcgce as the apdg001 without a password? If not, ssh has not been configured correctly. On both the CE and WN, check that the WN is listed in the file /opt/lcg/yaim/examples/kekwn-list.conf and re-run the configuration script. lcgce $ ./configure_node ../examples/keksite-info.def CE_torque lcgwn $ ./configure_node ../examples/keksite-info.def WN_torque |
| Are the nodes time synchronised? On the CE. lcgce $ date On the WN. lcgce $ date |
| Are the host certificate and CRL valid? |
| Log in to the User Interface and issue the following commands. $ export DPNS_HOST=<YOUR-STORAGE-ELEMENT> $ export DPM_HOST=<YOUR-STORAGE-ELEMENT> Check that the DPM Name Server is working. $ dpns-ls -l /grid $ dpns-ls -l /grid/apdg $ dpns-mkdir /grid/apdg/<YOUR-USERNAME> $ dpns-ls -l /grid/apdg $ dpns-rm -rf /grid/apdg/<YOUR-USERNAME> $ dpns-ls -l /dpm/cc.kek.jp/home/apdg $ dpns-mkdir /dpm/cc.kek.jp/home/apdg/<YOUR-USERNAME> $ dpns-ls -l /dpm/cc.kek.jp/home/apdg/ $ dpns-rm -rf /dpm/cc.kek.jp/home/apdg/<YOUR-USERNAME> Are the user ID and group ID as expected? Why or Why not? Check that GridFTP works $ dpns-mkdir /dpm/cc.kek.jp/home/apdg/<YOUR-USERNAME> $ globus-url-copy file:/etc/group gsiftp://<YOUR-STORAGE-ELEMENT>/dpm/cc.kek.jp/home/apdg/<YOUR-USERNAME>/group |
| Submit a job and then execute the following command on the CE. $ watch --interval=10 "netstat -tnlp" If you use a globus-job-submit, then STDIN may not be opened, in which case, only 3 ports will be open per job. |