EPP Grid - HOW-TO Configure an LCG CE Host to Submit to Multiple PBS Clusters


Start of topic | Skip to actions

HOW-TO Configure an LCG CE Host to Submit to Multiple PBS Clusters

Introduction


NB: This how-to assumes your clusters use PBS as the LRMS.

The standard installation of an LCG compute element implicitly assumes that it will be the front-end to one cluster. This is fine if you wish to build (and can resource) one CE per cluster at your site. However, if you wish to use one CE to submit to all of your clusters, then read on...

The basic idea is to have a jobmanager for each cluster.

For this example, we have two clusters: brecca and edda.

Globus / LCG - PBS Job Manager


  • Create the necessary job managers

cd /opt/globus/lib/perl/Globus/GRAM/JobManager

cp pbs.pm pbsbrecca.pm

cp pbs.pm pbsedda.pm

Save the following diff to a file called pbsbrecca.p1.

--- pbs.pm      2005-10-10 16:15:19.000000000 +1000
+++ pbsbrecca.pm        2005-10-10 18:56:18.000000000 +1000
@@ -6,7 +6,7 @@
 use Config;

 # NOTE: This package name must match the name of the .pm file!!
-package Globus::GRAM::JobManager::pbs;
+package Globus::GRAM::JobManager::pbsbrecca;

 @ISA = qw(Globus::GRAM::JobManager);

@@ -19,7 +19,7 @@
     $qstat =  '/usr/bin/qstat';
     $qdel = '/usr/bin/qdel';
     $cluster = 1;
-    $cpu_per_node = 1;
+    $cpu_per_node = 2;
     $remote_shell = '/usr/bin/ssh';
 }

@@ -384,6 +384,9 @@
         $errfile = "2>>" . $description->logfile();
     }

+    # MLR 10/10/05
+    $qsub = "/usr/bin/qsub -q ".$description->queue()."\@brecca-m.vpac.org";
+

     $self->nfssync( $pbs_job_script_name );
     $self->log("submitting job -- $qsub < $pbs_job_script_name $errfile");
     chomp($job_id = `$qsub < $pbs_job_script_name $errfile`);

Note the qsub which submits directly to queue@brecca....

Apply the patch to the pbsbrecca.pm file.

patch -Np0 pbsbrecca.pm pbsbrecca.p1

Save the following diff to a file called pbsedda.p1.

--- pbs.pm      2005-10-10 16:15:19.000000000 +1000
+++ pbsedda.pm  2005-10-10 16:15:43.000000000 +1000
@@ -6,7 +6,7 @@
 use Config;

 # NOTE: This package name must match the name of the .pm file!!
-package Globus::GRAM::JobManager::pbs;
+package Globus::GRAM::JobManager::pbsedda;

 @ISA = qw(Globus::GRAM::JobManager);

@@ -19,7 +19,7 @@
     $qstat =  '/usr/bin/qstat';
     $qdel = '/usr/bin/qdel';
     $cluster = 1;
-    $cpu_per_node = 1;
+    $cpu_per_node = 4;
     $remote_shell = '/usr/bin/ssh';
 }

@@ -384,6 +384,10 @@
         $errfile = "2>>" . $description->logfile();
     }

+
+    # MLR 10/10/49
+    $qsub =  "/usr/bin/qsub -q ".$description->queue()."\@edda-m.vpac.org";
+
     $self->nfssync( $pbs_job_script_name );
     $self->log("submitting job -- $qsub < $pbs_job_script_name $errfile");
     chomp($job_id = `$qsub < $pbs_job_script_name $errfile`);

Apply the patch to the pbsedda.pm file.

patch -Np0 pbsedda.pm pbsedda.p1

Globus Gatekeeper


Ok, so we now have two new job managers that we need to make the Globus gatekeeper aware of.

cd /etc

Save the following diff output to a file called globus.conf.p1.

--- /etc/globus.conf.orig       2005-10-10 16:18:41.000000000 +1000
+++ /etc/globus.conf    2005-10-10 19:04:24.000000000 +1000
@@ -33,7 +33,7 @@
 globus_gatekeeper=/opt/edg/sbin/edg-gatekeeper
 extra_options=\"-lcas_db_file lcas.db -lcas_etc_dir /opt/edg/etc/lcas/ -lcasmod_dir /opt/edg/lib/lcas/ -lcmaps_db_file lcmaps.db -lcmaps_etc_dir /opt/edg/etc/lcmaps -lcmapsmod_dir /opt/edg/lib/lcmaps\"
 logfile=/var/log/globus-gatekeeper.log
-jobmanagers="fork pbs"
+jobmanagers="fork pbs pbsbrecca pbsedda"

 [gatekeeper/fork]
 type=fork
@@ -41,3 +41,13 @@

 [gatekeeper/pbs]
 type=pbs
+
+[gatekeeper/pbsbrecca]
+type=pbsbrecca
+job_manager=globus-job-manager
+machine_type=i686
+
+[gatekeeper/pbsedda]
+type=pbsedda
+job_manager=globus-job-manager
+machine_type=power64

And patch the globus.conf file with it.

patch -b -Np0 globus.conf globus.conf.p1

Restart the service.

service globus-gatekeeper restart

Almost there.

cd /opt/globus/share/globus_gram_job_manager/

cp pbs.rvf pbsbrecca.rvf

cp pbs.rvf pbsedda.rvf

##ToDo: Need to add some notes here about these files - what they are? why they're used? etc etc

So you know...

Under LCG-2, restarting the gatekeeper resulted in the creation of jobmanager-pbsbrecca and jobmanager-pbsedda in /opt/globus/etc/grid-services. These files look like:

> cat jobmanager-pbsbrecca

stderr_log,local_cred - /opt/globus/libexec/globus-job-manager globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type pbsbrecca -rdn jobmanager-pbsbrecca -machine-type i686 -publish-jobs 

> cat jobmanager-pbsedda

stderr_log,local_cred - /opt/globus/libexec/globus-job-manager globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type pbsedda -rdn jobmanager-pbsedda -machine-type power64 -publish-jobs

Now we also have these two job managers available to us.

Dynamic Information Providers


Now we need to modify the dynamic information providers to query the correct servers.

cd /opt/lcg/libexec/

rm lcg-info-dynamic-ce

Save the following in a file called lcg-info-dynamic-ce.

#!/bin/sh
/opt/lcg/libexec/lcg-info-dynamic-pbs /opt/lcg/var/gip/lcg-info-generic.conf brecca-m.vpac.org
/opt/lcg/libexec/lcg-info-dynamic-pbs /opt/lcg/var/gip/lcg-info-generic.conf edda-m.vpac.org
EOF

chmod +x lcg-info-dynamic-ce

Save the following diff to lcg-info-dynamic-pbs.p1.

--- lcg-info-dynamic-pbs        2005-10-11 07:46:16.000000000 +1000
+++ lcg-info-dynamic-pbs.new    2005-10-11 07:46:53.000000000 +1000
@@ -23,6 +23,7 @@
 my $state;
 my $num_pro;
 my $Status;
+my $whichCluster;

 # Reads the configuration file
 if ($ARGV[0]) {
@@ -117,6 +118,21 @@
 close QSTAT;

 for(@dn){
+
+    # we need to match the $pbshost variable to the dn - if they don't match
+    #  we don't write it
+
+    if ($pbsHost=~/edda/) {
+        $whichCluster="edda";
+    }
+    else {
+        $whichCluster="brecca";
+    }
+
+    if(not $_ =~ $whichCluster) {
+        next;
+    }
+
     push @output, $_;
     $queue=$_;
     $queue=~s/,.*//;

And apply the patch to lcg-info-dynamic-pbs.

patch -b -Np0 lcg-info-dynamic-pbs lcg-info-dynamic-pbs.p1

Static Information


Almost there... now we need to recreate the static information file. Luckily, we only need to do this once (this makes our job, and the hack, a lot easier).

cd /opt/lcg/var/gip

Make a copy of the static ldif file if there isn't already an original copy.

if [ -a lcg-info-static.ldif.orig ]; then
  cp lcg-info-static.ldif.orig lcg-info-static.ldif
else
  cp lcg-info-static.ldif lcg-info-static.ldif.orig
fi

From the original static ldif file create one for brecca and one for edda.

cp lcg-info-static.ldif brecca-tmp.ldif

cp lcg-info-static.ldif edda-tmp.ldif

Modify brecca's ldif file

# CHANGE "dn: GlueSiteUniqueID" TO brecca - this needs to be unique for each cluster
sed 's/dn: GlueSiteUniqueID=vpac,mds-vo-name=local,/dn: GlueSiteUniqueID=brecca,mds-vo-name=local,/' brecca-tmp.ldif > brecca.ldif
mv brecca.ldif brecca-tmp.ldif

# CHANGE ALL OCCURRENCES OF "jobmanager-pbs" --> "jobmanager-pbsbrecca"
sed 's/jobmanager-pbs/jobmanager-pbsbrecca/' brecca-tmp.ldif > brecca.ldif
mv brecca.ldif brecca-tmp.ldif

# CHANGE THE RELEVANT STATIC INFORMATION TO SUIT THE CLUSTER
sed 's/GlueHostBenchmarkSI00: 1500/GlueHostBenchmarkSI00: 1000/' brecca-tmp.ldif > brecca.ldif
mv brecca.ldif brecca-tmp.ldif

sed 's/GlueHostMainMemoryRAMSize: 256/GlueHostMainMemoryRAMSize: 1000/' brecca-tmp.ldif > brecca.ldif
mv brecca.ldif brecca-tmp.ldif

sed 's/GlueHostMainMemoryVirtualSize: 512/GlueHostMainMemoryVirtualSize: 2000/' brecca-tmp.ldif > brecca.ldif
mv brecca.ldif brecca-tmp.ldif

sed 's/GlueHostNetworkAdapterInboundIP: FALSE/GlueHostNetworkAdapterInboundIP: FALSE/' brecca-tmp.ldif > brecca.ldif
mv brecca.ldif brecca-tmp.ldif

sed 's/GlueHostNetworkAdapterOutboundIP: TRUE/GlueHostNetworkAdapterOutboundIP: FALSE /' brecca-tmp.ldif > brecca.ldif
mv brecca.ldif brecca-tmp.ldif

sed 's/GlueHostOperatingSystemName: ScientificLinux/GlueHostOperatingSystemName: RedHat/' brecca-tmp.ldif > brecca.ldif
mv brecca.ldif brecca-tmp.ldif

sed 's/GlueHostOperatingSystemRelease: 3.0.5/GlueHostOperatingSystemRelease: 7.3/' brecca-tmp.ldif > brecca.ldif
mv brecca.ldif brecca-tmp.ldif

sed 's/GlueHostOperatingSystemVersion: 3/GlueHostOperatingSystemVersion: 7/' brecca-tmp.ldif > brecca.ldif
mv brecca.ldif brecca-tmp.ldif

sed 's/GlueHostProcessorClockSpeed: 3200/GlueHostProcessorClockSpeed: 2800/' brecca-tmp.ldif > brecca.ldif
mv brecca.ldif brecca-tmp.ldif

sed 's/GlueHostProcessorModel: PIV/GlueHostProcessorModel: Xeon/' brecca-tmp.ldif > brecca.ldif
mv brecca.ldif brecca-tmp.ldif

sed 's/GlueHostProcessorVendor: intel/GlueHostProcessorVendor: intel/' brecca-tmp.ldif > brecca.ldif
mv brecca.ldif brecca-tmp.ldif

Modify edda's ldif file

# CHANGE "dn: GlueSiteUniqueID" TO edda - this needs to be unique for each cluster
sed 's/dn: GlueSiteUniqueID=vpac,mds-vo-name=local,/dn: GlueSiteUniqueID=edda,mds-vo-name=local,/' edda-tmp.ldif > edda.ldif
mv edda.ldif edda-tmp.ldif

# CHANGE ALL OCCURRENCES OF "jobmanager-pbs" --> "jobmanager-pbsbedda"
sed 's/jobmanager-pbs/jobmanager-pbsedda/' edda-tmp.ldif > edda.ldif
mv edda.ldif edda-tmp.ldif

# CHANGE THE RELEVANT STATIC INFORMATION TO SUIT THE CLUSTER
sed 's/GlueHostBenchmarkSI00: 1500/GlueHostBenchmarkSI00: 1400/' edda-tmp.ldif > edda.ldif
mv edda.ldif edda-tmp.ldif

sed 's/GlueHostMainMemoryRAMSize: 256/GlueHostMainMemoryRAMSize: 8000/' edda-tmp.ldif > edda.ldif
mv edda.ldif edda-tmp.ldif

sed 's/GlueHostMainMemoryVirtualSize: 512/GlueHostMainMemoryVirtualSize: 16000/' edda-tmp.ldif > edda.ldif
mv edda.ldif edda-tmp.ldif

sed 's/GlueHostNetworkAdapterInboundIP: FALSE/GlueHostNetworkAdapterInboundIP: FALSE/' edda-tmp.ldif > edda.ldif
mv edda.ldif edda-tmp.ldif

sed 's/GlueHostNetworkAdapterOutboundIP: TRUE/GlueHostNetworkAdapterOutboundIP: FALSE/' edda-tmp.ldif > edda.ldif
mv edda.ldif edda-tmp.ldif

sed 's/GlueHostOperatingSystemName: ScientificLinux/GlueHostOperatingSystemName: SLES/' edda-tmp.ldif > edda.ldif
mv edda.ldif edda-tmp.ldif

sed 's/GlueHostOperatingSystemRelease: 3.0.5/GlueHostOperatingSystemRelease: 9/' edda-tmp.ldif > edda.ldif
mv edda.ldif edda-tmp.ldif

sed 's/GlueHostOperatingSystemVersion: 3/GlueHostOperatingSystemVersion: 9/' edda-tmp.ldif > edda.ldif
mv edda.ldif edda-tmp.ldif

sed 's/GlueHostProcessorClockSpeed: 3200/GlueHostProcessorClockSpeed: 1656/' edda-tmp.ldif > edda.ldif
mv edda.ldif edda-tmp.ldif

sed 's/GlueHostProcessorModel: PIV/GlueHostProcessorModel: Power5/' edda-tmp.ldif > edda.ldif
mv edda.ldif edda-tmp.ldif

sed 's/GlueHostProcessorVendor: intel/GlueHostProcessorVendor: ibm/' edda-tmp.ldif > edda.ldif
mv edda.ldif edda-tmp.ldif

Remove the original static ldif file.

rm lcg-info-static.ldif

Create a new static ldif file from the brecca and edda files.

cat brecca-tmp.ldif > lcg-info-static.ldif

cat edda-tmp.ldif >> lcg-info-static.ldif

Clean up

rm brecca-tmp.ldif edda-tmp.ldif

Fingers crossed, lcg-infosites should now give you output like:

****************************************************************
These are the related data for belle: (in terms of queues and CPUs)
****************************************************************

#CPU    Free    Total Jobs      Running Waiting ComputingElement
----------------------------------------------------------
 144      11     107             36       71    nglcg.vpac.org:2119/jobmanager-pbsedda-dque
 144      11       0              0        0    nglcg.vpac.org:2119/jobmanager-pbsedda-grid
 144      11       0              0        0    nglcg.vpac.org:2119/jobmanager-pbsedda-lque
 144      11       0              0        0    nglcg.vpac.org:2119/jobmanager-pbsedda-sque
 178      38      96             59       37    nglcg.vpac.org:2119/jobmanager-pbsbrecca-dque
 178      38       0              0        0    nglcg.vpac.org:2119/jobmanager-pbsbrecca-grid
 178      38      11             11        0    nglcg.vpac.org:2119/jobmanager-pbsbrecca-lque
 178      38       5              0        5    nglcg.vpac.org:2119/jobmanager-pbsbrecca-sque
 144      11       0              0        0    nglcg.vpac.org:2119/jobmanager-pbsedda-testing

...

key Log In Revision:  r3 - 24 Oct 2005 - MarcoLaRosa
Authorised by:  Geoff Taylor (G.Taylor @ physics.unimelb.edu.au)
Maintained using:  This site is powered by the TWiki collaboration platform
Copyright © 2000-2009 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.