EPP Grid - LCG Grid Middleware Deployment


Start of topic | Skip to actions

LCG Grid Middleware Deployment


Overview

This document details the steps required to install and configure an LCG node to act as a front end to an existing computing cluster. This configuration assumes that the compute cluster to which the front end is attached uses PBS (job queueing system) / MAUI (job scheduling) and has the user filesystems (/home) NFS mounted from the server/management node (PBS server node) to all of the compute nodes.

LCG middleware is certified to run on Scientific Linux 3 (sl3). For this installation we have chosen Scientific Linux 3.0.4 which is based on RedHat Enterprise Linux 3.

Prior to detailing the installation procedure, it is worthwhile outlining briefly the architecture of the LCG middleware. The LCG middleware is defined in terms of components. These components include Worker Nodes (WN - in traditional HPC parlance these equate to compute nodes), a Compute Element (CE - which in traditional HPC parlance equates to a management / login node; i.e. cluster front-end), a Storage Element (SE), a Resource Broker (RB), a Monitoring Node (MON), an Information Index (BDII) and a User Interface (UI).

In the context of the LHC Computing Grid, each of these components would be installed on a different physical computer (example). In this cartoon, each component of the middleware (blue boxes) is installed on a different physical computer - a computer per 'service'. However, this is not essential.

In the configuration discussed here, an LCG front-end containing CE, SE and UI components is configured and attached to an existing cluster (example). In this example, the user interface, compute element and storage element components are combined on an 'LCG front-end' node which is then attached to / directed at existing resources.

Note: It is known that CE and RB components can NOT coexist on the same physical machine.

The LCG installation guide can be found here.

Operating system installation and configuration

Install Scientific Linux 3.0.5

A base server install only. Package management is achieved via apt-get. Accordingly, any LCG middleware dependencies will be taken care of during installation.

Notes:
(1) Anyone wishing to install SL should not install from the 3.0.3 iso's downloaded from SL.org. They don't work. Use either 3.0.2 or 3.0.4. Upgrading to 3.0.5 simply involves 'apt-getting' the system up to the latest version.

apt-get update

apt-get dist-upgrade

(2) SL servers are *hugely* overloaded. I have found that using cern's repositories gives a much better response. That is, add the following lines to your sources list:

rpm http://linuxsoft.cern.ch cern/slc305/i386/apt os updates extras
rpm-src http://linuxsoft.cern.ch cern/slc305/i386/apt os updates extras

Further, discussion on the LCG-rollout mailing list (http://www.listserv.rl.ac.uk/cgi-bin/wa.exe?SUBED1=lcg-rollout&A=1) has revealed that there are differences between standard and CERN Scientific Linux. Something to be aware of.

Configuring SSH to allow HostbasedAuthentication between nodes

Given that our intention is to attach the LCG node to an existing cluster (and assuming that inter-node communication is configured via SSH HostbasedAuthentication), only the configuration of SSH on the LCG node will be detailed here.

  • In the ssh_config file on the LCG node (/etc/ssh/ssh_config), add the following lines:

HostbasedAuthentication yes
EnableSSHKeysign yes

(In most implementations of OpenSSH, enabling host based authentication automatically enables SSH keysign. However, it was found that on sl3, it was necessary to explicitly set the EnableSSHKeysign option).

  • In the sshd_config file on the LCG node (/etc/ssh/sshd_config), add the following lines:

Port 22
HostKey /etc/ssh/ssh_host_rsa_key
HostKey /etc/ssh/ssh_host_dsa_key
StrictModes yes
RSAAuthentication yes
RhostsAuthentication no
IgnoreRhosts yes
RhostsRSAAuthentication no
HostbasedAuthentication yes
IgnoreUserKnownHosts yes
PasswordAuthentication yes
PermitEmptyPasswords no
X11Forwarding no
KeepAlive yes

  • Create a shosts.equiv file in the /etc/ssh directory

This file should have the format:

machine1
machine1.fully.qualified.name
machine2
machine2.fully.qualified.name

etc...

  • Create a ssh_known_hosts file in the /etc/ssh directory

This file should have the format:

machine1,machine1.fully.qualified.name,IP address ssh-rsa rsa-key machine2,machine2.fully.qualified.name,IP address ssh-rsa rsa-key

Configure tcp-wrappers

In the hosts.allow file add the following line:

ALL: *.ph.unimelb.edu.au

In the hosts.deny file add the following line:

ALL: ALL:

Configure the firewall - IPTables

THIS IS ONLY A GUIDE

The LCG middleware (and all other components from which it is derived) provide services which require access both internally (e.g. to the computing cluster) and externally (e.g. to the global LHC grid). Clearly, the LCG front-end will act as a gateway between the outside world and the resources to which it is attached.

Here is an example iptables setup script. Modify it as needed.

NB: Network security policy will be site-dependent. These notes refer to an isolated test deployment of the LCG middleware.

Configure network time synchronisation

In this case we use the University network time server.

In the ntp.conf file in /etc, add the following lines:

restrict 128.250.5.101 mask 255.255.255.255 nomodify notrap noquery
server ntp.unimelb.edu.au

In the step-tickers file in /etc/ntp, add the following lines:

128.250.5.101

Restart the service:

service ntpd restart

LCG middleware installation and configuration

Download and install the LCG installer package - YAIM (Yet Another Installation Method)

Go to http://www.cern.ch/grid-deployment/gis/yaim/ and download the latest version of the installer (at the time of this writing it is lcg-yaim-2.6.0-8.noarch.rpm).

wget http://www.cern.ch/grid-deployment/gis/yaim/lcg-yaim-2.6.0-8.noarch.rpm

rpm -ivh lcg-yaim-2.4.0-3.noarch.rpm

or

add the following line to /etc/apt/sources.list.d/cern.list:

rpm http://grid-deployment.web.cern.ch/grid-deployment/gis apt/LCG-2_6_0/sl3/en/i386 lcg_sl3 lcg_sl3.updates

and then

apt-get install lcg-yaim

YAIM is a script based installation method. It's found in /opt/lcg/yaim. Site information is defined in the file site-info.def, the generic grid user accounts (that will be created) are defined in file users.conf whilst the list of worker nodes (compute nodes in traditional parlance) of the cluster is defined in the file wn-list.conf.

Note: Field explanations are provided in the configuration files

Before commencing the installation

Download and install j2sdk1.4.2_08 (get it from http://java.sun.com).

At a shell (bash/sh):
export JAVA_HOME=/usr/java/j2sdk1.4.2_08
export PATH=${PATH}:/opt/condor/bin

A brief explanation. JAVA is used extensively. However, due to its licensing, it cannot be distributed through the repositories.

Installation of LCG components

The LCG middleware is subdivided into components. For this installation method, all of the components will be installed on one physical machine.

Change to the yaim directory (/opt/lcg/yaim/scripts):

cd /opt/lcg/yaim/scripts

./install_node  ../examples/site-info.def lcg-CE-torque lcg-SECLASSIC lcg-UI

This script:
- updates the packages already installed
- downloads and installs any packages required by the middleware which are not yet installed
- downloads and installs the packages relating to the component of the middleware being installed
- downloads and installs the Certificate Authroity bundles

Install host certificates / Certificate Authority (CA) bundles

Install the hostcert.pem and hostkey.pem files in /etc/grid-security/

cp hostcert.pem /etc/grid-security/

cp hostkey.pem /etc/grid-security/

In this case where we will initially be disconnected from the LCG global grid, we will need to manually install the certificate bundle of the authority which created these certificates (A2G).

see: http://www.vpac.org/a2g

Configure the components

./configure_node ../examples/site-info.def CE_torque classic_SE UI

Completing the configuration of LCG node

  • 'Pointing' the LCG node at an existing PBS server

In the server_name file in /var/spool/pbs/, change the name of the machine to the fully qualified domain name of the node which runs the PBS server of the cluster.

  • PBS mom config file

Although the LCG node will be attached to an existing cluster, it is still necessary to tell it what methods it can use to move data. Create a file config in the directory /var/spool/pbs/mom_priv/ containing the following lines:

$clienthost server_name
$usecp *.ph.unimelb.edu.au:/home/ /home/
$usecp *.ph.unimelb.edu.au:/epp/ /home/

(NB: The setup of PBS is beyond the scope of this document. More information can be found via man pbs)

In our setup, user accounts are NFS mounted from the server (roberts:/export/home/) to the LCG node at (grickle:/epp/home). Hence the lines above which state that PBS can use cp to move data from /home/ to /home/ and from /epp/ to /home/

  • make sure /opt is exported from the LCG head

This is necessary so that all of the worker nodes in the cluster can acces the globus hierarchy (in actual fact, we are exporting everything in LCG - if it's needed, it is there!)

cp /etc/profile.d/ /opt/profile.d/

This dir contains shell initialisation scripts that need to be accessible to all of the grid accounts.

cp /etc/grid-security/certificates/ /opt/globus/share/certificates/

This dir needs to be available to all of the worker nodes so that they can check the authenticity of the proxy accompanying the job.

Aside: Most clusters have the worker nodes on a private subnet and do not allow them to access the outside world. Generally, any data that a computation needs is staged in via PBS. In LCG, it is assumed that the grid gateways and the worker nodes do not have a shared filesystem. Accordingly, the lcgpbs job manager uses gsiftp to transfer data to the worker node. If this is not acceptable, then the standard globus pbs jobmanager can be used in its place as it does not do any 'gsiftp magic'. However, consider that if a computation operates on a large amount of data, it makes more sense for each worker node to get it's own data directly rather than PBS 'streaming' the data to each active node.

  • in the generic grid user's .bashrc

LCG_INIT_SCRIPTS="/opt/profile.d/edg.sh /opt/profile.d/globus.sh \ /opt/profile.d/lcg.sh /opt/profile.d/lcgenv.sh"
for i in $LCG_INIT_SCRIPTS
do
if [ -r "$i" ]; then
. $i
fi
done

Important LCG config files

  • /opt/edg/etc/edg_wl_ui_cmd_var.conf

In this file we can set the number of job-resubmissions in case of failure, logging level and specify a default virtual organisation

  • /opt/edg/etc/edg_wl_ui_gui_var.conf

Same as above

  • /opt/edg/etc/edg-mkgridmap.conf

Contains ldap URI's from which '/etc/grid-security/grid-mapfile' is generated. In the first instance, these are just commented out.

  • /opt/globus/etc/grid-services

Where the default jobmanager is defined: jobmanager-torque

key Log In Revision:  r20 - 09 Sep 2005 - MarcoLaRosa
Authorised by:  Geoff Taylor (G.Taylor @ physics.unimelb.edu.au)
Maintained using:  This site is powered by the TWiki collaboration platform
Copyright © 2000-2009 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.