EPP Grid - Post job file processing errors - Torque / Maui / gLite3


Start of topic | Skip to actions

Post job file processing errors - Torque / Maui / gLite3

Ref: https://savannah.cern.ch/bugs/?7874
Thanx to: Steve Traylen and Charles Loomis.

Periodically I get errors arising from PBS not being able to copy back the .OU and .ER files. The submithelper.pl script will write to stdout that it couldn't stage back some files to the gatekeeper (via globus-url-copy). The actual problem is that the directory that the files go back to has already been cleaned up by the grid-monitor processes from the RB (before stage out occurs or before it completes - not exactly sure).

The hack around this is to set in the pbs server config:

qmgr -c 'set server keep_completed = 300'

and hack the pbs.pm and lcgpbs.pm jobmanagers as follows:

pbs.pm:

diff pbs.pm pbs.pm.orig
452,455d451
<         elsif(/C/)
<         {
<             $state = Globus::GRAM::JobState::DONE;
<         }

lcgpbs.pm

diff lcgpbs.pm lcgpbs.pm.orig
275,278d274
<             elsif(/C/)
<             {
<                 $state = Globus::GRAM::JobState::DONE;
<             }

A trace of a job through the PBS server and mom logs shows something like the following:

[root@charm-mgt server_logs]# grep 266863.charm-mgt.localnet 20070124
01/24/2007 01:18:49;0010;PBS_Server;Job;266863.charm-mgt.localnet;Exit_status=0 resources_used.cput=01:03:26 resources_used.mem=554836kb resources_used.vmem=758136kb resources_used.walltime=01:50:35
01/24/2007 01:18:57;000d;PBS_Server;Job;266863.charm-mgt.localnet;Post job file processing error; job 266863.charm-mgt.localnet on host pnet19/0
01/24/2007 01:18:57;0100;PBS_Server;Job;266863.charm-mgt.localnet;dequeuing from atlas, state COMPLETE

And on the node:

01/24/2007 01:18:49;0080;   pbs_mom;Job;266863.charm-mgt.localnet;scan_for_terminated: job 266863.charm-mgt.localnet task 1 terminated, sid 3142
01/24/2007 01:18:49;0008;   pbs_mom;Job;266863.charm-mgt.localnet;job was terminated

key Log In Revision:  r2 - 24 Jan 2007 - MarcoLaRosa
Authorised by:  Geoff Taylor (G.Taylor @ physics.unimelb.edu.au)
Maintained using:  This site is powered by the TWiki collaboration platform
Copyright © 2000-2009 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.