Start of topic | Skip to actions
Post job file processing errors - Torque / Maui / gLite3Ref: https://savannah.cern.ch/bugs/?7874Thanx to: Steve Traylen and Charles Loomis. Periodically I get errors arising from PBS not being able to copy back the .OU and .ER files. The submithelper.pl script will write to stdout that it couldn't stage back some files to the gatekeeper (via globus-url-copy). The actual problem is that the directory that the files go back to has already been cleaned up by the grid-monitor processes from the RB (before stage out occurs or before it completes - not exactly sure). The hack around this is to set in the pbs server config: qmgr -c 'set server keep_completed = 300' and hack the pbs.pm and lcgpbs.pm jobmanagers as follows: pbs.pm:
diff pbs.pm pbs.pm.orig
452,455d451
< elsif(/C/)
< {
< $state = Globus::GRAM::JobState::DONE;
< }
lcgpbs.pm
diff lcgpbs.pm lcgpbs.pm.orig
275,278d274
< elsif(/C/)
< {
< $state = Globus::GRAM::JobState::DONE;
< }
A trace of a job through the PBS server and mom logs shows something like the following:
[root@charm-mgt server_logs]# grep 266863.charm-mgt.localnet 20070124 01/24/2007 01:18:49;0010;PBS_Server;Job;266863.charm-mgt.localnet;Exit_status=0 resources_used.cput=01:03:26 resources_used.mem=554836kb resources_used.vmem=758136kb resources_used.walltime=01:50:35 01/24/2007 01:18:57;000d;PBS_Server;Job;266863.charm-mgt.localnet;Post job file processing error; job 266863.charm-mgt.localnet on host pnet19/0 01/24/2007 01:18:57;0100;PBS_Server;Job;266863.charm-mgt.localnet;dequeuing from atlas, state COMPLETEAnd on the node: 01/24/2007 01:18:49;0080; pbs_mom;Job;266863.charm-mgt.localnet;scan_for_terminated: job 266863.charm-mgt.localnet task 1 terminated, sid 3142 01/24/2007 01:18:49;0008; pbs_mom;Job;266863.charm-mgt.localnet;job was terminated | |