Start of topic | Skip to actions
CE overload and jobs not runningDecember 1, 2006Site started failing SAM tests. Investigation revealed:1. high number of grid monitor processes submitted from RB's to monitor the jobs - even for jobs which were not running. 2. submission directly through pbs as an atlas user worked fine. 3. reboot did not fix the problem - as soon as the gatekeeper was back online, the RB's re-submitted the monitor processes. 4. monitor processes not timing out. What I've done: - stop the gatekeeper - remove all of the pool account .gass_cache directories (about 7GB!) - remove the large .lcgjm directories - clean out all stage directories in pool accounts (about 5GB!) - clean out /var/tmp/ - clean out /opt/globus/tmp/ - clean out /opt/globus/tmp/gram_job_state/ - restart gatekeeper - the monitor processes are back - now to see if they die Interestingly, while waiting for the running biomed jobs to complete, and there were no dteam, ops or atlas jobs running, there were dteam, ops and atlas monitor processes. But, these processes were only for dteam009, ops002 and atlas064. The same monitor processes are back. December 9, 2006- su'ed to ops002 account and created a simple pbs script which runs /bin/hostname.- qsub'ed the script - it works and spits out pnet37 (the host it happens to go to). - Created a proxy on lcg-compute using the hostcert. Unable to do this from aulcg-rb (where my cert and grid tools are) because the host is down. Reason unknown - haven't been in to office yet. UPDATE 11/12/06 The host is not down. Someone has taken the power cord to the switch it's plugged in to. - Changed the proxy to be owned by ops002 and the su'ed to the ops account. - Added the cert DN to the grid-mapfile and mapped it to the ops pool account. - globus-job-submit lcg-compute.hpc.unimelb.edu.au:2119/jobmanager-lcg-pbs -q ops /bin/hostname - Job returned no output. - Checked the mail failure directory /var/spool/mqueue and found messages containing: PBS Job Id: 172818.charm-mgt.hpc.unimelb.edu.au Job Name: STDIN File stage in failed, see below. Job will be retried later, please investigate and correct problem. Unable to copy file ops004@lcg-compute.hpc.unimelb.edu.au:/imports/home/ops004/.lcgjm/globus-cache-export.P21514/globus-cache-export.P21514.gpg to globus-cache-export.P21514.gpg >>> error from copy Permission denied (publickey,password,hostbased). >>> end error output- When SLC updated SSH to 4.3, having '!!' in the password field caused things to break (technically, the account is locked). So, I changed the password file to 'x' (as per other people's recommendations), and remade the NIS maps (cd /var/yp && make). - Tried the job submission again, still no output. Files in the mail spooler failure dir had the same info. - Generated rsa pub/private key pair for ops004 and made an authorized_keys file for ops004 ssh-keygen -b 2048 -t rsa -N '' -C ops004@charm-mgt -f /home/ops004/.ssh/id_rsa cat id_rsa.pub > authorized_keys chmod u-wx,u+r,go-rwx authorized_keys chmod u-wx,go-wx,go+r id*- Tried the job submission again - SUCCESS!!!! It seems to be an auth problem. - Changed the shadow field back to '!!', remade the NIS maps. - Tried the job submission again - still works. - Removed the authorized_keys file. - Tried the job submission again - failed! Definitely an auth issue. - Replaced the authorized_keys file. - Tried the job submission again - success. - Checked the mom config file on pnet37 (where these jobs happen to be going) and it says "usecp" - not "usescp" - but, somewhere along the line SSH is involved. - Went back to the ops002 account and realised I had already keygen'ed in it, so, removed the authorized_keys file and qsub'ed the script - it still works. Seems like hostbased auth is working when directly logged in. So: - hostbased auth works because I can su to a normal user, qsub a script and it works perfectly. However, if a job comes in through the grid, it isn't happy. Seems that we require a pub/priv key pair in the account and an authorized_keys file for this stuff to work through the Grid. - The testing jobs from the Polish site come from Rafal Lichwala - and he happens to get mapped to ops003 on this system. So, keygen'ed him some keys for that account and set up the authorized_keys file. Now to run a SAM test and see what happens. - THE SAM TEST WORKED!!!! - current SSH version: 3.9p1-8.SL.3.20 (upgraded from 3.6... something) - This version was installed on Nov 16 4:10 UTC (rpm -qi openssh openssh-server openssh-clients) - date stamps on ssh_config and sshd_config: charm-mgt: pbs server node: Nov 23 4:52 UTC - date stamps on ssh_config and sshd_config: lcg-compute: glite CE: Nov 17 3:33 and 4:55 UTC - date stamps on ssh_config and sshd_config: pnet01: glite WN: Nov 17 3:37 UTC - date stamps on ssh_config and sshd_config: pnet37: glite WN: Nov 17 1:28 UTC - SAM TESTS STARTED FAILING: NOVEMBER 29 14:12 UTC - Just checked the SAM tests page and there have been a few successful tests - i think this is because of the keygen stuff in the accounts which Piotr (the SAM tester) has been mapped to. Need to keep an eye on this to be sure. December 10, 2006- SAM tests executing successfully since fixes yesterday. - have also noticed two ATLAS user jobs on the system as well as ATLAS SAM tests - CE still has a lot of grid-monitor processes relating to non-existent jobs. These processes do not seem to be dying. They also reappear after stopping the gatekeeper, killing all monitor processes and cleaning up the globus state directory (/opt/globus/tmp and /opt/globus/tmp/gram_job_state).December 11, 2006- Since December 10, 5 am UTC, dteam SAM tests have been failing. The same tests as ops have worked fine. Looking at the mail messages which have failed to send (on charm-mgt) we see:PBS Job Id: 174637.charm-mgt.hpc.unimelb.edu.au Job Name: STDIN File stage in failed, see below. Job will be retried later, please investigate and correct problem. Unable to copy file dteam009@lcg-compute.hpc.unimelb.edu.au:/imports/home/dteam009/.lcgjm/globus-cache-export.B14044/globus-cache-export.B14044.gpg to globus-cache-export.B14044.gpg >>> error from copy @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that the RSA host key has just been changed. The fingerprint for the RSA key sent by the remote host is 05:22:5a:f5:7b:bf:84:db:e8:6a:f4:7b:ca:56:b9:40. Please contact your system administrator. Add correct host key in /home/dteam009/.ssh/known_hosts to get rid of this message. Offending key in /etc/ssh/ssh_known_hosts:4 RSA host key for lcg-compute.hpc.unimelb.edu.au has changed and you have requested strict checking. Host key verification failed. >>> end error output- The above message is seen for both dteam009 and bio001. No other account has this problem. - Again, nothing has changed on the system. Further, it only seems to happen for jobs using the dteam009 and bio001 accounts. - Am able to ssh unchallenged charm-mgt -> lcg-compute -> pnet37 -> charm-mgt as dteam009 user. - Am able to ssh unchallenged charm-mgt -> lcg-compute -> pnet37 -> charm-mgt as bio001 user. - The key for lcg-compute in the ssh_known_hosts file on charm-mgt, lcg-compute and pnet37 is correct. The ssh_known_hosts file on those hosts has not been modified since November 17. - The error message says "you have requested strict checking". None of the hosts have SSH configured for StrictHostKeyChecking. - chkrootkit run on charm-mgt, pnet01, pnet37, and lcg-compute. Nothing detected. | |