Start of topic | Skip to actions
lcg-cp command hanging* lcg-cp command hanging on both Australia-ATLAS and Australia-UNIMELB-LCG2
New core network chomps connections after 1 hourAs part of the new core network, all connections to the internet go through new border routers and new firewall blades:traceroute to cern.ch (137.138.28.241), 30 hops max, 38 byte packets 1 128.250.150.66 (128.250.150.66) 2.234 ms 0.450 ms 0.485 ms 2 rtr3-000035 (128.250.5.3) 0.377 ms 0.311 ms 0.289 ms 3 172.19.1.133 (172.19.1.133) 0.452 ms 0.341 ms 0.315 ms 4 172.18.66.164 (172.18.66.164) 0.787 ms 0.705 ms 0.690 ms 5 br-266-te-2-2.unimelb.net.au (172.18.66.254) 0.794 ms 0.909 ms 0.776 ms 6 br-151-te-2-3.unimelb.net.au (172.18.66.233) 0.982 ms 0.801 ms 0.807 ms ---snip---The interesting nodes here are br-266... and br-151... : the new border routers. From here we hit firewall blades which have a default timeout of 1 hour for idle connections (Thanx to Brian Little, Network Ops). I've spoken to Brian about this and he will put in a request to the engineering team to have this limit upped to 12 hours - at my request. Hopefully this will not happen with the research LAN which will bypass the firewall blades. Atlas sgm jobThe problem was first noticed with software installation jobs on Australia-ATLAS (SL4.4/glite3.1). Basically, the lcg-cp command which downloads the tarball never seems to detach. Then noticed that the same problem happened on Australia-UNIMELB-LCG2 (SL307/glite3.0). Tried a download of the tarball myself on a WN on Au-ATLAS and it worked. As it happened, it seems Alessandro has replicated the particular tarball to our local storage koala.unimelb.edu.au. When I tried, the command picked up the site local replica and it completed successfully. Then I tried the same download using a replica in Italy. Note: Put an email out to rollout - got a message from Steve Traylen suggesting it might be the old globus tcp port range problem. GLOBUS_TCP_RANGE is set on Australia-ATLAS but not on Australia-UNIMELB-LCG2 yet the command hangs on both. see: http://goc.grid.sinica.edu.tw/gocwiki/gridftp_works_only_once_within_a_minute_or_sorunning on Australia-ATLAS - single threaded transfer* Using single threaded transfer - command (lcg-cp) does not detach successfully.
[dtm050@agc1 ~]$ /opt/lcg/bin/lcg-cp -v --vo dteam srm://grid-cert-03.roma1.infn.it/dpm/roma1.infn.it/home/atlas/install/lcg/atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz file:/home/dtm050/atlas-production-patch-13.0.30.2-1-i686_slc3_gcc323-opt.tar.gz
Source URL: srm://grid-cert-03.roma1.infn.it/dpm/roma1.infn.it/home/atlas/install/lcg/atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz
File size: 2569485639
Source URL for copy: gsiftp://atlas-storage-02.roma1.infn.it/atlas-storage-02.roma1.infn.it:/data2/atlas/2007-08-15/atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz.1755640.0
Destination URL: file:/home/dtm050/atlas-production-patch-13.0.30.2-1-i686_slc3_gcc323-opt.tar.gz
# streams: 1
# set timeout to 0 (seconds)
2274361344 bytes 189.15 KB/sec avg 512.01 KB/sec inst
and an strace...
write(2, "\r 2569485639 bytes", 20) = 20
write(2, " 144.53 KB/sec avg", 21) = 21
write(2, " 0.00 KB/sec inst", 22) = 22
gettimeofday({1193204775, 255737}, NULL) = 0
gettimeofday({1193204775, 255800}, NULL) = 0
gettimeofday({1193204775, 255862}, NULL) = 0
gettimeofday({1193204775, 255920}, NULL) = 0
select(10, [4 8], [], NULL, {0, 999817}) = 0 (Timeout)
gettimeofday({1193204776, 255369}, NULL) = 0
gettimeofday({1193204776, 255433}, NULL) = 0
select(10, [4 8], [], NULL, {0, 304}) = 0 (Timeout)
gettimeofday({1193204776, 256363}, NULL) = 0
write(2, "\r 2569485639 bytes", 20) = 20
write(2, " 144.53 KB/sec avg", 21) = 21
write(2, " 0.00 KB/sec inst", 22) = 22
gettimeofday({1193204776, 256818}, NULL) = 0
gettimeofday({1193204776, 256881}, NULL) = 0
gettimeofday({1193204776, 256942}, NULL) = 0
gettimeofday({1193204776, 257023}, NULL) = 0
select(10, [4 8], [], NULL, {0, 224752}) = 0 (Timeout)
gettimeofday({1193204776, 481358}, NULL) = 0
gettimeofday({1193204776, 481422}, NULL) = 0
select(10, [4 8], [], NULL, {0, 353}) = 0 (Timeout)
gettimeofday({1193204776, 482358}, NULL) = 0
gettimeofday({1193204776, 482424}, NULL) = 0
time(NULL) = 1193204776
gettimeofday({1193204776, 482538}, NULL) = 0
gettimeofday({1193204776, 482597}, NULL) = 0
select(10, [4 8], [], NULL, {0, 774221}) = 0 (Timeout)
gettimeofday({1193204777, 258353}, NULL) = 0
write(2, "\r 2569485639 bytes", 20) = 20
write(2, " 144.51 KB/sec avg", 21) = 21
write(2, " 0.00 KB/sec inst", 22) = 22
gettimeofday({1193204777, 258651}, NULL) = 0
gettimeofday({1193204777, 258713}, NULL) = 0
gettimeofday({1193204777, 258773}, NULL) = 0
gettimeofday({1193204777, 258832}, NULL) = 0
select(10, [4 8], [], NULL, {0, 999819} <unfinished ...>
Process 25733 detached
running on Australia-ATLAS - parallel threaded transfer - 2 threads - download from Italy: atlas-storage-02.roma1.infn.it
[dtm050@agc1 ~]$ lcg-cp -n 2 -v --vo dteam srm://grid-cert-03.roma1.infn.it/dpm/roma1.infn.it/home/atlas/install/lcg/atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz file:/home/dtm050/atlas-production-patch-13.0.30.2-1-i686_slc3_gcc323-opt.tar.gz Source URL: srm://grid-cert-03.roma1.infn.it/dpm/roma1.infn.it/home/atlas/install/lcg/atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz File size: 2569485639 Source URL for copy: gsiftp://atlas-storage-02.roma1.infn.it/atlas-storage-02.roma1.infn.it:/data2/atlas/2007-08-15/atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz.1755640.0 Destination URL: file:/home/dtm050/atlas-production-patch-13.0.30.2-1-i686_slc3_gcc323-opt.tar.gz # streams: 2 2569485639 bytes 352.23 KB/sec avg 0.00 KB/sec inst running on Australia-ATLAS - parallel threaded transfer - 2 threads - download from UK: fal-pygrid-24.lancs.ac.uk* Using 2 threads the command (lcg-cp) does not detach successfully - this host is a dCache[dtm050@agc1 ~]$ lcg-cp -n 2 -v --vo atlas gsiftp://fal-pygrid-24.lancs.ac.uk:2811//pnfs/lancs.ac.uk/data/atlas/sc3tests/test-tarfile-1 file:/home/dtm050/atlas-production-patch-13.0.30.2-1-i686_slc3_gcc323-opt.tar.gz Source URL: gsiftp://fal-pygrid-24.lancs.ac.uk:2811//pnfs/lancs.ac.uk/data/atlas/sc3tests/test-tarfile-1 File size: 2569485639 Source URL for copy: gsiftp://fal-pygrid-24.lancs.ac.uk:2811//pnfs/lancs.ac.uk/data/atlas/sc3tests/test-tarfile-1 Destination URL: file:/home/dtm050/atlas-production-patch-13.0.30.2-1-i686_slc3_gcc323-opt.tar.gz # streams: 2 # set timeout to 0 (seconds) 2569485639 bytes 374.52 KB/sec avg 0.00 KB/sec running on Australia-ATLAS - globus-url-copy - parallel threaded transfer - 2 threads - download from Italy: atlas-storage-02.roma1.infn.it* Using 2 threads the command (globus-url-copy) does not detach successfully[dtm050@agc1 ~]$ globus-url-copy -p 2 -vb gsiftp://atlas-storage-02.roma1.infn.it/atlas-storage-02.roma1.infn.it:/data2/atlas/2007-08-15/atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz.1755640.0 file:/home/dtm050/guc-2thread-download Source: gsiftp://atlas-storage-02.roma1.infn.it/atlas-storage-02.roma1.infn.it:/data2/atlas/2007-08-15/ Dest: file:/home/dtm050/ atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz.1755640.0 -> guc-2thread-download 2569485639 bytes 0.26 MB/sec avg 0.00 MB/sec inst running on Australia-ATLAS - parallel threaded transfer - 5 threads
[dtm050@agc1 ~]$ lcg-cp -n 5 -v --vo dteam srm://grid-cert-03.roma1.infn.it/dpm/roma1.infn.it/home/atlas/install/lcg/atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz file:/home/dtm050/atlas-production-patch-13.0.30.2-1-i686_slc3_gcc323-opt.tar.gz Source URL: srm://grid-cert-03.roma1.infn.it/dpm/roma1.infn.it/home/atlas/install/lcg/atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz File size: 2569485639 Source URL for copy: gsiftp://atlas-storage-02.roma1.infn.it/atlas-storage-02.roma1.infn.it:/data2/atlas/2007-08-15/atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz.1755640.0 Destination URL: file:/home/dtm050/atlas-production-patch-13.0.30.2-1-i686_slc3_gcc323-opt.tar.gz # streams: 5 # set timeout to 0 (seconds) 2569338880 bytes 936.24 KB/sec avg 928.00 KB/sec inst Transfer took 2680820 ms running on Australia-ATLAS - parallel threaded transfer - 10 threads
[dtm050@agc1 ~]$ lcg-cp -n 10 -v --vo dteam srm://grid-cert-03.roma1.infn.it/dpm/roma1.infn.it/home/atlas/install/lcg/atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz file:/home/dtm050/atlas-production-patch-13.0.30.2-1-i686_slc3_gcc323-opt.tar.gz Source URL: srm://grid-cert-03.roma1.infn.it/dpm/roma1.infn.it/home/atlas/install/lcg/atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz File size: 2569485639 Source URL for copy: gsiftp://atlas-storage-02.roma1.infn.it/atlas-storage-02.roma1.infn.it:/data2/atlas/2007-08-15/atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz.1755640.0 Destination URL: file:/home/dtm050/atlas-production-patch-13.0.30.2-1-i686_slc3_gcc323-opt.tar.gz # streams: 10 # set timeout to 0 (seconds) 2565799936 bytes 1867.11 KB/sec avg 1855.33 KB/sec inst Transfer took 1344650 ms running on Australia-ATLAS - parallel threaded transfer - 20 threads
[dtm050@agc1 ~]$ lcg-cp -n 20 -v --vo dteam srm://grid-cert-03.roma1.infn.it/dpm/roma1.infn.it/home/atlas/install/lcg/atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz file:/home/dtm050/atlas-production-patch-13.0.30.2-1-i686_slc3_gcc323-opt.tar.gz Source URL: srm://grid-cert-03.roma1.infn.it/dpm/roma1.infn.it/home/atlas/install/lcg/atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz File size: 2569485639 Source URL for copy: gsiftp://atlas-storage-02.roma1.infn.it/atlas-storage-02.roma1.infn.it:/data2/atlas/2007-08-15/atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz.1755640.0 Destination URL: file:/home/dtm050/atlas-production-patch-13.0.30.2-1-i686_slc3_gcc323-opt.tar.gz # streams: 20 # set timeout to 0 (seconds) 2568109383 bytes 3551.96 KB/sec avg 3256.93 KB/sec inst Transfer took 706440 ms running on Australia-UNIMELB-LCG2 - single threaded transfer* Using single threaded transfer - command (lcg-cp) does not detach successfully.[dtm050@pnet39 dtm050]$ echo $GLOBUS_TCP_PORT_RANGE [dtm050@pnet39 dtm050]$ /opt/lcg/bin/lcg-cp -v --vo dteam srm://grid-cert-03.roma1.infn.it/dpm/roma1.infn.it/home/atlas/install/lcg/atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz file:/home/dtm050/atlas-production-patch-13.0.30.2-1-i686_slc3_gcc323-opt.tar.gz Source URL: srm://grid-cert-03.roma1.infn.it/dpm/roma1.infn.it/home/atlas/install/lcg/atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz File size: 2569485639 Source URL for copy: gsiftp://atlas-storage-02.roma1.infn.it/atlas-storage-02.roma1.infn.it:/data2/atlas/2007-08-15/atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz.1755640.0 Destination URL: file:/home/dtm050/atlas-production-patch-13.0.30.2-1-i686_slc3_gcc323-opt.tar.gz # streams: 1 # set timeout to 0 (seconds) 1140850688 bytes 94.19 KB/sec avg 512.00 KB/sec instSo - the file downloads very slowly for a long time - and seems ok the whole time. It just doesn't disconnect. running on lxplus and downloading the file from Australia - single threaded transfer* Using single threaded transfer - command (lcg-cp) does not detach successfully. All times following are in CEST (Central European)Job started at 05:06 Download stalled at 08:36 Still running at 9:10 - but no traffic Killed at 9:10 [gmoloney@lxplus235]/tmp/gmoloney% lcg-cp -v --vo atlas srm://koala.unimelb.edu.au/dpm/unimelb.edu.au/home/atlas/install/lcg/atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz file:$PWD/atlas-production-patch-13.0.30.2-1-i686_slc3_gcc323-opt.tar.gz Source URL: srm://koala.unimelb.edu.au/dpm/unimelb.edu.au/home/atlas/install/lcg/atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz File size: 2569485639 Source URL for copy: gsiftp://ds01.hpc.unimelb.edu.au/ds01.hpc.unimelb.edu.au:/gridstorage/pool2/atlas/2007-10-15/atlas-production-13.0.20-1-i686_slc3_gcc323-opt.tar.gz.789002.0 Destination URL: file:/tmp/gmoloney/atlas-production-patch-13.0.30.2-1-i686_slc3_gcc323-opt.tar.gz # streams: 1 # set timeout to 0 (seconds) total 0 2569485639 bytes 174.23 KB/sec avg 0.00 KB/sec inst Atlas user job on Australia-UNIMELB-LCG2I noticed that an ATLAS user job had also hung on an lcg-cp.atl002 31631 16119 0 Oct23 ? 00:00:11 lcg-cp --verbose --vo atlas lfn:/grid/atlas/dq2/streamtest/streamtest.004924.inclJet.merge.AOD.v12000605_tid009429/streamtest.004924.inclJet.merge.AOD.v12000605_tid009429._00001.pool.root.3 file:///imports/home/atl002/globus-tmp.pnet13.15573.0/WMS_pnet13_016047_https_3a_2f_2frb106.cern.ch_3a9000_2fOMvCvJUneb5FkOUk8NEU_5fA/TopQuarkAODtoROOTuple/grid/streamtest.004924.inclJet.merge.AOD.v12000605_tid009429._00001.pool.root.3 [root@pnet13 root]# ll /imports/home/atl002/globus-tmp.pnet13.15573.0/WMS_pnet13_016047_https_3a_2f_2frb106.cern.ch_3a9000_2fOMvCvJUneb5FkOUk8NEU_5fA/TopQuarkAODtoROOTuple/grid/streamtest.004924.inclJet.merge.AOD.v12000605_tid009429._00001.pool.root.3 -rw-r--r-- 1 atl002 2000 692096538 Oct 23 18:10 streamtest.004924.inclJet.merge.AOD.v12000605_tid009429._00001.pool.root.3 sleep about 5 minutes [root@pnet13 root]# ll /imports/home/atl002/globus-tmp.pnet13.15573.0/WMS_pnet13_016047_https_3a_2f_2frb106.cern.ch_3a9000_2fOMvCvJUneb5FkOUk8NEU_5fA/TopQuarkAODtoROOTuple/grid/streamtest.004924.inclJet.merge.AOD.v12000605_tid009429._00001.pool.root.3 -rw-r--r-- 1 atl002 2000 692096538 Oct 23 18:10 streamtest.004924.inclJet.merge.AOD.v12000605_tid009429._00001.pool.root.3 and about 1 day later [root@pnet13 root]# ll /imports/home/atl002/globus-tmp.pnet13.15573.0/WMS_pnet13_016047_https_3a_2f_2frb106.cern.ch_3a9000_2fOMvCvJUneb5FkOUk8NEU_5fA/TopQuarkAODtoROOTuple/grid/streamtest.004924.inclJet.merge.AOD.v12000605_tid009429._00001.pool.root.3 -rw-r--r-- 1 atl002 2000 692096538 Oct 23 18:10 streamtest.004924.inclJet.merge.AOD.v12000605_tid009429._00001.pool.root.3Note: file size not changing
[root@pnet13 root]# pstree -alpuG
--snip--
├─pbs_mom,1865
│ └─sh,15439,atl002
│ └─346604.char.SC,15569 /var/spool/pbs/mom_priv/jobs/346604.char.SC
│ └─bootstrap.K1557,15573 -w /tmp/bootstrap.K15570 /home/atl002/ lcg-compute.hpc.unimelb.edu.au /home/atl002/.globus/.gass_cache/local/md5/81/e49a51112b3f6e7b6b7
│ ├─bootstrap.K1557,15579 -w /tmp/bootstrap.K15570 /home/atl002/ lcg-compute.hpc.unimelb.edu.au /home/atl002/.globus/.gass_cache/local/md5/81/e49a51112b3f6e7b6b7
│ ├─bootstrap.K1557,15834 -w /tmp/bootstrap.K15570 /home/atl002/ lcg-compute.hpc.unimelb.edu.au /home/atl002/.globus/.gass_cache/local/md5/81/e49a51112b3f6e7b6b7
│ └─sh,16045 -c if\040[\040-x\040${LCG_LOCATION:-/opt/lcg}/libexec/jobwrapper\040];\040then\040${LCG_LOCATION:-/opt/lcg}/libexec/jobwrapper\040/imports/home/atl002/globus-tmp.pnet1
│ └─jobwrapper,16046 /opt/lcg/libexec/jobwrapper /imports/home/atl002/globus-tmp.pnet13.15573.0/globus-tmp.pnet13.15573.0/local/md5/81/e49a51112b3f6e7b6b796889239
│ └─bash,16047 /imports/home/atl002/globus-tmp.pnet13.15573.0/globus-tmp.pnet13.15573.0/local/md5/81/e49a51112b3f6e7b6b796889239093/md5/d9/2b631755206e03784fa4
│ └─bash,16118 /imports/home/atl002/globus-tmp.pnet13.15573.0/globus-tmp.pnet13.15573.0/local/md5/81/e49a51112b3f6e7b6b796889239093/md5/d9/2b631755206e03784fa4
│ ├─perl,16121 -e \012\040\040\040\040while\040(1)\040{\012\040\040\040\040\040\040$time_left\040=\040`grid-proxy-info\040-timeleft\0402>\040/dev/null`\040||\0400;\012\040\040\040\040\040\040last\040if\040($time_left\040<=\0400);\012\040\040\040\040\040\040sleep($time_left);\012
│ └─sh,16119 ./streamtest.004924.inclJet.merge.AOD.v12000605_tid009429._00001.pool.root.3.sh
│ └─lcg-cp,31631 --verbose --vo atlas lfn:/grid/atlas/dq2/streamtest/streamtest.004924.inclJet.merge.AOD.v12000605_tid009429/streamtest.004924.inclJet.merge.AO
[root@pnet13 root]# lsof -p 16119 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME sh 16119 atl002 cwd DIR 0,11 4096 1933405 /imports/home/atl002/globus-tmp.pnet13.15573.0/WMS_pnet13_016047_https_3a_2f_2frb106.cern.ch_3a9000_2fOMvCvJUneb5FkOUk8NEU_5fA/TopQuarkAODtoROOTuple/grid (mgtnode:/home) sh 16119 atl002 rtd DIR 3,1 4096 2 / sh 16119 atl002 txt REG 3,1 585396 384130 /bin/bash sh 16119 atl002 mem REG 3,1 1572460 144105 /lib/tls/libc-2.3.2.so sh 16119 atl002 mem REG 3,1 11784 512174 /lib/libtermcap.so.2.0.8 sh 16119 atl002 mem REG 3,1 14812 512109 /lib/libdl-2.3.2.so sh 16119 atl002 mem REG 3,1 106888 512096 /lib/ld-2.3.2.so sh 16119 atl002 mem REG 3,1 21436 544302 /usr/lib/gconv/gconv-modules.cache sh 16119 atl002 mem REG 3,1 32148976 274032 /usr/lib/locale/locale-archive sh 16119 atl002 0r CHR 1,3 33739 /dev/null sh 16119 atl002 1w REG 0,11 422370 10895457 /imports/home/atl002/globus-tmp.pnet13.15573.0/WMS_pnet13_016047_https_3a_2f_2frb106.cern.ch_3a9000_2fOMvCvJUneb5FkOUk8NEU_5fA/streamtest.004924.inclJet.merge.AOD.v12000605_tid009429._00001.pool.root.3.kt.e.stdout (mgtnode:/home) sh 16119 atl002 2w REG 0,11 1969242 10895458 /imports/home/atl002/globus-tmp.pnet13.15573.0/WMS_pnet13_016047_https_3a_2f_2frb106.cern.ch_3a9000_2fOMvCvJUneb5FkOUk8NEU_5fA/streamtest.004924.inclJet.merge.AOD.v12000605_tid009429._00001.pool.root.3.kt.e.stderr (mgtnode:/home) sh 16119 atl002 16w FIFO 0,5 159172390 pipe sh 16119 atl002 255r REG 0,11 1632 10895449 /imports/home/atl002/globus-tmp.pnet13.15573.0/WMS_pnet13_016047_https_3a_2f_2frb106.cern.ch_3a9000_2fOMvCvJUneb5FkOUk8NEU_5fA/streamtest.004924.inclJet.merge.AOD.v12000605_tid009429._00001.pool.root.3.sh (mgtnode:/home) [root@pnet13 root]# strace -p 16119 Process 16119 attached - interrupt to quit waitpid(-1, <unfinished ...> Process 16119 detached | |