Bug #7597
closedhang in rados/test.sh
0%
Description
Looking into a particular hung job:
$ cat /a/teuthology-2014-03-02_01:10:05-ceph-deploy-firefly-distro-basic-vps/114221/pid 17147 $ sudo strace -s 1024 -p 17147 Process 17147 attached - interrupt to quit read(27, "o\360\2522E\2548c", 8) = 8 times({tms_utime=55678, tms_stime=5002, tms_cutime=144, tms_cstime=62}) = 1819307113 sendto(22, "\6\254\354\332\304(=\324\313]2\00066\23Y\247\200\211\345X\347\302\374\31#\255\265)\350\335\200i\3721=\317A\33\273-\\,C\2306\256\240\327K\313T", 52, 0, NULL, 0) = 52 recvfrom(22, 0x3003a9c, 16, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) epoll_ctl(17, EPOLL_CTL_ADD, 22, {EPOLLIN, {u32=22, u64=22}}) = 0 recvfrom(16, 0x3003a9c, 16, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) epoll_ctl(17, EPOLL_CTL_ADD, 16, {EPOLLIN, {u32=16, u64=16}}) = 0 recvfrom(14, 0x3003a9c, 16, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) epoll_ctl(17, EPOLL_CTL_ADD, 14, {EPOLLIN, {u32=14, u64=14}}) = 0 recvfrom(15, 0x3003a9c, 16, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) epoll_ctl(17, EPOLL_CTL_ADD, 15, {EPOLLIN, {u32=15, u64=15}}) = 0 epoll_wait(17, {}, 32, 100) = 0
Things like this repeated a lot. So what are those FDs?
$ sudo ls -l /proc/17147/fd/{1,2,14,15,16,17,22} lrwx------ 1 teuthworker teuthworker 64 Mar 4 07:50 /proc/17147/fd/1 -> /dev/pts/44 (deleted) lrwx------ 1 teuthworker teuthworker 64 Mar 4 07:50 /proc/17147/fd/14 -> socket:[52369526] lrwx------ 1 teuthworker teuthworker 64 Mar 4 07:50 /proc/17147/fd/15 -> socket:[52369578] lrwx------ 1 teuthworker teuthworker 64 Mar 4 07:50 /proc/17147/fd/16 -> socket:[52369531] lrwx------ 1 teuthworker teuthworker 64 Mar 4 07:50 /proc/17147/fd/17 -> anon_inode:[eventpoll] lrwx------ 1 teuthworker teuthworker 64 Mar 4 07:50 /proc/17147/fd/2 -> /dev/pts/44 (deleted) lrwx------ 1 teuthworker teuthworker 64 Mar 4 07:50 /proc/17147/fd/22 -> socket:[52358838]
Sockets, sure, but to what? (Also, why are 1 and 2 deleted?!)
$ sudo netstat -apeen | grep 17147 tcp 0 0 10.214.137.23:35793 10.214.137.23:11300 ESTABLISHED 1001 36003617 17147/python tcp 0 0 10.214.137.23:42653 10.214.137.23:11300 ESTABLISHED 1001 36957961 17147/python tcp 0 0 127.0.0.1:60819 127.0.1.1:11300 ESTABLISHED 1001 29322390 17147/python tcp 0 0 10.214.137.23:42641 10.214.138.155:22 ESTABLISHED 1001 52369526 17147/python tcp 0 0 127.0.0.1:49513 127.0.1.1:11300 ESTABLISHED 1001 26605391 17147/python tcp 0 0 10.214.137.23:58754 10.214.138.174:22 ESTABLISHED 1001 52358838 17147/python tcp 0 0 10.214.137.23:35965 10.214.138.180:22 ESTABLISHED 1001 52369578 17147/python tcp 0 0 10.214.137.23:60255 10.214.137.23:11300 ESTABLISHED 1001 36756739 17147/python tcp 0 0 10.214.137.23:36774 10.214.138.177:22 ESTABLISHED 1001 52369531 17147/python udp 0 0 0.0.0.0:47954 0.0.0.0:* 1001 52210114 17147/python udp 0 0 0.0.0.0:38453 0.0.0.0:* 1001 52210113 17147/python
They are ssh connections to vpm104, vpm105, vpm106, and vpm107 - all four of the targets of the job.
My conclusion is that either _run_tests() or something in orchestra.remote or orchestra.run needs to notice if the connection dies and raise an exception.
Updated by Ian Colle about 10 years ago
- Assignee set to Sandon Van Ness
- Priority changed from High to Urgent
Updated by Zack Cerza about 10 years ago
- Assignee deleted (
Sandon Van Ness)
I was wrong, it's not a network problem.
From vpm104:
root 2094 0.0 0.1 94088 3780 ? Ss Mar02 0:00 sshd: ubuntu [priv] ubuntu 2097 0.0 0.0 94088 1760 ? S Mar02 0:13 sshd: ubuntu@notty ubuntu 3561 0.0 0.0 106060 1512 ? Ss Mar02 0:00 bash -c mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && ubuntu 3581 0.0 0.0 106060 1292 ? S Mar02 0:00 /bin/sh -ex /home/ubuntu/cephtest/workunit.client.0/rados/test.sh ubuntu 4081 0.0 0.5 712976 9852 ? Sl Mar02 0:28 ceph_test_rados_api_misc root 32527 0.0 0.1 94088 3804 ? Ss 10:56 0:01 sshd: ubuntu [priv] ubuntu 32530 0.0 0.0 94088 1764 ? S 10:56 0:00 sshd: ubuntu@pts/0 ubuntu 32531 0.0 0.0 108300 1892 pts/0 Ss 10:57 0:00 -bash postfix 32630 0.0 0.1 81280 3320 ? S 12:06 0:00 pickup -l -t fifo -u root 32699 0.0 0.1 94088 3804 ? Ss 12:57 0:00 sshd: ubuntu [priv] ubuntu 32702 0.0 0.0 94088 1752 ? S 12:57 0:00 sshd: ubuntu@pts/1 ubuntu 32703 0.0 0.0 108300 1880 pts/1 Ss+ 12:57 0:00 -bash
Me using strace while Sandon kills the ceph_test_rados_api_misc
process:
[ubuntu@vpm104 ~]$ sudo strace -s 1024 -p 4081 Process 4081 attached - interrupt to quit futex(0x7fff80df211c, FUTEX_WAIT_PRIVATE, 1, NULL ) = ? ERESTARTSYS (To be restarted) --- SIGTERM (Terminated) @ 0 (0) --- Process 4081 detached
And the teuthology output from when he killed it:
2014-03-02T18:56:59.365 INFO:teuthology.task.workunit.client.0.out:[10.214.138.174]: [ RUN ] LibRadosMiscPP.CopyPP 2014-03-04T10:35:01.079 INFO:teuthology.task.workunit.client.0.err:[10.214.138.174]: /home/ubuntu/cephtest/workunit.client.0/rados/test.sh: line 7: 4081 Terminated ceph_test_rados_api_misc 2014-03-04T10:35:01.270 INFO:teuthology.task.workunit:Stopping rados/test.sh on client.0... 2014-03-04T10:35:01.270 DEBUG:teuthology.orchestra.run:Running [10.214.138.174]: 'rm -rf -- /home/ubuntu/cephtest/workunits.list /home/ubuntu/cephtest/workunit.client.0' 2014-03-04T10:35:02.143 ERROR:teuthology.parallel:Exception in parallel execution Traceback (most recent call last): File "/home/teuthworker/teuthology-firefly/teuthology/parallel.py", line 82, in __exit__ for result in self: File "/home/teuthworker/teuthology-firefly/teuthology/parallel.py", line 101, in next resurrect_traceback(result) File "/home/teuthworker/teuthology-firefly/teuthology/parallel.py", line 19, in capture_traceback return func(*args, **kwargs) File "/home/teuthworker/teuthology-firefly/teuthology/task/workunit.py", line 345, in _run_tests args=args, File "/home/teuthworker/teuthology-firefly/teuthology/orchestra/remote.py", line 106, in run r = self._runner(client=self.ssh, **kwargs) File "/home/teuthworker/teuthology-firefly/teuthology/orchestra/run.py", line 328, in run r.exitstatus = _check_status(r.exitstatus) File "/home/teuthworker/teuthology-firefly/teuthology/orchestra/run.py", line 324, in _check_status raise CommandFailedError(command=r.command, exitstatus=status, node=host) CommandFailedError: Command failed on 10.214.138.174 with status 143: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=32a4e903497bcf40a50e5e683c7bda17eccd11b3 TESTDIR="/home/ubuntu/cephtest" CEPH_ID="0" adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage /home/ubuntu/cephtest/workunit.client.0/rados/test.sh'
Updated by Zack Cerza about 10 years ago
- Project changed from teuthology to rgw
- Subject changed from task/workunit.py: hang in _run_tests() to hang in rados/test.sh
Updated by Zack Cerza about 10 years ago
This job was running for over 60 hours:
On plana55:
ubuntu 3615 0.0 10.5 2231508 864528 ? Ssl Mar02 2:15 ceph_test_librbd_fsx -d -W -R -p 100 -P /home/ubuntu/cephtest/archive -t 1 -l 250000000 -S 0 -N 5000 pool_client.0 image_client.0 $ sudo strace -p 3615 Process 3615 attached - interrupt to quit futex(0x148b948, FUTEX_WAIT_PRIVATE, 2, NULL
I killed it.
Updated by Zack Cerza about 10 years ago
Unsure if this is identical:
http://pulpito.front.sepia.ceph.com/teuthology-2014-03-05_02:30:28-rados-firefly-testing-basic-plana/117303/
http://qa-proxy.ceph.com/teuthology/teuthology-2014-03-05_02:30:28-rados-firefly-testing-basic-plana/117303/teuthology.log
2014-03-05T10:10:53.091 INFO:teuthology.task.ceph.mon.a.err:[10.214.133.22]: mon/PGMap.h: In function 'const pool_stat_t& PGMap::get_pg_pool_sum_stat(int64_t)' thread 7fcaade21700 time 2014-03-05 10:10:53.086771 2014-03-05T10:10:53.091 INFO:teuthology.task.ceph.mon.a.err:[10.214.133.22]: mon/PGMap.h: 210: FAILED assert(p != pg_pool_sum.end())
I'm not going to kill the job because it looks fairly interesting.
Updated by Sage Weil about 10 years ago
- Status changed from New to Duplicate
- Assignee deleted (
Yehuda Sadeh)
the mon assert is a dup