Actions
Bug #7597
closedhang in rados/test.sh
Status:
Duplicate
Priority:
Urgent
Assignee:
-
Target version:
-
% Done:
0%
Source:
other
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Looking into a particular hung job:
$ cat /a/teuthology-2014-03-02_01:10:05-ceph-deploy-firefly-distro-basic-vps/114221/pid 17147 $ sudo strace -s 1024 -p 17147 Process 17147 attached - interrupt to quit read(27, "o\360\2522E\2548c", 8) = 8 times({tms_utime=55678, tms_stime=5002, tms_cutime=144, tms_cstime=62}) = 1819307113 sendto(22, "\6\254\354\332\304(=\324\313]2\00066\23Y\247\200\211\345X\347\302\374\31#\255\265)\350\335\200i\3721=\317A\33\273-\\,C\2306\256\240\327K\313T", 52, 0, NULL, 0) = 52 recvfrom(22, 0x3003a9c, 16, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) epoll_ctl(17, EPOLL_CTL_ADD, 22, {EPOLLIN, {u32=22, u64=22}}) = 0 recvfrom(16, 0x3003a9c, 16, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) epoll_ctl(17, EPOLL_CTL_ADD, 16, {EPOLLIN, {u32=16, u64=16}}) = 0 recvfrom(14, 0x3003a9c, 16, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) epoll_ctl(17, EPOLL_CTL_ADD, 14, {EPOLLIN, {u32=14, u64=14}}) = 0 recvfrom(15, 0x3003a9c, 16, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) epoll_ctl(17, EPOLL_CTL_ADD, 15, {EPOLLIN, {u32=15, u64=15}}) = 0 epoll_wait(17, {}, 32, 100) = 0
Things like this repeated a lot. So what are those FDs?
$ sudo ls -l /proc/17147/fd/{1,2,14,15,16,17,22} lrwx------ 1 teuthworker teuthworker 64 Mar 4 07:50 /proc/17147/fd/1 -> /dev/pts/44 (deleted) lrwx------ 1 teuthworker teuthworker 64 Mar 4 07:50 /proc/17147/fd/14 -> socket:[52369526] lrwx------ 1 teuthworker teuthworker 64 Mar 4 07:50 /proc/17147/fd/15 -> socket:[52369578] lrwx------ 1 teuthworker teuthworker 64 Mar 4 07:50 /proc/17147/fd/16 -> socket:[52369531] lrwx------ 1 teuthworker teuthworker 64 Mar 4 07:50 /proc/17147/fd/17 -> anon_inode:[eventpoll] lrwx------ 1 teuthworker teuthworker 64 Mar 4 07:50 /proc/17147/fd/2 -> /dev/pts/44 (deleted) lrwx------ 1 teuthworker teuthworker 64 Mar 4 07:50 /proc/17147/fd/22 -> socket:[52358838]
Sockets, sure, but to what? (Also, why are 1 and 2 deleted?!)
$ sudo netstat -apeen | grep 17147 tcp 0 0 10.214.137.23:35793 10.214.137.23:11300 ESTABLISHED 1001 36003617 17147/python tcp 0 0 10.214.137.23:42653 10.214.137.23:11300 ESTABLISHED 1001 36957961 17147/python tcp 0 0 127.0.0.1:60819 127.0.1.1:11300 ESTABLISHED 1001 29322390 17147/python tcp 0 0 10.214.137.23:42641 10.214.138.155:22 ESTABLISHED 1001 52369526 17147/python tcp 0 0 127.0.0.1:49513 127.0.1.1:11300 ESTABLISHED 1001 26605391 17147/python tcp 0 0 10.214.137.23:58754 10.214.138.174:22 ESTABLISHED 1001 52358838 17147/python tcp 0 0 10.214.137.23:35965 10.214.138.180:22 ESTABLISHED 1001 52369578 17147/python tcp 0 0 10.214.137.23:60255 10.214.137.23:11300 ESTABLISHED 1001 36756739 17147/python tcp 0 0 10.214.137.23:36774 10.214.138.177:22 ESTABLISHED 1001 52369531 17147/python udp 0 0 0.0.0.0:47954 0.0.0.0:* 1001 52210114 17147/python udp 0 0 0.0.0.0:38453 0.0.0.0:* 1001 52210113 17147/python
They are ssh connections to vpm104, vpm105, vpm106, and vpm107 - all four of the targets of the job.
My conclusion is that either _run_tests() or something in orchestra.remote or orchestra.run needs to notice if the connection dies and raise an exception.
Actions