Project

General

Profile

Actions

Bug #7597

closed

hang in rados/test.sh

Added by Zack Cerza about 10 years ago. Updated about 10 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Looking into a particular hung job:

$ cat /a/teuthology-2014-03-02_01:10:05-ceph-deploy-firefly-distro-basic-vps/114221/pid
17147
$ sudo strace -s 1024 -p 17147                                                                                                     
Process 17147 attached - interrupt to quit
read(27, "o\360\2522E\2548c", 8)        = 8
times({tms_utime=55678, tms_stime=5002, tms_cutime=144, tms_cstime=62}) = 1819307113
sendto(22, "\6\254\354\332\304(=\324\313]2\00066\23Y\247\200\211\345X\347\302\374\31#\255\265)\350\335\200i\3721=\317A\33\273-\\,C\2306\256\240\327K\313T", 52, 0, NULL, 0) = 52
recvfrom(22, 0x3003a9c, 16, 0, 0, 0)    = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(17, EPOLL_CTL_ADD, 22, {EPOLLIN, {u32=22, u64=22}}) = 0
recvfrom(16, 0x3003a9c, 16, 0, 0, 0)    = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(17, EPOLL_CTL_ADD, 16, {EPOLLIN, {u32=16, u64=16}}) = 0
recvfrom(14, 0x3003a9c, 16, 0, 0, 0)    = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(17, EPOLL_CTL_ADD, 14, {EPOLLIN, {u32=14, u64=14}}) = 0
recvfrom(15, 0x3003a9c, 16, 0, 0, 0)    = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(17, EPOLL_CTL_ADD, 15, {EPOLLIN, {u32=15, u64=15}}) = 0
epoll_wait(17, {}, 32, 100)             = 0

Things like this repeated a lot. So what are those FDs?

$ sudo ls -l /proc/17147/fd/{1,2,14,15,16,17,22}
lrwx------ 1 teuthworker teuthworker 64 Mar  4 07:50 /proc/17147/fd/1 -> /dev/pts/44 (deleted)
lrwx------ 1 teuthworker teuthworker 64 Mar  4 07:50 /proc/17147/fd/14 -> socket:[52369526]
lrwx------ 1 teuthworker teuthworker 64 Mar  4 07:50 /proc/17147/fd/15 -> socket:[52369578]
lrwx------ 1 teuthworker teuthworker 64 Mar  4 07:50 /proc/17147/fd/16 -> socket:[52369531]
lrwx------ 1 teuthworker teuthworker 64 Mar  4 07:50 /proc/17147/fd/17 -> anon_inode:[eventpoll]
lrwx------ 1 teuthworker teuthworker 64 Mar  4 07:50 /proc/17147/fd/2 -> /dev/pts/44 (deleted)
lrwx------ 1 teuthworker teuthworker 64 Mar  4 07:50 /proc/17147/fd/22 -> socket:[52358838]

Sockets, sure, but to what? (Also, why are 1 and 2 deleted?!)

$ sudo netstat -apeen | grep 17147
tcp        0      0 10.214.137.23:35793     10.214.137.23:11300     ESTABLISHED 1001       36003617    17147/python    
tcp        0      0 10.214.137.23:42653     10.214.137.23:11300     ESTABLISHED 1001       36957961    17147/python    
tcp        0      0 127.0.0.1:60819         127.0.1.1:11300         ESTABLISHED 1001       29322390    17147/python    
tcp        0      0 10.214.137.23:42641     10.214.138.155:22       ESTABLISHED 1001       52369526    17147/python    
tcp        0      0 127.0.0.1:49513         127.0.1.1:11300         ESTABLISHED 1001       26605391    17147/python    
tcp        0      0 10.214.137.23:58754     10.214.138.174:22       ESTABLISHED 1001       52358838    17147/python    
tcp        0      0 10.214.137.23:35965     10.214.138.180:22       ESTABLISHED 1001       52369578    17147/python    
tcp        0      0 10.214.137.23:60255     10.214.137.23:11300     ESTABLISHED 1001       36756739    17147/python    
tcp        0      0 10.214.137.23:36774     10.214.138.177:22       ESTABLISHED 1001       52369531    17147/python    
udp        0      0 0.0.0.0:47954           0.0.0.0:*                           1001       52210114    17147/python    
udp        0      0 0.0.0.0:38453           0.0.0.0:*                           1001       52210113    17147/python    

They are ssh connections to vpm104, vpm105, vpm106, and vpm107 - all four of the targets of the job.

My conclusion is that either _run_tests() or something in orchestra.remote or orchestra.run needs to notice if the connection dies and raise an exception.

Actions

Also available in: Atom PDF