Project

General

Profile

Actions

Bug #7597

closed

hang in rados/test.sh

Added by Zack Cerza about 10 years ago. Updated about 10 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Looking into a particular hung job:

$ cat /a/teuthology-2014-03-02_01:10:05-ceph-deploy-firefly-distro-basic-vps/114221/pid
17147
$ sudo strace -s 1024 -p 17147                                                                                                     
Process 17147 attached - interrupt to quit
read(27, "o\360\2522E\2548c", 8)        = 8
times({tms_utime=55678, tms_stime=5002, tms_cutime=144, tms_cstime=62}) = 1819307113
sendto(22, "\6\254\354\332\304(=\324\313]2\00066\23Y\247\200\211\345X\347\302\374\31#\255\265)\350\335\200i\3721=\317A\33\273-\\,C\2306\256\240\327K\313T", 52, 0, NULL, 0) = 52
recvfrom(22, 0x3003a9c, 16, 0, 0, 0)    = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(17, EPOLL_CTL_ADD, 22, {EPOLLIN, {u32=22, u64=22}}) = 0
recvfrom(16, 0x3003a9c, 16, 0, 0, 0)    = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(17, EPOLL_CTL_ADD, 16, {EPOLLIN, {u32=16, u64=16}}) = 0
recvfrom(14, 0x3003a9c, 16, 0, 0, 0)    = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(17, EPOLL_CTL_ADD, 14, {EPOLLIN, {u32=14, u64=14}}) = 0
recvfrom(15, 0x3003a9c, 16, 0, 0, 0)    = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(17, EPOLL_CTL_ADD, 15, {EPOLLIN, {u32=15, u64=15}}) = 0
epoll_wait(17, {}, 32, 100)             = 0

Things like this repeated a lot. So what are those FDs?

$ sudo ls -l /proc/17147/fd/{1,2,14,15,16,17,22}
lrwx------ 1 teuthworker teuthworker 64 Mar  4 07:50 /proc/17147/fd/1 -> /dev/pts/44 (deleted)
lrwx------ 1 teuthworker teuthworker 64 Mar  4 07:50 /proc/17147/fd/14 -> socket:[52369526]
lrwx------ 1 teuthworker teuthworker 64 Mar  4 07:50 /proc/17147/fd/15 -> socket:[52369578]
lrwx------ 1 teuthworker teuthworker 64 Mar  4 07:50 /proc/17147/fd/16 -> socket:[52369531]
lrwx------ 1 teuthworker teuthworker 64 Mar  4 07:50 /proc/17147/fd/17 -> anon_inode:[eventpoll]
lrwx------ 1 teuthworker teuthworker 64 Mar  4 07:50 /proc/17147/fd/2 -> /dev/pts/44 (deleted)
lrwx------ 1 teuthworker teuthworker 64 Mar  4 07:50 /proc/17147/fd/22 -> socket:[52358838]

Sockets, sure, but to what? (Also, why are 1 and 2 deleted?!)

$ sudo netstat -apeen | grep 17147
tcp        0      0 10.214.137.23:35793     10.214.137.23:11300     ESTABLISHED 1001       36003617    17147/python    
tcp        0      0 10.214.137.23:42653     10.214.137.23:11300     ESTABLISHED 1001       36957961    17147/python    
tcp        0      0 127.0.0.1:60819         127.0.1.1:11300         ESTABLISHED 1001       29322390    17147/python    
tcp        0      0 10.214.137.23:42641     10.214.138.155:22       ESTABLISHED 1001       52369526    17147/python    
tcp        0      0 127.0.0.1:49513         127.0.1.1:11300         ESTABLISHED 1001       26605391    17147/python    
tcp        0      0 10.214.137.23:58754     10.214.138.174:22       ESTABLISHED 1001       52358838    17147/python    
tcp        0      0 10.214.137.23:35965     10.214.138.180:22       ESTABLISHED 1001       52369578    17147/python    
tcp        0      0 10.214.137.23:60255     10.214.137.23:11300     ESTABLISHED 1001       36756739    17147/python    
tcp        0      0 10.214.137.23:36774     10.214.138.177:22       ESTABLISHED 1001       52369531    17147/python    
udp        0      0 0.0.0.0:47954           0.0.0.0:*                           1001       52210114    17147/python    
udp        0      0 0.0.0.0:38453           0.0.0.0:*                           1001       52210113    17147/python    

They are ssh connections to vpm104, vpm105, vpm106, and vpm107 - all four of the targets of the job.

My conclusion is that either _run_tests() or something in orchestra.remote or orchestra.run needs to notice if the connection dies and raise an exception.

Actions #1

Updated by Ian Colle about 10 years ago

  • Assignee set to Sandon Van Ness
  • Priority changed from High to Urgent
Actions #2

Updated by Zack Cerza about 10 years ago

  • Assignee deleted (Sandon Van Ness)

I was wrong, it's not a network problem.

From vpm104:

root      2094  0.0  0.1  94088  3780 ?        Ss   Mar02   0:00 sshd: ubuntu [priv]
ubuntu    2097  0.0  0.0  94088  1760 ?        S    Mar02   0:13 sshd: ubuntu@notty
ubuntu    3561  0.0  0.0 106060  1512 ?        Ss   Mar02   0:00 bash -c mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp &&
ubuntu    3581  0.0  0.0 106060  1292 ?        S    Mar02   0:00 /bin/sh -ex /home/ubuntu/cephtest/workunit.client.0/rados/test.sh
ubuntu    4081  0.0  0.5 712976  9852 ?        Sl   Mar02   0:28 ceph_test_rados_api_misc
root     32527  0.0  0.1  94088  3804 ?        Ss   10:56   0:01 sshd: ubuntu [priv]
ubuntu   32530  0.0  0.0  94088  1764 ?        S    10:56   0:00 sshd: ubuntu@pts/0
ubuntu   32531  0.0  0.0 108300  1892 pts/0    Ss   10:57   0:00 -bash
postfix  32630  0.0  0.1  81280  3320 ?        S    12:06   0:00 pickup -l -t fifo -u
root     32699  0.0  0.1  94088  3804 ?        Ss   12:57   0:00 sshd: ubuntu [priv]
ubuntu   32702  0.0  0.0  94088  1752 ?        S    12:57   0:00 sshd: ubuntu@pts/1
ubuntu   32703  0.0  0.0 108300  1880 pts/1    Ss+  12:57   0:00 -bash

Me using strace while Sandon kills the ceph_test_rados_api_misc process:

[ubuntu@vpm104 ~]$ sudo strace -s 1024 -p 4081
Process 4081 attached - interrupt to quit
futex(0x7fff80df211c, FUTEX_WAIT_PRIVATE, 1, NULL

) = ? ERESTARTSYS (To be restarted)
--- SIGTERM (Terminated) @ 0 (0) ---
Process 4081 detached

And the teuthology output from when he killed it:

2014-03-02T18:56:59.365 INFO:teuthology.task.workunit.client.0.out:[10.214.138.174]: [ RUN      ] LibRadosMiscPP.CopyPP
2014-03-04T10:35:01.079 INFO:teuthology.task.workunit.client.0.err:[10.214.138.174]: /home/ubuntu/cephtest/workunit.client.0/rados/test.sh: line 7:  4081 Terminated              ceph_test_rados_api_misc
2014-03-04T10:35:01.270 INFO:teuthology.task.workunit:Stopping rados/test.sh on client.0...
2014-03-04T10:35:01.270 DEBUG:teuthology.orchestra.run:Running [10.214.138.174]: 'rm -rf -- /home/ubuntu/cephtest/workunits.list /home/ubuntu/cephtest/workunit.client.0'
2014-03-04T10:35:02.143 ERROR:teuthology.parallel:Exception in parallel execution
Traceback (most recent call last):
  File "/home/teuthworker/teuthology-firefly/teuthology/parallel.py", line 82, in __exit__
    for result in self:
  File "/home/teuthworker/teuthology-firefly/teuthology/parallel.py", line 101, in next
    resurrect_traceback(result)
  File "/home/teuthworker/teuthology-firefly/teuthology/parallel.py", line 19, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/teuthology-firefly/teuthology/task/workunit.py", line 345, in _run_tests
    args=args,
  File "/home/teuthworker/teuthology-firefly/teuthology/orchestra/remote.py", line 106, in run
    r = self._runner(client=self.ssh, **kwargs)
  File "/home/teuthworker/teuthology-firefly/teuthology/orchestra/run.py", line 328, in run
    r.exitstatus = _check_status(r.exitstatus)
  File "/home/teuthworker/teuthology-firefly/teuthology/orchestra/run.py", line 324, in _check_status
    raise CommandFailedError(command=r.command, exitstatus=status, node=host)
CommandFailedError: Command failed on 10.214.138.174 with status 143: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=32a4e903497bcf40a50e5e683c7bda17eccd11b3 TESTDIR="/home/ubuntu/cephtest" CEPH_ID="0" adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage /home/ubuntu/cephtest/workunit.client.0/rados/test.sh'

Actions #3

Updated by Zack Cerza about 10 years ago

  • Project changed from teuthology to rgw
  • Subject changed from task/workunit.py: hang in _run_tests() to hang in rados/test.sh
Actions #4

Updated by Ian Colle about 10 years ago

  • Assignee set to Yehuda Sadeh
Actions #5

Updated by Zack Cerza about 10 years ago

This job was running for over 60 hours:

http://pulpito.front.sepia.ceph.com/teuthology-2014-03-01_23:02:09-rbd-firefly-distro-basic-plana/113994/

On plana55:

ubuntu    3615  0.0 10.5 2231508 864528 ?      Ssl  Mar02   2:15 ceph_test_librbd_fsx -d -W -R -p 100 -P /home/ubuntu/cephtest/archive -t 1 -l 250000000 -S 0 -N 5000 pool_client.0 image_client.0

$ sudo strace -p 3615
Process 3615 attached - interrupt to quit
futex(0x148b948, FUTEX_WAIT_PRIVATE, 2, NULL

I killed it.

Actions #6

Updated by Zack Cerza about 10 years ago

Unsure if this is identical:
http://pulpito.front.sepia.ceph.com/teuthology-2014-03-05_02:30:28-rados-firefly-testing-basic-plana/117303/
http://qa-proxy.ceph.com/teuthology/teuthology-2014-03-05_02:30:28-rados-firefly-testing-basic-plana/117303/teuthology.log

2014-03-05T10:10:53.091 INFO:teuthology.task.ceph.mon.a.err:[10.214.133.22]: mon/PGMap.h: In function 'const pool_stat_t& PGMap::get_pg_pool_sum_stat(int64_t)' thread 7fcaade21700 time 2014-03-05 10:10:53.086771
2014-03-05T10:10:53.091 INFO:teuthology.task.ceph.mon.a.err:[10.214.133.22]: mon/PGMap.h: 210: FAILED assert(p != pg_pool_sum.end())

I'm not going to kill the job because it looks fairly interesting.

Actions #7

Updated by Sage Weil about 10 years ago

  • Status changed from New to Duplicate
  • Assignee deleted (Yehuda Sadeh)

the mon assert is a dup

Actions

Also available in: Atom PDF