Bug #7597: hang in rados/test.sh - rgw - Ceph

Actions

Copy link

Bug #7597

closed

hang in rados/test.sh

Added by Zack Cerza about 10 years ago. Updated about 10 years ago.

Status:

Duplicate

Priority:

Urgent

Assignee:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Looking into a particular hung job:

$ cat /a/teuthology-2014-03-02_01:10:05-ceph-deploy-firefly-distro-basic-vps/114221/pid
17147
$ sudo strace -s 1024 -p 17147                                                                                                     
Process 17147 attached - interrupt to quit
read(27, "o\360\2522E\2548c", 8)        = 8
times({tms_utime=55678, tms_stime=5002, tms_cutime=144, tms_cstime=62}) = 1819307113
sendto(22, "\6\254\354\332\304(=\324\313]2\00066\23Y\247\200\211\345X\347\302\374\31#\255\265)\350\335\200i\3721=\317A\33\273-\\,C\2306\256\240\327K\313T", 52, 0, NULL, 0) = 52
recvfrom(22, 0x3003a9c, 16, 0, 0, 0)    = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(17, EPOLL_CTL_ADD, 22, {EPOLLIN, {u32=22, u64=22}}) = 0
recvfrom(16, 0x3003a9c, 16, 0, 0, 0)    = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(17, EPOLL_CTL_ADD, 16, {EPOLLIN, {u32=16, u64=16}}) = 0
recvfrom(14, 0x3003a9c, 16, 0, 0, 0)    = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(17, EPOLL_CTL_ADD, 14, {EPOLLIN, {u32=14, u64=14}}) = 0
recvfrom(15, 0x3003a9c, 16, 0, 0, 0)    = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(17, EPOLL_CTL_ADD, 15, {EPOLLIN, {u32=15, u64=15}}) = 0
epoll_wait(17, {}, 32, 100)             = 0

Things like this repeated a lot. So what are those FDs?

$ sudo ls -l /proc/17147/fd/{1,2,14,15,16,17,22}
lrwx------ 1 teuthworker teuthworker 64 Mar  4 07:50 /proc/17147/fd/1 -> /dev/pts/44 (deleted)
lrwx------ 1 teuthworker teuthworker 64 Mar  4 07:50 /proc/17147/fd/14 -> socket:[52369526]
lrwx------ 1 teuthworker teuthworker 64 Mar  4 07:50 /proc/17147/fd/15 -> socket:[52369578]
lrwx------ 1 teuthworker teuthworker 64 Mar  4 07:50 /proc/17147/fd/16 -> socket:[52369531]
lrwx------ 1 teuthworker teuthworker 64 Mar  4 07:50 /proc/17147/fd/17 -> anon_inode:[eventpoll]
lrwx------ 1 teuthworker teuthworker 64 Mar  4 07:50 /proc/17147/fd/2 -> /dev/pts/44 (deleted)
lrwx------ 1 teuthworker teuthworker 64 Mar  4 07:50 /proc/17147/fd/22 -> socket:[52358838]

Sockets, sure, but to what? (Also, why are 1 and 2 deleted?!)

$ sudo netstat -apeen | grep 17147
tcp        0      0 10.214.137.23:35793     10.214.137.23:11300     ESTABLISHED 1001       36003617    17147/python    
tcp        0      0 10.214.137.23:42653     10.214.137.23:11300     ESTABLISHED 1001       36957961    17147/python    
tcp        0      0 127.0.0.1:60819         127.0.1.1:11300         ESTABLISHED 1001       29322390    17147/python    
tcp        0      0 10.214.137.23:42641     10.214.138.155:22       ESTABLISHED 1001       52369526    17147/python    
tcp        0      0 127.0.0.1:49513         127.0.1.1:11300         ESTABLISHED 1001       26605391    17147/python    
tcp        0      0 10.214.137.23:58754     10.214.138.174:22       ESTABLISHED 1001       52358838    17147/python    
tcp        0      0 10.214.137.23:35965     10.214.138.180:22       ESTABLISHED 1001       52369578    17147/python    
tcp        0      0 10.214.137.23:60255     10.214.137.23:11300     ESTABLISHED 1001       36756739    17147/python    
tcp        0      0 10.214.137.23:36774     10.214.138.177:22       ESTABLISHED 1001       52369531    17147/python    
udp        0      0 0.0.0.0:47954           0.0.0.0:*                           1001       52210114    17147/python    
udp        0      0 0.0.0.0:38453           0.0.0.0:*                           1001       52210113    17147/python

They are ssh connections to vpm104, vpm105, vpm106, and vpm107 - all four of the targets of the job.

My conclusion is that either _run_tests() or something in orchestra.remote or orchestra.run needs to notice if the connection dies and raise an exception.

Actions

Copy link

Updated by Ian Colle about 10 years ago

Assignee set to Sandon Van Ness
Priority changed from High to Urgent

Actions

Copy link

Updated by Zack Cerza about 10 years ago

Assignee deleted (~~Sandon Van Ness~~)

I was wrong, it's not a network problem.

From vpm104:

root      2094  0.0  0.1  94088  3780 ?        Ss   Mar02   0:00 sshd: ubuntu [priv]
ubuntu    2097  0.0  0.0  94088  1760 ?        S    Mar02   0:13 sshd: ubuntu@notty
ubuntu    3561  0.0  0.0 106060  1512 ?        Ss   Mar02   0:00 bash -c mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp &&
ubuntu    3581  0.0  0.0 106060  1292 ?        S    Mar02   0:00 /bin/sh -ex /home/ubuntu/cephtest/workunit.client.0/rados/test.sh
ubuntu    4081  0.0  0.5 712976  9852 ?        Sl   Mar02   0:28 ceph_test_rados_api_misc
root     32527  0.0  0.1  94088  3804 ?        Ss   10:56   0:01 sshd: ubuntu [priv]
ubuntu   32530  0.0  0.0  94088  1764 ?        S    10:56   0:00 sshd: ubuntu@pts/0
ubuntu   32531  0.0  0.0 108300  1892 pts/0    Ss   10:57   0:00 -bash
postfix  32630  0.0  0.1  81280  3320 ?        S    12:06   0:00 pickup -l -t fifo -u
root     32699  0.0  0.1  94088  3804 ?        Ss   12:57   0:00 sshd: ubuntu [priv]
ubuntu   32702  0.0  0.0  94088  1752 ?        S    12:57   0:00 sshd: ubuntu@pts/1
ubuntu   32703  0.0  0.0 108300  1880 pts/1    Ss+  12:57   0:00 -bash

Me using strace while Sandon kills the ceph_test_rados_api_misc process:

[ubuntu@vpm104 ~]$ sudo strace -s 1024 -p 4081
Process 4081 attached - interrupt to quit
futex(0x7fff80df211c, FUTEX_WAIT_PRIVATE, 1, NULL

) = ? ERESTARTSYS (To be restarted)
--- SIGTERM (Terminated) @ 0 (0) ---
Process 4081 detached

And the teuthology output from when he killed it:

2014-03-02T18:56:59.365 INFO:teuthology.task.workunit.client.0.out:[10.214.138.174]: [ RUN      ] LibRadosMiscPP.CopyPP
2014-03-04T10:35:01.079 INFO:teuthology.task.workunit.client.0.err:[10.214.138.174]: /home/ubuntu/cephtest/workunit.client.0/rados/test.sh: line 7:  4081 Terminated              ceph_test_rados_api_misc
2014-03-04T10:35:01.270 INFO:teuthology.task.workunit:Stopping rados/test.sh on client.0...
2014-03-04T10:35:01.270 DEBUG:teuthology.orchestra.run:Running [10.214.138.174]: 'rm -rf -- /home/ubuntu/cephtest/workunits.list /home/ubuntu/cephtest/workunit.client.0'
2014-03-04T10:35:02.143 ERROR:teuthology.parallel:Exception in parallel execution
Traceback (most recent call last):
  File "/home/teuthworker/teuthology-firefly/teuthology/parallel.py", line 82, in __exit__
    for result in self:
  File "/home/teuthworker/teuthology-firefly/teuthology/parallel.py", line 101, in next
    resurrect_traceback(result)
  File "/home/teuthworker/teuthology-firefly/teuthology/parallel.py", line 19, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/teuthology-firefly/teuthology/task/workunit.py", line 345, in _run_tests
    args=args,
  File "/home/teuthworker/teuthology-firefly/teuthology/orchestra/remote.py", line 106, in run
    r = self._runner(client=self.ssh, **kwargs)
  File "/home/teuthworker/teuthology-firefly/teuthology/orchestra/run.py", line 328, in run
    r.exitstatus = _check_status(r.exitstatus)
  File "/home/teuthworker/teuthology-firefly/teuthology/orchestra/run.py", line 324, in _check_status
    raise CommandFailedError(command=r.command, exitstatus=status, node=host)
CommandFailedError: Command failed on 10.214.138.174 with status 143: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=32a4e903497bcf40a50e5e683c7bda17eccd11b3 TESTDIR="/home/ubuntu/cephtest" CEPH_ID="0" adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage /home/ubuntu/cephtest/workunit.client.0/rados/test.sh'

Actions

Copy link

Updated by Zack Cerza about 10 years ago

Project changed from teuthology to rgw
Subject changed from task/workunit.py: hang in _run_tests() to hang in rados/test.sh

Actions

Copy link

Updated by Ian Colle about 10 years ago

Assignee set to Yehuda Sadeh

Actions

Copy link

Updated by Zack Cerza about 10 years ago

This job was running for over 60 hours:

http://pulpito.front.sepia.ceph.com/teuthology-2014-03-01_23:02:09-rbd-firefly-distro-basic-plana/113994/

On plana55:

ubuntu    3615  0.0 10.5 2231508 864528 ?      Ssl  Mar02   2:15 ceph_test_librbd_fsx -d -W -R -p 100 -P /home/ubuntu/cephtest/archive -t 1 -l 250000000 -S 0 -N 5000 pool_client.0 image_client.0

$ sudo strace -p 3615
Process 3615 attached - interrupt to quit
futex(0x148b948, FUTEX_WAIT_PRIVATE, 2, NULL

I killed it.

Actions

Copy link

Updated by Zack Cerza about 10 years ago

Unsure if this is identical:
http://pulpito.front.sepia.ceph.com/teuthology-2014-03-05_02:30:28-rados-firefly-testing-basic-plana/117303/
http://qa-proxy.ceph.com/teuthology/teuthology-2014-03-05_02:30:28-rados-firefly-testing-basic-plana/117303/teuthology.log

2014-03-05T10:10:53.091 INFO:teuthology.task.ceph.mon.a.err:[10.214.133.22]: mon/PGMap.h: In function 'const pool_stat_t& PGMap::get_pg_pool_sum_stat(int64_t)' thread 7fcaade21700 time 2014-03-05 10:10:53.086771
2014-03-05T10:10:53.091 INFO:teuthology.task.ceph.mon.a.err:[10.214.133.22]: mon/PGMap.h: 210: FAILED assert(p != pg_pool_sum.end())

I'm not going to kill the job because it looks fairly interesting.

Actions

Copy link

Updated by Sage Weil about 10 years ago

Status changed from New to Duplicate
Assignee deleted (~~Yehuda Sadeh~~)

the mon assert is a dup

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #7597

hang in rados/test.sh

Updated by Ian Colle about 10 years ago

Updated by Zack Cerza about 10 years ago

Updated by Zack Cerza about 10 years ago

Updated by Ian Colle about 10 years ago

Updated by Zack Cerza about 10 years ago

Updated by Zack Cerza about 10 years ago

Updated by Sage Weil about 10 years ago