Bug #44526
sporatic cephadm bootstrap failures: 'timed out'
0%
Description
2020-03-09T12:37:04.299 INFO:teuthology.orchestra.run.smithi123.stderr:INFO:cephadm:Non-zero exit code 1 from /bin/podman run --rm --net=host -e CONTAINER_IMAGE=quay.io/ceph-ci/ceph:8c7e5228ba7bdea07454d0dd9c211c1b6b6dcfc8 -e NODE_NAME=smithi123 -v /var/log/ceph/7d5d8d88-6202-11ea-9a3e-001a4aab830c:/var/log/ceph:z -v /tmp/ceph-tmpw7u46ywy:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpquh1zk_7:/etc/ceph/ceph.conf:z --entrypoint /usr/bin/ceph quay.io/ceph-ci/ceph:8c7e5228ba7bdea07454d0dd9c211c1b6b6dcfc8 -h 2020-03-09T12:37:04.299 INFO:teuthology.orchestra.run.smithi123.stderr:INFO:cephadm:/usr/bin/ceph:stdout 2020-03-09T12:37:04.299 INFO:teuthology.orchestra.run.smithi123.stderr:INFO:cephadm:/usr/bin/ceph:stdout General usage: 2020-03-09T12:37:04.299 INFO:teuthology.orchestra.run.smithi123.stderr:INFO:cephadm:/usr/bin/ceph:stdout ============== ... 2020-03-09T12:37:04.320 INFO:teuthology.orchestra.run.smithi123.stderr:INFO:cephadm:/usr/bin/ceph:stdout critical, interesting, useful, noninteresting, debug 2020-03-09T12:37:04.320 INFO:teuthology.orchestra.run.smithi123.stderr:INFO:cephadm:/usr/bin/ceph:stdout List shows a table of all available stats 2020-03-09T12:37:04.320 INFO:teuthology.orchestra.run.smithi123.stderr:INFO:cephadm:/usr/bin/ceph:stdout Run <count> times (default forever), 2020-03-09T12:37:04.320 INFO:teuthology.orchestra.run.smithi123.stderr:INFO:cephadm:/usr/bin/ceph:stdout once per <interval> seconds (default 1) 2020-03-09T12:37:04.320 INFO:teuthology.orchestra.run.smithi123.stderr:INFO:cephadm:/usr/bin/ceph:stdout 2020-03-09T12:37:04.321 INFO:teuthology.orchestra.run.smithi123.stderr:INFO:cephadm:/usr/bin/ceph:stderr timed out 2020-03-09T12:37:04.321 INFO:teuthology.orchestra.run.smithi123.stderr:Traceback (most recent call last): 2020-03-09T12:37:04.321 INFO:teuthology.orchestra.run.smithi123.stderr: File "/home/ubuntu/cephtest/cephadm", line 3729, in <module> 2020-03-09T12:37:04.321 INFO:teuthology.orchestra.run.smithi123.stderr: r = args.func() 2020-03-09T12:37:04.321 INFO:teuthology.orchestra.run.smithi123.stderr: File "/home/ubuntu/cephtest/cephadm", line 2053, in command_bootstrap 2020-03-09T12:37:04.321 INFO:teuthology.orchestra.run.smithi123.stderr: is_available('Dashboard', is_dashboard_available) 2020-03-09T12:37:04.321 INFO:teuthology.orchestra.run.smithi123.stderr: File "/home/ubuntu/cephtest/cephadm", line 575, in is_available 2020-03-09T12:37:04.322 INFO:teuthology.orchestra.run.smithi123.stderr: if func(): 2020-03-09T12:37:04.322 INFO:teuthology.orchestra.run.smithi123.stderr: File "/home/ubuntu/cephtest/cephadm", line 2051, in is_dashboard_available 2020-03-09T12:37:04.322 INFO:teuthology.orchestra.run.smithi123.stderr: out = cli(['-h'], timeout=timeout) 2020-03-09T12:37:04.322 INFO:teuthology.orchestra.run.smithi123.stderr: File "/home/ubuntu/cephtest/cephadm", line 1924, in cli 2020-03-09T12:37:04.322 INFO:teuthology.orchestra.run.smithi123.stderr: ).run(timeout=timeout) 2020-03-09T12:37:04.322 INFO:teuthology.orchestra.run.smithi123.stderr: File "/home/ubuntu/cephtest/cephadm", line 1681, in run 2020-03-09T12:37:04.323 INFO:teuthology.orchestra.run.smithi123.stderr: self.run_cmd(), desc=self.entrypoint, timeout=timeout) 2020-03-09T12:37:04.323 INFO:teuthology.orchestra.run.smithi123.stderr: File "/home/ubuntu/cephtest/cephadm", line 518, in call_throws 2020-03-09T12:37:04.323 INFO:teuthology.orchestra.run.smithi123.stderr: raise RuntimeError('Failed command: %s' % ' '.join(command)) 2020-03-09T12:37:04.324 INFO:teuthology.orchestra.run.smithi123.stderr:RuntimeError: Failed command: /bin/podman run --rm --net=host -e CONTAINER_IMAGE=quay.io/ceph-ci/ceph:8c7e5228ba7bdea07454d0dd9c211c1b6b6dcfc8 -e NODE_NAME=smithi123 -v /var/log/ceph/7d5d8d88-6202-11ea-9a3e-001a4aab830c:/var/log/ceph:z -v /tmp/ceph-tmpw7u46ywy:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpquh1zk_7:/etc/ceph/ceph.conf:z --entrypoint /usr/bin/ceph quay.io/ceph-ci/ceph:8c7e5228ba7bdea07454d0dd9c211c1b6b6dcfc8 -h
http://pulpito.ceph.com/sage-2020-03-09_11:35:13-rados:thrash-old-clients-wip-sage-testing-2020-03-08-2121-distro-basic-smithi/
i'm able to locally reproduce ceph -h ending with 'timed out' if i set ms_inject_socket_failures = 2000.
Related issues
History
#1 Updated by Sage Weil over 3 years ago
- Description updated (diff)
#2 Updated by Sage Weil over 3 years ago
ceph.in sets a short 5s timeout for -h, and that's triggering shutdown, but then ceph isn't cleanly stopping...
2020-03-09T11:08:59.462-0500 7f71637fe700 10 monclient(hunting): ms_handle_reset hunted mon v1:10.3.64.27:6789/0 timed out 2020-03-09T11:09:04.442-0500 7f71617fa700 10 monclient(hunting): shutdown: ceph version 15.1.0-1931-g5206bfbd484 (5206bfbd48467d551aaf2e0fce29cfcebfa44e33) octopus (rc) 1: (ceph::BackTrace::BackTrace(int)+0x45) [0x7f717dbe0a3f] 2: (MonClient::shutdown()+0x11b) [0x7f717e0322b5] 3: (()+0xe74a5) [0x7f7186da34a5] 4: (rados_shutdown()+0x124) [0x7f7186cf7e30] 5: (()+0x5f2b1) [0x7f7186f9b2b1] 6: (_PyMethodDef_RawFastCallDict()+0x11b) [0x7f719477ef6b] 7: (_PyCFunction_FastCallDict()+0x23) [0x7f719477f263] 8: (_PyEval_EvalFrameDefault()+0x6796) [0x7f71947eee56] 9: (_PyFunction_FastCallKeywords()+0x102) [0x7f71947a0b12] 10: (()+0x14180f) [0x7f71947b280f] 11: (_PyEval_EvalFrameDefault()+0xbfe) [0x7f71947e92be] 12: (_PyFunction_FastCallKeywords()+0x102) [0x7f71947a0b12] 13: (()+0x14180f) [0x7f71947b280f] 14: (_PyEval_EvalFrameDefault()+0xbfe) [0x7f71947e92be] 15: (_PyFunction_FastCallDict()+0x11a) [0x7f71947a117a] 16: (_PyObject_Call_Prepend()+0x66) [0x7f7194774f76] 17: (PyObject_Call()+0x7e) [0x7f71947a188e] 18: (()+0x207706) [0x7f7194878706] 19: (()+0x188204) [0x7f71947f9204] 20: (()+0x94e2) [0x7f71946584e2] 21: (clone()+0x43) [0x7f7194abe693] 2020-03-09T11:09:04.447-0500 7f71617fa700 20 monclient(hunting): shutdown discarding 1 pending message(s) 2020-03-09T11:09:04.447-0500 7f71617fa700 1 -- 10.3.64.23:0/4226455987 >> v1:10.3.64.27:6789/0 conn(0x7f7174065dd0 legacy=0x7f7174110130 unknown :-1 s=STATE_CLOSED l=1).mark_down 2020-03-09T11:09:04.447-0500 7f71617fa700 20 --1- 10.3.64.23:0/4226455987 >> v1:10.3.64.27:6789/0 conn(0x7f7174065dd0 0x7f7174110130 :-1 s=CLOSED pgs=1315 cs=1 l=1).stop 2020-03-09T11:09:04.447-0500 7f71617fa700 10 -- 10.3.64.23:0/4226455987 shutdown 10.3.64.23:0/4226455987 2020-03-09T11:09:04.447-0500 7f71617fa700 10 Processor -- stop 2020-03-09T11:09:04.447-0500 7f71617fa700 20 Event(0x7f717405deb0 nevent=5000 time_id=1).wakeup 2020-03-09T11:09:04.448-0500 7f71617fa700 1 -- 10.3.64.23:0/4226455987 shutdown_connections 2020-03-09T11:09:04.448-0500 7f71617fa700 5 -- 10.3.64.23:0/4226455987 shutdown_connections mark down v1:10.3.64.27:6789/0 0x7f7174065dd0 2020-03-09T11:09:04.448-0500 7f71617fa700 20 --1- 10.3.64.23:0/4226455987 >> v1:10.3.64.27:6789/0 conn(0x7f7174065dd0 0x7f7174110130 :-1 s=CLOSED pgs=1315 cs=1 l=1).stop 2020-03-09T11:09:04.448-0500 7f71617fa700 5 -- 10.3.64.23:0/4226455987 shutdown_connections delete 0x7f7174065dd0 2020-03-09T11:09:04.448-0500 7f71617fa700 1 -- 10.3.64.23:0/4226455987 >> 10.3.64.23:0/4226455987 conn(0x7f7174061670 msgr2=0x7f7174106c80 unknown :-1 s=STATE_NONE l=0).mark_down 2020-03-09T11:09:04.448-0500 7f71617fa700 20 --1- 10.3.64.23:0/4226455987 >> 10.3.64.23:0/4226455987 conn(0x7f7174061670 0x7f7174106c80 :-1 s=NONE pgs=0 cs=0 l=0).stop 2020-03-09T11:09:04.448-0500 7f71617fa700 2 --1- 10.3.64.23:0/4226455987 >> 10.3.64.23:0/4226455987 conn(0x7f7174061670 0x7f7174106c80 :-1 s=NONE pgs=0 cs=0 l=0).stop 2020-03-09T11:09:04.448-0500 7f71617fa700 5 --1- 10.3.64.23:0/4226455987 >> 10.3.64.23:0/4226455987 conn(0x7f7174061670 0x7f7174106c80 :-1 s=NONE pgs=0 cs=0 l=0).reset_recv_state 2020-03-09T11:09:04.448-0500 7f71617fa700 20 Event(0x7f717405deb0 nevent=5000 time_id=1).wakeup 2020-03-09T11:09:04.448-0500 7f71617fa700 10 --1- 10.3.64.23:0/4226455987 >> 10.3.64.23:0/4226455987 conn(0x7f7174061670 0x7f7174106c80 :-1 s=NONE pgs=0 cs=0 l=0).discard_out_queue started 2020-03-09T11:09:04.448-0500 7f717a174700 5 --1- 10.3.64.23:0/4226455987 >> 10.3.64.23:0/4226455987 conn(0x7f7174061670 0x7f7174106c80 :-1 s=NONE pgs=0 cs=0 l=0).reset_recv_state (warped) reseting security handlers 2020-03-09T11:09:04.448-0500 7f71617fa700 20 Event(0x7f717405deb0 nevent=5000 time_id=1).wakeup 2020-03-09T11:09:04.448-0500 7f717a174700 5 --1- 10.3.64.23:0/4226455987 >> 10.3.64.23:0/4226455987 conn(0x7f7174061670 0x7f7174106c80 :-1 s=CLOSED pgs=0 cs=0 l=0).reset_security 2020-03-09T11:09:04.448-0500 7f71617fa700 20 Event(0x7f717405ef60 nevent=5000 time_id=5).wakeup 2020-03-09T11:09:04.448-0500 7f71617fa700 20 Event(0x7f7174060040 nevent=5000 time_id=1).wakeup 2020-03-09T11:09:04.448-0500 7f71617fa700 10 -- 10.3.64.23:0/4226455987 wait: waiting for dispatch queue 2020-03-09T11:09:04.448-0500 7f71617fa700 10 -- 10.3.64.23:0/4226455987 wait: dispatch queue is stopped 2020-03-09T11:09:04.448-0500 7f71617fa700 1 -- 10.3.64.23:0/4226455987 shutdown_connections 2020-03-09T11:09:04.448-0500 7f71617fa700 5 -- 10.3.64.23:0/4226455987 shutdown_connections delete 0x7f7174061670 2020-03-09T11:09:04.448-0500 7f71617fa700 20 Event(0x7f717405deb0 nevent=5000 time_id=1).wakeup 2020-03-09T11:09:04.448-0500 7f71617fa700 20 Event(0x7f717405ef60 nevent=5000 time_id=5).wakeup 2020-03-09T11:09:04.448-0500 7f71617fa700 20 Event(0x7f7174060040 nevent=5000 time_id=1).wakeup 2020-03-09T11:09:04.448-0500 7f71617fa700 10 -- 10.3.64.23:0/4226455987 wait: done. 2020-03-09T11:09:04.448-0500 7f71617fa700 1 -- 10.3.64.23:0/4226455987 wait complete. ^CTraceback (most recent call last): File "bin/ceph", line 1282, in <module> run_in_thread(cluster_handle.shutdown) File "/home/sage/src/ceph/src/pybind/ceph_argparse.py", line 1331, in run_in_thread t.join(timeout=timeout) File "/usr/lib64/python3.7/threading.py", line 1048, in join self._wait_for_tstate_lock(timeout=max(timeout, 0)) File "/usr/lib64/python3.7/threading.py", line 1060, in _wait_for_tstate_lock elif lock.acquire(block, timeout): KeyboardInterrupt
(here i hit control-c a few seconds after it didn't exit)
#3 Updated by Sage Weil over 3 years ago
I think the fundamental problem here is how ceph.in is using librados. One thread is trying to do some work, which involves rados_connect and then some commands, while another thread has a timeout... and if the timeout fires, calls rados_shutdown. That shutdown may race with the connect... either while it is block (which is somewhat fixable, though gross) or it may even fire before the connect even gets started, in which case it's unclear how that thread should know that it should do nothing.
I think the real fix is to not call rados_shutdown in the timeout case....
#4 Updated by Sage Weil over 3 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 33825
#5 Updated by Sage Weil over 3 years ago
- Status changed from Fix Under Review to Resolved
#6 Updated by Sage Weil over 3 years ago
- Related to Bug #44566: ceph tell segv: librados fini vs protocolv2 added