Bug #44526: sporatic cephadm bootstrap failures: 'timed out' - Orchestrator - Ceph

Actions

Copy link

Bug #44526

closed

sporatic cephadm bootstrap failures: 'timed out'

Added by Sage Weil about 4 years ago. Updated about 4 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Sage Weil

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

33825

Crash signature (v1):

Crash signature (v2):

Description

2020-03-09T12:37:04.299 INFO:teuthology.orchestra.run.smithi123.stderr:INFO:cephadm:Non-zero exit code 1 from /bin/podman run --rm --net=host -e CONTAINER_IMAGE=quay.io/ceph-ci/ceph:8c7e5228ba7bdea07454d0dd9c211c1b6b6dcfc8 -e NODE_NAME=smithi123 -v /var/log/ceph/7d5d8d88-6202-11ea-9a3e-001a4aab830c:/var/log/ceph:z -v /tmp/ceph-tmpw7u46ywy:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpquh1zk_7:/etc/ceph/ceph.conf:z --entrypoint /usr/bin/ceph quay.io/ceph-ci/ceph:8c7e5228ba7bdea07454d0dd9c211c1b6b6dcfc8 -h
2020-03-09T12:37:04.299 INFO:teuthology.orchestra.run.smithi123.stderr:INFO:cephadm:/usr/bin/ceph:stdout
2020-03-09T12:37:04.299 INFO:teuthology.orchestra.run.smithi123.stderr:INFO:cephadm:/usr/bin/ceph:stdout  General usage:
2020-03-09T12:37:04.299 INFO:teuthology.orchestra.run.smithi123.stderr:INFO:cephadm:/usr/bin/ceph:stdout  ==============
...
2020-03-09T12:37:04.320 INFO:teuthology.orchestra.run.smithi123.stderr:INFO:cephadm:/usr/bin/ceph:stdout                          critical, interesting, useful, noninteresting, debug
2020-03-09T12:37:04.320 INFO:teuthology.orchestra.run.smithi123.stderr:INFO:cephadm:/usr/bin/ceph:stdout                         List shows a table of all available stats
2020-03-09T12:37:04.320 INFO:teuthology.orchestra.run.smithi123.stderr:INFO:cephadm:/usr/bin/ceph:stdout                         Run <count> times (default forever),
2020-03-09T12:37:04.320 INFO:teuthology.orchestra.run.smithi123.stderr:INFO:cephadm:/usr/bin/ceph:stdout                          once per <interval> seconds (default 1)
2020-03-09T12:37:04.320 INFO:teuthology.orchestra.run.smithi123.stderr:INFO:cephadm:/usr/bin/ceph:stdout
2020-03-09T12:37:04.321 INFO:teuthology.orchestra.run.smithi123.stderr:INFO:cephadm:/usr/bin/ceph:stderr timed out
2020-03-09T12:37:04.321 INFO:teuthology.orchestra.run.smithi123.stderr:Traceback (most recent call last):
2020-03-09T12:37:04.321 INFO:teuthology.orchestra.run.smithi123.stderr:  File "/home/ubuntu/cephtest/cephadm", line 3729, in <module>
2020-03-09T12:37:04.321 INFO:teuthology.orchestra.run.smithi123.stderr:    r = args.func()
2020-03-09T12:37:04.321 INFO:teuthology.orchestra.run.smithi123.stderr:  File "/home/ubuntu/cephtest/cephadm", line 2053, in command_bootstrap
2020-03-09T12:37:04.321 INFO:teuthology.orchestra.run.smithi123.stderr:    is_available('Dashboard', is_dashboard_available)
2020-03-09T12:37:04.321 INFO:teuthology.orchestra.run.smithi123.stderr:  File "/home/ubuntu/cephtest/cephadm", line 575, in is_available
2020-03-09T12:37:04.322 INFO:teuthology.orchestra.run.smithi123.stderr:    if func():
2020-03-09T12:37:04.322 INFO:teuthology.orchestra.run.smithi123.stderr:  File "/home/ubuntu/cephtest/cephadm", line 2051, in is_dashboard_available
2020-03-09T12:37:04.322 INFO:teuthology.orchestra.run.smithi123.stderr:    out = cli(['-h'], timeout=timeout)
2020-03-09T12:37:04.322 INFO:teuthology.orchestra.run.smithi123.stderr:  File "/home/ubuntu/cephtest/cephadm", line 1924, in cli
2020-03-09T12:37:04.322 INFO:teuthology.orchestra.run.smithi123.stderr:    ).run(timeout=timeout)
2020-03-09T12:37:04.322 INFO:teuthology.orchestra.run.smithi123.stderr:  File "/home/ubuntu/cephtest/cephadm", line 1681, in run
2020-03-09T12:37:04.323 INFO:teuthology.orchestra.run.smithi123.stderr:    self.run_cmd(), desc=self.entrypoint, timeout=timeout)
2020-03-09T12:37:04.323 INFO:teuthology.orchestra.run.smithi123.stderr:  File "/home/ubuntu/cephtest/cephadm", line 518, in call_throws
2020-03-09T12:37:04.323 INFO:teuthology.orchestra.run.smithi123.stderr:    raise RuntimeError('Failed command: %s' % ' '.join(command))
2020-03-09T12:37:04.324 INFO:teuthology.orchestra.run.smithi123.stderr:RuntimeError: Failed command: /bin/podman run --rm --net=host -e CONTAINER_IMAGE=quay.io/ceph-ci/ceph:8c7e5228ba7bdea07454d0dd9c211c1b6b6dcfc8 -e NODE_NAME=smithi123 -v /var/log/ceph/7d5d8d88-6202-11ea-9a3e-001a4aab830c:/var/log/ceph:z -v /tmp/ceph-tmpw7u46ywy:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpquh1zk_7:/etc/ceph/ceph.conf:z --entrypoint /usr/bin/ceph quay.io/ceph-ci/ceph:8c7e5228ba7bdea07454d0dd9c211c1b6b6dcfc8 -h

http://pulpito.ceph.com/sage-2020-03-09_11:35:13-rados:thrash-old-clients-wip-sage-testing-2020-03-08-2121-distro-basic-smithi/

i'm able to locally reproduce ceph -h ending with 'timed out' if i set ms_inject_socket_failures = 2000.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Sage Weil about 4 years ago

Description updated (diff)

Actions

Copy link

Updated by Sage Weil about 4 years ago

ceph.in sets a short 5s timeout for -h, and that's triggering shutdown, but then ceph isn't cleanly stopping...

2020-03-09T11:08:59.462-0500 7f71637fe700 10 monclient(hunting): ms_handle_reset hunted mon v1:10.3.64.27:6789/0
timed out
2020-03-09T11:09:04.442-0500 7f71617fa700 10 monclient(hunting): shutdown:
 ceph version 15.1.0-1931-g5206bfbd484 (5206bfbd48467d551aaf2e0fce29cfcebfa44e33) octopus (rc)
 1: (ceph::BackTrace::BackTrace(int)+0x45) [0x7f717dbe0a3f]
 2: (MonClient::shutdown()+0x11b) [0x7f717e0322b5]
 3: (()+0xe74a5) [0x7f7186da34a5]
 4: (rados_shutdown()+0x124) [0x7f7186cf7e30]
 5: (()+0x5f2b1) [0x7f7186f9b2b1]
 6: (_PyMethodDef_RawFastCallDict()+0x11b) [0x7f719477ef6b]
 7: (_PyCFunction_FastCallDict()+0x23) [0x7f719477f263]
 8: (_PyEval_EvalFrameDefault()+0x6796) [0x7f71947eee56]
 9: (_PyFunction_FastCallKeywords()+0x102) [0x7f71947a0b12]
 10: (()+0x14180f) [0x7f71947b280f]
 11: (_PyEval_EvalFrameDefault()+0xbfe) [0x7f71947e92be]
 12: (_PyFunction_FastCallKeywords()+0x102) [0x7f71947a0b12]
 13: (()+0x14180f) [0x7f71947b280f]
 14: (_PyEval_EvalFrameDefault()+0xbfe) [0x7f71947e92be]
 15: (_PyFunction_FastCallDict()+0x11a) [0x7f71947a117a]
 16: (_PyObject_Call_Prepend()+0x66) [0x7f7194774f76]
 17: (PyObject_Call()+0x7e) [0x7f71947a188e]
 18: (()+0x207706) [0x7f7194878706]
 19: (()+0x188204) [0x7f71947f9204]
 20: (()+0x94e2) [0x7f71946584e2]
 21: (clone()+0x43) [0x7f7194abe693]

2020-03-09T11:09:04.447-0500 7f71617fa700 20 monclient(hunting): shutdown discarding 1 pending message(s)
2020-03-09T11:09:04.447-0500 7f71617fa700  1 -- 10.3.64.23:0/4226455987 >> v1:10.3.64.27:6789/0 conn(0x7f7174065dd0 legacy=0x7f7174110130 unknown :-1 s=STATE_CLOSED l=1).mark_down
2020-03-09T11:09:04.447-0500 7f71617fa700 20 --1- 10.3.64.23:0/4226455987 >> v1:10.3.64.27:6789/0 conn(0x7f7174065dd0 0x7f7174110130 :-1 s=CLOSED pgs=1315 cs=1 l=1).stop
2020-03-09T11:09:04.447-0500 7f71617fa700 10 -- 10.3.64.23:0/4226455987 shutdown 10.3.64.23:0/4226455987
2020-03-09T11:09:04.447-0500 7f71617fa700 10  Processor -- stop
2020-03-09T11:09:04.447-0500 7f71617fa700 20 Event(0x7f717405deb0 nevent=5000 time_id=1).wakeup
2020-03-09T11:09:04.448-0500 7f71617fa700  1 -- 10.3.64.23:0/4226455987 shutdown_connections 
2020-03-09T11:09:04.448-0500 7f71617fa700  5 -- 10.3.64.23:0/4226455987 shutdown_connections mark down v1:10.3.64.27:6789/0 0x7f7174065dd0
2020-03-09T11:09:04.448-0500 7f71617fa700 20 --1- 10.3.64.23:0/4226455987 >> v1:10.3.64.27:6789/0 conn(0x7f7174065dd0 0x7f7174110130 :-1 s=CLOSED pgs=1315 cs=1 l=1).stop
2020-03-09T11:09:04.448-0500 7f71617fa700  5 -- 10.3.64.23:0/4226455987 shutdown_connections delete 0x7f7174065dd0
2020-03-09T11:09:04.448-0500 7f71617fa700  1 -- 10.3.64.23:0/4226455987 >> 10.3.64.23:0/4226455987 conn(0x7f7174061670 msgr2=0x7f7174106c80 unknown :-1 s=STATE_NONE l=0).mark_down
2020-03-09T11:09:04.448-0500 7f71617fa700 20 --1- 10.3.64.23:0/4226455987 >> 10.3.64.23:0/4226455987 conn(0x7f7174061670 0x7f7174106c80 :-1 s=NONE pgs=0 cs=0 l=0).stop
2020-03-09T11:09:04.448-0500 7f71617fa700  2 --1- 10.3.64.23:0/4226455987 >> 10.3.64.23:0/4226455987 conn(0x7f7174061670 0x7f7174106c80 :-1 s=NONE pgs=0 cs=0 l=0).stop
2020-03-09T11:09:04.448-0500 7f71617fa700  5 --1- 10.3.64.23:0/4226455987 >> 10.3.64.23:0/4226455987 conn(0x7f7174061670 0x7f7174106c80 :-1 s=NONE pgs=0 cs=0 l=0).reset_recv_state
2020-03-09T11:09:04.448-0500 7f71617fa700 20 Event(0x7f717405deb0 nevent=5000 time_id=1).wakeup
2020-03-09T11:09:04.448-0500 7f71617fa700 10 --1- 10.3.64.23:0/4226455987 >> 10.3.64.23:0/4226455987 conn(0x7f7174061670 0x7f7174106c80 :-1 s=NONE pgs=0 cs=0 l=0).discard_out_queue started
2020-03-09T11:09:04.448-0500 7f717a174700  5 --1- 10.3.64.23:0/4226455987 >> 10.3.64.23:0/4226455987 conn(0x7f7174061670 0x7f7174106c80 :-1 s=NONE pgs=0 cs=0 l=0).reset_recv_state (warped) reseting security handlers
2020-03-09T11:09:04.448-0500 7f71617fa700 20 Event(0x7f717405deb0 nevent=5000 time_id=1).wakeup
2020-03-09T11:09:04.448-0500 7f717a174700  5 --1- 10.3.64.23:0/4226455987 >> 10.3.64.23:0/4226455987 conn(0x7f7174061670 0x7f7174106c80 :-1 s=CLOSED pgs=0 cs=0 l=0).reset_security
2020-03-09T11:09:04.448-0500 7f71617fa700 20 Event(0x7f717405ef60 nevent=5000 time_id=5).wakeup
2020-03-09T11:09:04.448-0500 7f71617fa700 20 Event(0x7f7174060040 nevent=5000 time_id=1).wakeup
2020-03-09T11:09:04.448-0500 7f71617fa700 10 -- 10.3.64.23:0/4226455987 wait: waiting for dispatch queue
2020-03-09T11:09:04.448-0500 7f71617fa700 10 -- 10.3.64.23:0/4226455987 wait: dispatch queue is stopped
2020-03-09T11:09:04.448-0500 7f71617fa700  1 -- 10.3.64.23:0/4226455987 shutdown_connections 
2020-03-09T11:09:04.448-0500 7f71617fa700  5 -- 10.3.64.23:0/4226455987 shutdown_connections delete 0x7f7174061670
2020-03-09T11:09:04.448-0500 7f71617fa700 20 Event(0x7f717405deb0 nevent=5000 time_id=1).wakeup
2020-03-09T11:09:04.448-0500 7f71617fa700 20 Event(0x7f717405ef60 nevent=5000 time_id=5).wakeup
2020-03-09T11:09:04.448-0500 7f71617fa700 20 Event(0x7f7174060040 nevent=5000 time_id=1).wakeup
2020-03-09T11:09:04.448-0500 7f71617fa700 10 -- 10.3.64.23:0/4226455987 wait: done.
2020-03-09T11:09:04.448-0500 7f71617fa700  1 -- 10.3.64.23:0/4226455987 wait complete.
^CTraceback (most recent call last):
  File "bin/ceph", line 1282, in <module>
    run_in_thread(cluster_handle.shutdown)
  File "/home/sage/src/ceph/src/pybind/ceph_argparse.py", line 1331, in run_in_thread
    t.join(timeout=timeout)
  File "/usr/lib64/python3.7/threading.py", line 1048, in join
    self._wait_for_tstate_lock(timeout=max(timeout, 0))
  File "/usr/lib64/python3.7/threading.py", line 1060, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt

(here i hit control-c a few seconds after it didn't exit)

Actions

Copy link

Updated by Sage Weil about 4 years ago

I think the fundamental problem here is how ceph.in is using librados. One thread is trying to do some work, which involves rados_connect and then some commands, while another thread has a timeout... and if the timeout fires, calls rados_shutdown. That shutdown may race with the connect... either while it is block (which is somewhat fixable, though gross) or it may even fire before the connect even gets started, in which case it's unclear how that thread should know that it should do nothing.

I think the real fix is to not call rados_shutdown in the timeout case....

Actions

Copy link