Bug #53768: timed out waiting for admin_socket to appear after osd.2 restart in thrasher/defaults workload/small-objects - RADOS - Ceph

Actions

Copy link

Bug #53768

closed

timed out waiting for admin_socket to appear after osd.2 restart in thrasher/defaults workload/small-objects

Added by Joseph Sawaya over 2 years ago. Updated 1 day ago.

Status:

Closed

Priority:

Normal

Assignee:

Samuel Just

Category:

Target version:

% Done:

Source:

Tags:

Backport:

pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Error snippet:

2022-01-02T01:37:09.296 DEBUG:teuthology.orchestra.run.smithi086:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok dump_ops_in_flight
2022-01-02T01:37:09.410 INFO:teuthology.orchestra.run.smithi086.stderr:admin_socket: exception getting command descriptions: [Errno 111] Connection refused
2022-01-02T01:37:09.413 DEBUG:teuthology.orchestra.run:got remote process result: 22
2022-01-02T01:37:09.413 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
File "/home/teuthworker/src/github.com_ceph_ceph_a2f5a3c1dbfa4dce41e25da4f029a8fdb8c8d864/qa/tasks/ceph_manager.py", line 189, in wrapper
return func(self)
File "/home/teuthworker/src/github.com_ceph_ceph_a2f5a3c1dbfa4dce41e25da4f029a8fdb8c8d864/qa/tasks/ceph_manager.py", line 1412, in _do_thrash
self.choose_action()()
File "/home/teuthworker/src/github.com_ceph_ceph_a2f5a3c1dbfa4dce41e25da4f029a8fdb8c8d864/qa/tasks/ceph_manager.py", line 581, in revive_osd
skip_admin_check=skip_admin_check)
File "/home/teuthworker/src/github.com_ceph_ceph_a2f5a3c1dbfa4dce41e25da4f029a8fdb8c8d864/qa/tasks/ceph_manager.py", line 3021, in revive_osd
timeout=timeout, stdout=DEVNULL)
File "/home/teuthworker/src/github.com_ceph_ceph_a2f5a3c1dbfa4dce41e25da4f029a8fdb8c8d864/qa/tasks/ceph_manager.py", line 1963, in wait_run_admin_socket
id=service_id))
Exception: timed out waiting for admin_socket to appear after osd.2 restart

2022-01-02T01:37:09.414 ERROR:tasks.thrashosds.thrasher:exception:
Traceback (most recent call last):
File "/home/teuthworker/src/github.com_ceph_ceph_a2f5a3c1dbfa4dce41e25da4f029a8fdb8c8d864/qa/tasks/ceph_manager.py", line 1280, in do_thrash
self._do_thrash()
File "/home/teuthworker/src/github.com_ceph_ceph_a2f5a3c1dbfa4dce41e25da4f029a8fdb8c8d864/qa/tasks/ceph_manager.py", line 189, in wrapper
return func(self)
File "/home/teuthworker/src/github.com_ceph_ceph_a2f5a3c1dbfa4dce41e25da4f029a8fdb8c8d864/qa/tasks/ceph_manager.py", line 1412, in _do_thrash
self.choose_action()()
File "/home/teuthworker/src/github.com_ceph_ceph_a2f5a3c1dbfa4dce41e25da4f029a8fdb8c8d864/qa/tasks/ceph_manager.py", line 581, in revive_osd
skip_admin_check=skip_admin_check)
File "/home/teuthworker/src/github.com_ceph_ceph_a2f5a3c1dbfa4dce41e25da4f029a8fdb8c8d864/qa/tasks/ceph_manager.py", line 3021, in revive_osd
timeout=timeout, stdout=DEVNULL)
File "/home/teuthworker/src/github.com_ceph_ceph_a2f5a3c1dbfa4dce41e25da4f029a8fdb8c8d864/qa/tasks/ceph_manager.py", line 1963, in wait_run_admin_socket
id=service_id))
Exception: timed out waiting for admin_socket to appear after osd.2 restart
2022-01-02T01:37:09.743 INFO:tasks.ceph.osd.3.smithi086.stderr:INFO 2022-01-02 01:37:09,809 [shard 0] alienstore - stat
2022-01-02T01:37:09.814 INFO:tasks.ceph.osd.2.smithi086.stderr:INFO 2022-01-02 01:37:09,879 [shard 0] alienstore - stat
2022-01-02T01:37:09.901 INFO:tasks.ceph.osd.1.smithi074.stderr:INFO 2022-01-02 01:37:09,967 [shard 0] alienstore - stat
2022-01-02T01:37:10.194 INFO:tasks.ceph.osd.1.smithi074.stderr:ERROR 2022-01-02 01:37:10,259 [shard 0] ms - ms_dispatch unhandled message ping magic: 0 v1
2022-01-02T01:37:10.194 INFO:tasks.ceph.osd.3.smithi086.stderr:ERROR 2022-01-02 01:37:10,259 [shard 0] ms - ms_dispatch unhandled message ping magic: 0 v1
2022-01-02T01:37:10.459 DEBUG:teuthology.orchestra.run.smithi074:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd unset noscrub
2022-01-02T01:37:10.616 INFO:tasks.daemonwatchdog.daemon_watchdog:OSDThrasher failed
2022-01-02T01:37:10.617 INFO:tasks.daemonwatchdog.daemon_watchdog:BARK! unmounting mounts and killing all daemons

Full logs found here: http://qa-proxy.ceph.com/teuthology/teuthology-2022-01-02_01:01:03-crimson-rados-master-distro-default-smithi/6589721/teuthology.log

Looks like osd 2 is failing to restart correctly, seem to be some memory leaks later in the logs pertaining to PGs.

Actions

Copy link

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

/a/yuriw-2022-07-22_03:30:40-rados-wip-yuri3-testing-2022-07-21-1604-distro-default-smithi/6943791/

Actions

Copy link

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

Hey Joseph what's the status on this?

Actions

Copy link

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

/a/yuriw-2022-07-22_03:30:40-rados-wip-yuri3-testing-2022-07-21-1604-distro-default-smithi/6944338/

Actions

Copy link

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

job dead hit max timeout but trace back suggests:

Exception: timed out waiting for admin_socket to appear after osd.2 restart

/a/yuriw-2022-07-22_03:30:40-rados-wip-yuri3-testing-2022-07-21-1604-distro-default-smithi/6943718
/a/yuriw-2022-07-22_03:30:40-rados-wip-yuri3-testing-2022-07-21-1604-distro-default-smithi/6943741

Actions

Copy link