Bug #52562
closedThrashosds read error injection failed with error ENXIO
0%
Description
/a/yuriw-2021-09-08_15:10:21-rados-wip-yuri2-testing-2021-09-07-1258-distro-basic-smithi/6379886
As part of the thrashosds test, the osds that are out/dead are revived. As part of this, osd.0 was revived from the dead state. After the daemon was restarted, the command to inject read error injection failed with error ENXIO:
2021-09-08T15:47:42.406 INFO:tasks.thrashosds.thrasher:reviving osd 2021-09-08T15:47:42.406 INFO:tasks.thrashosds.thrasher:Reviving osd 0 2021-09-08T15:47:42.407 INFO:tasks.ceph.osd.0:Restarting daemon 2021-09-08T15:47:42.407 DEBUG:teuthology.orchestra.run.smithi064:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-osd -f --cluster ceph -i 0 2021-09-08T15:47:42.410 INFO:tasks.ceph.osd.0:Started 2021-09-08T15:47:42.410 DEBUG:teuthology.orchestra.run.smithi064:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok dump_ops_in_flight 2021-09-08T15:47:42.531 INFO:teuthology.orchestra.run.smithi064.stderr:admin_socket: exception getting command descriptions: [Errno 111] Connection refused 2021-09-08T15:47:42.533 DEBUG:teuthology.orchestra.run:got remote process result: 22 2021-09-08T15:47:42.533 INFO:tasks.ceph.ceph_manager.ceph:waiting on admin_socket for osd-0, ['dump_ops_in_flight'] 2021-09-08T15:47:43.076 INFO:tasks.ceph.osd.0.smithi064.stderr:2021-09-08T15:47:43.074+0000 7f36982490c0 -1 Falling back to public interface 2021-09-08T15:47:46.993 INFO:tasks.ceph.osd.0.smithi064.stderr:2021-09-08T15:47:46.992+0000 7f36982490c0 -1 osd.0 244 log_to_monitors {default=true} 2021-09-08T15:47:47.344 INFO:tasks.ceph.osd.0.smithi064.stderr:2021-09-08T15:47:47.341+0000 7f368959e700 -1 osd.0 301 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory 2021-09-08T15:47:47.534 DEBUG:teuthology.orchestra.run.smithi064:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok dump_ops_in_flight 2021-09-08T15:47:47.663 INFO:teuthology.orchestra.run.smithi064.stdout:{ 2021-09-08T15:47:47.663 INFO:teuthology.orchestra.run.smithi064.stdout: "ops": [], 2021-09-08T15:47:47.663 INFO:teuthology.orchestra.run.smithi064.stdout: "num_ops": 0 2021-09-08T15:47:47.663 INFO:teuthology.orchestra.run.smithi064.stdout:} 2021-09-08T15:47:47.673 DEBUG:teuthology.orchestra.run.smithi064:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config set filestore_debug_random_read_err 0.33 2021-09-08T15:47:47.798 INFO:teuthology.orchestra.run.smithi064.stdout:{ 2021-09-08T15:47:47.799 INFO:teuthology.orchestra.run.smithi064.stdout: "success": "filestore_debug_random_read_err = '0.330000' (not observed, change may require restart) " 2021-09-08T15:47:47.799 INFO:teuthology.orchestra.run.smithi064.stdout:} 2021-09-08T15:47:47.809 DEBUG:teuthology.orchestra.run.smithi064:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config set bluestore_debug_random_read_err 0.33 2021-09-08T15:47:47.935 INFO:teuthology.orchestra.run.smithi064.stdout:{ 2021-09-08T15:47:47.936 INFO:teuthology.orchestra.run.smithi064.stdout: "success": "bluestore_debug_random_read_err = '0.330000' (not observed, change may require restart) " 2021-09-08T15:47:47.936 INFO:teuthology.orchestra.run.smithi064.stdout:} 2021-09-08T15:47:47.945 DEBUG:teuthology.orchestra.run.smithi064:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph -- tell osd.0 injectargs --filestore_debug_random_read_err=0.0 2021-09-08T15:47:48.067 INFO:teuthology.orchestra.run.smithi064.stderr:Error ENXIO: problem getting command descriptions from osd.0 2021-09-08T15:47:48.069 DEBUG:teuthology.orchestra.run:got remote process result: 6 2021-09-08T15:47:48.070 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last): File "/home/teuthworker/src/github.com_ceph_ceph-c_5091269d28a67ed0674391279081ff3cac6e91e1/qa/tasks/ceph_manager.py", line 188, in wrapper return func(self) File "/home/teuthworker/src/github.com_ceph_ceph-c_5091269d28a67ed0674391279081ff3cac6e91e1/qa/tasks/ceph_manager.py", line 1416, in _do_thrash 'filestore_debug_random_read_err', '0.0') File "/home/teuthworker/src/github.com_ceph_ceph-c_5091269d28a67ed0674391279081ff3cac6e91e1/qa/tasks/ceph_manager.py", line 1990, in inject_args self.raw_cluster_cmd('--', 'tell', whom, 'injectargs', opt_arg) File "/home/teuthworker/src/github.com_ceph_ceph-c_5091269d28a67ed0674391279081ff3cac6e91e1/qa/tasks/ceph_manager.py", line 1596, in raw_cluster_cmd return self.run_cluster_cmd(**kwargs).stdout.getvalue() File "/home/teuthworker/src/github.com_ceph_ceph-c_5091269d28a67ed0674391279081ff3cac6e91e1/qa/tasks/ceph_manager.py", line 1587, in run_cluster_cmd return self.controller.run(**kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_b135a4f82b02b966053efba0501e6a47c9d2d793/teuthology/orchestra/remote.py", line 509, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_b135a4f82b02b966053efba0501e6a47c9d2d793/teuthology/orchestra/run.py", line 455, in run r.wait() File "/home/teuthworker/src/git.ceph.com_git_teuthology_b135a4f82b02b966053efba0501e6a47c9d2d793/teuthology/orchestra/run.py", line 161, in wait self._raise_for_status() File "/home/teuthworker/src/git.ceph.com_git_teuthology_b135a4f82b02b966053efba0501e6a47c9d2d793/teuthology/orchestra/run.py", line 183, in _raise_for_status node=self.hostname, label=self.label teuthology.exceptions.CommandFailedError: Command failed on smithi064 with status 6: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph -- tell osd.0 injectargs --filestore_debug_random_read_err=0.0'
There were fixes made in the test as part of https://tracker.ceph.com/issues/21206 (quite a while ago), but this seems to be different and related to the time it took for osd.0 to become active.
Updated by Neha Ojha over 2 years ago
- Status changed from New to Triaged
Looking at the osd and mon logs
Here's when the osd was restarted in revive_osd()
2021-09-08T15:47:42.550+0000 7f36982490c0 0 ceph version 17.0.0-7521-g5091269d (5091269d28a67ed0674391279081ff3cac6e91e1) quincy (dev), process ceph-osd, pid 90160
wasn't up until
2021-09-08T15:47:47.372+0000 7f3671360700 20 osd.0 301 dispatch_context not up in osdmap
boot message sent here
2021-09-08T15:47:47.364+0000 7f368959e700 1 -- v1:172.21.15.64:6809/90160 --> v1:172.21.15.64:6789/0 -- osd_boot(osd.0 booted 0 features 4540138303579357183 v301) v7 -- 0x55c6c7a1cc00 con 0x55c6c2483c00
Marked UP in the osdmap only here
2021-09-08T15:47:48.314+0000 7f527f8b3700 2 mon.a@0(leader).osd e301 osd.0 UP v1:172.21.15.64:6809/90160
This was run before that
2021-09-08T15:47:47.945 DEBUG:teuthology.orchestra.run.smithi064:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph -- tell osd.0 injectargs --filestore_debug_random_read_err=0.0 2021-09-08T15:47:48.067 INFO:teuthology.orchestra.run.smithi064.stderr:Error ENXIO: problem getting command descriptions from osd.0
We can probably use something like "wait_for_all_osds_up()", which has a check for up osds in the osdmap, to avoid such issues.
Updated by Deepika Upadhyay over 2 years ago
2021-09-14T04:26:45.603 INFO:teuthology.orchestra.run.smithi041.stderr:Error ENXIO: problem getting command descriptions from osd.1
/ceph/teuthology-archive/yuriw-2021-09-13_19:12:32-rados-wip-yuri6-testing-2021-09-13-1025-pacific-distro-basic-smithi/6387610/teuthology.log
Updated by Neha Ojha over 2 years ago
Deepika Upadhyay wrote:
[...]
/ceph/teuthology-archive/yuriw-2021-09-13_19:12:32-rados-wip-yuri6-testing-2021-09-13-1025-pacific-distro-basic-smithi/6387610/teuthology.log
Deepika, I think what you saw is different, osd.1 here had https://tracker.ceph.com/issues/52124 issue
Updated by Laura Flores over 2 years ago
- Related to Bug #53767: qa/workunits/cls/test_cls_2pc_queue.sh: killing an osd during thrashing causes timeout added
Updated by Aishwarya Mathuria about 2 years ago
/a/yuriw-2022-03-29_21:35:32-rados-wip-yuri5-testing-2022-03-29-1152-quincy-distro-default-smithi/6767850
Updated by Laura Flores almost 2 years ago
- Related to deleted (Bug #53767: qa/workunits/cls/test_cls_2pc_queue.sh: killing an osd during thrashing causes timeout)
Updated by Nitzan Mordechai about 1 year ago
- Related to Bug #56097: Timeout on `sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.1 flush_pg_stats` added
Updated by Sridhar Seshasayee 3 months ago
/a/yuriw-2024-01-26_01:08:12-rados-wip-yuri2-testing-2024-01-25-1327-reef-distro-default-smithi/7533809
Updated by Laura Flores 3 months ago
Sridhar Seshasayee wrote:
/a/yuriw-2024-01-26_01:08:12-rados-wip-yuri2-testing-2024-01-25-1327-reef-distro-default-smithi/7533809
I see this in the teuthology log:
2024-01-27T06:15:21.583 INFO:tasks.ceph.osd.7.smithi196.stderr:*** Caught signal (Aborted) **
2024-01-27T06:15:21.584 INFO:tasks.ceph.osd.7.smithi196.stderr: in thread 2ca9e640 thread_name:tp_osd_tp
2024-01-27T06:15:21.590 INFO:teuthology.orchestra.run.smithi161.stderr:2024-01-27T06:15:21.590+0000 7f530c705640 10 client.23175.objecter _submit_command 12 [{"prefix": "injectargs", "injected_args": ["--osd_enable_op_tracker=true"]}]
2024-01-27T06:15:21.590 INFO:teuthology.orchestra.run.smithi161.stderr:2024-01-27T06:15:21.590+0000 7f530c705640 15 client.23175.objecter _session_command_op_assign -1 12
2024-01-27T06:15:21.590 INFO:teuthology.orchestra.run.smithi161.stderr:2024-01-27T06:15:21.590+0000 7f530c705640 20 client.23175.objecter _get_session s=0x7f53041ed790 osd=5 3
2024-01-27T06:15:21.590 INFO:teuthology.orchestra.run.smithi161.stderr:2024-01-27T06:15:21.590+0000 7f530c705640 20 client.23175.objecter put_session s=0x7f53041ed790 osd=5 3
2024-01-27T06:15:21.591 INFO:teuthology.orchestra.run.smithi161.stderr:2024-01-27T06:15:21.590+0000 7f530c705640 20 client.23175.objecter _get_session s=0x7f53041ed790 osd=5 3
2024-01-27T06:15:21.591 INFO:teuthology.orchestra.run.smithi161.stderr:2024-01-27T06:15:21.590+0000 7f530c705640 15 client.23175.objecter _session_command_op_remove -1 12
2024-01-27T06:15:21.591 INFO:teuthology.orchestra.run.smithi161.stderr:2024-01-27T06:15:21.590+0000 7f530c705640 20 client.23175.objecter get_session s=0x7f53041ed790 osd=5 3
2024-01-27T06:15:21.591 INFO:teuthology.orchestra.run.smithi161.stderr:2024-01-27T06:15:21.590+0000 7f530c705640 15 client.23175.objecter _session_command_op_assign 5 12
2024-01-27T06:15:21.591 INFO:teuthology.orchestra.run.smithi161.stderr:2024-01-27T06:15:21.590+0000 7f530c705640 20 client.23175.objecter put_session s=0x7f53041ed790 osd=5 4
2024-01-27T06:15:21.591 INFO:teuthology.orchestra.run.smithi161.stderr:2024-01-27T06:15:21.590+0000 7f530c705640 10 client.23175.objecter _send_command 12
2024-01-27T06:15:21.591 INFO:teuthology.orchestra.run.smithi161.stderr:2024-01-27T06:15:21.590+0000 7f530c705640 1 -- 172.21.15.161:0/1160651865 --> v2:172.21.15.196:6800/34680 -- command(tid 12: {"prefix": "injectargs", "injected_args": ["--osd_enable_op_tracker=true"]}) v1 -- 0x7f53041f0770 con 0x7f53041ede40
2024-01-27T06:15:21.638 INFO:tasks.ceph.osd.7.smithi196.stderr: ceph version 18.2.1-302-ga6afa309 (a6afa309b73acbe67470860ef3bb48c61ffcf770) reef (stable)
2024-01-27T06:15:21.638 INFO:tasks.ceph.osd.7.smithi196.stderr: 1: /lib64/libc.so.6(+0x54db0) [0x549fdb0]
2024-01-27T06:15:21.639 INFO:tasks.ceph.osd.7.smithi196.stderr: 2: /lib64/libc.so.6(+0xa154c) [0x54ec54c]
2024-01-27T06:15:21.639 INFO:tasks.ceph.osd.7.smithi196.stderr: 3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, std::chrono::time_point<ceph::coarse_mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >)+0x226) [0xb9ab06]
2024-01-27T06:15:21.639 INFO:tasks.ceph.osd.7.smithi196.stderr: 4: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >)+0x70) [0xb9ac00]
2024-01-27T06:15:21.639 INFO:tasks.ceph.osd.7.smithi196.stderr: 5: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x248) [0xbab188]
2024-01-27T06:15:21.639 INFO:tasks.ceph.osd.7.smithi196.stderr: 6: ceph-osd(+0xaa3704) [0xbab704]
2024-01-27T06:15:21.639 INFO:tasks.ceph.osd.7.smithi196.stderr: 7: /lib64/libc.so.6(+0x9f802) [0x54ea802]
2024-01-27T06:15:21.639 INFO:tasks.ceph.osd.7.smithi196.stderr: 8: clone()
Updated by Radoslaw Zarzynski 3 months ago
When ENXIO
is being generated? Doesn't this happen when somebody opens a unix socket without a peer?
If so, a death of an OSD can be just a trigger of this issue,
Updated by Radoslaw Zarzynski 3 months ago
When ENXIO is being generated? Doesn't this happen when somebody opens a unix socket without a peer?
Rather no. From man unix
:
ECONNREFUSED The remote address specified by connect(2) was not a listening socket. This error can also occur if the target pathname is not a socket. ECONNRESET Remote socket was unexpectedly closed. ... ENOMEM Out of memory. ENOTCONN Socket operation needs a target address, but the socket is not connected.
MonClient
can generate ENXIO
:
int monc_error_category::from_code(int ev) const noexcept {
if (ev == 0)
return 0;
switch (static_cast<monc_errc>(ev)) {
case monc_errc::shutting_down:
return -ECANCELED;
case monc_errc::session_reset:
return -EAGAIN;
case monc_errc::rank_dne:
[[fallthrough]];
case monc_errc::mon_dne:
return -ENOENT;
case monc_errc::timed_out:
return -ETIMEDOUT;
case monc_errc::mon_unavailable:
return -ENXIO;
}
return -EDOM;
}
Updated by Radoslaw Zarzynski 3 months ago
Yeah, there is only one place which generates the mon_unavailable
error:
void MonClient::_send_command(MonCommand *r)
{
if (r->is_tell()) {
++r->send_attempts;
if (r->send_attempts > cct->_conf->mon_client_directed_command_retry) {
_finish_command(r, monc_errc::mon_unavailable, "mon unavailable", {});
return;
}
So MonClient
was unable to talk with a monitor while hearbeats between OSDs were observed as well.
A lab / networking issue?
Updated by Radoslaw Zarzynski 3 months ago
- Translation missing: en.field_tag_list set to test-failure
- Tags set to test-failure
Updated by Radoslaw Zarzynski 2 months ago
- Status changed from Triaged to Closed
Closing per c#12 -- likely a networking issue.