Project

General

Profile

Actions

Bug #52562

closed

Thrashosds read error injection failed with error ENXIO

Added by Sridhar Seshasayee over 2 years ago. Updated 2 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
test-failure
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2021-09-08_15:10:21-rados-wip-yuri2-testing-2021-09-07-1258-distro-basic-smithi/6379886

As part of the thrashosds test, the osds that are out/dead are revived. As part of this, osd.0 was revived from the dead state. After the daemon was restarted, the command to inject read error injection failed with error ENXIO:

2021-09-08T15:47:42.406 INFO:tasks.thrashosds.thrasher:reviving osd
2021-09-08T15:47:42.406 INFO:tasks.thrashosds.thrasher:Reviving osd 0
2021-09-08T15:47:42.407 INFO:tasks.ceph.osd.0:Restarting daemon
2021-09-08T15:47:42.407 DEBUG:teuthology.orchestra.run.smithi064:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-osd -f --cluster ceph -i 0
2021-09-08T15:47:42.410 INFO:tasks.ceph.osd.0:Started
2021-09-08T15:47:42.410 DEBUG:teuthology.orchestra.run.smithi064:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok dump_ops_in_flight
2021-09-08T15:47:42.531 INFO:teuthology.orchestra.run.smithi064.stderr:admin_socket: exception getting command descriptions: [Errno 111] Connection refused
2021-09-08T15:47:42.533 DEBUG:teuthology.orchestra.run:got remote process result: 22
2021-09-08T15:47:42.533 INFO:tasks.ceph.ceph_manager.ceph:waiting on admin_socket for osd-0, ['dump_ops_in_flight']
2021-09-08T15:47:43.076 INFO:tasks.ceph.osd.0.smithi064.stderr:2021-09-08T15:47:43.074+0000 7f36982490c0 -1 Falling back to public interface
2021-09-08T15:47:46.993 INFO:tasks.ceph.osd.0.smithi064.stderr:2021-09-08T15:47:46.992+0000 7f36982490c0 -1 osd.0 244 log_to_monitors {default=true}
2021-09-08T15:47:47.344 INFO:tasks.ceph.osd.0.smithi064.stderr:2021-09-08T15:47:47.341+0000 7f368959e700 -1 osd.0 301 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
2021-09-08T15:47:47.534 DEBUG:teuthology.orchestra.run.smithi064:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok dump_ops_in_flight
2021-09-08T15:47:47.663 INFO:teuthology.orchestra.run.smithi064.stdout:{
2021-09-08T15:47:47.663 INFO:teuthology.orchestra.run.smithi064.stdout:    "ops": [],
2021-09-08T15:47:47.663 INFO:teuthology.orchestra.run.smithi064.stdout:    "num_ops": 0
2021-09-08T15:47:47.663 INFO:teuthology.orchestra.run.smithi064.stdout:}
2021-09-08T15:47:47.673 DEBUG:teuthology.orchestra.run.smithi064:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config set filestore_debug_random_read_err 0.33
2021-09-08T15:47:47.798 INFO:teuthology.orchestra.run.smithi064.stdout:{
2021-09-08T15:47:47.799 INFO:teuthology.orchestra.run.smithi064.stdout:    "success": "filestore_debug_random_read_err = '0.330000' (not observed, change may require restart) " 
2021-09-08T15:47:47.799 INFO:teuthology.orchestra.run.smithi064.stdout:}
2021-09-08T15:47:47.809 DEBUG:teuthology.orchestra.run.smithi064:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config set bluestore_debug_random_read_err 0.33
2021-09-08T15:47:47.935 INFO:teuthology.orchestra.run.smithi064.stdout:{
2021-09-08T15:47:47.936 INFO:teuthology.orchestra.run.smithi064.stdout:    "success": "bluestore_debug_random_read_err = '0.330000' (not observed, change may require restart) " 
2021-09-08T15:47:47.936 INFO:teuthology.orchestra.run.smithi064.stdout:}
2021-09-08T15:47:47.945 DEBUG:teuthology.orchestra.run.smithi064:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph -- tell osd.0 injectargs --filestore_debug_random_read_err=0.0
2021-09-08T15:47:48.067 INFO:teuthology.orchestra.run.smithi064.stderr:Error ENXIO: problem getting command descriptions from osd.0
2021-09-08T15:47:48.069 DEBUG:teuthology.orchestra.run:got remote process result: 6
2021-09-08T15:47:48.070 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_5091269d28a67ed0674391279081ff3cac6e91e1/qa/tasks/ceph_manager.py", line 188, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_5091269d28a67ed0674391279081ff3cac6e91e1/qa/tasks/ceph_manager.py", line 1416, in _do_thrash
    'filestore_debug_random_read_err', '0.0')
  File "/home/teuthworker/src/github.com_ceph_ceph-c_5091269d28a67ed0674391279081ff3cac6e91e1/qa/tasks/ceph_manager.py", line 1990, in inject_args
    self.raw_cluster_cmd('--', 'tell', whom, 'injectargs', opt_arg)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_5091269d28a67ed0674391279081ff3cac6e91e1/qa/tasks/ceph_manager.py", line 1596, in raw_cluster_cmd
    return self.run_cluster_cmd(**kwargs).stdout.getvalue()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_5091269d28a67ed0674391279081ff3cac6e91e1/qa/tasks/ceph_manager.py", line 1587, in run_cluster_cmd
    return self.controller.run(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b135a4f82b02b966053efba0501e6a47c9d2d793/teuthology/orchestra/remote.py", line 509, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b135a4f82b02b966053efba0501e6a47c9d2d793/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b135a4f82b02b966053efba0501e6a47c9d2d793/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_b135a4f82b02b966053efba0501e6a47c9d2d793/teuthology/orchestra/run.py", line 183, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed on smithi064 with status 6: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph -- tell osd.0 injectargs --filestore_debug_random_read_err=0.0'

There were fixes made in the test as part of https://tracker.ceph.com/issues/21206 (quite a while ago), but this seems to be different and related to the time it took for osd.0 to become active.


Related issues 1 (1 open0 closed)

Related to RADOS - Bug #56097: Timeout on `sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.1 flush_pg_stats`Fix Under ReviewNitzan Mordechai

Actions
Actions #1

Updated by Neha Ojha over 2 years ago

  • Status changed from New to Triaged

Looking at the osd and mon logs

Here's when the osd was restarted in revive_osd()

2021-09-08T15:47:42.550+0000 7f36982490c0  0 ceph version 17.0.0-7521-g5091269d (5091269d28a67ed0674391279081ff3cac6e91e1) quincy (dev), process ceph-osd, pid 90160

wasn't up until

2021-09-08T15:47:47.372+0000 7f3671360700 20 osd.0 301 dispatch_context not up in osdmap

boot message sent here

2021-09-08T15:47:47.364+0000 7f368959e700  1 -- v1:172.21.15.64:6809/90160 --> v1:172.21.15.64:6789/0 -- osd_boot(osd.0 booted 0 features 4540138303579357183 v301) v7 -- 0x55c6c7a1cc00 con 0x55c6c2483c00

Marked UP in the osdmap only here

2021-09-08T15:47:48.314+0000 7f527f8b3700  2 mon.a@0(leader).osd e301  osd.0 UP v1:172.21.15.64:6809/90160

This was run before that

2021-09-08T15:47:47.945 DEBUG:teuthology.orchestra.run.smithi064:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph -- tell osd.0 injectargs --filestore_debug_random_read_err=0.0
2021-09-08T15:47:48.067 INFO:teuthology.orchestra.run.smithi064.stderr:Error ENXIO: problem getting command descriptions from osd.0

We can probably use something like "wait_for_all_osds_up()", which has a check for up osds in the osdmap, to avoid such issues.

Actions #2

Updated by Deepika Upadhyay over 2 years ago

2021-09-14T04:26:45.603 INFO:teuthology.orchestra.run.smithi041.stderr:Error ENXIO: problem getting command descriptions from osd.1

/ceph/teuthology-archive/yuriw-2021-09-13_19:12:32-rados-wip-yuri6-testing-2021-09-13-1025-pacific-distro-basic-smithi/6387610/teuthology.log
Actions #3

Updated by Neha Ojha over 2 years ago

Deepika Upadhyay wrote:

[...]
/ceph/teuthology-archive/yuriw-2021-09-13_19:12:32-rados-wip-yuri6-testing-2021-09-13-1025-pacific-distro-basic-smithi/6387610/teuthology.log

Deepika, I think what you saw is different, osd.1 here had https://tracker.ceph.com/issues/52124 issue

Actions #4

Updated by Laura Flores over 2 years ago

  • Related to Bug #53767: qa/workunits/cls/test_cls_2pc_queue.sh: killing an osd during thrashing causes timeout added
Actions #5

Updated by Aishwarya Mathuria about 2 years ago

/a/yuriw-2022-03-29_21:35:32-rados-wip-yuri5-testing-2022-03-29-1152-quincy-distro-default-smithi/6767850

Actions #6

Updated by Laura Flores almost 2 years ago

  • Related to deleted (Bug #53767: qa/workunits/cls/test_cls_2pc_queue.sh: killing an osd during thrashing causes timeout)
Actions #7

Updated by Nitzan Mordechai about 1 year ago

  • Related to Bug #56097: Timeout on `sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.1 flush_pg_stats` added
Actions #8

Updated by Sridhar Seshasayee 3 months ago

/a/yuriw-2024-01-26_01:08:12-rados-wip-yuri2-testing-2024-01-25-1327-reef-distro-default-smithi/7533809

Actions #9

Updated by Laura Flores 3 months ago

Sridhar Seshasayee wrote:

/a/yuriw-2024-01-26_01:08:12-rados-wip-yuri2-testing-2024-01-25-1327-reef-distro-default-smithi/7533809

I see this in the teuthology log:

2024-01-27T06:15:21.583 INFO:tasks.ceph.osd.7.smithi196.stderr:*** Caught signal (Aborted) **
2024-01-27T06:15:21.584 INFO:tasks.ceph.osd.7.smithi196.stderr: in thread 2ca9e640 thread_name:tp_osd_tp
2024-01-27T06:15:21.590 INFO:teuthology.orchestra.run.smithi161.stderr:2024-01-27T06:15:21.590+0000 7f530c705640 10 client.23175.objecter _submit_command 12 [{"prefix": "injectargs", "injected_args": ["--osd_enable_op_tracker=true"]}]
2024-01-27T06:15:21.590 INFO:teuthology.orchestra.run.smithi161.stderr:2024-01-27T06:15:21.590+0000 7f530c705640 15 client.23175.objecter _session_command_op_assign -1 12
2024-01-27T06:15:21.590 INFO:teuthology.orchestra.run.smithi161.stderr:2024-01-27T06:15:21.590+0000 7f530c705640 20 client.23175.objecter _get_session s=0x7f53041ed790 osd=5 3
2024-01-27T06:15:21.590 INFO:teuthology.orchestra.run.smithi161.stderr:2024-01-27T06:15:21.590+0000 7f530c705640 20 client.23175.objecter put_session s=0x7f53041ed790 osd=5 3
2024-01-27T06:15:21.591 INFO:teuthology.orchestra.run.smithi161.stderr:2024-01-27T06:15:21.590+0000 7f530c705640 20 client.23175.objecter _get_session s=0x7f53041ed790 osd=5 3
2024-01-27T06:15:21.591 INFO:teuthology.orchestra.run.smithi161.stderr:2024-01-27T06:15:21.590+0000 7f530c705640 15 client.23175.objecter _session_command_op_remove -1 12
2024-01-27T06:15:21.591 INFO:teuthology.orchestra.run.smithi161.stderr:2024-01-27T06:15:21.590+0000 7f530c705640 20 client.23175.objecter get_session s=0x7f53041ed790 osd=5 3
2024-01-27T06:15:21.591 INFO:teuthology.orchestra.run.smithi161.stderr:2024-01-27T06:15:21.590+0000 7f530c705640 15 client.23175.objecter _session_command_op_assign 5 12
2024-01-27T06:15:21.591 INFO:teuthology.orchestra.run.smithi161.stderr:2024-01-27T06:15:21.590+0000 7f530c705640 20 client.23175.objecter put_session s=0x7f53041ed790 osd=5 4
2024-01-27T06:15:21.591 INFO:teuthology.orchestra.run.smithi161.stderr:2024-01-27T06:15:21.590+0000 7f530c705640 10 client.23175.objecter _send_command 12
2024-01-27T06:15:21.591 INFO:teuthology.orchestra.run.smithi161.stderr:2024-01-27T06:15:21.590+0000 7f530c705640  1 -- 172.21.15.161:0/1160651865 --> v2:172.21.15.196:6800/34680 -- command(tid 12: {"prefix": "injectargs", "injected_args": ["--osd_enable_op_tracker=true"]}) v1 -- 0x7f53041f0770 con 0x7f53041ede40
2024-01-27T06:15:21.638 INFO:tasks.ceph.osd.7.smithi196.stderr: ceph version 18.2.1-302-ga6afa309 (a6afa309b73acbe67470860ef3bb48c61ffcf770) reef (stable)
2024-01-27T06:15:21.638 INFO:tasks.ceph.osd.7.smithi196.stderr: 1: /lib64/libc.so.6(+0x54db0) [0x549fdb0]
2024-01-27T06:15:21.639 INFO:tasks.ceph.osd.7.smithi196.stderr: 2: /lib64/libc.so.6(+0xa154c) [0x54ec54c]
2024-01-27T06:15:21.639 INFO:tasks.ceph.osd.7.smithi196.stderr: 3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, std::chrono::time_point<ceph::coarse_mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >)+0x226) [0xb9ab06]
2024-01-27T06:15:21.639 INFO:tasks.ceph.osd.7.smithi196.stderr: 4: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >)+0x70) [0xb9ac00]
2024-01-27T06:15:21.639 INFO:tasks.ceph.osd.7.smithi196.stderr: 5: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x248) [0xbab188]
2024-01-27T06:15:21.639 INFO:tasks.ceph.osd.7.smithi196.stderr: 6: ceph-osd(+0xaa3704) [0xbab704]
2024-01-27T06:15:21.639 INFO:tasks.ceph.osd.7.smithi196.stderr: 7: /lib64/libc.so.6(+0x9f802) [0x54ea802]
2024-01-27T06:15:21.639 INFO:tasks.ceph.osd.7.smithi196.stderr: 8: clone()

Actions #10

Updated by Radoslaw Zarzynski 3 months ago

When ENXIO is being generated? Doesn't this happen when somebody opens a unix socket without a peer?
If so, a death of an OSD can be just a trigger of this issue,

https://stackoverflow.com/questions/73738257/calling-open-on-a-unix-domain-socket-failed-with-error-no-such-device-or-addr

Actions #11

Updated by Radoslaw Zarzynski 3 months ago

When ENXIO is being generated? Doesn't this happen when somebody opens a unix socket without a peer?

Rather no. From man unix:

       ECONNREFUSED
              The remote address specified by connect(2) was not a listening socket.  This error can also occur if the target pathname is not a socket.

       ECONNRESET
              Remote socket was unexpectedly closed.
...
       ENOMEM Out of memory.

       ENOTCONN
              Socket operation needs a target address, but the socket is not connected.

MonClient can generate ENXIO:

int monc_error_category::from_code(int ev) const noexcept {
  if (ev == 0)
    return 0;

  switch (static_cast<monc_errc>(ev)) {
  case monc_errc::shutting_down:
    return -ECANCELED;
  case monc_errc::session_reset:
    return -EAGAIN;
  case monc_errc::rank_dne:
    [[fallthrough]];
  case monc_errc::mon_dne:
    return -ENOENT;
  case monc_errc::timed_out:
    return -ETIMEDOUT;
  case monc_errc::mon_unavailable:
    return -ENXIO;
  }
  return -EDOM;
}
Actions #12

Updated by Radoslaw Zarzynski 3 months ago

Yeah, there is only one place which generates the mon_unavailable error:

void MonClient::_send_command(MonCommand *r)
{
  if (r->is_tell()) {
    ++r->send_attempts;
    if (r->send_attempts > cct->_conf->mon_client_directed_command_retry) {
      _finish_command(r, monc_errc::mon_unavailable, "mon unavailable", {});
      return;
    }

So MonClient was unable to talk with a monitor while hearbeats between OSDs were observed as well.
A lab / networking issue?

Actions #13

Updated by Radoslaw Zarzynski 3 months ago

  • Translation missing: en.field_tag_list set to test-failure
  • Tags set to test-failure
Actions #14

Updated by Radoslaw Zarzynski 2 months ago

  • Status changed from Triaged to Closed

Closing per c#12 -- likely a networking issue.

Actions

Also available in: Atom PDF