Bug #64214: Health check failed: 1 osds down (OSD_DOWN) in cluster logs. - RADOS - Ceph

Actions

Copy link

Bug #64214

closed

Health check failed: 1 osds down (OSD_DOWN) in cluster logs.

Added by Sridhar Seshasayee 3 months ago. Updated 3 months ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

/a/yuriw-2024-01-26_01:08:12-rados-wip-yuri2-testing-2024-01-25-1327-reef-distro-default-smithi/7533539
/a/yuriw-2024-01-26_01:08:12-rados-wip-yuri2-testing-2024-01-25-1327-reef-distro-default-smithi/7533680

Logs from JobID: 7533539:

During the execution of tests as part of test_cls_rbd.sh OSD.0 goes silent for around 90 secs
during which the OSD_DOWN warning is generated.

2024-01-27T02:49:18.597+0000 24728640 20 osd.0 op_wq(2) _process empty q, waiting
2024-01-27T02:49:18.597+0000 28730640 20 osd.0 op_wq(2) _process empty q, waiting
2024-01-27T02:49:18.606+0000 23f27640 10 osd.0 241 maybe_share_map con 0x116e2b50 v1:172.21.15.99:6802/34719 map epoch 237 -> 241 (shared)

<silence for around 90 secs>

2024-01-27T02:50:48.138+0000 23f27640  1 -- v1:172.21.15.46:6813/34693 --> v1:172.21.15.99:6802/34719 -- pg_query2(56.1 56.1 query(info 0'0 epoch_sent 241) e241/241) v1 -- 0x13fe6330 con 0x116e2b50 
2024-01-27T02:50:48.139+0000 23f27640 10 osd.0 241 handle_pg_create_info new pg pg[56.1( empty local-lis/les=0/0 n=0 ec=241/241 lis/c=0/0 les/c/f=0/0/0 sis=241) [0,4] r=0 lpr=241 crt=0'0 mlcod 0'0 creating+peering mbc={}]
2024-01-27T02:50:48.139+0000 23f27640 10 osd.0:1._attach_pg 56.1 0x60aa5a0
2024-01-27T02:50:48.139+0000 23f27640 20 osd.0:1._wake_pg_slot _wake_pg_slot 56.1 to_process <> waiting <> waiting_peering {}
2024-01-27T02:50:48.139+0000 23f27640 20 osd.0 241 identify_splits_and_merges 56.1 e241 to e241
2024-01-27T02:50:48.139+0000 23f27640 20 osd.0 241 identify_splits_and_merges 56.1 e241 to e241 pg_nums {241=32}
2024-01-27T02:50:48.139+0000 23f27640 10 osd.0:1._prime_splits 
2024-01-27T02:50:48.139+0000 23f27640 10 osd.0 pg_epoch: 241 pg[56.1( empty local-lis/les=0/0 n=0 ec=241/241 lis/c=0/0 les/c/f=0/0/0 sis=241) [0,4] r=0 lpr=241 crt=0'0 mlcod 0'0 creating+peering mbc={}] do_peering_event: epoch_sent: 241 epoch_re        quested: 241 NullEvt +create_info
2024-01-27T02:50:48.139+0000 23f27640 10 log is not dirty
2024-01-27T02:50:48.139+0000 23f27640 10 osd.0 241 queue_want_up_thru want 241 <= queued 241, currently 237
2024-01-27T02:50:48.139+0000 23f27640  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x23f27640' had timed out after 15.000000954s
...
2024-01-27T02:50:48.260+0000 e257640 10 osd.0 241 maybe_update_heartbeat_peers forcing update after 102.696689 seconds

The last log above shows that the hearbeat update was performed long after the grace period indicating that the OSD was down for longer than usual.

osd.0 was marked DOWN during the period of silence shown below:

2024-01-27T02:50:43.783410+0000 mon.a (mon.0) 906 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)" in cluster log

But none of the tests as part of rados_cls_all suite failed. The symptoms indicate connection re-establishment took longer than usual after a probable failure injection. The test sets "osd heartbeat grace" to 80 secs but the OSD took longer than that to re-establish connection with its peers.

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Radoslaw Zarzynski 3 months ago

Note from the bug scrub: this has been run under Valgrind.
I don't think this is an OSD's issue – if the execution stops, it's expected to see hearbeat timeouts.

Actions

Copy link

Updated by Laura Flores 3 months ago

Checked the logs:

/a/yuriw-2024-01-26_01:08:12-rados-wip-yuri2-testing-2024-01-25-1327-reef-distro-default-smithi/7533539$ cat teuthology.log | grep "Exit " 
2024-01-27T03:47:56.670 INFO:tasks.ceph.mon.a.smithi046.stderr:==00:01:48:19.793 34377== Exit program on first error (--exit-on-first-error=yes)
2024-01-27T03:48:00.555 INFO:tasks.ceph.mon.c.smithi046.stderr:==00:01:48:23.677 34381== Exit program on first error (--exit-on-first-error=yes)
2024-01-27T03:48:06.651 INFO:tasks.ceph.mon.b.smithi099.stderr:==00:01:48:29.728 34118== Exit program on first error (--exit-on-first-error=yes)

/a/yuriw-2024-01-26_01:08:12-rados-wip-yuri2-testing-2024-01-25-1327-reef-distro-default-smithi/7533539/remote/smithi046/log/valgrind/mon.a.log.gz

...
      <fn>void std::vector&lt;std::unique_ptr&lt;rocksdb::ObjectLibrary::Entry, std::default_delete&lt;rocksdb::ObjectLibrary::Entry&gt; &gt;, std::allocator&lt;std::unique_ptr&lt;rocksdb::ObjectLibrary::Entry, std::default_delete&lt;rocksdb::ObjectLibrary::Entry&gt; &gt; &gt; &gt;::_M_realloc_insert&lt;std::unique_ptr&lt;rocksdb::ObjectLibrary::Entry, std::default_delete&lt;rocksdb::ObjectLibrary::Entry&gt; &gt; &gt;(__gnu_cxx::__normal_iterator&lt;std::unique_ptr&lt;rocksdb::ObjectLibrary::Entry, std::default_delete&lt;rocksdb::ObjectLibrary::Entry&gt; &gt;*, std::vector&lt;std::unique_ptr&lt;rocksdb::ObjectLibrary::Entry, std::default_delete&lt;rocksdb::ObjectLibrary::Entry&gt; &gt;, std::allocator&lt;std::unique_ptr&lt;rocksdb::ObjectLibrary::Entry, std::default_delete&lt;rocksdb::ObjectLibrary::Entry&gt; &gt; &gt; &gt; &gt;, std::unique_ptr&lt;rocksdb::ObjectLibrary::Entry, std::default_delete&lt;rocksdb::ObjectLibrary::Entry&gt; &gt;&amp;&amp;)</fn>
...

Looks like a case of https://tracker.ceph.com/issues/61774.

Actions

Copy link