Project

General

Profile

Actions

Bug #62162

open

local_shared_foreign_ptr: Assertion `ptr && *ptr' failed

Added by Matan Breizman 9 months ago. Updated about 1 hour ago.

Status:
Need More Info
Priority:
High
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

https://pulpito.ceph.com/matan-2023-07-25_10:55:34-crimson-rados-wip-matanb-crimson-only-wip-62098-distro-crimson-smithi/

osds 0 and 3 crashed with the same backtrace of:

DEBUG 2023-07-25 11:35:31,184 [shard 0] bluestore - bluestore(/var/lib/ceph/osd/ceph-0) statfs store_statfs(0x167e5ad000/0x0/0x1680000000, data 0x9547a/0x1e000, compress 0x749c/0xa000/0x8c000, omap 0x0, meta 0x1a30000)
DEBUG 2023-07-25 11:35:31,205 [shard 0] osd - maybe_share_osdmap updating peer 3 session's projected_epochfrom 13 to ping map epoch of 14
DEBUG 2023-07-25 11:35:31,205 [shard 0] osd - maybe_share_osdmap peer 3 projected_epoch 14 is already later than our osdmap epoch of 14
DEBUG 2023-07-25 11:35:31,205 [shard 0] osd - maybe_share_osdmap peer 3 projected_epoch 14 is already later than our osdmap epoch of 14
DEBUG 2023-07-25 11:35:31,229 [shard 0] bluestore - bluestore.MempoolThread(0x621000025288) _resize_shards cache_size: 2845415832 kv_alloc: 1275068416 kv_used: 1234 kv_onode_alloc: 128849018 kv_onode_used: -22 meta_alloc: 1207959552 meta_used: 40435 data_alloc: 234881024 data_used: 0
INFO  2023-07-25 11:35:31,235 [shard 0] ms - [0x6110000dc040 osd.0(client) v2:172.21.15.52:6803/2925641458@62481 >> mgr.4101 v2:172.21.15.153:6802/578688534] mark_down() at io_stat(io_state=open, in_seq=1, out_seq=1, out_pending_msgs_size=0, out_sent_msgs_size=0, need_ack=0, need_keepalive=0, need_keepalive_ack=0), send 1 notify_mark_down()
INFO  2023-07-25 11:35:31,236 [shard 0] ms - [0x6110000dc040 osd.0(client) v2:172.21.15.52:6803/2925641458@62481 >> mgr.4101 v2:172.21.15.153:6802/578688534] closing: reset no, replace no
ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-5133-g64136e5f/rpm/el8/BUILD/ceph-18.0.0-5133-g64136e5f/src/crimson/common/local_shared_foreign_ptr.h:75: crimson::local_shared_foreign_ptr<PtrType>::element_type* crimson::local_shared_foreign_ptr<PtrType>::operator->() const [with PtrType = seastar::shared_ptr<crimson::net::Connection>; crimson::local_shared_foreign_ptr<PtrType>::element_type = crimson::net::Connection]: Assertion `ptr && *ptr' failed.
Aborting on shard 0.
Backtrace:
Reactor stalled for 91 ms on shard 0. Backtrace: 0x45d4d 0x483e13b9 0x48102e77 0x4811e28b 0x4811e732 0x4811e888 0x4811ecd3 0x12cef 0xf9912 0x3f8b4a06 0x3f8b693c 0x3f8bc011 0x3f8bcc82 0x3f8bd350 0x3f8b10db 0x3f8b15bf 0x3f8b1b64 0x12cef 0x4eace 0x21ea4 0x21d78 0x47425 0x4004e210 0x4004e864 0x4009e573 0x400a1403 0x400a2075 0x480e0c50 0x4813c6d3 0x48330bf9 0x48332c10 0x47c40eb1 0x47c45c2f 0x391b5009 0x3ad84 0x38e233cd
kernel callstack: 0xffffffffffffff80 0xffffffff8eedc905 0xffffffff8eee200d 0xffffffff8ecf512e 0xffffffff8ecf6bfd 0xffffffff8ecf708b 0xffffffff8ec0539b 0xffffffff8f8000a9
Reactor stalled for 278 ms on shard 0. Backtrace: 0x45d4d 0x483e13b9 0x48102e77 0x4811e28b 0x4811e732 0x4811e888 0x4811ecd3 0x12cef 0xf9912 0x3f8b4a06 0x3f8b849a 0x3f8bc011 0x3f8bcc82 0x3f8bd350 0x3f8b10db 0x3f8b15bf 0x3f8b1b64 0x12cef 0x4eace 0x21ea4 0x21d78 0x47425 0x4004e210 0x4004e864 0x4009e573 0x400a1403 0x400a2075 0x480e0c50 0x4813c6d3 0x48330bf9 0x48332c10 0x47c40eb1 0x47c45c2f 0x391b5009 0x3ad84 0x38e233cd
kernel callstack: 0xffffffffffffff80 0xffffffff8fa01440
Reactor stalled for 563 ms on shard 0. Backtrace: 0x45d4d 0x483e13b9 0x48102e77 0x4811e28b 0x4811e732 0x4811e888 0x4811ecd3 0x12cef 0xf9912 0x3f8b4a06 0x3f8b849a 0x3f8bc011 0x3f8bcc82 0x3f8bd350 0x3f8b10db 0x3f8b15bf 0x3f8b1b64 0x12cef 0x4eace 0x21ea4 0x21d78 0x47425 0x4004e210 0x4004e864 0x4009e573 0x400a1403 0x400a2075 0x480e0c50 0x4813c6d3 0x48330bf9 0x48332c10 0x47c40eb1 0x47c45c2f 0x391b5009 0x3ad84 0x38e233cd
kernel callstack: 0xffffffffffffff80 0xffffffff8eedc909 0xffffffff8eee200d 0xffffffff8ecf512e 0xffffffff8ecf6bfd 0xffffffff8ecf708b 0xffffffff8ec0539b 0xffffffff8f8000a9
 0# gsignal in /lib64/libc.so.6
 1# abort in /lib64/libc.so.6
 2# 0x00007F81FD6A0D79 in /lib64/libc.so.6
 3# 0x00007F81FD6C6426 in /lib64/libc.so.6
 4# 0x0000563B685B1211 in ceph-osd
 5# 0x0000563B685B1865 in ceph-osd
 6# auto seastar::internal::future_invoke<seastar::noncopyable_function<seastar::future<void> (std::unique_ptr<Message, crimson::common::UniquePtrDeleter>&&)>&, std::unique_ptr<Message, crimson::common::UniquePtrDeleter> >(seastar::noncopyable_function<seastar::future<void> (std::unique_ptr<Message, crimson::common::UniquePtrDeleter>&&)>&, std::unique_ptr<Message, crimson::common::UniquePtrDeleter>&&) in ceph-osd
 7# void seastar::futurize<seastar::future<void> >::satisfy_with_result_of<seastar::future<std::unique_ptr<Message, crimson::common::UniquePtrDeleter> >::then_impl_nrvo<seastar::noncopyable_function<seastar::future<void> (std::unique_ptr<Message, crimson::common::UniquePtrDeleter>&&)>, seastar::future<void> >(seastar::noncopyable_function<seastar::future<void> (std::unique_ptr<Message, crimson::common::UniquePtrDeleter>&&)>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (std::unique_ptr<Message, crimson::common::UniquePtrDeleter>&&)>&, seastar::future_state<std::unique_ptr<Message, crimson::common::UniquePtrDeleter> >&&)#1}::operator()(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (std::unique_ptr<Message, crimson::common::UniquePtrDeleter>&&)>&, seastar::future_state<std::unique_ptr<Message, crimson::common::UniquePtrDeleter> >&&) const::{lambda()#1}>(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (std::unique_ptr<Message, crimson::common::UniquePtrDeleter>&&)>&&) in ceph-osd
 8# seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::noncopyable_function<seastar::future<void> (std::unique_ptr<Message, crimson::common::UniquePtrDeleter>&&)>, seastar::future<std::unique_ptr<Message, crimson::common::UniquePtrDeleter> >::then_impl_nrvo<seastar::noncopyable_function<seastar::future<void> (std::unique_ptr<Message, crimson::common::UniquePtrDeleter>&&)>, seastar::future<void> >(seastar::noncopyable_function<seastar::future<void> (std::unique_ptr<Message, crimson::common::UniquePtrDeleter>&&)>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (std::unique_ptr<Message, crimson::common::UniquePtrDeleter>&&)>&, seastar::future_state<std::unique_ptr<Message, crimson::common::UniquePtrDeleter> >&&)#1}, std::unique_ptr<Message, crimson::common::UniquePtrDeleter> >::run_and_dispose() in ceph-osd
 9# 0x0000563B70643C51 in ceph-osd
Actions #1

Updated by Matan Breizman 9 months ago

  • Description updated (diff)
Actions #4

Updated by Matan Breizman about 1 month ago

I'm afraid this was introduced in: https://github.com/ceph/ceph/pull/50835
We should verify whether local_shared_foreign_ptr is used appropriately.

Actions #5

Updated by Matan Breizman about 1 month ago

  • Status changed from New to Need More Info
  • Assignee set to Matan Breizman

It looks like we added support of nullptr shared_foreign pointers, this may cause the later assertion when trying to access them:

/// Wraps ptr in a local_shared_foreign_ptr<>.
template <typename T>
local_shared_foreign_ptr<T> make_local_shared_foreign(T &&ptr) {
  return make_local_shared_foreign<T>(
    ptr ? seastar::make_foreign(std::forward<T>(ptr)) : nullptr); <--
}

ceph-osd: ./src/crimson/common/local_shared_foreign_ptr.h:75:

crimson::local_shared_foreign_ptr<PtrType>::element_type* 
crimson::local_shared_foreign_ptr<PtrType>::operator->() const 
[with PtrType =seastar::shared_ptr<crimson::net::Connection>;
 crimson::local_shared_foreign_ptr<PtrType>::element_type = crimson::net::Connection]

: Assertion `ptr && *ptr' failed.

Note: The issue occurs on startups only

Actions #8

Updated by Matan Breizman 2 days ago

  • Assignee changed from Matan Breizman to Nitzan Mordechai
Actions #10

Updated by Matan Breizman about 1 hour ago

Looking at:

OSDs 2 and 3:
https://pulpito.ceph.com/matan-2024-05-02_11:41:00-crimson-rados-wip-crimson-only-coherent-log-and-at_version-distro-crimson-smithi/7685283

osd2:

INFO  2024-05-02 12:48:43,705 [shard 0:main] ms - [0x6110000c3540 osd.2(client) v2:172.21.15.183:6803/3044837935@59592 >> mgr.4100 v2:172.21.15.183:6800/114576641] mark_down() at io_stat(io_state=open, in_seq=1, out_seq=2, out_pending_msgs_size=0, out_sent_msgs_size=0, need_ack=0, need_keepalive=0, need_keepalive_ack=0), send 1 notify_mark_down()
INFO  2024-05-02 12:48:43,705 [shard 0:main] ms - [0x6110000c3540 osd.2(client) v2:172.21.15.183:6803/3044837935@59592 >> mgr.4100 v2:172.21.15.183:6800/114576641] closing: reset no, replace no
INFO  2024-05-02 12:48:43,706 [shard 0:main] ms - [0x6110000c3540 osd.2(client) v2:172.21.15.183:6803/3044837935@59592 >> mgr.4100 v2:172.21.15.183:6800/114576641] do_in_dispatch(): fault at drop, io_stat(io_state=drop, in_seq=1, out_seq=2, out_pending_msgs_size=0, out_sent_msgs_size=0, need_ack=0, need_keepalive=0, need_keepalive_ack=0) -- read eof
INFO  2024-05-02 12:48:43,707 [shard 0:main] alienstore - pool_statfs
DEBUG 2024-05-02 12:48:43,707 [shard 0:main] bluestore - bluestore(/var/lib/ceph/osd/ceph-2) pool_statfs pool 2
ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.0.0-3442-ga8289204/rpm/el9/BUILD/ceph-19.0.0-3442-ga8289204/src/crimson/common/local_shared_foreign_ptr.h:75: crimson::local_shared_foreign_ptr<PtrType>::element_type* crimson::local_shared_foreign_ptr<PtrType>::operator->() const [with PtrType = seastar::shared_ptr<crimson::net::Connection>; element_type = crimson::net::Connection]: Assertion `ptr && *ptr' failed.

osd3:

INFO  2024-05-02 12:48:43,706 [shard 0:main] ms - [0x6110000bd780 osd.3(client) v2:172.21.15.183:6804/1792673789@55251 >> mgr.4100 v2:172.21.15.183:6800/114576641] mark_down() at io_stat(io_state=open, in_seq=1, out_seq=2, out_pending_msgs_size=0, out_sent_msgs_size=0, need_ack=0, need_keepalive=0, need_keepalive_ack=0), send 1 notify_mark_down()
INFO  2024-05-02 12:48:43,706 [shard 0:main] ms - [0x6110000bd780 osd.3(client) v2:172.21.15.183:6804/1792673789@55251 >> mgr.4100 v2:172.21.15.183:6800/114576641] closing: reset no, replace no
DEBUG 2024-05-02 12:48:43,706 [shard 0:main] bluestore - bluestore(/var/lib/ceph/osd/ceph-3) pool_statfs pool 1
INFO  2024-05-02 12:48:43,706 [shard 0:main] ms - [0x6110000bd780 osd.3(client) v2:172.21.15.183:6804/1792673789@55251 >> mgr.4100 v2:172.21.15.183:6800/114576641] do_in_dispatch(): fault at drop, io_stat(io_state=drop, in_seq=1, out_seq=2, out_pending_msgs_size=0, out_sent_msgs_size=0, need_ack=0, need_keepalive=0, need_keepalive_ack=0) -- read eof
DEBUG 2024-05-02 12:48:43,707 [shard 0:main] bluestore - bluestore(/var/lib/ceph/osd/ceph-3) pool_statfsstore_statfs(0x0/0x0/0x0, data 0x90220/0xf000, compress 0x754e/0xa000/0x8c000, omap 0x0, meta 0x0)
INFO  2024-05-02 12:48:43,707 [shard 0:main] alienstore - pool_statfs
DEBUG 2024-05-02 12:48:43,707 [shard 0:main] bluestore - bluestore(/var/lib/ceph/osd/ceph-3) pool_statfs pool 2
DEBUG 2024-05-02 12:48:43,707 [shard 0:main] bluestore - bluestore(/var/lib/ceph/osd/ceph-3) pool_statfsstore_statfs(0x0/0x0/0x0, data 0x0/0x0, compress 0x0/0x0/0x0, omap 0x0, meta 0x0)
ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.0.0-3442-ga8289204/rpm/el9/BUILD/ceph-19.0.0-3442-ga8289204/src/crimson/common/local_shared_foreign_ptr.h:75: crimson::local_shared_foreign_ptr<PtrType>::element_type* crimson::local_shared_foreign_ptr<PtrType>::operator->() const [with PtrType = seastar::shared_ptr<crimson::net::Connection>; element_type = crimson::net::Connection]: Assertion `ptr && *ptr' failed.

Notice the similar timestamp of of 12:48:43.

From the non-crashing OSDs 0 and 1, the message sent from the osd to the mgr is skipped:
osd.0:

INFO  2024-05-02 12:48:43,704 [shard 0:main] ms - [0x6110000c3540 osd.0(client) v2:172.21.15.151:6804/626009721@51081 >> mgr.4100 v2:172.21.15.183:6800/114576641] mark_down() at io_stat(io_state=open, in_seq=1, out_seq=1, out_pending_msgs_size=0, out_sent_msgs_size=0, need_ack=0, need_keepalive=0, need_keepalive_ack=0), send 1 notify_mark_down()
INFO  2024-05-02 12:48:43,704 [shard 0:main] ms - [0x6110000c3540 osd.0(client) v2:172.21.15.151:6804/626009721@51081 >> mgr.4100 v2:172.21.15.183:6800/114576641] closing: reset no, replace no
WARN  2024-05-02 12:48:43,705 [shard 0:main] mgrc - cannot send report; no conn available
WARN  2024-05-02 12:48:43,705 [shard 0:main] mgrc - report: no conn available; report skipped

osd.1:

INFO  2024-05-02 12:48:43,704 [shard 0:main] ms - [0x6110000bd640 osd.1(client) v2:172.21.15.151:6801/1072082880@62730 >> mgr.4100 v2:172.21.15.183:6800/114576641] mark_down() at io_stat(io_state=open, in_seq=1, out_seq=1, out_pending_msgs_size=0, out_sent_msgs_size=0, need_ack=0, need_keepalive=0, need_keepalive_ack=0), send 1 notify_mark_down()
INFO  2024-05-02 12:48:43,704 [shard 0:main] ms - [0x6110000bd640 osd.1(client) v2:172.21.15.151:6801/1072082880@62730 >> mgr.4100 v2:172.21.15.183:6800/114576641] closing: reset no, replace no
WARN  2024-05-02 12:48:43,705 [shard 0:main] mgrc - cannot send report; no conn available
WARN  2024-05-02 12:48:43,705 [shard 0:main] mgrc - report: no conn available; report skipped

All OSDs are handling the MMgrConfigure message around 12:48:38,699.

INFO  2024-05-02 12:48:38,699 [shard 0:main] mgrc - handle_mgr_conf mgrconfigure(period=5, threshold=5) v4

handle_mgr_conf will arm the report_timer which calls Client::report() periodically every 5 seconds.
Meaning 12:48:38,699 + 5 = 12:48:43 will be the first report sent - same as when the OSDs crashes.

I suspect the connection is not yet set up properly and, for some reason, the "if (!conn)" passes.
I have increased mgr_stats_period to 30 to allow more time for the connection to set up properly.

https://pulpito.ceph.com/matan-2024-05-02_13:55:54-crimson-rados-wip-crimson-only-coherent-log-and-at_version-distro-crimson-smithi/

We should also add more logs around Client::_send_report() to verify the suspection above.

Actions #11

Updated by Matan Breizman about 1 hour ago

  • Priority changed from Normal to High

This issue impacts most test runs lately, bumping up.

Actions

Also available in: Atom PDF