Bug #64824
openmon: ceph-16.2.14/src/mon/Monitor.cc: 5661: FAILED ceph_assert(err == 0)
0%
Description
-1> 2024-03-11T02:29:03.716+0000 7f6600eaf700 -1 /root/rpmbuild/BUILD/ceph-16.2.14/src/mon/Monitor.cc: In function 'bool Monitor::_scrub(ScrubResult*, std::pair<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char> >, int)' thread 7f6600eaf700 time 2024-03-11T02:29:03.716280+0000
/root/rpmbuild/BUILD/ceph-16.2.14/src/mon/Monitor.cc: 5661: FAILED ceph_assert(err == 0)
ceph version 16.2.14 (8ee3a81f5d70c1cd5b0b8c5cfade8580a0025906) pacific (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f660ea0dfcc]
2: /usr/lib64/ceph/libceph-common.so.2(+0x27a1e6) [0x7f660ea0e1e6]
3: (Monitor::_scrub(ScrubResult*, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, int)+0x14bf) [0x562d64b265bf]
4: (Monitor::handle_scrub(boost::intrusive_ptr<MonOpRequest>)+0x288) [0x562d64b2a7d8]
5: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xdc0) [0x562d64b4b090]
6: (Monitor::_ms_dispatch(Message*)+0x670) [0x562d64b4c080]
7: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x562d64b7b2bc]
8: (DispatchQueue::entry()+0x126a) [0x7f660ec55b7a]
9: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f660ed0a1c1]
10: /lib64/libpthread.so.0(+0x81ca) [0x7f660c73a1ca]
11: clone()
Updated by Radoslaw Zarzynski about 2 months ago
- Status changed from New to Need More Info
Looks like a mon-scrub failure. This can be caused by a HW issue or by a corruption.
Is there a sign of malfunctioning hardware?
Is a coredump available by any chance?
Updated by yite gu about 2 months ago
Radoslaw Zarzynski wrote:
Looks like a mon-scrub failure. This can be caused by a HW issue or by a corruption.
Is there a sign of malfunctioning hardware?
Is a coredump available by any chance?
no any malfunctioning hardware. this crash occurred continuously for 4 days, all at the same time point.
/opt/rook/rook-ceph/crash/posted# ll total 16 drwx------ 2 167 167 4096 Mar 8 10:29 2024-03-08T02:29:00.281609Z_caf1e9d9-2079-4a3d-aa31-1cc374f9aa7b drwx------ 2 167 167 4096 Mar 9 10:29 2024-03-09T02:29:00.960189Z_52fdca6e-28fc-4f55-bbf1-da1d1bf3c1a4 drwx------ 2 167 167 4096 Mar 10 10:29 2024-03-10T02:29:02.119014Z_730c5ede-6717-4ed5-91a7-25dbf5e76786 drwx------ 2 167 167 4096 Mar 11 16:25 2024-03-11T02:29:03.718856Z_07e5c327-21ba-43f7-81f8-e051eea27e49
After redeploy problem monitor, it crash no happen today.
Updated by Radoslaw Zarzynski about 2 months ago
Would need logs with debug_mon=20
and debug_rocksdb=20
from period before the assertion.
Updated by yite gu about 2 months ago
Radoslaw Zarzynski wrote:
Would need logs with
debug_mon=20
anddebug_rocksdb=20
from period before the assertion.
ok, wait for next crash.
Updated by yite gu about 1 month ago
-31> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 mon.d@0(leader) e10 _scrub last_key (kv,35617) scrubbed_keys 100 has_next 1 -30> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 15 mon.d@0(leader) e10 scrub_reset_timeout reset timeout event -29> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 mon.d@0(leader) e10 _ms_dispatch existing session 0x55888a354240 for mon.0 -28> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 mon.d@0(leader) e10 entity_name global_id 0 (none) caps allow * -27> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 5 mon.d@0(leader).paxos(paxos active c 40569131..40569761) is_readable = 1 - now=2024-03-28T06:42:38.171076+0000 lease_expire=2024-03-28T06:42:40.652792+0000 has v0 lc 40569761 -26> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 10 mon.d@0(leader).log v19867627 preprocess_query log(1 entries from seq 21188 at 2024-03-28T06:42:38.169947+0000) v1 from mon.0 v2:10.6.153.24:3300/0 -25> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 10 mon.d@0(leader).log v19867627 preprocess_log log(1 entries from seq 21188 at 2024-03-28T06:42:38.169947+0000) v1 from mon.0 -24> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 is_capable service=log command= write addr - on cap allow * -23> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 allow so far , doing grant allow * -22> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 allow all -21> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 10 mon.d@0(leader).log v19867627 prepare_update log(1 entries from seq 21188 at 2024-03-28T06:42:38.169947+0000) v1 from mon.0 v2:10.6.153.24:3300/0 -20> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 10 mon.d@0(leader).log v19867627 prepare_log log(1 entries from seq 21188 at 2024-03-28T06:42:38.169947+0000) v1 from mon.0 -19> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 10 mon.d@0(leader).log v19867627 logging 2024-03-28T06:42:38.169947+0000 mon.d (mon.0) 21188 : cluster [DBG] scrub ok on 0,1,2: ScrubResult(keys {auth=100} crc {auth=275392284}) -18> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 mon.d@0(leader) e10 _ms_dispatch existing session 0x55888a354480 for mon.2 -17> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 mon.d@0(leader) e10 entity_name global_id 0 (none) caps allow * -16> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 is_capable service=mon command= read addr v2:10.6.153.26:3300/0 on cap allow * -15> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 allow so far , doing grant allow * -14> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 allow all -13> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 10 mon.d@0(leader) e10 handle_scrub mon_scrub(result v 40569761 ScrubResult(keys {auth=26,config=2,health=12,kv=60} crc {auth=4123610191,config=876675216,health=2092533985,kv=2385793469}) num_keys 100 key (kv,35617)) v2 -12> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 15 mon.d@0(leader) e10 scrub_reset_timeout reset timeout event -11> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 mon.d@0(leader) e10 _ms_dispatch existing session 0x55888a11f8c0 for mon.1 -10> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 mon.d@0(leader) e10 entity_name global_id 0 (none) caps allow * -9> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 is_capable service=mon command= read addr v2:10.6.153.23:3300/0 on cap allow * -8> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 allow so far , doing grant allow * -7> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 allow all -6> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 10 mon.d@0(leader) e10 handle_scrub mon_scrub(result v 40569761 ScrubResult(keys {auth=26,config=2,health=12,kv=60} crc {auth=4123610191,config=876675216,health=2092533985,kv=2385793469}) num_keys 100 key (kv,35617)) v2 -5> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 15 mon.d@0(leader) e10 scrub_reset_timeout reset timeout event -4> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 10 mon.d@0(leader) e10 scrub_check_results -3> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2: ScrubResult(keys {auth=26,config=2,health=12,kv=60} crc {auth=4123610191,config=876675216,health=2092533985,kv=2385793469}) -2> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 10 mon.d@0(leader) e10 _scrub start (kv,35617) num_keys 100 -1> 2024-03-28T06:42:38.172+0000 7f2c51bc8700 -1 /root/rpmbuild/BUILD/ceph-16.2.14-2/src/mon/Monitor.cc: In function 'bool Monitor::_scrub(ScrubResult*, std::pair<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char> >*, int*)' thread 7f2c51bc8700 time 2024-03-28T06:42:38.172021+0000 /root/rpmbuild/BUILD/ceph-16.2.14-2/src/mon/Monitor.cc: 5661: FAILED ceph_assert(err == 0) ceph version 16.2.14-2 (8ee3a81f5d70c1cd5b0b8c5cfade8580a0025906) pacific (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f2c5f726fcc] 2: /usr/lib64/ceph/libceph-common.so.2(+0x27a1e6) [0x7f2c5f7271e6] 3: (Monitor::_scrub(ScrubResult*, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*, int*)+0x14bf) [0x5588862375bf] 4: (Monitor::scrub()+0x3b8) [0x55888623b198] 5: (Monitor::handle_scrub(boost::intrusive_ptr<MonOpRequest>)+0x470) [0x55888623b9c0] 6: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xdc0) [0x55888625c090] 7: (Monitor::_ms_dispatch(Message*)+0x670) [0x55888625d080] 8: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x55888628c2bc] 9: (DispatchQueue::entry()+0x126a) [0x7f2c5f96eb7a] 10: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f2c5fa231c1] 11: /lib64/libpthread.so.0(+0x81ca) [0x7f2c5d4531ca] 12: clone()