Project

General

Profile

Actions

Bug #64824

open

mon: ceph-16.2.14/src/mon/Monitor.cc: 5661: FAILED ceph_assert(err == 0)

Added by yite gu about 2 months ago. Updated about 1 month ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

-1> 2024-03-11T02:29:03.716+0000 7f6600eaf700 -1 /root/rpmbuild/BUILD/ceph-16.2.14/src/mon/Monitor.cc: In function 'bool Monitor::_scrub(ScrubResult*, std::pair<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char> >, int)' thread 7f6600eaf700 time 2024-03-11T02:29:03.716280+0000
/root/rpmbuild/BUILD/ceph-16.2.14/src/mon/Monitor.cc: 5661: FAILED ceph_assert(err == 0)

ceph version 16.2.14 (8ee3a81f5d70c1cd5b0b8c5cfade8580a0025906) pacific (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f660ea0dfcc]
2: /usr/lib64/ceph/libceph-common.so.2(+0x27a1e6) [0x7f660ea0e1e6]
3: (Monitor::_scrub(ScrubResult*, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, int)+0x14bf) [0x562d64b265bf]
4: (Monitor::handle_scrub(boost::intrusive_ptr<MonOpRequest>)+0x288) [0x562d64b2a7d8]
5: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xdc0) [0x562d64b4b090]
6: (Monitor::_ms_dispatch(Message*)+0x670) [0x562d64b4c080]
7: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x562d64b7b2bc]
8: (DispatchQueue::entry()+0x126a) [0x7f660ec55b7a]
9: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f660ed0a1c1]
10: /lib64/libpthread.so.0(+0x81ca) [0x7f660c73a1ca]
11: clone()

Actions #1

Updated by Radoslaw Zarzynski about 2 months ago

  • Status changed from New to Need More Info

Looks like a mon-scrub failure. This can be caused by a HW issue or by a corruption.
Is there a sign of malfunctioning hardware?
Is a coredump available by any chance?

Actions #2

Updated by yite gu about 2 months ago

Radoslaw Zarzynski wrote:

Looks like a mon-scrub failure. This can be caused by a HW issue or by a corruption.
Is there a sign of malfunctioning hardware?
Is a coredump available by any chance?

no any malfunctioning hardware. this crash occurred continuously for 4 days, all at the same time point.

/opt/rook/rook-ceph/crash/posted# ll
total 16
drwx------ 2 167 167 4096 Mar  8 10:29 2024-03-08T02:29:00.281609Z_caf1e9d9-2079-4a3d-aa31-1cc374f9aa7b
drwx------ 2 167 167 4096 Mar  9 10:29 2024-03-09T02:29:00.960189Z_52fdca6e-28fc-4f55-bbf1-da1d1bf3c1a4
drwx------ 2 167 167 4096 Mar 10 10:29 2024-03-10T02:29:02.119014Z_730c5ede-6717-4ed5-91a7-25dbf5e76786
drwx------ 2 167 167 4096 Mar 11 16:25 2024-03-11T02:29:03.718856Z_07e5c327-21ba-43f7-81f8-e051eea27e49

After redeploy problem monitor, it crash no happen today.

Actions #3

Updated by Radoslaw Zarzynski about 2 months ago

Would need logs with debug_mon=20 and debug_rocksdb=20 from period before the assertion.

Actions #4

Updated by yite gu about 2 months ago

Radoslaw Zarzynski wrote:

Would need logs with debug_mon=20 and debug_rocksdb=20 from period before the assertion.

ok, wait for next crash.

Actions #5

Updated by yite gu about 1 month ago

   -31> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 mon.d@0(leader) e10 _scrub last_key (kv,35617) scrubbed_keys 100 has_next 1
   -30> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 15 mon.d@0(leader) e10 scrub_reset_timeout reset timeout event
   -29> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 mon.d@0(leader) e10 _ms_dispatch existing session 0x55888a354240 for mon.0
   -28> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 mon.d@0(leader) e10  entity_name  global_id 0 (none) caps allow *
   -27> 2024-03-28T06:42:38.170+0000 7f2c51bc8700  5 mon.d@0(leader).paxos(paxos active c 40569131..40569761) is_readable = 1 - now=2024-03-28T06:42:38.171076+0000 lease_expire=2024-03-28T06:42:40.652792+0000 has v0 lc 40569761
   -26> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 10 mon.d@0(leader).log v19867627 preprocess_query log(1 entries from seq 21188 at 2024-03-28T06:42:38.169947+0000) v1 from mon.0 v2:10.6.153.24:3300/0
   -25> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 10 mon.d@0(leader).log v19867627 preprocess_log log(1 entries from seq 21188 at 2024-03-28T06:42:38.169947+0000) v1 from mon.0
   -24> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 is_capable service=log command= write addr - on cap allow *
   -23> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20  allow so far , doing grant allow *
   -22> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20  allow all
   -21> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 10 mon.d@0(leader).log v19867627 prepare_update log(1 entries from seq 21188 at 2024-03-28T06:42:38.169947+0000) v1 from mon.0 v2:10.6.153.24:3300/0
   -20> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 10 mon.d@0(leader).log v19867627 prepare_log log(1 entries from seq 21188 at 2024-03-28T06:42:38.169947+0000) v1 from mon.0
   -19> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 10 mon.d@0(leader).log v19867627  logging 2024-03-28T06:42:38.169947+0000 mon.d (mon.0) 21188 : cluster [DBG] scrub ok on 0,1,2: ScrubResult(keys {auth=100} crc {auth=275392284})
   -18> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 mon.d@0(leader) e10 _ms_dispatch existing session 0x55888a354480 for mon.2
   -17> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 mon.d@0(leader) e10  entity_name  global_id 0 (none) caps allow *
   -16> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 is_capable service=mon command= read addr v2:10.6.153.26:3300/0 on cap allow *
   -15> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20  allow so far , doing grant allow *
   -14> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20  allow all
   -13> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 10 mon.d@0(leader) e10 handle_scrub mon_scrub(result v 40569761 ScrubResult(keys {auth=26,config=2,health=12,kv=60} crc {auth=4123610191,config=876675216,health=2092533985,kv=2385793469}) num_keys 100 key (kv,35617)) v2
   -12> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 15 mon.d@0(leader) e10 scrub_reset_timeout reset timeout event
   -11> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 mon.d@0(leader) e10 _ms_dispatch existing session 0x55888a11f8c0 for mon.1
   -10> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 mon.d@0(leader) e10  entity_name  global_id 0 (none) caps allow *
    -9> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20 is_capable service=mon command= read addr v2:10.6.153.23:3300/0 on cap allow *
    -8> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20  allow so far , doing grant allow *
    -7> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 20  allow all
    -6> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 10 mon.d@0(leader) e10 handle_scrub mon_scrub(result v 40569761 ScrubResult(keys {auth=26,config=2,health=12,kv=60} crc {auth=4123610191,config=876675216,health=2092533985,kv=2385793469}) num_keys 100 key (kv,35617)) v2
    -5> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 15 mon.d@0(leader) e10 scrub_reset_timeout reset timeout event
    -4> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 10 mon.d@0(leader) e10 scrub_check_results
    -3> 2024-03-28T06:42:38.170+0000 7f2c51bc8700  0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2: ScrubResult(keys {auth=26,config=2,health=12,kv=60} crc {auth=4123610191,config=876675216,health=2092533985,kv=2385793469})
    -2> 2024-03-28T06:42:38.170+0000 7f2c51bc8700 10 mon.d@0(leader) e10 _scrub start (kv,35617) num_keys 100
    -1> 2024-03-28T06:42:38.172+0000 7f2c51bc8700 -1 /root/rpmbuild/BUILD/ceph-16.2.14-2/src/mon/Monitor.cc: In function 'bool Monitor::_scrub(ScrubResult*, std::pair<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char> >*, int*)' thread 7f2c51bc8700 time 2024-03-28T06:42:38.172021+0000
/root/rpmbuild/BUILD/ceph-16.2.14-2/src/mon/Monitor.cc: 5661: FAILED ceph_assert(err == 0)

 ceph version 16.2.14-2 (8ee3a81f5d70c1cd5b0b8c5cfade8580a0025906) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f2c5f726fcc]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x27a1e6) [0x7f2c5f7271e6]
 3: (Monitor::_scrub(ScrubResult*, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*, int*)+0x14bf) [0x5588862375bf]
 4: (Monitor::scrub()+0x3b8) [0x55888623b198]
 5: (Monitor::handle_scrub(boost::intrusive_ptr<MonOpRequest>)+0x470) [0x55888623b9c0]
 6: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xdc0) [0x55888625c090]
 7: (Monitor::_ms_dispatch(Message*)+0x670) [0x55888625d080]
 8: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x55888628c2bc]
 9: (DispatchQueue::entry()+0x126a) [0x7f2c5f96eb7a]
 10: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f2c5fa231c1]
 11: /lib64/libpthread.so.0(+0x81ca) [0x7f2c5d4531ca]
 12: clone()
Actions

Also available in: Atom PDF