Project

General

Profile

Actions

Bug #50950

closed

MIMIC OSD very high CPU usage(3xx%), stop responding to other osd, causing PG stuck at peering

Added by Bin Guo almost 3 years ago. Updated almost 3 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I'm using this mimic cluster (about 530 OSDs) for over 1 year, recently I found some particular OSDs randomly run into busy loop mode, with very cpu usage(300%~400% which hornor the Pod resource limitation). Meanwhile, these OSDs stop responding to any messages from outside and the cluster status shows some PGs stuck at peering state.

All the problems mentioned above could disappear after about 3 to 4 hours, and them everything back to normal. I can't reproduce this, but it's been happened for 3 times.

Any help will be appreciated!

Actions #1

Updated by Bin Guo almost 3 years ago

Finally, I got the cpu killer stack:

#0  0x00007fd2b41ba5ea in btree::btree<btree::btree_map_params<pg_t, int*, std::less<pg_t>, std::allocator<std::pair<pg_t const, int*> >, 256> >::internal_insert(btree::btree_iterator<btree::btree_node<btree::btree_map_params<pg_t, int*, std::less<pg_t>, std::allocator<std::pair<pg_t const, int*> >, 256> >, std::pair<pg_t const, int*>&, std::pair<pg_t const, int*>*>, std::pair<pg_t const, int*> const&) ()
   from /usr/lib/ceph/libceph-common.so.0
#1  0x00007fd2b41bb268 in PGTempMap::decode(ceph::buffer::list::iterator&) () from /usr/lib/ceph/libceph-common.so.0
#2  0x00007fd2b41a0a42 in OSDMap::decode(ceph::buffer::list::iterator&) () from /usr/lib/ceph/libceph-common.so.0
#3  0x00007fd2b41a3261 in OSDMap::decode(ceph::buffer::list&) () from /usr/lib/ceph/libceph-common.so.0
#4  0x0000000000755438 in OSDService::try_get_map(unsigned int) ()
#5  0x0000000000759e7d in OSD::build_initial_pg_history(spg_t, unsigned int, utime_t, pg_history_t*, PastIntervals*) ()
#6  0x0000000000763a9a in OSD::handle_pg_create(boost::intrusive_ptr<OpRequest>) ()
#7  0x00000000007641e9 in OSD::dispatch_op(boost::intrusive_ptr<OpRequest>) ()
#8  0x0000000000764708 in OSD::_dispatch(Message*) ()
#9  0x0000000000764a06 in OSD::ms_dispatch(Message*) ()
#10 0x00007fd2b409ab32 in DispatchQueue::entry() () from /usr/lib/ceph/libceph-common.so.0
#11 0x00007fd2b413971d in DispatchQueue::DispatchThread::entry() () from /usr/lib/ceph/libceph-common.so.0
#12 0x00007fd2b26a76ba in start_thread (arg=0x7fd2a02fe700) at pthread_create.c:333
#13 0x00007fd2b1cb641d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Actions #2

Updated by Bin Guo almost 3 years ago

And the what looks like from top:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                  
1974330 64045     20   0 2015312   1.1g  41396 S 320.1   1.7 149:25.28 /usr/bin/ceph-osd --cluster ceph -f -i 208 --setuser ceph --setgroup disk

I'm assuming that OSD want to rebuild the in memory btree from disk, but for unknown reason, the `internal_insert ` operation went crazy!

Actions #3

Updated by Neha Ojha almost 3 years ago

  • Project changed from bluestore to RADOS
  • Status changed from New to Won't Fix

Mimic is EOL, can you please upgrade to newer version and re-open this ticker if you continue to see this issue.

Actions

Also available in: Atom PDF