Bug #23712: High MON cpu usage when cluster is changing - Ceph - Ceph

Actions

Copy link

Bug #23712

closed

High MON cpu usage when cluster is changing

Added by Xiaoxi Chen about 6 years ago. Updated about 6 years ago.

Status:

Duplicate

Priority:

High

Assignee:

Category:

Monitor

Target version:

% Done:

Source:

Community (dev)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

After upgrading to Luminous 12.2.4 (from Jewel 10.2.5), we consistently see high cpu usage when OSDMap changes , especially from DOWN to UP.

At first the behavior was when one OSD gets down and up (trigger by `ceph osd down osd.x`), monitor quorum will stuck and re-election due to lease timeout. Further investigation shows
1. lease message is not getting handled for 3s+ during this period.
2. the CPU usage reach 300% in that periods, indicating 3 async worker thread is not enough.
```
2018-04-09 23:46:51.030583 7fecad4de700 10 mon.lvs3b03c-2gsx@1(peon).paxos(paxos active c 112056948..112057621) handle_lease on 112057621 now 2018-04-09 23:46:52.480946
```

Then we changed to use 8 worker threads , or changed back to simple messenger(as jewel default to simple), both looks better in above cases, i.e , single OSD restart will not causing lease timeout. But, if we reboot a OSD node with 24 osds, or just
`for i in seq 1 10 ; do ceph osd down $i; done;` can cause monitors use massive CPU (we saw 1600% CPU -> 16 cores in simple messenger) and the last 1 or 2 osd take very long time for `boot` message.

It looks like the ms_dispatch thread is busy with handling mon_subscribe from clients ( we have 7000+ volumes), interestingly, it usually take 0.2ms(counted by the interval between each 2 `handle_subscribe mon_subscribe` log from logfile) per mon_subscribe request in both 12.2.4 and 10.2.5, but it may take 15ms + sometimes, during above mentioned case.

```
2018-04-13 02:42:51.740904 7f173c8f5700 10 mon.lvs3b03c-2fht@0(leader) e37 handle_subscribe mon_subscribe({osdmap=2038709}) v2
2018-04-13 02:42:51.740910 7f173c8f5700 10 mon.lvs3b03c-2fht@0(leader).osd e2038709 check_osdmap_sub 0x7f178079f100 next 2038709 (onetime)
2018-04-13 02:42:51.740912 7f173c8f5700 5 mon.lvs3b03c-2fht@0(leader).osd e2038709 send_incremental [2038709..2038709] to client.? 10.173.84.46:0/524267465
2018-04-13 02:42:51.740915 7f173c8f5700 10 mon.lvs3b03c-2fht@0(leader).osd e2038709 build_incremental [2038709..2038709]
2018-04-13 02:42:51.754653 7f173c8f5700 10 mon.lvs3b03c-2fht@0(leader) e37 handle_subscribe mon_subscribe({osdmap=2038708}) v2
2018-04-13 02:42:51.754658 7f173c8f5700 10 mon.lvs3b03c-2fht@0(leader).osd e2038709 check_osdmap_sub 0x7f175dd8db80 next 2038708 (onetime)
2018-04-13 02:42:51.754660 7f173c8f5700 5 mon.lvs3b03c-2fht@0(leader).osd e2038709 send_incremental [2038708..2038709] to client.? 10.161.98.161:0/559412026
2018-04-13 02:42:51.754683 7f173c8f5700 10 mon.lvs3b03c-2fht@0(leader).osd e2038709 build_incremental [2038708..2038709]

```
I dont have further idea about what is going on, and the cluster now is fairly unstable if a bunch of OSDs get up/down in a short interval.

i can get whatever logs for debugging, as it is fair easy to reproduce