Bug #23713
closedHigh MON cpu usage when cluster is changing
0%
Description
After upgrading to Luminous 12.2.4 (from Jewel 10.2.5), we consistently see high cpu usage when OSDMap changes , especially from DOWN to UP.
At first the behavior was when one OSD gets down and up (trigger by `ceph osd down osd.x`), monitor quorum will stuck and re-election due to lease timeout. Further investigation shows
1. lease message is not getting handled for 3s+ during this period.
2. the CPU usage reach 300% in that periods, indicating 3 async worker thread is not enough.
2018-04-09 23:46:51.030583 7fecad4de700 10 mon.lvs3b03c-2gsx@1(peon).paxos(paxos active c 112056948..112057621) handle_lease on 112057621 now 2018-04-09 23:46:52.480946
Then we changed to use 8 worker threads , or changed back to simple messenger(as jewel default to simple), both looks better in above cases, i.e , single OSD restart will not causing lease timeout. But, if we reboot a OSD node with 24 osds, or just
`for i in seq 1 10 ; do ceph osd down $i; done;` can cause monitors use massive CPU (we saw 1600% CPU -> 16 cores in simple messenger) and the last 1 or 2 osd take very long time for `boot` message.
It looks like the ms_dispatch thread is busy with handling mon_subscribe from clients ( we have 7000+ volumes), interestingly, it usually take 0.2ms(counted by the interval between each 2 `handle_subscribe mon_subscribe` log from logfile) per mon_subscribe request in both 12.2.4 and 10.2.5, but it may take 15ms + sometimes, during above mentioned case.
@
2018-04-13 02:42:51.740904 7f173c8f5700 10 mon.lvs3b03c-2fht@0(leader) e37 handle_subscribe mon_subscribe({osdmap=2038709}) v2
2018-04-13 02:42:51.740910 7f173c8f5700 10 mon.lvs3b03c-2fht@0(leader).osd e2038709 check_osdmap_sub 0x7f178079f100 next 2038709 (onetime)
2018-04-13 02:42:51.740912 7f173c8f5700 5 mon.lvs3b03c-2fht@0(leader).osd e2038709 send_incremental [2038709..2038709] to client.? 10.173.84.46:0/524267465
2018-04-13 02:42:51.740915 7f173c8f5700 10 mon.lvs3b03c-2fht@0(leader).osd e2038709 build_incremental [2038709..2038709]
2018-04-13 02:42:51.754653 7f173c8f5700 10 mon.lvs3b03c-2fht@0(leader) e37 handle_subscribe mon_subscribe({osdmap=2038708}) v2
2018-04-13 02:42:51.754658 7f173c8f5700 10 mon.lvs3b03c-2fht@0(leader).osd e2038709 check_osdmap_sub 0x7f175dd8db80 next 2038708 (onetime)
2018-04-13 02:42:51.754660 7f173c8f5700 5 mon.lvs3b03c-2fht@0(leader).osd e2038709 send_incremental [2038708..2038709] to client.? 10.161.98.161:0/559412026
2018-04-13 02:42:51.754683 7f173c8f5700 10 mon.lvs3b03c-2fht@0(leader).osd e2038709 build_incremental [2038708..2038709]
@
I dont have further idea about what is going on, and the cluster now is fairly unstable if a bunch of OSDs get up/down in a short interval.
i can get whatever logs for debugging, as it is fair easy to reproduce
Updated by Sage Weil about 6 years ago
My guess is that this is the compat reencoding of the OSDMap for the pre-luminous clients.
Are you by chance making use of the crush-compat balancer? That would additionally require a reencoded crush map.
Updated by Greg Farnum about 6 years ago
- Project changed from Ceph to RADOS
- Category deleted (
Monitor) - Component(RADOS) Monitor added
Updated by Sage Weil about 6 years ago
- Status changed from New to Fix Under Review
Updated by Sage Weil almost 6 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to luminous
Updated by Nathan Cutler almost 6 years ago
- Copied to Backport #23912: luminous: mon: High MON cpu usage when cluster is changing added
Updated by Nathan Cutler over 5 years ago
- Status changed from Pending Backport to Resolved