Actions
Bug #44945
openMon High CPU usage when another mon syncing from it
Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Each sync request take very long time to come back, as show below. And from the TOP of source-mon, the CPU was on dispatcher, perf shows mostly on buffer operations.
One special thing in this cluster is , the cluster is heavily used by rbd snapshot, where 1K + volumes make snapshots every hour and keep for 7 days.
2020-04-05 19:53:09.722 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] send_to--> mon [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- ?+0 0x199af980 2020-04-05 19:53:09.722 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] --> [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- 0x199af980 con 0xc632d80 2020-04-05 19:53:09.758 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] <== mon.2 v2:10.75.88.177:3300/0 244 ==== mon_sync(chunk cookie 79456894977 lc 40807336 bl 1625633 bytes) v2 ==== 1625721+0+0 (secure 0 0 0) 0x199af980 con 0xc632d80 2020-04-05 19:53:09.766 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] send_to--> mon [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- ?+0 0x199ae580 2020-04-05 19:53:09.766 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] --> [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- 0x199ae580 con 0xc632d80 2020-04-05 19:53:23.219 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] <== mon.2 v2:10.75.88.177:3300/0 245 ==== mon_sync(chunk cookie 79456894977 lc 40807337 bl 998724 bytes last_key osd_snap,removed_snap_1_00000000004fedcb) v2 ==== 998851+0+0 (secure 0 0 0) 0x199ae580 con 0xc632d80 2020-04-05 19:53:23.235 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] send_to--> mon [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- ?+0 0x199ae300 2020-04-05 19:53:23.235 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] --> [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- 0x199ae300 con 0xc632d80 2020-04-05 19:53:23.271 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] <== mon.2 v2:10.75.88.177:3300/0 246 ==== mon_sync(chunk cookie 79456894977 lc 40807338 bl 1175179 bytes) v2 ==== 1175267+0+0 (secure 0 0 0) 0x199ae300 con 0xc632d80 2020-04-05 19:53:23.279 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] send_to--> mon [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- ?+0 0x199af200 2020-04-05 19:53:23.279 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] --> [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- 0x199af200 con 0xc632d80 2020-04-05 19:53:40.436 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] <== mon.2 v2:10.75.88.177:3300/0 247 ==== mon_sync(chunk cookie 79456894977 lc 40807338 bl 1048478 bytes last_key osd_snap,removed_snap_1_0000000000504aa4) v2 ==== 1048605+0+0 (secure 0 0 0) 0x199af200 con 0xc632d80 2020-04-05 19:53:40.448 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] send_to--> mon [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- ?+0 0x199ac780 2020-04-05 19:53:40.448 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] --> [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- 0x199ac780 con 0xc632d80 2020-04-05 19:53:54.404 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] <== mon.2 v2:10.75.88.177:3300/0 248 ==== mon_sync(chunk cookie 79456894977 lc 40807339 bl 998698 bytes last_key osd_snap,removed_snap_1_000000000050a0f4) v2 ==== 998825+0+0 (secure 0 0 0) 0x199ac780 con 0xc632d80 2020-04-05 19:53:54.420 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] send_to--> mon [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- ?+0 0x199afc00 2020-04-05 19:53:54.420 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] --> [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- 0x199afc00 con 0xc632d80 2020-04-05 19:54:12.133 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] <== mon.2 v2:10.75.88.177:3300/0 249 ==== mon_sync(chunk cookie 79456894977 lc 40807339 bl 1048478 bytes last_key osd_snap,removed_snap_1_000000000051020c) v2 ==== 1048605+0+0 (secure 0 0 0) 0x199afc00 con 0xc632d80 2020-04-05 19:54:12.149 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] send_to--> mon [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- ?+0 0x199ac280 2020-04-05 19:54:12.149 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] --> [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- 0x199ac280 con 0xc632d80 2020-04-05 19:54:12.173 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] <== mon.2 v2:10.75.88.177:3300/0 250 ==== mon_sync(chunk cookie 79456894977 lc 40807340 bl 1626593 bytes) v2 ==== 1626681+0+0 (secure 0 0 0) 0x199ac280 con 0xc632d80 2020-04-05 19:54:12.181 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] send_to--> mon [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- ?+0 0x199af480 2020-04-05 19:54:12.181 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] --> [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- 0x199af480 con 0xc632d80 2020-04-05 19:54:26.138 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] <== mon.2 v2:10.75.88.177:3300/0 251 ==== mon_sync(chunk cookie 79456894977 lc 40807341 bl 998690 bytes last_key osd_snap,removed_snap_1_0000000000515c15) v2 ==== 998817+0+0 (secure 0 0 0) 0x199af480 con 0xc632d80 2020-04-05 19:54:26.154 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] send_to--> mon [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- ?+0 0xce4ed00 2020-04-05 19:54:26.154 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] --> [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- 0xce4ed00 con 0xc632d80 2020-04-05 19:54:43.459 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] <== mon.2 v2:10.75.88.177:3300/0 252 ==== mon_sync(chunk cookie 79456894977 lc 40807341 bl 1048478 bytes last_key osd_snap,removed_snap_1_000000000051c031) v2 ==== 1048605+0+0 (secure 0 0 0) 0xce4ed00 con 0xc632d80 2020-04-05 19:54:43.475 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] send_to--> mon [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- ?+0 0xce4ef80 2020-04-05 19:54:43.475 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] --> [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- 0xce4ef80 con 0xc632d80 2020-04-05 19:54:43.507 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] <== mon.2 v2:10.75.88.177:3300/0 253 ==== mon_sync(chunk cookie 79456894977 lc 40807342 bl 1624418 bytes) v2 ==== 1624506+0+0 (secure 0 0 0) 0xce4ef80 con 0xc632d80 2020-04-05 19:54:43.511 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] send_to--> mon [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- ?+0 0xce4f200 2020-04-05 19:54:43.511 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] --> [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- 0xce4f200 con 0xc632d80 2020-04-05 19:54:57.272 7f34ada19700 1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] <== mon.2 v2:10.75.88.177:3300/0 254 ==== mon_sync(chunk cookie 79456894977 lc 40807343 bl 998702 bytes last_key osd_snap,removed_snap_1_0000000000521b3f) v2 ==== 998829+0+0 (secure 0 0 0) 0xce4f200 con 0xc632d80
top - 19:52:23 up 333 days, 8:14, 2 users, load average: 1.21, 1.17, 0.95 Threads: 27 total, 1 running, 26 sleeping, 0 stopped, 0 zombie %Cpu(s): 13.6 us, 0.2 sy, 0.0 ni, 86.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 16431696 total, 8569568 free, 1426536 used, 6435592 buff/cache KiB Swap: 0 total, 0 free, 0 used. 14443308 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3812474 ceph 20 0 1654848 940032 28164 R 99.0 5.7 33:25.47 ms_dispatch 3812472 ceph 20 0 1654848 940032 28164 S 2.7 5.7 17:08.01 cpu_tp 3812470 ceph 20 0 1654848 940032 28164 S 2.3 5.7 18:23.44 cpu_tp 3812471 ceph 20 0 1654848 940032 28164 S 2.3 5.7 17:27.31 cpu_tp 3812473 ceph 20 0 1654848 940032 28164 S 2.3 5.7 18:07.42 cpu_tp 3812439 ceph 20 0 1654848 940032 28164 S 0.3 5.7 0:08.48 log 3812464 ceph 20 0 1654848 940032 28164 S 0.3 5.7 0:14.46 msgr-worker-0 3812437 ceph 20 0 1654848 940032 28164 S 0.0 5.7 0:00.47 ceph-mon 3812440 ceph 20 0 1654848 940032 28164 S 0.0 5.7 0:00.69 service 3812441 ceph 20 0 1654848 940032 28164 S 0.0 5.7 0:03.91 admin_socket 3812442 ceph 20 0 1654848 940032 28164 S 0.0 5.7 0:00.00 signal_handler 3812443 ceph 20 0 1654848 940032 28164 S 0.0 5.7 27:21.85 rocksdb:low0 3812444 ceph 20 0 1654848 940032 28164 S 0.0 5.7 1:18.43 rocksdb:high0 3812445 ceph 20 0 1654848 940032 28164 S 0.0 5.7 0:00.00 ceph-mon 3812461 ceph 20 0 1654848 940032 28164 S 0.0 5.7 0:00.05 rocksdb:dump_st
15.16% libceph-common.so.0 [.] _ZN4ceph6buffer7v14_2_03ptr6appendEPKcj ▒ 12.67% libceph-common.so.0 [.] _ZN4ceph6buffer7v14_2_03ptr7releaseEv ▒ 11.29% libceph-common.so.0 [.] _ZN4ceph6buffer7v14_2_04list6appendEPKcj ▒ 9.08% libceph-common.so.0 [.] _ZN4ceph6buffer7v14_2_03ptrC1ERKS2_ ▒ 8.63% libtcmalloc.so.4.2.6 [.] tc_malloc ▒ 4.26% libtcmalloc.so.4.2.6 [.] _ZN8tcmalloc11ThreadCache21ReleaseToCentralCacheEPNS0_8FreeListEmi ▒ 4.02% libtcmalloc.so.4.2.6 [.] tc_deletearray ▒ 3.99% libceph-common.so.0 [.] _ZNK4ceph6buffer7v14_2_03ptr18unused_tail_lengthEv ▒ 3.90% ceph-mon [.] _ZNK14MonitorDBStore11Transaction6encodeERN4ceph6buffer7v14_2_04listE
Updated by Xiaoxi Chen about 4 years ago
It is probably relate with huge removed_snap keys.
Updated by Neha Ojha over 3 years ago
- Status changed from New to Need More Info
Which ceph version is this? I'd be curious to know if this is still an issue with Octopus since we improved removed snaps handling a bunch there.
Updated by Wout van Heeswijk over 3 years ago
Actions