Project

General

Profile

Bug #44945

Mon High CPU usage when another mon syncing from it

Added by Xiaoxi Chen almost 3 years ago. Updated about 1 year ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Each sync request take very long time to come back, as show below. And from the TOP of source-mon, the CPU was on dispatcher, perf shows mostly on buffer operations.

One special thing in this cluster is , the cluster is heavily used by rbd snapshot, where 1K + volumes make snapshots every hour and keep for 7 days.

2020-04-05 19:53:09.722 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] send_to--> mon [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- ?+0 0x199af980
2020-04-05 19:53:09.722 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] --> [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- 0x199af980 con 0xc632d80
2020-04-05 19:53:09.758 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] <== mon.2 v2:10.75.88.177:3300/0 244 ==== mon_sync(chunk cookie 79456894977 lc 40807336 bl 1625633 bytes) v2 ==== 1625721+0+0 (secure 0 0 0) 0x199af980 con 0xc632d80
2020-04-05 19:53:09.766 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] send_to--> mon [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- ?+0 0x199ae580
2020-04-05 19:53:09.766 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] --> [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- 0x199ae580 con 0xc632d80
2020-04-05 19:53:23.219 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] <== mon.2 v2:10.75.88.177:3300/0 245 ==== mon_sync(chunk cookie 79456894977 lc 40807337 bl 998724 bytes last_key osd_snap,removed_snap_1_00000000004fedcb) v2 ==== 998851+0+0 (secure 0 0 0) 0x199ae580 con 0xc632d80
2020-04-05 19:53:23.235 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] send_to--> mon [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- ?+0 0x199ae300
2020-04-05 19:53:23.235 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] --> [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- 0x199ae300 con 0xc632d80
2020-04-05 19:53:23.271 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] <== mon.2 v2:10.75.88.177:3300/0 246 ==== mon_sync(chunk cookie 79456894977 lc 40807338 bl 1175179 bytes) v2 ==== 1175267+0+0 (secure 0 0 0) 0x199ae300 con 0xc632d80
2020-04-05 19:53:23.279 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] send_to--> mon [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- ?+0 0x199af200
2020-04-05 19:53:23.279 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] --> [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- 0x199af200 con 0xc632d80
2020-04-05 19:53:40.436 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] <== mon.2 v2:10.75.88.177:3300/0 247 ==== mon_sync(chunk cookie 79456894977 lc 40807338 bl 1048478 bytes last_key osd_snap,removed_snap_1_0000000000504aa4) v2 ==== 1048605+0+0 (secure 0 0 0) 0x199af200 con 0xc632d80
2020-04-05 19:53:40.448 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] send_to--> mon [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- ?+0 0x199ac780
2020-04-05 19:53:40.448 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] --> [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- 0x199ac780 con 0xc632d80
2020-04-05 19:53:54.404 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] <== mon.2 v2:10.75.88.177:3300/0 248 ==== mon_sync(chunk cookie 79456894977 lc 40807339 bl 998698 bytes last_key osd_snap,removed_snap_1_000000000050a0f4) v2 ==== 998825+0+0 (secure 0 0 0) 0x199ac780 con 0xc632d80
2020-04-05 19:53:54.420 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] send_to--> mon [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- ?+0 0x199afc00
2020-04-05 19:53:54.420 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] --> [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- 0x199afc00 con 0xc632d80
2020-04-05 19:54:12.133 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] <== mon.2 v2:10.75.88.177:3300/0 249 ==== mon_sync(chunk cookie 79456894977 lc 40807339 bl 1048478 bytes last_key osd_snap,removed_snap_1_000000000051020c) v2 ==== 1048605+0+0 (secure 0 0 0) 0x199afc00 con 0xc632d80
2020-04-05 19:54:12.149 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] send_to--> mon [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- ?+0 0x199ac280
2020-04-05 19:54:12.149 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] --> [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- 0x199ac280 con 0xc632d80
2020-04-05 19:54:12.173 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] <== mon.2 v2:10.75.88.177:3300/0 250 ==== mon_sync(chunk cookie 79456894977 lc 40807340 bl 1626593 bytes) v2 ==== 1626681+0+0 (secure 0 0 0) 0x199ac280 con 0xc632d80
2020-04-05 19:54:12.181 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] send_to--> mon [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- ?+0 0x199af480
2020-04-05 19:54:12.181 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] --> [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- 0x199af480 con 0xc632d80
2020-04-05 19:54:26.138 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] <== mon.2 v2:10.75.88.177:3300/0 251 ==== mon_sync(chunk cookie 79456894977 lc 40807341 bl 998690 bytes last_key osd_snap,removed_snap_1_0000000000515c15) v2 ==== 998817+0+0 (secure 0 0 0) 0x199af480 con 0xc632d80
2020-04-05 19:54:26.154 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] send_to--> mon [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- ?+0 0xce4ed00
2020-04-05 19:54:26.154 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] --> [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- 0xce4ed00 con 0xc632d80
2020-04-05 19:54:43.459 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] <== mon.2 v2:10.75.88.177:3300/0 252 ==== mon_sync(chunk cookie 79456894977 lc 40807341 bl 1048478 bytes last_key osd_snap,removed_snap_1_000000000051c031) v2 ==== 1048605+0+0 (secure 0 0 0) 0xce4ed00 con 0xc632d80
2020-04-05 19:54:43.475 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] send_to--> mon [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- ?+0 0xce4ef80
2020-04-05 19:54:43.475 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] --> [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- 0xce4ef80 con 0xc632d80
2020-04-05 19:54:43.507 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] <== mon.2 v2:10.75.88.177:3300/0 253 ==== mon_sync(chunk cookie 79456894977 lc 40807342 bl 1624418 bytes) v2 ==== 1624506+0+0 (secure 0 0 0) 0xce4ef80 con 0xc632d80
2020-04-05 19:54:43.511 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] send_to--> mon [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- ?+0 0xce4f200
2020-04-05 19:54:43.511 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] --> [v2:10.75.88.177:3300/0,v1:10.75.88.177:6789/0] -- mon_sync(get_chunk cookie 79456894977) v2 -- 0xce4f200 con 0xc632d80
2020-04-05 19:54:57.272 7f34ada19700  1 -- [v2:10.75.86.255:3300/0,v1:10.75.86.255:6789/0] <== mon.2 v2:10.75.88.177:3300/0 254 ==== mon_sync(chunk cookie 79456894977 lc 40807343 bl 998702 bytes last_key osd_snap,removed_snap_1_0000000000521b3f) v2 ==== 998829+0+0 (secure 0 0 0) 0xce4f200 con 0xc632d80
top - 19:52:23 up 333 days,  8:14,  2 users,  load average: 1.21, 1.17, 0.95
Threads:  27 total,   1 running,  26 sleeping,   0 stopped,   0 zombie
%Cpu(s): 13.6 us,  0.2 sy,  0.0 ni, 86.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16431696 total,  8569568 free,  1426536 used,  6435592 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 14443308 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
3812474 ceph      20   0 1654848 940032  28164 R 99.0  5.7  33:25.47 ms_dispatch
3812472 ceph      20   0 1654848 940032  28164 S  2.7  5.7  17:08.01 cpu_tp
3812470 ceph      20   0 1654848 940032  28164 S  2.3  5.7  18:23.44 cpu_tp
3812471 ceph      20   0 1654848 940032  28164 S  2.3  5.7  17:27.31 cpu_tp
3812473 ceph      20   0 1654848 940032  28164 S  2.3  5.7  18:07.42 cpu_tp
3812439 ceph      20   0 1654848 940032  28164 S  0.3  5.7   0:08.48 log
3812464 ceph      20   0 1654848 940032  28164 S  0.3  5.7   0:14.46 msgr-worker-0
3812437 ceph      20   0 1654848 940032  28164 S  0.0  5.7   0:00.47 ceph-mon
3812440 ceph      20   0 1654848 940032  28164 S  0.0  5.7   0:00.69 service
3812441 ceph      20   0 1654848 940032  28164 S  0.0  5.7   0:03.91 admin_socket
3812442 ceph      20   0 1654848 940032  28164 S  0.0  5.7   0:00.00 signal_handler
3812443 ceph      20   0 1654848 940032  28164 S  0.0  5.7  27:21.85 rocksdb:low0
3812444 ceph      20   0 1654848 940032  28164 S  0.0  5.7   1:18.43 rocksdb:high0
3812445 ceph      20   0 1654848 940032  28164 S  0.0  5.7   0:00.00 ceph-mon
3812461 ceph      20   0 1654848 940032  28164 S  0.0  5.7   0:00.05 rocksdb:dump_st
  15.16%  libceph-common.so.0   [.] _ZN4ceph6buffer7v14_2_03ptr6appendEPKcj                                                                                                                                                                                                    ▒
  12.67%  libceph-common.so.0   [.] _ZN4ceph6buffer7v14_2_03ptr7releaseEv                                                                                                                                                                                                      ▒
  11.29%  libceph-common.so.0   [.] _ZN4ceph6buffer7v14_2_04list6appendEPKcj                                                                                                                                                                                                   ▒
   9.08%  libceph-common.so.0   [.] _ZN4ceph6buffer7v14_2_03ptrC1ERKS2_                                                                                                                                                                                                        ▒
   8.63%  libtcmalloc.so.4.2.6  [.] tc_malloc                                                                                                                                                                                                                                  ▒
   4.26%  libtcmalloc.so.4.2.6  [.] _ZN8tcmalloc11ThreadCache21ReleaseToCentralCacheEPNS0_8FreeListEmi                                                                                                                                                                         ▒
   4.02%  libtcmalloc.so.4.2.6  [.] tc_deletearray                                                                                                                                                                                                                             ▒
   3.99%  libceph-common.so.0   [.] _ZNK4ceph6buffer7v14_2_03ptr18unused_tail_lengthEv                                                                                                                                                                                         ▒
   3.90%  ceph-mon              [.] _ZNK14MonitorDBStore11Transaction6encodeERN4ceph6buffer7v14_2_04listE

History

#1 Updated by Xiaoxi Chen almost 3 years ago

It is probably relate with huge removed_snap keys.

#2 Updated by Josh Durgin almost 3 years ago

  • Priority changed from Normal to High

#3 Updated by Neha Ojha over 2 years ago

  • Status changed from New to Need More Info

Which ceph version is this? I'd be curious to know if this is still an issue with Octopus since we improved removed snaps handling a bunch there.

#4 Updated by Wout van Heeswijk about 2 years ago

I think this might be related to #42830. If so it may be resolved with Ceph Nautilus 14.2.10 by backport #44464. If so, can this report be closed?

#5 Updated by Neha Ojha almost 2 years ago

Wout van Heeswijk wrote:

I think this might be related to #42830. If so it may be resolved with Ceph Nautilus 14.2.10 by backport #44464. If so, can this report be closed?

Have you had a chance to test with 14.2.10 yet?

#6 Updated by Neha Ojha about 1 year ago

  • Priority changed from High to Normal

Also available in: Atom PDF