Bug #42830

problem returning mon to cluster

Added by Nikola Ciprich almost 4 years ago. Updated over 2 years ago.

Target version:
% Done:


2 - major
Affected Versions:
Pull request ID:
Crash signature (v1):
Crash signature (v2):


as discussed on the list, here

After rebooting one of the nodes, when trying to start monitor, whole cluster
seems to hang, including IO, ceph -s etc. When this mon is stopped again,
everything continues. Trying to spawn new monitor leads to the same problem
(even on different node).

All cluster nodes are centos 7 machines, I have 3 monitors (so 2 are now running), I'm
using ceph 13.2.6. monitor database is not very large, ~65MB. None of the cluster machines is overloaded.

update: after some discussion on the list, I was able to workaroud by setting mon lease timeout to 50s, waiting for monitor to join the cluster and then setting it back to 5s again.. this mon connect took hours btw! after it got OK, stopping/starting it is without flaw.

I'm quite sure there is no network issue there and since this first case, we got hit by it on another cluster.

probably good news is, that I was able to reproduce this problem by creating same test environment in VMs, with same hostnames, addresses and ceph version and copied monitor data. so if anyone would be interested, we're able to give SSH access or exact steps and data to reproduce.

if I could provide more data, please let me know. I'm also attaching ceph-mon.log with debug_mon set to 10/10.

ceph-mon.nodev1d.log View (190 KB) Nikola Ciprich, 11/15/2019 07:30 AM

Related issues

Related to RADOS - Bug #44453: mon: fix/improve mon sync over small keys Resolved


#1 Updated by Jérôme Poulin almost 4 years ago

We encountered the same problem last week, after stopping a monitor service on a server on the cluster, trying to start it again cycles between peon/electing. We destroyed all of the monitor except one, integrated the second one with some issues that only waiting could fix, then starting the third, without even adding it to the monmap, makes any command to the cluster block until I stop the new monitor. The monitor stays stuck at synchronizing now.

2019-11-18 16:37:57.614 7fa35f587700 1 mon.sg2crsrv02@-1(synchronizing) e45 sync_obtain_latest_monmap
2019-11-18 16:37:57.674 7fa35f587700 1 mon.sg2crsrv02@-1(synchronizing) e45 sync_obtain_latest_monmap obtained monmap e45

#2 Updated by Jérôme Poulin almost 4 years ago

I forgot, Ceph is at version 14.2.1 on our side.

#3 Updated by Greg Farnum almost 4 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (Monitor)
  • Component(RADOS) Monitor added

#4 Updated by Dan van der Ster over 3 years ago

Seeing the same here in 13.2.8 starting a new empty mon. Leader's CPU goes to 100%, until an election is called then the mons start flapping between themselves.

#5 Updated by Dan van der Ster over 3 years ago

I noticed there is very little osdmap caching in the leader mon -- here we see only 1 single osdmap in the mempool.
Overall the memory usage is very low on our mons.

[16:36][root@p05517715y58557 (production:ceph/beesly/mon*2:) ~]# ceph daemon mon.`hostname -s` dump_mempools
    "mempool": {
        "by_pool": {
            "bloom_filter": {
                "items": 0,
                "bytes": 0
            "bluestore_alloc": {
                "items": 0,
                "bytes": 0
            "bluestore_cache_data": {
                "items": 0,
                "bytes": 0
            "bluestore_cache_onode": {
                "items": 0,
                "bytes": 0
            "bluestore_cache_other": {
                "items": 0,
                "bytes": 0
            "bluestore_fsck": {
                "items": 0,
                "bytes": 0
            "bluestore_txc": {
                "items": 0,
                "bytes": 0
            "bluestore_writing_deferred": {
                "items": 0,
                "bytes": 0
            "bluestore_writing": {
                "items": 0,
                "bytes": 0
            "bluefs": {
                "items": 0,
                "bytes": 0
            "buffer_anon": {
                "items": 1779,
                "bytes": 36199481
            "buffer_meta": {
                "items": 8,
                "bytes": 512
            "osd": {
                "items": 0,
                "bytes": 0
            "osd_mapbl": {
                "items": 0,
                "bytes": 0
            "osd_pglog": {
                "items": 0,
                "bytes": 0
            "osdmap": {
                "items": 23486,
                "bytes": 608920
            "osdmap_mapping": {
                "items": 293744,
                "bytes": 2400928
            "pgmap": {
                "items": 9670,
                "bytes": 125168
            "mds_co": {
                "items": 0,
                "bytes": 0
            "unittest_1": {
                "items": 0,
                "bytes": 0
            "unittest_2": {
                "items": 0,
                "bytes": 0
        "total": {
            "items": 328687,
            "bytes": 39335009

perf top show this when the ceph-mon is 100%:

  33.34%                             [.] __memcmp_sse4_1
  17.98%                        [.] 0x000000000003961c
  13.35%                             [.] __memcpy_ssse3_back
  11.68%                        [.] 0x000000000005d7ec
   5.99%  [kernel]                                 [k] clear_page_c_e
   5.24%                        [.] 0x000000000005d7f4
   4.83%  [kernel]                                 [k] perf_event_task_tick
   3.60%  [kernel]                                 [k] avtab_search_node
   1.06%                      [.] std::_Rb_tree<snapid_t, std::pair<snapid_t const, sn
   0.73%  [kernel]                                 [k] nmi
   0.54%                        [.] 0x000000000005d88d
   0.47%                        [.] 0x000000000005111d
   0.44%  [kernel]                                 [k] copy_pte_range
   0.28%                           [.] 0x0000000000010f1c
   0.19%  [kernel]                                 [k] page_add_new_anon_rmap
   0.09%  [kernel]                                 [k] worker_thread
   0.04%                      [.] std::string::compare
   0.03%                               [.] strchr
   0.01%                        [.] 0x0000000000009855
   0.01%  [kernel]                                 [k] __put_user_4
   0.01%                        [.] 0x0000000000038db8
   0.01%                      [.] ceph::buffer::list::append
   0.01%                      [.] ceph::buffer::ptr::append
   0.01%                        [.] 0x000000000005d8a5

#6 Updated by Wido den Hollander over 3 years ago

I also posted this on the mailinglist, but let me post it here as well:

I can chime in here: I had this happen to a customer as well.

Compact did not work.

Some background:

5 Monitors and the DBs were ~350M in size. They upgraded one MON from
13.2.6 to 13.2.8 and that caused one MON (sync source) to eat 100% CPU.

The logs showed that the upgraded MON (which was restarted) was in the
synchronizing state.

Because they had 5 MONs they now had 3 left so the cluster kept running.

I left this for about 5 minutes, but it never synced.

I tried a compact, didn't work either.

Eventually I stopped one MON, tarballed it's database and used that to
bring back the MON which was upgraded to 13.2.8

That work without any hickups. The MON joined again within a few seconds.

#7 Updated by Neha Ojha over 3 years ago

  • Related to Bug #44453: mon: fix/improve mon sync over small keys added

#8 Updated by Dan van der Ster over 3 years ago

Workaround in our case is: `ceph config set mon mon_sync_max_payload_size 4096`
We have 5 mons again!

#9 Updated by Wout van Heeswijk over 2 years ago

Am I correct in thinking this is resolved for nautilus by backport #44464 and it is not going to backported to Mimic?

After several hours of problems with one monitor, I've used the `ceph config set mon mon_sync_max_payload_size 4096` fix successfully on a Nautilus 14.2.9 cluster. The monitor then joined quorum after a few seconds. We were considering the fix @wido suggested, as we have successfully executed that before.

Also available in: Atom PDF