Bug #42830: problem returning mon to cluster - RADOS - Ceph

Actions

Copy link

Bug #42830

open

problem returning mon to cluster

Added by Nikola Ciprich over 4 years ago. Updated over 3 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v13.2.6

ceph-qa-suite:

Component(RADOS):

Monitor

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

as discussed on the list, here https://www.spinics.net/lists/ceph-users/msg55977.html

After rebooting one of the nodes, when trying to start monitor, whole cluster
seems to hang, including IO, ceph -s etc. When this mon is stopped again,
everything continues. Trying to spawn new monitor leads to the same problem
(even on different node).

All cluster nodes are centos 7 machines, I have 3 monitors (so 2 are now running), I'm
using ceph 13.2.6. monitor database is not very large, ~65MB. None of the cluster machines is overloaded.

update: after some discussion on the list, I was able to workaroud by setting mon lease timeout to 50s, waiting for monitor to join the cluster and then setting it back to 5s again.. this mon connect took hours btw! after it got OK, stopping/starting it is without flaw.

I'm quite sure there is no network issue there and since this first case, we got hit by it on another cluster.

probably good news is, that I was able to reproduce this problem by creating same test environment in VMs, with same hostnames, addresses and ceph version and copied monitor data. so if anyone would be interested, we're able to give SSH access or exact steps and data to reproduce.

if I could provide more data, please let me know. I'm also attaching ceph-mon.log with debug_mon set to 10/10.

Files

ceph-mon.nodev1d.log (190 KB) ceph-mon.nodev1d.log

Nikola Ciprich, 11/15/2019 07:30 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Jérôme Poulin over 4 years ago

We encountered the same problem last week, after stopping a monitor service on a server on the cluster, trying to start it again cycles between peon/electing. We destroyed all of the monitor except one, integrated the second one with some issues that only waiting could fix, then starting the third, without even adding it to the monmap, makes any command to the cluster block until I stop the new monitor. The monitor stays stuck at synchronizing now.

2019-11-18 16:37:57.614 7fa35f587700 1 mon.sg2crsrv02@-1(synchronizing) e45 sync_obtain_latest_monmap
2019-11-18 16:37:57.674 7fa35f587700 1 mon.sg2crsrv02@-1(synchronizing) e45 sync_obtain_latest_monmap obtained monmap e45

Actions

Copy link

Updated by Jérôme Poulin over 4 years ago

I forgot, Ceph is at version 14.2.1 on our side.

Actions

Copy link

Updated by Greg Farnum over 4 years ago

Project changed from Ceph to RADOS
Category deleted (~~Monitor~~)
Component(RADOS) Monitor added

Actions

Copy link

Updated by Dan van der Ster about 4 years ago

Seeing the same here in 13.2.8 starting a new empty mon. Leader's CPU goes to 100%, until an election is called then the mons start flapping between themselves.

Actions

Copy link

Updated by Dan van der Ster about 4 years ago

I noticed there is very little osdmap caching in the leader mon -- here we see only 1 single osdmap in the mempool.
Overall the memory usage is very low on our mons.

[16:36][root@p05517715y58557 (production:ceph/beesly/mon*2:) ~]# ceph daemon mon.`hostname -s` dump_mempools
{
    "mempool": {
        "by_pool": {
            "bloom_filter": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_alloc": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_cache_data": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_cache_onode": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_cache_other": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_fsck": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_txc": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_writing_deferred": {
                "items": 0,
                "bytes": 0
            },
            "bluestore_writing": {
                "items": 0,
                "bytes": 0
            },
            "bluefs": {
                "items": 0,
                "bytes": 0
            },
            "buffer_anon": {
                "items": 1779,
                "bytes": 36199481
            },
            "buffer_meta": {
                "items": 8,
                "bytes": 512
            },
            "osd": {
                "items": 0,
                "bytes": 0
            },
            "osd_mapbl": {
                "items": 0,
                "bytes": 0
            },
            "osd_pglog": {
                "items": 0,
                "bytes": 0
            },
            "osdmap": {
                "items": 23486,
                "bytes": 608920
            },
            "osdmap_mapping": {
                "items": 293744,
                "bytes": 2400928
            },
            "pgmap": {
                "items": 9670,
                "bytes": 125168
            },
            "mds_co": {
                "items": 0,
                "bytes": 0
            },
            "unittest_1": {
                "items": 0,
                "bytes": 0
            },
            "unittest_2": {
                "items": 0,
                "bytes": 0
            }
        },
        "total": {
            "items": 328687,
            "bytes": 39335009
        }
    }
}

perf top show this when the ceph-mon is 100%:

  33.34%  libc-2.17.so                             [.] __memcmp_sse4_1
  17.98%  libfreeblpriv3.so                        [.] 0x000000000003961c
  13.35%  libc-2.17.so                             [.] __memcpy_ssse3_back
  11.68%  libfreeblpriv3.so                        [.] 0x000000000005d7ec
   5.99%  [kernel]                                 [k] clear_page_c_e
   5.24%  libfreeblpriv3.so                        [.] 0x000000000005d7f4
   4.83%  [kernel]                                 [k] perf_event_task_tick
   3.60%  [kernel]                                 [k] avtab_search_node
   1.06%  libceph-common.so.0                      [.] std::_Rb_tree<snapid_t, std::pair<snapid_t const, sn
   0.73%  [kernel]                                 [k] nmi
   0.54%  libfreeblpriv3.so                        [.] 0x000000000005d88d
   0.47%  libfreeblpriv3.so                        [.] 0x000000000005111d
   0.44%  [kernel]                                 [k] copy_pte_range
   0.28%  libjq.so.1.0.4                           [.] 0x0000000000010f1c
   0.19%  [kernel]                                 [k] page_add_new_anon_rmap
   0.09%  [kernel]                                 [k] worker_thread
   0.04%  libceph-common.so.0                      [.] std::string::compare
   0.03%  ld-2.17.so                               [.] strchr
   0.01%  libfreeblpriv3.so                        [.] 0x0000000000009855
   0.01%  [kernel]                                 [k] __put_user_4
   0.01%  libfreeblpriv3.so                        [.] 0x0000000000038db8
   0.01%  libceph-common.so.0                      [.] ceph::buffer::list::append
   0.01%  libceph-common.so.0                      [.] ceph::buffer::ptr::append
   0.01%  libfreeblpriv3.so                        [.] 0x000000000005d8a5

Actions

Copy link

Updated by Wido den Hollander about 4 years ago

I also posted this on the mailinglist, but let me post it here as well:

I can chime in here: I had this happen to a customer as well.

Compact did not work.

Some background:

5 Monitors and the DBs were ~350M in size. They upgraded one MON from
13.2.6 to 13.2.8 and that caused one MON (sync source) to eat 100% CPU.

The logs showed that the upgraded MON (which was restarted) was in the
synchronizing state.

Because they had 5 MONs they now had 3 left so the cluster kept running.

I left this for about 5 minutes, but it never synced.

I tried a compact, didn't work either.

Eventually I stopped one MON, tarballed it's database and used that to
bring back the MON which was upgraded to 13.2.8

That work without any hickups. The MON joined again within a few seconds.

Actions

Copy link

Updated by Neha Ojha about 4 years ago

Related to Bug #44453: mon: fix/improve mon sync over small keys added

Actions

Copy link

Updated by Dan van der Ster about 4 years ago

Workaround in our case is: `ceph config set mon mon_sync_max_payload_size 4096`
We have 5 mons again!

Actions

Copy link

Updated by Wout van Heeswijk over 3 years ago

Am I correct in thinking this is resolved for nautilus by backport #44464 and it is not going to backported to Mimic?

After several hours of problems with one monitor, I've used the `ceph config set mon mon_sync_max_payload_size 4096` fix successfully on a Nautilus 14.2.9 cluster. The monitor then joined quorum after a few seconds. We were considering the fix @Wido den Hollander suggested, as we have successfully executed that before.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #42830

problem returning mon to cluster

Updated by Jérôme Poulin over 4 years ago

Updated by Jérôme Poulin over 4 years ago

Updated by Greg Farnum over 4 years ago

Updated by Dan van der Ster about 4 years ago

Updated by Dan van der Ster about 4 years ago

Updated by Wido den Hollander about 4 years ago

Updated by Neha Ojha about 4 years ago

Updated by Dan van der Ster about 4 years ago

Updated by Wout van Heeswijk over 3 years ago