Project

General

Profile

Actions

Bug #42830

open

problem returning mon to cluster

Added by Nikola Ciprich over 4 years ago. Updated over 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

as discussed on the list, here https://www.spinics.net/lists/ceph-users/msg55977.html

After rebooting one of the nodes, when trying to start monitor, whole cluster
seems to hang, including IO, ceph -s etc. When this mon is stopped again,
everything continues. Trying to spawn new monitor leads to the same problem
(even on different node).

All cluster nodes are centos 7 machines, I have 3 monitors (so 2 are now running), I'm
using ceph 13.2.6. monitor database is not very large, ~65MB. None of the cluster machines is overloaded.

update: after some discussion on the list, I was able to workaroud by setting mon lease timeout to 50s, waiting for monitor to join the cluster and then setting it back to 5s again.. this mon connect took hours btw! after it got OK, stopping/starting it is without flaw.

I'm quite sure there is no network issue there and since this first case, we got hit by it on another cluster.

probably good news is, that I was able to reproduce this problem by creating same test environment in VMs, with same hostnames, addresses and ceph version and copied monitor data. so if anyone would be interested, we're able to give SSH access or exact steps and data to reproduce.

if I could provide more data, please let me know. I'm also attaching ceph-mon.log with debug_mon set to 10/10.


Files

ceph-mon.nodev1d.log (190 KB) ceph-mon.nodev1d.log Nikola Ciprich, 11/15/2019 07:30 AM

Related issues 1 (0 open1 closed)

Related to RADOS - Bug #44453: mon: fix/improve mon sync over small keysResolved

Actions
Actions

Also available in: Atom PDF