Bug #42830
openproblem returning mon to cluster
0%
Description
as discussed on the list, here https://www.spinics.net/lists/ceph-users/msg55977.html
After rebooting one of the nodes, when trying to start monitor, whole cluster
seems to hang, including IO, ceph -s etc. When this mon is stopped again,
everything continues. Trying to spawn new monitor leads to the same problem
(even on different node).
All cluster nodes are centos 7 machines, I have 3 monitors (so 2 are now running), I'm
using ceph 13.2.6. monitor database is not very large, ~65MB. None of the cluster machines is overloaded.
update: after some discussion on the list, I was able to workaroud by setting mon lease timeout to 50s, waiting for monitor to join the cluster and then setting it back to 5s again.. this mon connect took hours btw! after it got OK, stopping/starting it is without flaw.
I'm quite sure there is no network issue there and since this first case, we got hit by it on another cluster.
probably good news is, that I was able to reproduce this problem by creating same test environment in VMs, with same hostnames, addresses and ceph version and copied monitor data. so if anyone would be interested, we're able to give SSH access or exact steps and data to reproduce.
if I could provide more data, please let me know. I'm also attaching ceph-mon.log with debug_mon set to 10/10.
Files