Bug #45006
openceph-mgr runs on inactive node
0%
Description
I have a cluster with Ubuntu 19.10 and ceph 14.2.8. It ran fine for a while, but gets shut down and started a lot as it is my test cluster on my laptop.
A couple of days ago I found that when restarting the cluster node by node mgr is reported by ceph -s on a node that is not running yet. In ceph -s I see the run time counting up, the mgr is noted as active.
According to the documentation the ceph-mgr should automatically be replaced with a backup if it does not send a beacon within the required timeouts. As the node the supposed mgr is running on is down, it can not send a beacon, still the Ceph mons do not replace the mgr.
When I start the node with the supposed ceph-mgr on it, the mgr goes into a crash loop. It will be marked as (active, starting since X seconds). It never leaves that state. In the log I see the same sequence repeating endlessly (see attached file ceph-mgr_failure.log).
Furthermore, the mgr leaves the socket file (/var/run/ceph/ceph-mgr.node3.asok) in place when it dies, so upon next restart it complains that it can not create the admin socket. Erasing the file manually helps, but does not fix the problem.
I tried starting the mgr and manually failing it over, unsuccessfully as the mgr does not appear to enter a state where it can actually listen on the socket.
I tried tracing the ceph-mgr, but have not found anything pertinent yet. I am adding the output of the strace and the stdout during the strace run. These look the same for every run I try. The process runs for about 100 seconds before it fails out and restarts.
I initially suspected the dashboard module, but excluding it has not made a difference.
I am going to leave the cluster alone for now so I can provide more data if requested.
Files