ceph-mgr module(s) inaccessible after a reboot
I've enabled and secured the RESTful API and verified that I can access its web UI at https://<active-mgr-node-ip-address>:8003/doc.
After a reboot, I tried to access the UI again, but the page just kept loading forever. When I restarted ceph-mgr, I was able to access the page as expected.
I'm attaching the ceph-mgr log if that helps.
ceph version 12.1.2-1.el7cp (b661348f156f148d764b998b65b90451f096cb27) luminous (rc)
#2 Updated by John Spray over 4 years ago
Hmm, I can't see much from that log -- it looks like it's from a single run of the mgr rather than spanning the reboots?
If this is reproducible, then please could you set "debug mgr = 20" in the mgr's ceph.conf, and try to get a log that shows all three mgr lifetimes (before the reboot, after the reboot, after the restart) - thanks!
#3 Updated by Boris Ranto over 4 years ago
John, this is reproducible, you just need to deploy a cluster with ansible and reboot the machine. For some reason, the mgr daemon is started in a weird way. The output of ps -A calls the ceph-mgr process exe for some reason (ps aux shows the proper name). After the mgr restart, everything is well and ps -A calls the process ceph-mgr as usual.
The ps output when this is happening:
7-1 ps -A|grep ceph 1126 ? 00:00:01 ceph-mon 1145 ? 00:00:00 ceph-osd 7-1 ps aux|grep ceph ceph 1126 1.3 1.5 439308 29912 ? Ssl 10:38 0:01 /usr/bin/ceph-mon -f --cluster ceph --id node1 --setuser ceph --setgroup ceph ceph 1129 0.8 0.7 362716 13736 ? Ssl 10:38 0:00 /usr/bin/ceph-mgr -f --cluster ceph --id node1 --setuser ceph --setgroup ceph ceph 1145 0.3 1.3 752680 25424 ? Ssl 10:38 0:00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph root 2695 0.0 0.0 112648 992 pts/0 R+ 10:40 0:00 grep --color=auto ceph 7-1 ps -A|grep 1129 1129 ? 00:00:01 exe
#4 Updated by John Spray over 4 years ago
Good to know it's reproducible, if I have the log from the existing instance where this is happening then that would help me out by saving me the effort of setting up another environment myself.
The `exe` thing is a quick fix (http://tracker.ceph.com/issues/21404), is there any indication that it's related to this issue?
#5 Updated by Boris Ranto over 4 years ago
I think it might be related. If there are any issues with the module not being killed properly before respawn then the socket can remain open/active and that will block the module start on respawn. Are you calling shutdown on the modules when doing the respawn?
#6 Updated by John Spray over 4 years ago
The 'exe' thing is really just the process naming - even if there is an issue of sockets not getting torn down, that would not be related to the process/thread name.
The question in my mind is why ceph-mgr appears to have respawned immediately after a reboot -- the debug logs might give me a clue.
#7 Updated by Boris Ranto over 4 years ago
- Priority changed from Normal to High
Sorry for the delay. I have been playing with this a bit. I am not saying it is related to the executable name being exe but to the fact that the respawn occurred. I suspect that we do not call the shutdown method for the modules when doing a respawn. I was able to hit 'Address already in use' tracebacks while playing with this which suggests that the server was not destroyed properly previously.
Also, I tend to see this behaviour when the exec name is just 'exe'. If it shows as a regular exec name (and hence, the respawn probably did not occur), everything seems to work fine.
#9 Updated by John Spray almost 4 years ago
- Status changed from New to Closed
There's no log to look at, and no further reports, so I'm going to close this.
The fix for the 'exe' thing went into 12.2.2 (https://github.com/ceph/ceph/pull/18738/commits/7e08cdf53992570d27b47d0028c698b78908ba83)