Bug #21399
closed
ceph-mgr module(s) inaccessible after a reboot
Added by Bara Ancincova over 6 years ago.
Updated about 6 years ago.
Description
I've enabled and secured the RESTful API and verified that I can access its web UI at https://<active-mgr-node-ip-address>:8003/doc.
After a reboot, I tried to access the UI again, but the page just kept loading forever. When I restarted ceph-mgr, I was able to access the page as expected.
I'm attaching the ceph-mgr log if that helps.
ceph version 12.1.2-1.el7cp (b661348f156f148d764b998b65b90451f096cb27) luminous (rc)
ceph-mgr-12.1.2-1.el7cp.x86_64
Files
- Description updated (diff)
Hmm, I can't see much from that log -- it looks like it's from a single run of the mgr rather than spanning the reboots?
If this is reproducible, then please could you set "debug mgr = 20" in the mgr's ceph.conf, and try to get a log that shows all three mgr lifetimes (before the reboot, after the reboot, after the restart) - thanks!
John, this is reproducible, you just need to deploy a cluster with ansible and reboot the machine. For some reason, the mgr daemon is started in a weird way. The output of ps -A calls the ceph-mgr process exe for some reason (ps aux shows the proper name). After the mgr restart, everything is well and ps -A calls the process ceph-mgr as usual.
The ps output when this is happening:
7-1 ps -A|grep ceph
1126 ? 00:00:01 ceph-mon
1145 ? 00:00:00 ceph-osd
7-1 ps aux|grep ceph
ceph 1126 1.3 1.5 439308 29912 ? Ssl 10:38 0:01 /usr/bin/ceph-mon -f --cluster ceph --id node1 --setuser ceph --setgroup ceph
ceph 1129 0.8 0.7 362716 13736 ? Ssl 10:38 0:00 /usr/bin/ceph-mgr -f --cluster ceph --id node1 --setuser ceph --setgroup ceph
ceph 1145 0.3 1.3 752680 25424 ? Ssl 10:38 0:00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
root 2695 0.0 0.0 112648 992 pts/0 R+ 10:40 0:00 grep --color=auto ceph
7-1 ps -A|grep 1129
1129 ? 00:00:01 exe
Good to know it's reproducible, if I have the log from the existing instance where this is happening then that would help me out by saving me the effort of setting up another environment myself.
The `exe` thing is a quick fix (http://tracker.ceph.com/issues/21404), is there any indication that it's related to this issue?
I think it might be related. If there are any issues with the module not being killed properly before respawn then the socket can remain open/active and that will block the module start on respawn. Are you calling shutdown on the modules when doing the respawn?
The 'exe' thing is really just the process naming - even if there is an issue of sockets not getting torn down, that would not be related to the process/thread name.
The question in my mind is why ceph-mgr appears to have respawned immediately after a reboot -- the debug logs might give me a clue.
- Priority changed from Normal to High
Sorry for the delay. I have been playing with this a bit. I am not saying it is related to the executable name being exe but to the fact that the respawn occurred. I suspect that we do not call the shutdown method for the modules when doing a respawn. I was able to hit 'Address already in use' tracebacks while playing with this which suggests that the server was not destroyed properly previously.
Also, I tend to see this behaviour when the exec name is just 'exe'. If it shows as a regular exec name (and hence, the respawn probably did not occur), everything seems to work fine.
- Project changed from Ceph to mgr
- Status changed from New to Closed
Also available in: Atom
PDF