Bug #21399: ceph-mgr module(s) inaccessible after a reboot - mgr - Ceph

Actions

Copy link

Bug #21399

closed

ceph-mgr module(s) inaccessible after a reboot

Added by Bara Ancincova over 6 years ago. Updated about 6 years ago.

Status:

Closed

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I've enabled and secured the RESTful API and verified that I can access its web UI at https://<active-mgr-node-ip-address>:8003/doc.

After a reboot, I tried to access the UI again, but the page just kept loading forever. When I restarted ceph-mgr, I was able to access the page as expected.

I'm attaching the ceph-mgr log if that helps.

ceph version 12.1.2-1.el7cp (b661348f156f148d764b998b65b90451f096cb27) luminous (rc)
ceph-mgr-12.1.2-1.el7cp.x86_64

Files

ceph-mgr.node2.log (10.1 KB) ceph-mgr.node2.log

Bara Ancincova, 09/15/2017 10:59 AM

Actions

Copy link

Updated by Bara Ancincova over 6 years ago

Description updated (diff)

Actions

Copy link

Updated by John Spray over 6 years ago

Hmm, I can't see much from that log -- it looks like it's from a single run of the mgr rather than spanning the reboots?

If this is reproducible, then please could you set "debug mgr = 20" in the mgr's ceph.conf, and try to get a log that shows all three mgr lifetimes (before the reboot, after the reboot, after the restart) - thanks!

Actions

Copy link

Updated by Boris Ranto over 6 years ago

John, this is reproducible, you just need to deploy a cluster with ansible and reboot the machine. For some reason, the mgr daemon is started in a weird way. The output of ps -A calls the ceph-mgr process exe for some reason (ps aux shows the proper name). After the mgr restart, everything is well and ps -A calls the process ceph-mgr as usual.

The ps output when this is happening:

7-1 ps -A|grep ceph
   1126 ?        00:00:01 ceph-mon
   1145 ?        00:00:00 ceph-osd
7-1 ps aux|grep ceph
ceph        1126  1.3  1.5 439308 29912 ?        Ssl  10:38   0:01 /usr/bin/ceph-mon -f --cluster ceph --id node1 --setuser ceph --setgroup ceph
ceph        1129  0.8  0.7 362716 13736 ?        Ssl  10:38   0:00 /usr/bin/ceph-mgr -f --cluster ceph --id node1 --setuser ceph --setgroup ceph
ceph        1145  0.3  1.3 752680 25424 ?        Ssl  10:38   0:00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
root        2695  0.0  0.0 112648   992 pts/0    R+   10:40   0:00 grep --color=auto ceph
7-1 ps -A|grep 1129
   1129 ?        00:00:01 exe

Actions

Copy link

Updated by John Spray over 6 years ago

Good to know it's reproducible, if I have the log from the existing instance where this is happening then that would help me out by saving me the effort of setting up another environment myself.

The `exe` thing is a quick fix (http://tracker.ceph.com/issues/21404), is there any indication that it's related to this issue?

Actions

Copy link

Updated by Boris Ranto over 6 years ago

I think it might be related. If there are any issues with the module not being killed properly before respawn then the socket can remain open/active and that will block the module start on respawn. Are you calling shutdown on the modules when doing the respawn?

Actions

Copy link

Updated by John Spray over 6 years ago

The 'exe' thing is really just the process naming - even if there is an issue of sockets not getting torn down, that would not be related to the process/thread name.

The question in my mind is why ceph-mgr appears to have respawned immediately after a reboot -- the debug logs might give me a clue.

Actions

Copy link

Updated by Boris Ranto over 6 years ago

Priority changed from Normal to High

Sorry for the delay. I have been playing with this a bit. I am not saying it is related to the executable name being exe but to the fact that the respawn occurred. I suspect that we do not call the shutdown method for the modules when doing a respawn. I was able to hit 'Address already in use' tracebacks while playing with this which suggests that the server was not destroyed properly previously.

Also, I tend to see this behaviour when the exec name is just 'exe'. If it shows as a regular exec name (and hence, the respawn probably did not occur), everything seems to work fine.

Actions

Copy link