Project

General

Profile

Bug #21399

ceph-mgr module(s) inaccessible after a reboot

Added by Bara Ancincova about 2 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
-
Target version:
-
Start date:
09/15/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

I've enabled and secured the RESTful API and verified that I can access its web UI at https://<active-mgr-node-ip-address>:8003/doc.

After a reboot, I tried to access the UI again, but the page just kept loading forever. When I restarted ceph-mgr, I was able to access the page as expected.

I'm attaching the ceph-mgr log if that helps.

ceph version 12.1.2-1.el7cp (b661348f156f148d764b998b65b90451f096cb27) luminous (rc)
ceph-mgr-12.1.2-1.el7cp.x86_64

ceph-mgr.node2.log View (10.1 KB) Bara Ancincova, 09/15/2017 10:59 AM

History

#1 Updated by Bara Ancincova about 2 years ago

  • Description updated (diff)

#2 Updated by John Spray about 2 years ago

Hmm, I can't see much from that log -- it looks like it's from a single run of the mgr rather than spanning the reboots?

If this is reproducible, then please could you set "debug mgr = 20" in the mgr's ceph.conf, and try to get a log that shows all three mgr lifetimes (before the reboot, after the reboot, after the restart) - thanks!

#3 Updated by Boris Ranto about 2 years ago

John, this is reproducible, you just need to deploy a cluster with ansible and reboot the machine. For some reason, the mgr daemon is started in a weird way. The output of ps -A calls the ceph-mgr process exe for some reason (ps aux shows the proper name). After the mgr restart, everything is well and ps -A calls the process ceph-mgr as usual.

The ps output when this is happening:

7-1 ps -A|grep ceph
   1126 ?        00:00:01 ceph-mon
   1145 ?        00:00:00 ceph-osd
7-1 ps aux|grep ceph
ceph        1126  1.3  1.5 439308 29912 ?        Ssl  10:38   0:01 /usr/bin/ceph-mon -f --cluster ceph --id node1 --setuser ceph --setgroup ceph
ceph        1129  0.8  0.7 362716 13736 ?        Ssl  10:38   0:00 /usr/bin/ceph-mgr -f --cluster ceph --id node1 --setuser ceph --setgroup ceph
ceph        1145  0.3  1.3 752680 25424 ?        Ssl  10:38   0:00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
root        2695  0.0  0.0 112648   992 pts/0    R+   10:40   0:00 grep --color=auto ceph
7-1 ps -A|grep 1129
   1129 ?        00:00:01 exe

#4 Updated by John Spray about 2 years ago

Good to know it's reproducible, if I have the log from the existing instance where this is happening then that would help me out by saving me the effort of setting up another environment myself.

The `exe` thing is a quick fix (http://tracker.ceph.com/issues/21404), is there any indication that it's related to this issue?

#5 Updated by Boris Ranto about 2 years ago

I think it might be related. If there are any issues with the module not being killed properly before respawn then the socket can remain open/active and that will block the module start on respawn. Are you calling shutdown on the modules when doing the respawn?

#6 Updated by John Spray about 2 years ago

The 'exe' thing is really just the process naming - even if there is an issue of sockets not getting torn down, that would not be related to the process/thread name.

The question in my mind is why ceph-mgr appears to have respawned immediately after a reboot -- the debug logs might give me a clue.

#7 Updated by Boris Ranto about 2 years ago

  • Priority changed from Normal to High

Sorry for the delay. I have been playing with this a bit. I am not saying it is related to the executable name being exe but to the fact that the respawn occurred. I suspect that we do not call the shutdown method for the modules when doing a respawn. I was able to hit 'Address already in use' tracebacks while playing with this which suggests that the server was not destroyed properly previously.

Also, I tend to see this behaviour when the exec name is just 'exe'. If it shows as a regular exec name (and hence, the respawn probably did not occur), everything seems to work fine.

#8 Updated by Patrick Donnelly almost 2 years ago

  • Project changed from Ceph to mgr

#9 Updated by John Spray almost 2 years ago

  • Status changed from New to Closed

There's no log to look at, and no further reports, so I'm going to close this.

The fix for the 'exe' thing went into 12.2.2 (https://github.com/ceph/ceph/pull/18738/commits/7e08cdf53992570d27b47d0028c698b78908ba83)

Also available in: Atom PDF