Bug #21399: ceph-mgr module(s) inaccessible after a reboot - mgr - Ceph

Custom queries

Backports
Backports: nautilus
Bug queue
Bug triage
Crash queue
Crash triage
Dashboard Telemetry issues
Feedback
Low Hanging Fruit
My issues
Need Review
New Dashboard Bugs
Orchestrator
Pending backports
Product Backlog Scrub

Actions

Copy link

Bug #21399

closed

ceph-mgr module(s) inaccessible after a reboot

Added by Bara Ancincova over 6 years ago. Updated about 6 years ago.

Status:

Closed

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I've enabled and secured the RESTful API and verified that I can access its web UI at https://<active-mgr-node-ip-address>:8003/doc.

After a reboot, I tried to access the UI again, but the page just kept loading forever. When I restarted ceph-mgr, I was able to access the page as expected.

I'm attaching the ceph-mgr log if that helps.

ceph version 12.1.2-1.el7cp (b661348f156f148d764b998b65b90451f096cb27) luminous (rc)
ceph-mgr-12.1.2-1.el7cp.x86_64

Files

ceph-mgr.node2.log (10.1 KB) ceph-mgr.node2.log

Bara Ancincova, 09/15/2017 10:59 AM

History
Notes
Property changes

Actions

Copy link

Updated by Bara Ancincova over 6 years ago

Description updated (diff)

Actions

Copy link

Updated by John Spray over 6 years ago

Hmm, I can't see much from that log -- it looks like it's from a single run of the mgr rather than spanning the reboots?

If this is reproducible, then please could you set "debug mgr = 20" in the mgr's ceph.conf, and try to get a log that shows all three mgr lifetimes (before the reboot, after the reboot, after the restart) - thanks!

Actions

Copy link

Updated by Boris Ranto over 6 years ago

John, this is reproducible, you just need to deploy a cluster with ansible and reboot the machine. For some reason, the mgr daemon is started in a weird way. The output of ps -A calls the ceph-mgr process exe for some reason (ps aux shows the proper name). After the mgr restart, everything is well and ps -A calls the process ceph-mgr as usual.

The ps output when this is happening:

7-1 ps -A|grep ceph
   1126 ?        00:00:01 ceph-mon
   1145 ?        00:00:00 ceph-osd
7-1 ps aux|grep ceph
ceph        1126  1.3  1.5 439308 29912 ?        Ssl  10:38   0:01 /usr/bin/ceph-mon -f --cluster ceph --id node1 --setuser ceph --setgroup ceph
ceph        1129  0.8  0.7 362716 13736 ?        Ssl  10:38   0:00 /usr/bin/ceph-mgr -f --cluster ceph --id node1 --setuser ceph --setgroup ceph
ceph        1145  0.3  1.3 752680 25424 ?        Ssl  10:38   0:00 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
root        2695  0.0  0.0 112648   992 pts/0    R+   10:40   0:00 grep --color=auto ceph
7-1 ps -A|grep 1129
   1129 ?        00:00:01 exe

Actions

Copy link

Updated by John Spray over 6 years ago

Good to know it's reproducible, if I have the log from the existing instance where this is happening then that would help me out by saving me the effort of setting up another environment myself.

The `exe` thing is a quick fix (http://tracker.ceph.com/issues/21404), is there any indication that it's related to this issue?

Actions

Copy link

Updated by Boris Ranto over 6 years ago

I think it might be related. If there are any issues with the module not being killed properly before respawn then the socket can remain open/active and that will block the module start on respawn. Are you calling shutdown on the modules when doing the respawn?

Actions

Copy link

Updated by John Spray over 6 years ago

The 'exe' thing is really just the process naming - even if there is an issue of sockets not getting torn down, that would not be related to the process/thread name.

The question in my mind is why ceph-mgr appears to have respawned immediately after a reboot -- the debug logs might give me a clue.

Actions

Copy link

Updated by Boris Ranto over 6 years ago

Priority changed from Normal to High

Sorry for the delay. I have been playing with this a bit. I am not saying it is related to the executable name being exe but to the fact that the respawn occurred. I suspect that we do not call the shutdown method for the modules when doing a respawn. I was able to hit 'Address already in use' tracebacks while playing with this which suggests that the server was not destroyed properly previously.

Also, I tend to see this behaviour when the exec name is just 'exe'. If it shows as a regular exec name (and hence, the respawn probably did not occur), everything seems to work fine.

Actions

Copy link