Project

General

Profile

Actions

Bug #45006

open

ceph-mgr runs on inactive node

Added by Christian Huebner about 4 years ago. Updated about 4 years ago.

Status:
New
Priority:
High
Assignee:
-
Category:
ceph-mgr
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have a cluster with Ubuntu 19.10 and ceph 14.2.8. It ran fine for a while, but gets shut down and started a lot as it is my test cluster on my laptop.

A couple of days ago I found that when restarting the cluster node by node mgr is reported by ceph -s on a node that is not running yet. In ceph -s I see the run time counting up, the mgr is noted as active.

According to the documentation the ceph-mgr should automatically be replaced with a backup if it does not send a beacon within the required timeouts. As the node the supposed mgr is running on is down, it can not send a beacon, still the Ceph mons do not replace the mgr.

When I start the node with the supposed ceph-mgr on it, the mgr goes into a crash loop. It will be marked as (active, starting since X seconds). It never leaves that state. In the log I see the same sequence repeating endlessly (see attached file ceph-mgr_failure.log).

Furthermore, the mgr leaves the socket file (/var/run/ceph/ceph-mgr.node3.asok) in place when it dies, so upon next restart it complains that it can not create the admin socket. Erasing the file manually helps, but does not fix the problem.

I tried starting the mgr and manually failing it over, unsuccessfully as the mgr does not appear to enter a state where it can actually listen on the socket.

I tried tracing the ceph-mgr, but have not found anything pertinent yet. I am adding the output of the strace and the stdout during the strace run. These look the same for every run I try. The process runs for about 100 seconds before it fails out and restarts.

I initially suspected the dashboard module, but excluding it has not made a difference.

I am going to leave the cluster alone for now so I can provide more data if requested.


Files

ceph-mgr_failure.log (11.3 KB) ceph-mgr_failure.log Christian Huebner, 04/08/2020 11:52 PM
ceph-mgr_failure_trace.stdout.txt (1.68 KB) ceph-mgr_failure_trace.stdout.txt Christian Huebner, 04/09/2020 12:25 AM
crushmap.out (2.15 KB) crushmap.out Christian Huebner, 04/09/2020 12:25 AM
ceph-mgr_failure_tracefile.txt (371 KB) ceph-mgr_failure_tracefile.txt Christian Huebner, 04/09/2020 12:25 AM
Actions #1

Updated by Neha Ojha about 4 years ago

  • Priority changed from Normal to High
Actions

Also available in: Atom PDF