Bug #45006: ceph-mgr runs on inactive node - mgr - Ceph

Actions

Copy link

Bug #45006

open

ceph-mgr runs on inactive node

Added by Christian Huebner about 4 years ago. Updated about 4 years ago.

Status:

New

Priority:

High

Assignee:

Category:

ceph-mgr

Target version:

Ceph - v14.2.8

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I have a cluster with Ubuntu 19.10 and ceph 14.2.8. It ran fine for a while, but gets shut down and started a lot as it is my test cluster on my laptop.

A couple of days ago I found that when restarting the cluster node by node mgr is reported by ceph -s on a node that is not running yet. In ceph -s I see the run time counting up, the mgr is noted as active.

According to the documentation the ceph-mgr should automatically be replaced with a backup if it does not send a beacon within the required timeouts. As the node the supposed mgr is running on is down, it can not send a beacon, still the Ceph mons do not replace the mgr.

When I start the node with the supposed ceph-mgr on it, the mgr goes into a crash loop. It will be marked as (active, starting since X seconds). It never leaves that state. In the log I see the same sequence repeating endlessly (see attached file ceph-mgr_failure.log).

Furthermore, the mgr leaves the socket file (/var/run/ceph/ceph-mgr.node3.asok) in place when it dies, so upon next restart it complains that it can not create the admin socket. Erasing the file manually helps, but does not fix the problem.

I tried starting the mgr and manually failing it over, unsuccessfully as the mgr does not appear to enter a state where it can actually listen on the socket.

I tried tracing the ceph-mgr, but have not found anything pertinent yet. I am adding the output of the strace and the stdout during the strace run. These look the same for every run I try. The process runs for about 100 seconds before it fails out and restarts.

I initially suspected the dashboard module, but excluding it has not made a difference.

I am going to leave the cluster alone for now so I can provide more data if requested.

Files

Download all files

ceph-mgr_failure.log (11.3 KB) ceph-mgr_failure.log		Christian Huebner, 04/08/2020 11:52 PM
ceph-mgr_failure_trace.stdout.txt (1.68 KB) ceph-mgr_failure_trace.stdout.txt		Christian Huebner, 04/09/2020 12:25 AM
crushmap.out (2.15 KB) crushmap.out		Christian Huebner, 04/09/2020 12:25 AM
ceph-mgr_failure_tracefile.txt (371 KB) ceph-mgr_failure_tracefile.txt		Christian Huebner, 04/09/2020 12:25 AM

Actions

Copy link

Updated by Neha Ojha about 4 years ago

Priority changed from Normal to High

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Bug #45006

ceph-mgr runs on inactive node

Updated by Neha Ojha about 4 years ago