Bug #49693: Manager daemon is unresponsive, replacing it with standby daemon - mgr - Ceph

Actions

Copy link

Bug #49693

open

Manager daemon is unresponsive, replacing it with standby daemon

Added by Gunther Heinrich about 3 years ago. Updated about 1 month ago.

Status:

New

Priority:

Normal

Assignee:

Category:

ceph-mgr

Target version:

Ceph - v17.0.0

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I noticed that on the cluster the active mgr daemon is marked as unresposive and another mgr takes over. Currently this happens on a daily basis. journalctl doesn't show anything suspicious. A restart of the daemon by systemctl seems to help at the moment.
Here are the journalctl entries

Mar 08 19:47:15 iz-ceph-01-mon-03 bash[1567]: cluster 2021-03-08T18:47:15.474781+0000 mon.iz-ceph-01-mon-01 (mon.0) 472871 : cluster [INF] Manager daemon iz-ceph-01-mon-03.gjmkfc is unresponsive, replacing it with standby daemon iz-ceph-01-mon-02.gfiexf
Mar 09 17:25:59 iz-ceph-01-mon-02 bash[1592]: cluster 2021-03-09T16:25:59.541147+0000 mon.iz-ceph-01-mon-01 (mon.0) 538293 : cluster [INF] Manager daemon iz-ceph-01-mon-02.gfiexf is unresponsive, replacing it with standby daemon iz-ceph-01-mon-05.exotes

The cluster is in version 15.2.7 and runs on Ubuntu 20.0.4.2

Actions

Copy link

Updated by Neha Ojha about 3 years ago

This could be a result of one of the mgr modules overloading the mgr, have you done any debugging around that?

Actions

Copy link

Updated by Gunther Heinrich about 3 years ago

I didn't start to debug it but I noticed that this could be related to the dashboard.

When I open the dashboard for 16 hours (another login after 8 hours and then let it display its data until it logs the user out again) it becomes unresponsive. But when I open the dashboard for only 8 hours or not at all the mgr problem does not occur.

The strange thing is that this did not happen when the cluster was online but not crunching data because I began to move data (and iops) onto it just some days ago.

Actions

Copy link

Updated by Josias Montag about 3 years ago

We are regularly facing the same issue. We are using ceph version 15.2.10, deployed via ceph orch.

The mgr seems to freeze but not to crash. The logging of the mgr just stops and the dashboard gets unreachable. The CPU usage of the mgr goes to 100%. Standby mgr kicks in.
These are the last logs before the freeze of the mgr:

debug 2021-03-31T20:37:13.566+0000 7fe3718e0700 10 mgr.server tick 
debug 2021-03-31T20:37:13.566+0000 7fe3718e0700 15 mgr get_health_checks getting health checks for balancer
debug 2021-03-31T20:37:13.566+0000 7fe3718e0700 15 mgr get_health_checks getting health checks for cephadm
debug 2021-03-31T20:37:13.566+0000 7fe3718e0700 15 mgr get_health_checks getting health checks for crash
debug 2021-03-31T20:37:13.566+0000 7fe3718e0700 15 mgr get_health_checks getting health checks for dashboard
debug 2021-03-31T20:37:13.566+0000 7fe3718e0700 15 mgr get_health_checks getting health checks for devicehealth
debug 2021-03-31T20:37:13.566+0000 7fe3718e0700 15 mgr get_health_checks getting health checks for iostat
debug 2021-03-31T20:37:13.566+0000 7fe3718e0700 15 mgr get_health_checks getting health checks for orchestrator
debug 2021-03-31T20:37:13.566+0000 7fe3718e0700 15 mgr get_health_checks getting health checks for pg_autoscaler
debug 2021-03-31T20:37:13.566+0000 7fe3718e0700 15 mgr get_health_checks getting health checks for progress
debug 2021-03-31T20:37:13.566+0000 7fe3718e0700 15 mgr get_health_checks getting health checks for prometheus
debug 2021-03-31T20:37:13.566+0000 7fe3718e0700 15 mgr get_health_checks getting health checks for rbd_support
debug 2021-03-31T20:37:13.566+0000 7fe3718e0700 15 mgr get_health_checks getting health checks for restful
debug 2021-03-31T20:37:13.566+0000 7fe3718e0700 15 mgr get_health_checks getting health checks for status
debug 2021-03-31T20:37:13.566+0000 7fe3718e0700 15 mgr get_health_checks getting health checks for telemetry
debug 2021-03-31T20:37:13.566+0000 7fe3718e0700 15 mgr get_health_checks getting health checks for volumes
debug 2021-03-31T20:37:13.566+0000 7fe3718e0700 10 mgr update_delta_stats  v1665
debug 2021-03-31T20:37:13.582+0000 7fe3728e2700 10 mgr.server handle_report from 0x55d03b1abc00 osd.9
debug 2021-03-31T20:37:14.242+0000 7fe3945b0700 10 mgr tick tick
debug 2021-03-31T20:37:14.242+0000 7fe3945b0700 20 mgr send_beacon active
debug 2021-03-31T20:37:14.242+0000 7fe3945b0700 15 mgr send_beacon noting RADOS client for blacklist: v2:192.168.240.1:0/2356899642
debug 2021-03-31T20:37:14.242+0000 7fe3945b0700 15 mgr send_beacon noting RADOS client for blacklist: v2:192.168.240.1:0/489716832
debug 2021-03-31T20:37:14.242+0000 7fe3945b0700 15 mgr send_beacon noting RADOS client for blacklist: v2:192.168.240.1:0/4021494541
debug 2021-03-31T20:37:14.242+0000 7fe3945b0700 10 mgr send_beacon sending beacon as gid 34266

Another freeze:

debug 2021-03-31T19:41:46.752+0000 7fd95550b700 10 mgr ms_dispatch2 active service_map(e842 1 svc) v1
debug 2021-03-31T19:41:46.752+0000 7fd95550b700 10 mgr ms_dispatch2 service_map(e842 1 svc) v1
debug 2021-03-31T19:41:46.752+0000 7fd95550b700 10 mgr handle_service_map e842
debug 2021-03-31T19:41:46.752+0000 7fd95550b700 10 mgr.server operator() got updated map e842
debug 2021-03-31T19:41:46.752+0000 7fd95550b700 10 mgr notify_all notify_all: notify_all service_map
debug 2021-03-31T19:41:46.760+0000 7fd92f8b5700 10 mgr.server handle_report from 0x561b4c91f000 osd.27
debug 2021-03-31T19:41:46.760+0000 7fd92f8b5700 10 mgr.server handle_report daemon_health_metrics [SLOW_OPS(0|(0,0)),PENDING_CREATING_PGS(0|(0,0))]
debug 2021-03-31T19:41:46.776+0000 7fd92f8b5700 10 mgr.server handle_report from 0x561b4b937c00 mgr.storage2.wnzgqv
debug 2021-03-31T19:41:46.796+0000 7fd95550b700 10 mgr ms_dispatch2 active service_map(e842 1 svc) v1
debug 2021-03-31T19:41:46.796+0000 7fd95550b700 10 mgr ms_dispatch2 service_map(e842 1 svc) v1
debug 2021-03-31T19:41:46.796+0000 7fd95550b700 10 mgr handle_service_map e842
debug 2021-03-31T19:41:46.796+0000 7fd95550b700 10 mgr.server operator() got updated map e842
debug 2021-03-31T19:41:46.796+0000 7fd95550b700 10 mgr notify_all notify_all: notify_all service_map
debug 2021-03-31T19:41:46.948+0000 7fd92f8b5700 10 mgr.server handle_report from 0x561b4d441000 osd.7
debug 2021-03-31T19:41:46.948+0000 7fd92f8b5700 10 mgr.server handle_report daemon_health_metrics [SLOW_OPS(0|(0,0)),PENDING_CREATING_PGS(0|(0,0))]
debug 2021-03-31T19:41:47.132+0000 7fd92f8b5700 10 mgr.server handle_report from 0x561b4bd86800 osd.1
debug 2021-03-31T19:41:47.132+0000 7fd92f8b5700 10 mgr.server handle_report daemon_health_metrics [SLOW_OPS(0|(0,0)),PENDING_CREATING_PGS(0|(0,0))]
debug 2021-03-31T19:41:47.244+0000 7fd951503700 10 mgr tick tick
debug 2021-03-31T19:41:47.244+0000 7fd951503700 10 mgr send_beacon sending beacon as gid 34087
debug 2021-03-31T19:41:47.260+0000 7fd92f8b5700 10 mgr.server handle_report from 0x561b47890400 osd.17
debug 2021-03-31T19:41:47.260+0000 7fd92f8b5700 10 mgr.server handle_report daemon_health_metrics [SLOW_OPS(0|(0,0)),PENDING_CREATING_PGS(0|(0,0))]
debug 2021-03-31T19:41:47.344+0000 7fd92f8b5700 10 mgr.server handle_report from 0x561b49e8fc00 osd.16
debug 2021-03-31T19:41:47.344+0000 7fd92f8b5700 10 mgr.server handle_report daemon_health_metrics [SLOW_OPS(0|(0,0)),PENDING_CREATING_PGS(0|(0,0))]
debug 2021-03-31T19:41:47.508+0000 7fd92f8b5700 10 mgr.server handle_report from 0x561b4bd85c00 osd.11
debug 2021-03-31T19:41:47.508+0000 7fd92f8b5700 10 mgr.server handle_report daemon_health_metrics [SLOW_OPS(0|(0,0)),PENDING_CREATING_PGS(0|(0,0))]
debug 2021-03-31T19:41:47.648+0000 7fd92e8b3700 10 mgr.server tick 
debug 2021-03-31T19:41:47.648+0000 7fd92e8b3700 10 mgr update_delta_stats  v40764
debug 2021-03-31T19:41:47.808+0000 7fd92f8b5700 10 mgr.server handle_report from 0x561b472ce400 osd.2
debug 2021-03-31T19:41:47.948+0000 7fd958d12700 10 mgr.server ms_handle_authentication ms_handle_authentication new session 0x561b4737b0e0 con 0x561b4a9b5c00 entity osd.10 addr 
debug 2021-03-31T19:41:47.948+0000 7fd958d12700 10 mgr.server ms_handle_authentication  session 0x561b4737b0e0 osd.10 has caps profile osd 'allow profile osd'
debug 2021-03-31T19:41:50.576+0000 7fd959d14700 10 mgr.server ms_handle_authentication ms_handle_authentication new session 0x561b497e78c0 con 0x561b4a9b4000 entity osd.17 addr 
debug 2021-03-31T19:41:50.576+0000 7fd959d14700 10 mgr.server ms_handle_authentication  session 0x561b497e78c0 osd.17 has caps profile osd 'allow profile osd'

I think it is just normal logging before the freeze.

It seems to be related to using the dashboard. When using the dashboard actively, these freezes are happening every few hours. We have now tried to not open the dashboard at all and the mgr is running for 40h without issues.

Actions

Copy link

Updated by Matthew Hutchinson about 2 months ago

I'm running a cluster on Ubuntu 20.04 with Quincy version 17.2.7, and I'm encountering the same issue. Once I access the dashboard, it functions for about a minute before becoming unresponsive. This leads to the Manager (MGR) crashing or initiating a failover/restart process.

Given that the post hasn't been updated in three years, I'm exploring what additional information I can provide to address this problem.

Actions

Copy link

Updated by Matthew Hutchinson about 1 month ago

Are there any logs or information to get this resolved? currently the dashboard is disabled on the cluster to stop the MGR from crashing and I would like to be able to use the ceph dashboard

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Bug #49693

Manager daemon is unresponsive, replacing it with standby daemon

Updated by Neha Ojha about 3 years ago

Updated by Gunther Heinrich about 3 years ago

Updated by Josias Montag about 3 years ago

Updated by Matthew Hutchinson about 2 months ago

Updated by Matthew Hutchinson about 1 month ago