Bug #49808: Ceph manager becomes unresponsive and is replaced by standby daemon - mgr - Ceph

Actions

Copy link

Bug #49808

open

Ceph manager becomes unresponsive and is replaced by standby daemon

Added by Mathias Lindberg about 3 years ago. Updated almost 3 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

ceph-mgr

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Running (for dashboard) not supported OS CentOS7.9.2009, dependencies not in EPEL installed via pip. Ceph version 15.2.9. MGR process was running > 2 weeks this time before becoming unresponsive has been shorter previously. I do not expect to get support running an unsupported OS / component, just providing information if this is affecting users running supported combinations.

Manager logs:

2021-03-15T13:32:46.585+0100 7f908eabe700 0 [prometheus DEBUG root] Starting method get_rbd_stats.
2021-03-15T13:32:46.585+0100 7f908eabe700 0 [prometheus DEBUG root] Method get_rbd_stats ran 0.000 seconds.
2021-03-15T13:32:46.597+0100 7f90775d0700 0 [dashboard DEBUG request] [********:50471] [GET] [dash] /api/summary
2021-03-15T13:32:46.597+0100 7f90775d0700 0 [dashboard DEBUG auth] token: *******************
2021-03-15T13:32:46.597+0100 7f90775d0700 4 mgr get_store get_store key: mgr/dashboard/jwt_token_black_list
2021-03-15T13:32:46.597+0100 7f90775d0700 0 [dashboard DEBUG auth] checking authorization...
2021-03-15T13:32:46.629+0100 7f9061ee2700 0 [dashboard DEBUG viewcache] starting execution of <function get_daemons_and_pools at 0x7f90a005c2f0>
2021-03-15T13:32:46.757+0100 7f9099713700 0 log_channel(audit) log [DBG] : from='client.151670041 -' entity='client.admin' cmd=[{"prefix": "osd pool stats", "target": ["mon-mgr", ""], "format": "json"}]: dispatch
2021-03-15T13:32:46.812+0100 7f908eabe700 0 [prometheus DEBUG root] Method collect ran 0.733 seconds.
2021-03-15T13:32:46.812+0100 7f908eabe700 0 [prometheus DEBUG root] collecting cache in thread done
2021-03-15T13:32:46.840+0100 7f9061ee2700 0 [dashboard DEBUG controllers.rbd_mirror] Constructing IOCtx rbd
2021-03-15T13:32:46.841+0100 7f9061ee2700 0 [dashboard DEBUG controllers.rbd_mirror] Constructing IOCtx ssc-cinder-volumes-md
2021-03-15T13:32:46.841+0100 7f9061ee2700 0 [dashboard DEBUG controllers.rbd_mirror] Constructing IOCtx ssc-glance-images-md
2021-03-15T13:32:46.842+0100 7f9061ee2700 0 [dashboard DEBUG controllers.rbd_mirror] Constructing IOCtx ssc-nova-vms-md
2021-03-15T13:32:46.843+0100 7f9061ee2700 0 [dashboard DEBUG controllers.rbd_mirror] Constructing IOCtx mare4
2021-03-15T13:32:46.844+0100 7f9061ee2700 0 [dashboard DEBUG viewcache] execution of <function get_daemons_and_pools at 0x7f90a005c2f0> finished in: 0.21473383903503418
2021-03-15T13:32:46.845+0100 7f90775d0700 0 [dashboard INFO request] [********:50471] [GET] [200] [0.249s] [dash] [241.0B] /api/summary
2021-03-15T13:32:47.373+0100 7f9071dc5700 0 [dashboard DEBUG notification_queue] processing queue: 1
2021-03-15T13:32:47.508+0100 7f9071dc5700 0 [dashboard DEBUG notification_queue] processing queue: 1
2021-03-15T13:32:47.571+0100 7f9077dd1700 0 [dashboard DEBUG request] [********:62014] [GET] [dash] /api/cluster_conf/
2021-03-15T13:32:47.571+0100 7f9077dd1700 0 [dashboard DEBUG auth] token: *******************
2021-03-15T13:32:47.571+0100 7f9077dd1700 4 mgr get_store get_store key: mgr/dashboard/jwt_token_black_list
2021-03-15T13:32:47.572+0100 7f9077dd1700 0 [dashboard DEBUG auth] checking authorization...
2021-03-15T13:32:47.572+0100 7f9077dd1700 0 [dashboard DEBUG auth] checking '['read']' access to 'config-opt' scope
2021-03-15T13:32:47.751+0100 7f9087970700 0 [rbd_support DEBUG root] TaskHandler: tick
2021-03-15T13:32:47.751+0100 7f908d97c700 0 [rbd_support DEBUG root] PerfHandler: tick

Files

ceph-mgr.cephyr-mon1.log.20210518.gz (38.2 KB) ceph-mgr.cephyr-mon1.log.20210518.gz

MGR-logs

Mathias Lindberg, 05/19/2021 03:03 PM

Actions

Copy link

Updated by Ernesto Puerta about 3 years ago

Thanks, Mathias!

- When you say 'dependencies not in EPEL', which ones are missing? AFAIK all deps should be available as RPMs from EPEL 7.
- Additionally, we recently faced and fixed an issue involving the dashboard becoming unresponsive due to a bug in cheroot (https://tracker.ceph.com/issues/48973). However, in this case it looks like the whole ceph-mgr daemon stops responding, so this points to a different issue.
- Have you checked system logs for OOM events? If you are running Dashboard with the monitoring stack, you can check the Grafana dashboards + Prometheus metrics to look for memory/CPU load increases.
- Additionally, I'd suggest you to increase the verbosity level for manager logs.

Actions

Copy link

Updated by Mathias Lindberg about 3 years ago

Hi,
Only python3 packages related to the (on CentOS7) unsupported dashboard. The following packages was installed with pip: pecan, cheroot, jaraco.collections, more-itertools, portend, zc.lockfile and repoze.lru.
We built these RPM's for python3-cherrypy, python3-jwt, python3-more-itertools and python3-routes.
Building scipy (with dependencies) for ceph-mgr-diskprediction-local seemed to big of a task at the time.
We noticed the Cheroot issue and upgraded to v8.5.2 a while back.
No OOM events as far as i can notice, i have increased the verbosity level to 20 now.
Thank you!

Actions

Copy link

Updated by Mathias Lindberg almost 3 years ago

File ceph-mgr.cephyr-mon1.log.20210518.gz ceph-mgr.cephyr-mon1.log.20210518.gz added

Finally got the manager process to become unresponsive again, took a good +2 months. Attaching log from the last minute or so. The manager was not listed as a standby and the manager process it self did ~100% CPU. I have output from strace both with and without -f on the manager process if that is of use. Currently at version 15.2.12.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Bug #49808

Ceph manager becomes unresponsive and is replaced by standby daemon

Updated by Ernesto Puerta about 3 years ago

Updated by Mathias Lindberg about 3 years ago

Updated by Mathias Lindberg almost 3 years ago