Project

General

Profile

Actions

Bug #49808

open

Ceph manager becomes unresponsive and is replaced by standby daemon

Added by Mathias Lindberg about 3 years ago. Updated almost 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Running (for dashboard) not supported OS CentOS7.9.2009, dependencies not in EPEL installed via pip. Ceph version 15.2.9. MGR process was running > 2 weeks this time before becoming unresponsive has been shorter previously. I do not expect to get support running an unsupported OS / component, just providing information if this is affecting users running supported combinations.

Manager logs:

2021-03-15T13:32:46.585+0100 7f908eabe700 0 [prometheus DEBUG root] Starting method get_rbd_stats.
2021-03-15T13:32:46.585+0100 7f908eabe700 0 [prometheus DEBUG root] Method get_rbd_stats ran 0.000 seconds.
2021-03-15T13:32:46.597+0100 7f90775d0700 0 [dashboard DEBUG request] [********:50471] [GET] [dash] /api/summary
2021-03-15T13:32:46.597+0100 7f90775d0700 0 [dashboard DEBUG auth] token: *******************
2021-03-15T13:32:46.597+0100 7f90775d0700 4 mgr get_store get_store key: mgr/dashboard/jwt_token_black_list
2021-03-15T13:32:46.597+0100 7f90775d0700 0 [dashboard DEBUG auth] checking authorization...
2021-03-15T13:32:46.629+0100 7f9061ee2700 0 [dashboard DEBUG viewcache] starting execution of <function get_daemons_and_pools at 0x7f90a005c2f0>
2021-03-15T13:32:46.757+0100 7f9099713700 0 log_channel(audit) log [DBG] : from='client.151670041 -' entity='client.admin' cmd=[{"prefix": "osd pool stats", "target": ["mon-mgr", ""], "format": "json"}]: dispatch
2021-03-15T13:32:46.812+0100 7f908eabe700 0 [prometheus DEBUG root] Method collect ran 0.733 seconds.
2021-03-15T13:32:46.812+0100 7f908eabe700 0 [prometheus DEBUG root] collecting cache in thread done
2021-03-15T13:32:46.840+0100 7f9061ee2700 0 [dashboard DEBUG controllers.rbd_mirror] Constructing IOCtx rbd
2021-03-15T13:32:46.841+0100 7f9061ee2700 0 [dashboard DEBUG controllers.rbd_mirror] Constructing IOCtx ssc-cinder-volumes-md
2021-03-15T13:32:46.841+0100 7f9061ee2700 0 [dashboard DEBUG controllers.rbd_mirror] Constructing IOCtx ssc-glance-images-md
2021-03-15T13:32:46.842+0100 7f9061ee2700 0 [dashboard DEBUG controllers.rbd_mirror] Constructing IOCtx ssc-nova-vms-md
2021-03-15T13:32:46.843+0100 7f9061ee2700 0 [dashboard DEBUG controllers.rbd_mirror] Constructing IOCtx mare4
2021-03-15T13:32:46.844+0100 7f9061ee2700 0 [dashboard DEBUG viewcache] execution of <function get_daemons_and_pools at 0x7f90a005c2f0> finished in: 0.21473383903503418
2021-03-15T13:32:46.845+0100 7f90775d0700 0 [dashboard INFO request] [********:50471] [GET] [200] [0.249s] [dash] [241.0B] /api/summary
2021-03-15T13:32:47.373+0100 7f9071dc5700 0 [dashboard DEBUG notification_queue] processing queue: 1
2021-03-15T13:32:47.508+0100 7f9071dc5700 0 [dashboard DEBUG notification_queue] processing queue: 1
2021-03-15T13:32:47.571+0100 7f9077dd1700 0 [dashboard DEBUG request] [********:62014] [GET] [dash] /api/cluster_conf/
2021-03-15T13:32:47.571+0100 7f9077dd1700 0 [dashboard DEBUG auth] token: *******************
2021-03-15T13:32:47.571+0100 7f9077dd1700 4 mgr get_store get_store key: mgr/dashboard/jwt_token_black_list
2021-03-15T13:32:47.572+0100 7f9077dd1700 0 [dashboard DEBUG auth] checking authorization...
2021-03-15T13:32:47.572+0100 7f9077dd1700 0 [dashboard DEBUG auth] checking '['read']' access to 'config-opt' scope
2021-03-15T13:32:47.751+0100 7f9087970700 0 [rbd_support DEBUG root] TaskHandler: tick
2021-03-15T13:32:47.751+0100 7f908d97c700 0 [rbd_support DEBUG root] PerfHandler: tick


Files

ceph-mgr.cephyr-mon1.log.20210518.gz (38.2 KB) ceph-mgr.cephyr-mon1.log.20210518.gz MGR-logs Mathias Lindberg, 05/19/2021 03:03 PM
Actions #1

Updated by Ernesto Puerta about 3 years ago

Thanks, Mathias!

- When you say 'dependencies not in EPEL', which ones are missing? AFAIK all deps should be available as RPMs from EPEL 7.
- Additionally, we recently faced and fixed an issue involving the dashboard becoming unresponsive due to a bug in cheroot (https://tracker.ceph.com/issues/48973). However, in this case it looks like the whole ceph-mgr daemon stops responding, so this points to a different issue.
- Have you checked system logs for OOM events? If you are running Dashboard with the monitoring stack, you can check the Grafana dashboards + Prometheus metrics to look for memory/CPU load increases.
- Additionally, I'd suggest you to increase the verbosity level for manager logs.

Actions #2

Updated by Mathias Lindberg about 3 years ago

Hi,
Only python3 packages related to the (on CentOS7) unsupported dashboard. The following packages was installed with pip: pecan, cheroot, jaraco.collections, more-itertools, portend, zc.lockfile and repoze.lru.
We built these RPM's for python3-cherrypy, python3-jwt, python3-more-itertools and python3-routes.
Building scipy (with dependencies) for ceph-mgr-diskprediction-local seemed to big of a task at the time.
We noticed the Cheroot issue and upgraded to v8.5.2 a while back.
No OOM events as far as i can notice, i have increased the verbosity level to 20 now.
Thank you!

Actions #3

Updated by Mathias Lindberg almost 3 years ago

Finally got the manager process to become unresponsive again, took a good +2 months. Attaching log from the last minute or so. The manager was not listed as a standby and the manager process it self did ~100% CPU. I have output from strace both with and without -f on the manager process if that is of use. Currently at version 15.2.12.

Actions

Also available in: Atom PDF