Project

General

Profile

Bug #47110

Ceph dashboard not working : rook-ceph-mgr-a pod : "OOM KILL" and "CrashLoopBackOff".

Added by julian kraif 26 days ago. Updated 24 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
% Done:

0%

Source:
Support
Tags:
mgr oom kill
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
08/24/2020
Affected Versions:
ceph-qa-suite:
upgrade/nautilus-x
Pull request ID:
Crash signature:

Description

Hi to all,

It's first time i'm posting here so sorry in advance if i forget important details, just ask for more if needed.
Also i'm not sure with "Source","ceph-qa-suite" and "pull request ID" i entered for this demand.

Me and my collegue are trying to administrate a Devops Foundry using k8s and rook ceph.
We actually have serious probleme getting the MGR working properly, we can't acces the Dashboard on external https.

Probleme is our MGR is constantly getting OOMKILLED and going into CrashLoopBackOff.

(in CRD)

mgr:
limits:
cpu: 1Gi
memory: 3Gi
requests:
cpu: 500m
memory: 1Gi

QoS Class: Burstable

On v1.1.7 the mgr pod were tacking way to much memory (possible memory leak) so we upgraded rook to version v1.3.7

Environment:
Linux 3.10.0-862.el7.x86_64 2018 x86_64 x86_64 x86_64 GNU/Linux
REDHAT 7

*image
Image: ceph/ceph:v15.2.4-20200630

Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", Platform:"linux/amd64"}

Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):
HEALTH_WARN no active mgr; clock skew detected on mon.az, mon.ba, mon.bc; mon ba is low on available space; Reduced data availability: 9 pgs inactive; 8 pool(s) have no replicas configured; 1 pool(s) have non-power-of-two pg_num; 8 pools have too few placement groups; 1585 slow ops, oldest one blocked for 349118 sec, daemons [osd.4,osd.5,osd.6] have slow ops.

Thank in advance to all for your precious help :)

Ps: Here is some additionnal log

@debug 2020-08-24T12:18:26.998+0000 7fdc5be75080 0 set uid:gid to 167:167 (ceph:ceph)
debug 2020-08-24T12:18:26.998+0000 7fdc5be75080 0 ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable), process ceph-mgr, pid 1
debug 2020-08-24T12:18:26.998+0000 7fdc5be75080 0 pidfile_write: ignore empty --pid-file
debug 2020-08-24T12:18:27.036+0000 7fdc5be75080 1 mgr[py] Loading python module 'alerts'
debug 2020-08-24T12:18:27.246+0000 7fdc5be75080 1 mgr[py] Loading python module 'balancer'
debug 2020-08-24T12:18:27.340+0000 7fdc5be75080 1 mgr[py] Loading python module 'cephadm'
debug 2020-08-24T12:18:28.009+0000 7fdc5be75080 1 mgr[py] Loading python module 'crash'
debug 2020-08-24T12:18:28.106+0000 7fdc5be75080 1 mgr[py] Loading python module 'dashboard'
debug 2020-08-24T12:18:29.816+0000 7fdc5be75080 1 mgr[py] Loading python module 'devicehealth'
debug 2020-08-24T12:18:29.901+0000 7fdc5be75080 1 mgr[py] Loading python module 'diskprediction_local'
debug 2020-08-24T12:18:30.417+0000 7fdc5be75080 1 mgr[py] Loading python module 'influx'
debug 2020-08-24T12:18:30.554+0000 7fdc5be75080 1 mgr[py] Loading python module 'insights'
debug 2020-08-24T12:18:30.639+0000 7fdc5be75080 1 mgr[py] Loading python module 'iostat'
debug 2020-08-24T12:18:30.715+0000 7fdc5be75080 1 mgr[py] Loading python module 'k8sevents'
debug 2020-08-24T12:18:32.796+0000 7fdc5be75080 1 mgr[py] Loading python module 'localpool'
debug 2020-08-24T12:18:32.874+0000 7fdc5be75080 1 mgr[py] Loading python module 'orchestrator'
debug 2020-08-24T12:18:33.147+0000 7fdc5be75080 1 mgr[py] Loading python module 'osd_support'
debug 2020-08-24T12:18:33.227+0000 7fdc5be75080 1 mgr[py] Loading python module 'pg_autoscaler'
debug 2020-08-24T12:18:33.354+0000 7fdc5be75080 1 mgr[py] Loading python module 'progress'
debug 2020-08-24T12:18:33.467+0000 7fdc5be75080 1 mgr[py] Loading python module 'prometheus'
debug 2020-08-24T12:18:34.044+0000 7fdc5be75080 1 mgr[py] Loading python module 'rbd_support'
debug 2020-08-24T12:18:34.214+0000 7fdc5be75080 1 mgr[py] Loading python module 'restful'
debug 2020-08-24T12:18:34.813+0000 7fdc5be75080 1 mgr[py] Loading python module 'rook'
debug 2020-08-24T12:18:35.667+0000 7fdc5be75080 1 mgr[py] Loading python module 'selftest'
debug 2020-08-24T12:18:35.751+0000 7fdc5be75080 1 mgr[py] Loading python module 'status'
debug 2020-08-24T12:18:35.857+0000 7fdc5be75080 1 mgr[py] Loading python module 'telegraf'
debug 2020-08-24T12:18:35.954+0000 7fdc5be75080 1 mgr[py] Loading python module 'telemetry'
debug 2020-08-24T12:18:36.163+0000 7fdc5be75080 1 mgr[py] Loading python module 'test_orchestrator'
debug 2020-08-24T12:18:36.552+0000 7fdc5be75080 1 mgr[py] Loading python module 'volumes'
debug 2020-08-24T12:18:36.837+0000 7fdc5be75080 1 mgr[py] Loading python module 'zabbix'
debug 2020-08-24T12:18:36.929+0000 7fdc490fa700 0 ms_deliver_dispatch: unhandled message 0x55704d97c420 mon_map magic: 0 v1 from mon.2 v2:10.233.3.180:3300/0
debug 2020-08-24T12:18:38.139+0000 7fdc490fa700 1 mgr handle_mgr_map Activating!
debug 2020-08-24T12:18:38.139+0000 7fdc490fa700 1 mgr handle_mgr_map I am now activating
debug 2020-08-24T12:18:38.233+0000 7fdc2baf1700 0 [balancer DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
debug 2020-08-24T12:18:38.233+0000 7fdc2baf1700 1 mgr load Constructed class from module: balancer
debug 2020-08-24T12:18:38.234+0000 7fdc2baf1700 0 [crash DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
debug 2020-08-24T12:18:38.234+0000 7fdc2baf1700 1 mgr load Constructed class from module: crash
debug 2020-08-24T12:18:38.235+0000 7fdc2baf1700 0 [dashboard DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
debug 2020-08-24T12:18:38.235+0000 7fdc2baf1700 1 mgr load Constructed class from module: dashboard
debug 2020-08-24T12:18:38.239+0000 7fdc2baf1700 0 [devicehealth DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
debug 2020-08-24T12:18:38.239+0000 7fdc2baf1700 1 mgr load Constructed class from module: devicehealth
debug 2020-08-24T12:18:38.240+0000 7fdc2baf1700 0 [iostat DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
debug 2020-08-24T12:18:38.241+0000 7fdc2baf1700 1 mgr load Constructed class from module: iostat
debug 2020-08-24T12:18:38.243+0000 7fdc2baf1700 0 [orchestrator DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
debug 2020-08-24T12:18:38.243+0000 7fdc2baf1700 1 mgr load Constructed class from module: orchestrator
debug 2020-08-24T12:18:38.244+0000 7fdc2baf1700 0 [osd_support DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
debug 2020-08-24T12:18:38.244+0000 7fdc2baf1700 1 mgr load Constructed class from module: osd_support
debug 2020-08-24T12:18:38.247+0000 7fdc2baf1700 0 [pg_autoscaler DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
debug 2020-08-24T12:18:38.247+0000 7fdc2baf1700 1 mgr load Constructed class from module: pg_autoscaler
debug 2020-08-24T12:18:38.253+0000 7fdc2baf1700 0 [progress DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
debug 2020-08-24T12:18:38.253+0000 7fdc2baf1700 1 mgr load Constructed class from module: progress
debug 2020-08-24T12:18:38.279+0000 7fdc2baf1700 0 [prometheus DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
debug 2020-08-24T12:18:38.279+0000 7fdc2baf1700 1 mgr load Constructed class from module: prometheus
debug 2020-08-24T12:18:38.281+0000 7fdc2baf1700 0 [rbd_support DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
[24/Aug/2020:12:18:38] ENGINE Bus STARTING
CherryPy Checker:
The Application mounted at '' has an empty config.

[24/Aug/2020:12:18:38] ENGINE Serving on http://0.0.0.0:9283
[24/Aug/2020:12:18:38] ENGINE Bus STARTED
debug 2020-08-24T12:18:39.152+0000 7fdc2c2f2700 0 log_channel(audit) log [DBG] : from='client.18915725 -' entity='client.csi-rbd-provisioner' cmd=[{"image_spec": "monitoring-r1/csi-vol-5bb629a1-a733-11ea-91a6-1e72bbfcc7ba", "prefix": "rbd task add remove", "target": ["mgr", ""]}]: dispatch
debug 2020-08-24T12:18:39.155+0000 7fdc2c2f2700 0 log_channel(audit) log [DBG] : from='client.20194281 -' entity='client.crash' cmd=[{"prefix": "crash post", "target": ["mon-mgr", ""]}]: dispatch
debug 2020-08-24T12:18:39.156+0000 7fdc2c2f2700 0 log_channel(audit) log [DBG] : from='client.18823654 -' entity='client.csi-rbd-provisioner' cmd=[{"image_spec": "builds-r1/csi-vol-b47fa0bd-d30a-11ea-a2eb-72f121e0663b", "prefix": "rbd task add remove", "target": ["mgr", ""]}]: dispatch
debug 2020-08-24T12:18:39.156+0000 7fdc2c2f2700 0 log_channel(audit) log [DBG] : from='client.20116395 -' entity='client.admin' cmd=[{"prefix": "balancer mode", "target": ["mgr", ""], "mode": "upmap", "format": "json"}]: dispatch
debug 2020-08-24T12:18:39.156+0000 7fdc2c2f2700 0 log_channel(audit) log [DBG] : from='client.20194890 -' entity='client.admin' cmd=[{"prefix": "osd ok-to-stop", "target": ["mgr", ""], "ids": ["1"], "format": "json"}]: dispatch
debug 2020-08-24T12:18:39.156+0000 7fdc2c2f2700 -1 mgr.server reply reply (11) Resource temporarily unavailable 191 pgs have unknown state; cannot draw any conclusions
debug 2020-08-24T12:18:39.162+0000 7fdc2c2f2700 0 log_channel(audit) log [DBG] : from='client.18922959 -' entity='client.csi-rbd-provisioner' cmd=[{"image_spec": "monitoring-r1/csi-vol-8347786b-af0b-11ea-91a6-1e72bbfcc7ba", "prefix": "rbd task add remove", "target": ["mgr", ""]}]: dispatch
debug 2020-08-24T12:18:39.162+0000 7fdc2c2f2700 0 log_channel(audit) log [DBG] : from='client.20120167 -' entity='client.crash' cmd=[{"prefix": "crash post", "target": ["mon-mgr", ""]}]: dispatch
debug 2020-08-24T12:18:39.163+0000 7fdc2c2f2700 0 log_channel(audit) log [DBG] : from='client.20183056 -' entity='client.crash' cmd=[{"prefix": "crash post", "target": ["mon-mgr", ""]}]: dispatch
debug 2020-08-24T12:18:39.163+0000 7fdc2c2f2700 0 log_channel(audit) log [DBG] : from='client.18735276 -' entity='client.csi-rbd-provisioner' cmd=[{"image_spec": "monitoring-r3/csi-vol-5d68973b-c1ee-11ea-a223-da83f88a0486", "prefix": "rbd task add remove", "target": ["mgr", ""]}]: dispatch
debug 2020-08-24T12:18:39.164+0000 7fdc2c2f2700 0 log_channel(audit) log [DBG] : from='client.20120207 -' entity='client.crash' cmd=[{"prefix": "crash post", "target": ["mon-mgr", ""]}]: dispatch
debug 2020-08-24T12:18:39.164+0000 7fdc2c2f2700 0 log_channel(audit) log [DBG] : from='client.18922863 -' entity='client.csi-rbd-provisioner' cmd=[{"image_spec": "builds-r1/csi-vol-0526fe14-cd91-11ea-ae5b-a273a14ecc5d", "prefix": "rbd task add remove", "target": ["mgr", ""]}]: dispatch
debug 2020-08-24T12:18:39.165+0000 7fdc2c2f2700 0 log_channel(audit) log [DBG] : from='client.20195390 -' entity='client.crash' cmd=[{"prefix": "crash post", "target": ["mon-mgr", ""]}]: dispatch
@

History

#1 Updated by julian kraif 26 days ago

Some logs from last week that could help :

debug 2020-08-20T11:14:35.339+0000 7fe10ed2b700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.1 ()
debug 2020-08-20T11:14:35.346+0000 7fe10ed2b700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.6 ()
debug 2020-08-20T11:14:35.347+0000 7fe10ed2b700 0 [devicehealth ERROR root] Fail to parse JSON result from daemon osd.7 ()
CherryPy Checker:
File "/lib/python3.6/site-packages/cherrypy/_cprequest.py", line 638, in respond
File "/lib/python3.6/site-packages/cherrypy/_cprequest.py", line 697, in _do_respond
File "/lib/python3.6/site-packages/cherrypy/lib/encoding.py", line 219, in call
File "/lib/python3.6/site-packages/cherrypy/_cpdispatch.py", line 54, in call
KeyError: 'wait'
debug 2020-08-20T11:14:54.472+0000 7fe0fde10700 0 [prometheus ERROR cherrypy.error.140604633254600] [20/Aug/2020:11:14:54] HTTP
File "/lib/python3.6/site-packages/cherrypy/_cprequest.py", line 638, in respond
File "/lib/python3.6/site-packages/cherrypy/_cprequest.py", line 697, in _do_respond
File "/lib/python3.6/site-packages/cherrypy/lib/encoding.py", line 219, in call
File "/lib/python3.6/site-packages/cherrypy/_cpdispatch.py", line 54, in call

#2 Updated by julian kraif 24 days ago

Hi to all,

We tried yesterday to downgrade the version of ceph in the deploy of MGR,(we tried previous octopus build and even nautilus) Unfortunately nothing worked out...

We are today completely out of ideas to make it work. Please help.

Also available in: Atom PDF