Project

General

Profile

Bug #41736

"ActivePyModule.cc: 54: FAILED ceph_assert(pClassInstance != nullptr)" due to race when loading modules

Added by Gavin Baker 6 months ago. Updated 5 months ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

After doing a yum update and service restart of a Ceph cluster, manager services crash and fail to restart. Main error appears to be: "mgr operator() Failed to run module in active mode ('rbd_support')".

Sep  9 19:13:28 ceph-mgr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Sep  9 19:13:28 ceph-mgr: -235> 2019-09-09 19:13:28.427 7fbbed24a700 -1 mgr load Failed to construct class in 'rbd_support'
Sep  9 19:13:28 ceph-mgr: -218> 2019-09-09 19:13:28.427 7fbbed24a700 -1 mgr load Traceback (most recent call last):
Sep  9 19:13:28 ceph-mgr: File "/usr/share/ceph/mgr/rbd_support/module.py", line 1326, in __init__
Sep  9 19:13:28 ceph-mgr: self.task = TaskHandler(self)
Sep  9 19:13:28 ceph-mgr: File "/usr/share/ceph/mgr/rbd_support/module.py", line 610, in __init__
Sep  9 19:13:28 ceph-mgr: self.init_task_queue()
Sep  9 19:13:28 ceph-mgr: File "/usr/share/ceph/mgr/rbd_support/module.py", line 674, in init_task_queue
Sep  9 19:13:28 ceph-mgr: self.load_task_queue(ioctx, pool_name)
Sep  9 19:13:28 ceph-mgr: File "/usr/share/ceph/mgr/rbd_support/module.py", line 708, in load_task_queue
Sep  9 19:13:28 ceph-mgr: ioctx.operate_read_op(read_op, RBD_TASK_OID)
Sep  9 19:13:28 ceph-mgr: File "rados.pyx", line 516, in rados.requires.wrapper.validate_func (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.3/rpm/el7/BUILD/ceph-14.2.3/build/src/pybind/rados/pyrex/rados.c:4721)
Sep  9 19:13:28 ceph-mgr: File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.3/rpm/el7/BUILD/ceph-14.2.3/build/src/pybind/rados/pyrex/rados.c:36554)
Sep  9 19:13:28 ceph-mgr: PermissionError: [errno 1] Failed to operate read op for oid rbd_task
Sep  9 19:13:28 ceph-mgr: -217> 2019-09-09 19:13:28.583 7fbbed24a700 -1 mgr operator() Failed to run module in active mode ('rbd_support')
Sep  9 19:13:28 ceph-mgr: -128> 2019-09-09 19:13:28.590 7fbbed24a700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.3/rpm/el7/BUILD/ceph-14.2.3/src/mgr/ActivePyModule.cc: In function 'void ActivePyModule::notify(const string&, const string&)' thread 7fbbed24a700 time 2019-09-09 19:13:28.590091
Sep  9 19:13:28 ceph-mgr: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.3/rpm/el7/BUILD/ceph-14.2.3/src/mgr/ActivePyModule.cc: 54: FAILED ceph_assert(pClassInstance != nullptr)
Sep  9 19:13:28 ceph-mgr: ceph version 14.2.3 (0f776cf838a1ae3130b2b73dc26be9c95c6ccc39) nautilus (stable)
Sep  9 19:13:28 ceph-mgr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7fbc0e38eac2]
Sep  9 19:13:28 ceph-mgr: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7fbc0e38ec90]
Sep  9 19:13:28 ceph-mgr: 3: (ActivePyModule::notify(std::string const&, std::string const&)+0x4f5) [0x56043aea69f5]
Sep  9 19:13:28 ceph-mgr: 4: (FunctionContext::finish(int)+0x2c) [0x56043aeb8eac]
Sep  9 19:13:28 ceph-mgr: 5: (Context::complete(int)+0x9) [0x56043aeb5659]
Sep  9 19:13:28 ceph-mgr: 6: (Finisher::finisher_thread_entry()+0x156) [0x7fbc0e3d5cc6]
Sep  9 19:13:28 ceph-mgr: 7: (()+0x7dd5) [0x7fbc0bc8ddd5]
Sep  9 19:13:28 ceph-mgr: 8: (clone()+0x6d) [0x7fbc0a93702d]
Sep  9 19:13:28 ceph-mgr: -106> 2019-09-09 19:13:28.591 7fbbed24a700 -1 *** Caught signal (Aborted) **
Sep  9 19:13:28 ceph-mgr: in thread 7fbbed24a700 thread_name:mgr-fin
Sep  9 19:13:28 ceph-mgr: ceph version 14.2.3 (0f776cf838a1ae3130b2b73dc26be9c95c6ccc39) nautilus (stable)
Sep  9 19:13:28 ceph-mgr: 1: (()+0xf5d0) [0x7fbc0bc955d0]
Sep  9 19:13:28 ceph-mgr: 2: (gsignal()+0x37) [0x7fbc0a86f2c7]
Sep  9 19:13:28 ceph-mgr: 3: (abort()+0x148) [0x7fbc0a8709b8]
Sep  9 19:13:28 ceph-mgr: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x7fbc0e38eb11]
Sep  9 19:13:28 ceph-mgr: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7fbc0e38ec90]
Sep  9 19:13:28 ceph-mgr: 6: (ActivePyModule::notify(std::string const&, std::string const&)+0x4f5) [0x56043aea69f5]
Sep  9 19:13:28 ceph-mgr: 7: (FunctionContext::finish(int)+0x2c) [0x56043aeb8eac]
Sep  9 19:13:28 ceph-mgr: 8: (Context::complete(int)+0x9) [0x56043aeb5659]
Sep  9 19:13:28 ceph-mgr: 9: (Finisher::finisher_thread_entry()+0x156) [0x7fbc0e3d5cc6]
Sep  9 19:13:28 ceph-mgr: 10: (()+0x7dd5) [0x7fbc0bc8ddd5]
Sep  9 19:13:28 ceph-mgr: 11: (clone()+0x6d) [0x7fbc0a93702d]

History

#1 Updated by Gavin Baker 6 months ago

Removing a number of old conf options seems to have enabled the mgr service to start. However the ceph status command outputs the following error:

health: HEALTH_ERR
Module 'rbd_support' has failed: Not found or unloadable

Config options that were removed:

mgr                            advanced mgr/balancer/active                true                                                                                                     
mgr advanced mgr/balancer/mode crush-compat
mgr advanced mgr/balancer/pool_ids 24,1,2,20,14,22,23
mgr advanced mgr/devicehealth/enable_monitoring false

#2 Updated by Greg Farnum 5 months ago

  • Project changed from Ceph to mgr

#3 Updated by Gavin Baker 5 months ago

After further testing, the manager daemon will fail entirely as described initially if the balancer is turned on. Otherwise the rbd_support module error shows but the service starts if the balancer is off. This behavior is consistent on our two separate long running Ceph production clusters.

If the balancer is off this is the error messages produced in logs:

```
2019-09-23 12:00:25.532 7f4ca4423700 -1 mgr load Failed to construct class in 'rbd_support'
2019-09-23 12:00:25.532 7f4ca4423700 -1 mgr load Traceback (most recent call last):
File "/usr/share/ceph/mgr/rbd_support/module.py", line 1326, in init
self.task = TaskHandler(self)
File "/usr/share/ceph/mgr/rbd_support/module.py", line 610, in init
self.init_task_queue()
File "/usr/share/ceph/mgr/rbd_support/module.py", line 674, in init_task_queue
self.load_task_queue(ioctx, pool_name)
File "/usr/share/ceph/mgr/rbd_support/module.py", line 708, in load_task_queue
ioctx.operate_read_op(read_op, RBD_TASK_OID)
File "rados.pyx", line 516, in rados.requires.wrapper.validate_func (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721)
File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554)
PermissionError: [errno 1] Failed to operate read op for oid rbd_task

2019-09-23 12:00:25.533 7f4ca4423700 -1 mgr operator() Failed to run module in active mode ('rbd_support')
```

#4 Updated by Gavin Baker 5 months ago

This does appear to be an auth permissions issue with the mgr caps. I'm not sure whether the base caps changed since we deployed, but certainly adding osd/mds permissions seems to have cleared the warning and mgr crashing.

#5 Updated by Mikaƫl Cluseau 5 months ago

I confirm the same action fixes this issue:

```
ceph auth caps mgr.{ID} mon 'allow profile mgr' osd 'allow *' mds 'allow *'
```

#6 Updated by Sebastian Wagner 5 months ago

  • Description updated (diff)

#7 Updated by Mykola Golub 5 months ago

  • Status changed from New to In Progress
  • Assignee set to Mykola Golub

So, I see two issues here:

1) rbd_support module failed to load due to not properly configured mgr auth caps;
2) there is a race in ceph-mgr on a module load (a notify is received between a loading module is added to the modules list and it is loaded) that may lead to the mgr crash.

Right now I don't know if (1) is a bug (in docs, upgrade procedures, etc) and if need to do something with this, but I am going to work on (2).

Note, 'notify' is called on mgr module config options processing, that is why removing config options workarounded the issue (2).

#8 Updated by Mykola Golub 5 months ago

  • Status changed from In Progress to Fix Under Review
  • Target version deleted (v14.2.4)
  • Backport set to nautilus
  • Pull request ID set to 30670

#9 Updated by Mykola Golub 5 months ago

  • Subject changed from ceph-mgr crashes with "Failed to run module in active mode ('rbd_support')" after upgrade from 14.2.2 -> 14.2. to "ActivePyModule.cc: 54: FAILED ceph_assert(pClassInstance != nullptr)" due to race when loading modules

Also available in: Atom PDF