Project

General

Profile

Bug #42744

mgr/dashboard: Executing the run-backend-api-tests script results in infinite loop

Added by Laura Paduano 4 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Category:
dashboard/qa
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Test output:

2019-11-11 14:51:45,469.469 INFO:__main__:Running ['./bin/ceph', 'log', 'Starting test tasks.mgr.test_dashboard.TestDashboard.test_standby']
2019-11-11 14:51:46,778.778 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-11-11 14:51:46,797.797 INFO:__main__:Found ps line for daemon: 43098 ?        Ssl    0:05 ./bin/./ceph-mgr -i y
2019-11-11 14:51:46,798.798 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-11-11 14:51:46,815.815 INFO:__main__:Found ps line for daemon: 43098 ?        Ssl    0:05 ./bin/./ceph-mgr -i y
2019-11-11 14:51:46,815.815 INFO:__main__:Killing PID 43098 for mgr.y
2019-11-11 14:51:46,816.816 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-11-11 14:51:46,836.836 INFO:__main__:No match for mgr y
2019-11-11 14:51:46,837.837 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-11-11 14:51:46,860.860 INFO:__main__:No match for mgr y
2019-11-11 14:51:46,861.861 INFO:__main__:Running ['./bin/ceph', 'mgr', 'fail', 'y']
2019-11-11 14:51:47,815.815 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-11-11 14:51:47,835.835 INFO:__main__:Found ps line for daemon: 43128 ?        Ssl    0:05 ./bin/./ceph-mgr -i x
2019-11-11 14:51:47,836.836 INFO:__main__:Running ['ps', 'ww', '-u0']

mgr.log:

   -18> 2019-11-11T15:06:53.342+0000 7f0c79ffb700 20 mgr ~Gil Destroying new thread state 0x55872f2b4fc0
   -17> 2019-11-11T15:06:53.342+0000 7f0c79ffb700 20 mgr Gil Switched to new thread state 0x55872f2b4fc0
   -16> 2019-11-11T15:06:53.342+0000 7f0c79ffb700 20 mgr ~Gil Destroying new thread state 0x55872f2b4fc0
   -15> 2019-11-11T15:06:53.342+0000 7f0c79ffb700 20 mgr Gil Switched to new thread state 0x55872f2b4fc0
   -14> 2019-11-11T15:06:53.342+0000 7f0c79ffb700 20 mgr ~Gil Destroying new thread state 0x55872f2b4fc0
   -13> 2019-11-11T15:06:53.342+0000 7f0c79ffb700 20 mgr Gil Switched to new thread state 0x55872f2b4fc0
   -12> 2019-11-11T15:06:53.342+0000 7f0c79ffb700 20 mgr ~Gil Destroying new thread state 0x55872f2b4fc0
   -11> 2019-11-11T15:06:53.346+0000 7f0c79ffb700 20 mgr Gil Switched to new thread state 0x55872f2b4fc0
   -10> 2019-11-11T15:06:53.346+0000 7f0c79ffb700 20 mgr ~Gil Destroying new thread state 0x55872f2b4fc0
    -9> 2019-11-11T15:06:53.346+0000 7f0c79ffb700 20 mgr Gil Switched to new thread state 0x55872f2b4fc0
    -8> 2019-11-11T15:06:53.346+0000 7f0c79ffb700 20 mgr ~Gil Destroying new thread state 0x55872f2b4fc0
    -7> 2019-11-11T15:06:53.346+0000 7f0c79ffb700 20 mgr Gil Switched to new thread state 0x55872f2b4fc0
    -6> 2019-11-11T15:06:53.346+0000 7f0c79ffb700 20 mgr ~Gil Destroying new thread state 0x55872f2b4fc0
    -5> 2019-11-11T15:06:53.346+0000 7f0c79ffb700 20 mgr Gil Switched to new thread state 0x55872f2b4fc0
    -4> 2019-11-11T15:06:53.346+0000 7f0c79ffb700 20 mgr ~Gil Destroying new thread state 0x55872f2b4fc0
    -3> 2019-11-11T15:06:53.346+0000 7f0c79ffb700 20 mgr Gil Switched to new thread state 0x55872f2b4fc0
    -2> 2019-11-11T15:06:53.346+0000 7f0c79ffb700 20 mgr ~Gil Destroying new thread state 0x55872f2b4fc0
    -1> 2019-11-11T15:06:53.354+0000 7f0c79ffb700 -1 /ceph/src/mgr/ActivePyModule.cc: In function 'void ActivePyModule::notify(const string&, const string&)' thread 7f0c79ffb700 time 2019-11-11T15:06:53.348308+0000
/ceph/src/mgr/ActivePyModule.cc: 56: FAILED ceph_assert(pClassInstance != nullptr)

 ceph version Development (no_version) octopus (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19d) [0x7f0c90b98091]
 2: (()+0x15e5313) [0x7f0c90b98313]
 3: (ActivePyModule::notify(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x4c) [0x558727e282c4]
 4: (()+0xa0da9c) [0x558727e31a9c]
 5: (()+0xa15580) [0x558727e39580]
 6: (Context::complete(int)+0x27) [0x558727e39da3]
 7: (Finisher::finisher_thread_entry()+0x39d) [0x7f0c90acafff]
 8: (Finisher::FinisherThread::entry()+0x1c) [0x558727e3a2cc]
 9: (Thread::entry_wrapper()+0x78) [0x7f0c90b38b44]
 10: (Thread::_entry_func(void*)+0x18) [0x7f0c90b38ac2]
 11: (()+0x8f2a) [0x7f0c8ed00f2a]
 12: (clone()+0x3f) [0x7f0c8e86a4af]

     0> 2019-11-11T15:06:53.366+0000 7f0c79ffb700 -1 *** Caught signal (Aborted) **
 in thread 7f0c79ffb700 thread_name:mgr-fin

 ceph version Development (no_version) octopus (dev)
 1: (()+0xd4ef74) [0x558728172f74]
 2: (()+0x142d0) [0x7f0c8ed0c2d0]
 3: (gsignal()+0x141) [0x7f0c8e7a70d1]
 4: (abort()+0x127) [0x7f0c8e79055d]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x36c) [0x7f0c90b98260]
 6: (()+0x15e5313) [0x7f0c90b98313]
 7: (ActivePyModule::notify(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x4c) [0x558727e282c4]
 8: (()+0xa0da9c) [0x558727e31a9c]
 9: (()+0xa15580) [0x558727e39580]
 10: (Context::complete(int)+0x27) [0x558727e39da3]
 11: (Finisher::finisher_thread_entry()+0x39d) [0x7f0c90acafff]
 12: (Finisher::FinisherThread::entry()+0x1c) [0x558727e3a2cc]
 13: (Thread::entry_wrapper()+0x78) [0x7f0c90b38b44]
 14: (Thread::_entry_func(void*)+0x18) [0x7f0c90b38ac2]
 15: (()+0x8f2a) [0x7f0c8ed00f2a]
 16: (clone()+0x3f) [0x7f0c8e86a4af]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Also see: https://jenkins.ceph.com/job/ceph-dashboard-pr-backend/462/

This issue is blocking developers from executing the tests locally and it also causes the Jenkins job to fail.


Related issues

Related to fs - Bug #42299: mgr/volumes: cleanup libcephfs handles on mgr shutdown Pending Backport
Related to mgr - Bug #42981: run-backend-api-tests.sh: mgr oneshot signal handlers do not revert to killing process New
Related to mgr - Bug #38573: mgr/ActivePyModule.cc: 54: FAILED ceph_assert(pClassInstance != nullptr) Resolved
Blocks fs - Backport #42738: nautilus: mgr/volumes: cleanup libcephfs handles on mgr shutdown Need More Info

History

#1 Updated by Laura Paduano 4 months ago

  • Description updated (diff)

#2 Updated by Tatjana Dehler 4 months ago

I can also tell, it happens on various test cases, e.g.:

2019-11-11 11:44:58,589.589 INFO:__main__:ERROR: setUpClass (tasks.mgr.dashboard.test_pool.PoolTest)
2019-11-11 11:44:58,589.589 INFO:__main__:----------------------------------------------------------------------
2019-11-11 11:44:58,589.589 INFO:__main__:Traceback (most recent call last):
2019-11-11 11:44:58,589.589 INFO:__main__:  File "/ceph/qa/tasks/mgr/dashboard/helper.py", line 113, in setUpClass
2019-11-11 11:44:58,590.590 INFO:__main__:    cls._assign_ports("dashboard", "ssl_server_port")
2019-11-11 11:44:58,590.590 INFO:__main__:  File "/ceph/qa/tasks/mgr/mgr_test_case.py", line 193, in _assign_ports
2019-11-11 11:44:58,590.590 INFO:__main__:    cls.mgr_cluster.mgr_stop(mgr_id)
2019-11-11 11:44:58,590.590 INFO:__main__:  File "/ceph/qa/tasks/mgr/mgr_test_case.py", line 28, in mgr_stop
2019-11-11 11:44:58,590.590 INFO:__main__:    self.mgr_daemons[mgr_id].stop()
2019-11-11 11:44:58,590.590 INFO:__main__:  File "../qa/tasks/vstart_runner.py", line 443, in stop
2019-11-11 11:44:58,591.591 INFO:__main__:    self.daemon_type, self.daemon_id))
2019-11-11 11:44:58,591.591 INFO:__main__:MaxWhileTries: Timed out waiting for daemon mgr.x
2019-11-11 14:45:42,929.929 INFO:__main__:ERROR: setUpClass (tasks.mgr.dashboard.test_settings.SettingsTest)
2019-11-11 14:45:42,929.929 INFO:__main__:----------------------------------------------------------------------
2019-11-11 14:45:42,929.929 INFO:__main__:Traceback (most recent call last):
2019-11-11 14:45:42,930.930 INFO:__main__:  File "/ceph/qa/tasks/mgr/dashboard/helper.py", line 113, in setUpClass
2019-11-11 14:45:42,930.930 INFO:__main__:    cls._assign_ports("dashboard", "ssl_server_port")
2019-11-11 14:45:42,931.931 INFO:__main__:  File "/ceph/qa/tasks/mgr/mgr_test_case.py", line 193, in _assign_ports
2019-11-11 14:45:42,931.931 INFO:__main__:    cls.mgr_cluster.mgr_stop(mgr_id)
2019-11-11 14:45:42,931.931 INFO:__main__:  File "/ceph/qa/tasks/mgr/mgr_test_case.py", line 28, in mgr_stop
2019-11-11 14:45:42,932.932 INFO:__main__:    self.mgr_daemons[mgr_id].stop()
2019-11-11 14:45:42,932.932 INFO:__main__:  File "../qa/tasks/vstart_runner.py", line 443, in stop
2019-11-11 14:45:42,932.932 INFO:__main__:    self.daemon_type, self.daemon_id))
2019-11-11 14:45:42,932.932 INFO:__main__:MaxWhileTries: Timed out waiting for daemon mgr.x
2019-11-11 14:46:18,961.961 INFO:__main__:ERROR: setUpClass (tasks.mgr.dashboard.test_rgw.RgwDaemonTest)
2019-11-11 14:46:18,961.961 INFO:__main__:----------------------------------------------------------------------
2019-11-11 14:46:18,961.961 INFO:__main__:Traceback (most recent call last):
2019-11-11 14:46:18,961.961 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-dashboard-pr-backend/qa/tasks/mgr/dashboard/helper.py", line 113, in setUpClass
2019-11-11 14:46:18,962.962 INFO:__main__:    cls._assign_ports("dashboard", "ssl_server_port")
2019-11-11 14:46:18,962.962 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-dashboard-pr-backend/qa/tasks/mgr/mgr_test_case.py", line 193, in _assign_ports
2019-11-11 14:46:18,962.962 INFO:__main__:    cls.mgr_cluster.mgr_stop(mgr_id)
2019-11-11 14:46:18,962.962 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-dashboard-pr-backend/qa/tasks/mgr/mgr_test_case.py", line 28, in mgr_stop
2019-11-11 14:46:18,962.962 INFO:__main__:    self.mgr_daemons[mgr_id].stop()
2019-11-11 14:46:18,962.962 INFO:__main__:  File "../qa/tasks/vstart_runner.py", line 443, in stop
2019-11-11 14:46:18,962.962 INFO:__main__:    self.daemon_type, self.daemon_id))
2019-11-11 14:46:18,962.962 INFO:__main__:MaxWhileTries: Timed out waiting for daemon mgr.x
2019-11-13 09:01:54,801.801 INFO:__main__:ERROR: test_standby (tasks.mgr.test_dashboard.TestDashboard)
2019-11-13 09:01:54,801.801 INFO:__main__:----------------------------------------------------------------------
2019-11-13 09:01:54,802.802 INFO:__main__:Traceback (most recent call last):
2019-11-13 09:01:54,802.802 INFO:__main__:  File "/ceph/qa/tasks/mgr/test_dashboard.py", line 18, in setUp
2019-11-13 09:01:54,802.802 INFO:__main__:    self._assign_ports("dashboard", "ssl_server_port")
2019-11-13 09:01:54,802.802 INFO:__main__:  File "/ceph/qa/tasks/mgr/mgr_test_case.py", line 193, in _assign_ports
2019-11-13 09:01:54,802.802 INFO:__main__:    cls.mgr_cluster.mgr_stop(mgr_id)
2019-11-13 09:01:54,802.802 INFO:__main__:  File "/ceph/qa/tasks/mgr/mgr_test_case.py", line 28, in mgr_stop
2019-11-13 09:01:54,802.802 INFO:__main__:    self.mgr_daemons[mgr_id].stop()
2019-11-13 09:01:54,802.802 INFO:__main__:  File "../qa/tasks/vstart_runner.py", line 443, in stop
2019-11-13 09:01:54,802.802 INFO:__main__:    self.daemon_type, self.daemon_id))
2019-11-13 09:01:54,802.802 INFO:__main__:MaxWhileTries: Timed out waiting for daemon mgr.x

They all have `cls._assign_ports("dashboard", "ssl_server_port")` in common.

#3 Updated by Lenz Grimmer 4 months ago

Is this limited to a certain OS/platform or a generic issue?

#4 Updated by Laura Paduano 4 months ago

Lenz Grimmer wrote:

Is this limited to a certain OS/platform or a generic issue?

Seems like it's a generic issue, happened in my docker container and also in Jenkins.

#6 Updated by Kiefer Chang 4 months ago

Saw the same issue locally, waiting for mgr.x to be shutdown.

At the moment, `ceph -s` doesn't contain mgr.x. But mgr.x continue writing messages to its log:

2019-11-13T08:38:42.284+0000 7f37ce16c700 -1 mgr handle_signal *** Got signal Terminated ***
2019-11-13T08:38:42.284+0000 7f37be8a0700  1 -- [v2:10.156.43.6:6800/4506,v1:10.156.43.6:6801/4506] shutdown_connections
2019-11-13T08:38:42.284+0000 7f37be8a0700  1 -- [v2:10.156.43.6:6800/4506,v1:10.156.43.6:6801/4506] shutdown_connections
2019-11-13T08:38:42.284+0000 7f37be8a0700  1 -- [v2:10.156.43.6:6800/4506,v1:10.156.43.6:6801/4506] wait complete.
2019-11-13T08:38:42.288+0000 7f37be8a0700  4 mgr[balancer] Stopping
2019-11-13T08:38:42.288+0000 7f37be8a0700  4 mgr[devicehealth] Stopping
2019-11-13T08:38:42.288+0000 7f37be8a0700  4 mgr[pg_autoscaler] Stopping pg_autoscaler
2019-11-13T08:38:42.292+0000 7f37be8a0700  4 mgr[volumes] shutting down
2019-11-13T08:38:43.280+0000 7f37cd16a700  1 -- 10.156.43.6:0/4506 <== mon.0 v2:10.156.43.6:40917/0 20 ==== mgrmap(e 50) v1 ==== 34349+0+0 (crc 0 0 0) 0x5579e9f82000 con 0x5579eabc4d80
2019-11-13T08:38:43.280+0000 7f37cd16a700  4 mgr handle_mgr_map received map epoch 50
2019-11-13T08:38:43.280+0000 7f37cd16a700  4 mgr handle_mgr_map active in map: 1 active is 4996
2019-11-13T08:38:45.356+0000 7f37cc168700 10 monclient: tick
2019-11-13T08:38:45.356+0000 7f37cc168700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-11-13T08:38:15.358104+0000)
2019-11-13T08:38:45.356+0000 7f37cc168700 20 monclient: _un_backoff reopen_interval_multipler now 1
2019-11-13T08:38:55.356+0000 7f37cc168700 10 monclient: tick
2019-11-13T08:38:55.356+0000 7f37cc168700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-11-13T08:38:25.358368+0000)
2019-11-13T08:38:55.356+0000 7f37cc168700 20 monclient: _un_backoff reopen_interval_multipler now 1
2019-11-13T08:39:05.356+0000 7f37cc168700 10 monclient: tick
...

gdb trace for mgr.x daemon:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f37d30516a5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
Missing separate debuginfos, use: zypper install glibc-debuginfo-2.30-1.1.x86_64 libblkid1-debuginfo-2.34-1.1.x86_64 libbz2-1-debuginfo-1.0.8-1.1.x86_64 libcrypt1-debuginfo-4.4.3-1.1.x86_64 libexpat1-debuginfo-2.2.8-1.1.x86_64 libffi7-debuginfo-3.2.1.git259-6.3.x86_64 libgcc_s1-debuginfo-9.1.1+r273734-1.1.x86_64 libibverbs1-debuginfo-25.1-3.1.x86_64 libltdl7-debuginfo-2.4.6-7.9.x86_64 liblttng-ust0-debuginfo-2.10.5-1.1.x86_64 liblzma5-debuginfo-5.2.4-1.5.x86_64 libnl3-200-debuginfo-3.4-1.6.x86_64 libopenssl1_1-debuginfo-1.1.1c-1.3.x86_64 libpython3_7m1_0-debuginfo-3.7.3-1.3.x86_64 librdmacm1-debuginfo-25.1-3.1.x86_64 libstdc++6-debuginfo-9.1.1+r273734-1.1.x86_64 libudev1-debuginfo-243-3.1.x86_64 libunwind-debuginfo-1.3.1-1.1.x86_64 liburcu6-debuginfo-0.10.0-2.6.x86_64 libuuid1-debuginfo-2.34-1.1.x86_64 libxml2-2-debuginfo-2.9.9-1.1.x86_64 libxmlsec1-1-debuginfo-1.2.28-1.4.x86_64 libxmlsec1-openssl1-debuginfo-1.2.28-1.4.x86_64 libxslt1-debuginfo-1.1.33-3.2.x86_64 libz1-debuginfo-1.2.11-6.1.x86_64 python3-MarkupSafe-debuginfo-1.1.1-1.4.x86_64 python3-base-debuginfo-3.7.3-1.3.x86_64 python3-cffi-debuginfo-1.13.0-1.1.x86_64 python3-cryptography-debuginfo-2.8-1.1.x86_64
(gdb) bt
#0  0x00007f37d30516a5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f37d2eacafc in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib64/libstdc++.so.6
#2  0x00007f37d3e25877 in AsyncMessenger::wait (this=0x5579eab1e000) at /ceph/src/msg/async/AsyncMessenger.cc:535
#3  0x00005579e9361ef4 in MgrStandby::main (this=this@entry=0x7ffec744fbd0, args=...) at /usr/include/c++/9/bits/unique_ptr.h:357
#4  0x00005579e9360baa in main (argc=3, argv=<optimized out>) at /usr/include/c++/9/bits/stl_algobase.h:465

#7 Updated by Laura Paduano 4 months ago

  • Description updated (diff)

#8 Updated by Ricardo Marques 4 months ago

  • Target version set to v15.0.0

#9 Updated by Ricardo Marques 4 months ago

  • Priority changed from Normal to High

#10 Updated by Patrick Donnelly 4 months ago

  • Status changed from 12 to In Progress
  • Assignee set to Patrick Donnelly
  • Start date deleted (11/11/2019)

The issue is that the modules have been deleted during shutdown but some notify finishers are still being called. I'm working on a fix.

#11 Updated by Patrick Donnelly 4 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 31620

#12 Updated by Sage Weil 3 months ago

this pr fixes the other half of the problem: https://github.com/ceph/ceph/pull/31685

#13 Updated by Patrick Donnelly 3 months ago

  • Related to Bug #42299: mgr/volumes: cleanup libcephfs handles on mgr shutdown added

#14 Updated by Patrick Donnelly 3 months ago

  • Blocked by Backport #42738: nautilus: mgr/volumes: cleanup libcephfs handles on mgr shutdown added

#15 Updated by Patrick Donnelly 3 months ago

  • Blocked by deleted (Backport #42738: nautilus: mgr/volumes: cleanup libcephfs handles on mgr shutdown)

#16 Updated by Patrick Donnelly 3 months ago

  • Blocks Backport #42738: nautilus: mgr/volumes: cleanup libcephfs handles on mgr shutdown added

#17 Updated by Patrick Donnelly 3 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to nautilus

#18 Updated by Patrick Donnelly 3 months ago

  • Status changed from Pending Backport to Resolved
  • Backport deleted (nautilus)

Nautilus backport will be tracked by #42738.

#19 Updated by Sage Weil 3 months ago

  • Related to Bug #42981: run-backend-api-tests.sh: mgr oneshot signal handlers do not revert to killing process added

#20 Updated by Sage Weil 3 months ago

  • Related to Bug #38573: mgr/ActivePyModule.cc: 54: FAILED ceph_assert(pClassInstance != nullptr) added

Also available in: Atom PDF