Bug #42744
mgr/dashboard: Executing the run-backend-api-tests script results in infinite loop
0%
Description
Test output:
2019-11-11 14:51:45,469.469 INFO:__main__:Running ['./bin/ceph', 'log', 'Starting test tasks.mgr.test_dashboard.TestDashboard.test_standby']
2019-11-11 14:51:46,778.778 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-11-11 14:51:46,797.797 INFO:__main__:Found ps line for daemon: 43098 ? Ssl 0:05 ./bin/./ceph-mgr -i y
2019-11-11 14:51:46,798.798 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-11-11 14:51:46,815.815 INFO:__main__:Found ps line for daemon: 43098 ? Ssl 0:05 ./bin/./ceph-mgr -i y
2019-11-11 14:51:46,815.815 INFO:__main__:Killing PID 43098 for mgr.y
2019-11-11 14:51:46,816.816 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-11-11 14:51:46,836.836 INFO:__main__:No match for mgr y
2019-11-11 14:51:46,837.837 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-11-11 14:51:46,860.860 INFO:__main__:No match for mgr y
2019-11-11 14:51:46,861.861 INFO:__main__:Running ['./bin/ceph', 'mgr', 'fail', 'y']
2019-11-11 14:51:47,815.815 INFO:__main__:Running ['ps', 'ww', '-u0']
2019-11-11 14:51:47,835.835 INFO:__main__:Found ps line for daemon: 43128 ? Ssl 0:05 ./bin/./ceph-mgr -i x
2019-11-11 14:51:47,836.836 INFO:__main__:Running ['ps', 'ww', '-u0']
mgr.log:
-18> 2019-11-11T15:06:53.342+0000 7f0c79ffb700 20 mgr ~Gil Destroying new thread state 0x55872f2b4fc0
-17> 2019-11-11T15:06:53.342+0000 7f0c79ffb700 20 mgr Gil Switched to new thread state 0x55872f2b4fc0
-16> 2019-11-11T15:06:53.342+0000 7f0c79ffb700 20 mgr ~Gil Destroying new thread state 0x55872f2b4fc0
-15> 2019-11-11T15:06:53.342+0000 7f0c79ffb700 20 mgr Gil Switched to new thread state 0x55872f2b4fc0
-14> 2019-11-11T15:06:53.342+0000 7f0c79ffb700 20 mgr ~Gil Destroying new thread state 0x55872f2b4fc0
-13> 2019-11-11T15:06:53.342+0000 7f0c79ffb700 20 mgr Gil Switched to new thread state 0x55872f2b4fc0
-12> 2019-11-11T15:06:53.342+0000 7f0c79ffb700 20 mgr ~Gil Destroying new thread state 0x55872f2b4fc0
-11> 2019-11-11T15:06:53.346+0000 7f0c79ffb700 20 mgr Gil Switched to new thread state 0x55872f2b4fc0
-10> 2019-11-11T15:06:53.346+0000 7f0c79ffb700 20 mgr ~Gil Destroying new thread state 0x55872f2b4fc0
-9> 2019-11-11T15:06:53.346+0000 7f0c79ffb700 20 mgr Gil Switched to new thread state 0x55872f2b4fc0
-8> 2019-11-11T15:06:53.346+0000 7f0c79ffb700 20 mgr ~Gil Destroying new thread state 0x55872f2b4fc0
-7> 2019-11-11T15:06:53.346+0000 7f0c79ffb700 20 mgr Gil Switched to new thread state 0x55872f2b4fc0
-6> 2019-11-11T15:06:53.346+0000 7f0c79ffb700 20 mgr ~Gil Destroying new thread state 0x55872f2b4fc0
-5> 2019-11-11T15:06:53.346+0000 7f0c79ffb700 20 mgr Gil Switched to new thread state 0x55872f2b4fc0
-4> 2019-11-11T15:06:53.346+0000 7f0c79ffb700 20 mgr ~Gil Destroying new thread state 0x55872f2b4fc0
-3> 2019-11-11T15:06:53.346+0000 7f0c79ffb700 20 mgr Gil Switched to new thread state 0x55872f2b4fc0
-2> 2019-11-11T15:06:53.346+0000 7f0c79ffb700 20 mgr ~Gil Destroying new thread state 0x55872f2b4fc0
-1> 2019-11-11T15:06:53.354+0000 7f0c79ffb700 -1 /ceph/src/mgr/ActivePyModule.cc: In function 'void ActivePyModule::notify(const string&, const string&)' thread 7f0c79ffb700 time 2019-11-11T15:06:53.348308+0000
/ceph/src/mgr/ActivePyModule.cc: 56: FAILED ceph_assert(pClassInstance != nullptr)
ceph version Development (no_version) octopus (dev)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19d) [0x7f0c90b98091]
2: (()+0x15e5313) [0x7f0c90b98313]
3: (ActivePyModule::notify(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x4c) [0x558727e282c4]
4: (()+0xa0da9c) [0x558727e31a9c]
5: (()+0xa15580) [0x558727e39580]
6: (Context::complete(int)+0x27) [0x558727e39da3]
7: (Finisher::finisher_thread_entry()+0x39d) [0x7f0c90acafff]
8: (Finisher::FinisherThread::entry()+0x1c) [0x558727e3a2cc]
9: (Thread::entry_wrapper()+0x78) [0x7f0c90b38b44]
10: (Thread::_entry_func(void*)+0x18) [0x7f0c90b38ac2]
11: (()+0x8f2a) [0x7f0c8ed00f2a]
12: (clone()+0x3f) [0x7f0c8e86a4af]
0> 2019-11-11T15:06:53.366+0000 7f0c79ffb700 -1 *** Caught signal (Aborted) **
in thread 7f0c79ffb700 thread_name:mgr-fin
ceph version Development (no_version) octopus (dev)
1: (()+0xd4ef74) [0x558728172f74]
2: (()+0x142d0) [0x7f0c8ed0c2d0]
3: (gsignal()+0x141) [0x7f0c8e7a70d1]
4: (abort()+0x127) [0x7f0c8e79055d]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x36c) [0x7f0c90b98260]
6: (()+0x15e5313) [0x7f0c90b98313]
7: (ActivePyModule::notify(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x4c) [0x558727e282c4]
8: (()+0xa0da9c) [0x558727e31a9c]
9: (()+0xa15580) [0x558727e39580]
10: (Context::complete(int)+0x27) [0x558727e39da3]
11: (Finisher::finisher_thread_entry()+0x39d) [0x7f0c90acafff]
12: (Finisher::FinisherThread::entry()+0x1c) [0x558727e3a2cc]
13: (Thread::entry_wrapper()+0x78) [0x7f0c90b38b44]
14: (Thread::_entry_func(void*)+0x18) [0x7f0c90b38ac2]
15: (()+0x8f2a) [0x7f0c8ed00f2a]
16: (clone()+0x3f) [0x7f0c8e86a4af]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Also see: https://jenkins.ceph.com/job/ceph-dashboard-pr-backend/462/
This issue is blocking developers from executing the tests locally and it also causes the Jenkins job to fail.
Related issues
History
#1 Updated by Laura Paduano over 4 years ago
- Description updated (diff)
#2 Updated by Tatjana Dehler over 4 years ago
I can also tell, it happens on various test cases, e.g.:
2019-11-11 11:44:58,589.589 INFO:__main__:ERROR: setUpClass (tasks.mgr.dashboard.test_pool.PoolTest) 2019-11-11 11:44:58,589.589 INFO:__main__:---------------------------------------------------------------------- 2019-11-11 11:44:58,589.589 INFO:__main__:Traceback (most recent call last): 2019-11-11 11:44:58,589.589 INFO:__main__: File "/ceph/qa/tasks/mgr/dashboard/helper.py", line 113, in setUpClass 2019-11-11 11:44:58,590.590 INFO:__main__: cls._assign_ports("dashboard", "ssl_server_port") 2019-11-11 11:44:58,590.590 INFO:__main__: File "/ceph/qa/tasks/mgr/mgr_test_case.py", line 193, in _assign_ports 2019-11-11 11:44:58,590.590 INFO:__main__: cls.mgr_cluster.mgr_stop(mgr_id) 2019-11-11 11:44:58,590.590 INFO:__main__: File "/ceph/qa/tasks/mgr/mgr_test_case.py", line 28, in mgr_stop 2019-11-11 11:44:58,590.590 INFO:__main__: self.mgr_daemons[mgr_id].stop() 2019-11-11 11:44:58,590.590 INFO:__main__: File "../qa/tasks/vstart_runner.py", line 443, in stop 2019-11-11 11:44:58,591.591 INFO:__main__: self.daemon_type, self.daemon_id)) 2019-11-11 11:44:58,591.591 INFO:__main__:MaxWhileTries: Timed out waiting for daemon mgr.x
2019-11-11 14:45:42,929.929 INFO:__main__:ERROR: setUpClass (tasks.mgr.dashboard.test_settings.SettingsTest) 2019-11-11 14:45:42,929.929 INFO:__main__:---------------------------------------------------------------------- 2019-11-11 14:45:42,929.929 INFO:__main__:Traceback (most recent call last): 2019-11-11 14:45:42,930.930 INFO:__main__: File "/ceph/qa/tasks/mgr/dashboard/helper.py", line 113, in setUpClass 2019-11-11 14:45:42,930.930 INFO:__main__: cls._assign_ports("dashboard", "ssl_server_port") 2019-11-11 14:45:42,931.931 INFO:__main__: File "/ceph/qa/tasks/mgr/mgr_test_case.py", line 193, in _assign_ports 2019-11-11 14:45:42,931.931 INFO:__main__: cls.mgr_cluster.mgr_stop(mgr_id) 2019-11-11 14:45:42,931.931 INFO:__main__: File "/ceph/qa/tasks/mgr/mgr_test_case.py", line 28, in mgr_stop 2019-11-11 14:45:42,932.932 INFO:__main__: self.mgr_daemons[mgr_id].stop() 2019-11-11 14:45:42,932.932 INFO:__main__: File "../qa/tasks/vstart_runner.py", line 443, in stop 2019-11-11 14:45:42,932.932 INFO:__main__: self.daemon_type, self.daemon_id)) 2019-11-11 14:45:42,932.932 INFO:__main__:MaxWhileTries: Timed out waiting for daemon mgr.x
2019-11-11 14:46:18,961.961 INFO:__main__:ERROR: setUpClass (tasks.mgr.dashboard.test_rgw.RgwDaemonTest) 2019-11-11 14:46:18,961.961 INFO:__main__:---------------------------------------------------------------------- 2019-11-11 14:46:18,961.961 INFO:__main__:Traceback (most recent call last): 2019-11-11 14:46:18,961.961 INFO:__main__: File "/home/jenkins-build/build/workspace/ceph-dashboard-pr-backend/qa/tasks/mgr/dashboard/helper.py", line 113, in setUpClass 2019-11-11 14:46:18,962.962 INFO:__main__: cls._assign_ports("dashboard", "ssl_server_port") 2019-11-11 14:46:18,962.962 INFO:__main__: File "/home/jenkins-build/build/workspace/ceph-dashboard-pr-backend/qa/tasks/mgr/mgr_test_case.py", line 193, in _assign_ports 2019-11-11 14:46:18,962.962 INFO:__main__: cls.mgr_cluster.mgr_stop(mgr_id) 2019-11-11 14:46:18,962.962 INFO:__main__: File "/home/jenkins-build/build/workspace/ceph-dashboard-pr-backend/qa/tasks/mgr/mgr_test_case.py", line 28, in mgr_stop 2019-11-11 14:46:18,962.962 INFO:__main__: self.mgr_daemons[mgr_id].stop() 2019-11-11 14:46:18,962.962 INFO:__main__: File "../qa/tasks/vstart_runner.py", line 443, in stop 2019-11-11 14:46:18,962.962 INFO:__main__: self.daemon_type, self.daemon_id)) 2019-11-11 14:46:18,962.962 INFO:__main__:MaxWhileTries: Timed out waiting for daemon mgr.x
2019-11-13 09:01:54,801.801 INFO:__main__:ERROR: test_standby (tasks.mgr.test_dashboard.TestDashboard) 2019-11-13 09:01:54,801.801 INFO:__main__:---------------------------------------------------------------------- 2019-11-13 09:01:54,802.802 INFO:__main__:Traceback (most recent call last): 2019-11-13 09:01:54,802.802 INFO:__main__: File "/ceph/qa/tasks/mgr/test_dashboard.py", line 18, in setUp 2019-11-13 09:01:54,802.802 INFO:__main__: self._assign_ports("dashboard", "ssl_server_port") 2019-11-13 09:01:54,802.802 INFO:__main__: File "/ceph/qa/tasks/mgr/mgr_test_case.py", line 193, in _assign_ports 2019-11-13 09:01:54,802.802 INFO:__main__: cls.mgr_cluster.mgr_stop(mgr_id) 2019-11-13 09:01:54,802.802 INFO:__main__: File "/ceph/qa/tasks/mgr/mgr_test_case.py", line 28, in mgr_stop 2019-11-13 09:01:54,802.802 INFO:__main__: self.mgr_daemons[mgr_id].stop() 2019-11-13 09:01:54,802.802 INFO:__main__: File "../qa/tasks/vstart_runner.py", line 443, in stop 2019-11-13 09:01:54,802.802 INFO:__main__: self.daemon_type, self.daemon_id)) 2019-11-13 09:01:54,802.802 INFO:__main__:MaxWhileTries: Timed out waiting for daemon mgr.x
They all have `cls._assign_ports("dashboard", "ssl_server_port")` in common.
#3 Updated by Lenz Grimmer over 4 years ago
Is this limited to a certain OS/platform or a generic issue?
#4 Updated by Laura Paduano over 4 years ago
Lenz Grimmer wrote:
Is this limited to a certain OS/platform or a generic issue?
Seems like it's a generic issue, happened in my docker container and also in Jenkins.
#5 Updated by Sebastian Wagner over 4 years ago
- Status changed from New to 12
#6 Updated by Kiefer Chang over 4 years ago
Saw the same issue locally, waiting for mgr.x to be shutdown.
At the moment, `ceph -s` doesn't contain mgr.x. But mgr.x continue writing messages to its log:
2019-11-13T08:38:42.284+0000 7f37ce16c700 -1 mgr handle_signal *** Got signal Terminated *** 2019-11-13T08:38:42.284+0000 7f37be8a0700 1 -- [v2:10.156.43.6:6800/4506,v1:10.156.43.6:6801/4506] shutdown_connections 2019-11-13T08:38:42.284+0000 7f37be8a0700 1 -- [v2:10.156.43.6:6800/4506,v1:10.156.43.6:6801/4506] shutdown_connections 2019-11-13T08:38:42.284+0000 7f37be8a0700 1 -- [v2:10.156.43.6:6800/4506,v1:10.156.43.6:6801/4506] wait complete. 2019-11-13T08:38:42.288+0000 7f37be8a0700 4 mgr[balancer] Stopping 2019-11-13T08:38:42.288+0000 7f37be8a0700 4 mgr[devicehealth] Stopping 2019-11-13T08:38:42.288+0000 7f37be8a0700 4 mgr[pg_autoscaler] Stopping pg_autoscaler 2019-11-13T08:38:42.292+0000 7f37be8a0700 4 mgr[volumes] shutting down 2019-11-13T08:38:43.280+0000 7f37cd16a700 1 -- 10.156.43.6:0/4506 <== mon.0 v2:10.156.43.6:40917/0 20 ==== mgrmap(e 50) v1 ==== 34349+0+0 (crc 0 0 0) 0x5579e9f82000 con 0x5579eabc4d80 2019-11-13T08:38:43.280+0000 7f37cd16a700 4 mgr handle_mgr_map received map epoch 50 2019-11-13T08:38:43.280+0000 7f37cd16a700 4 mgr handle_mgr_map active in map: 1 active is 4996 2019-11-13T08:38:45.356+0000 7f37cc168700 10 monclient: tick 2019-11-13T08:38:45.356+0000 7f37cc168700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-11-13T08:38:15.358104+0000) 2019-11-13T08:38:45.356+0000 7f37cc168700 20 monclient: _un_backoff reopen_interval_multipler now 1 2019-11-13T08:38:55.356+0000 7f37cc168700 10 monclient: tick 2019-11-13T08:38:55.356+0000 7f37cc168700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-11-13T08:38:25.358368+0000) 2019-11-13T08:38:55.356+0000 7f37cc168700 20 monclient: _un_backoff reopen_interval_multipler now 1 2019-11-13T08:39:05.356+0000 7f37cc168700 10 monclient: tick ...
gdb trace for mgr.x daemon:
[Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x00007f37d30516a5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 Missing separate debuginfos, use: zypper install glibc-debuginfo-2.30-1.1.x86_64 libblkid1-debuginfo-2.34-1.1.x86_64 libbz2-1-debuginfo-1.0.8-1.1.x86_64 libcrypt1-debuginfo-4.4.3-1.1.x86_64 libexpat1-debuginfo-2.2.8-1.1.x86_64 libffi7-debuginfo-3.2.1.git259-6.3.x86_64 libgcc_s1-debuginfo-9.1.1+r273734-1.1.x86_64 libibverbs1-debuginfo-25.1-3.1.x86_64 libltdl7-debuginfo-2.4.6-7.9.x86_64 liblttng-ust0-debuginfo-2.10.5-1.1.x86_64 liblzma5-debuginfo-5.2.4-1.5.x86_64 libnl3-200-debuginfo-3.4-1.6.x86_64 libopenssl1_1-debuginfo-1.1.1c-1.3.x86_64 libpython3_7m1_0-debuginfo-3.7.3-1.3.x86_64 librdmacm1-debuginfo-25.1-3.1.x86_64 libstdc++6-debuginfo-9.1.1+r273734-1.1.x86_64 libudev1-debuginfo-243-3.1.x86_64 libunwind-debuginfo-1.3.1-1.1.x86_64 liburcu6-debuginfo-0.10.0-2.6.x86_64 libuuid1-debuginfo-2.34-1.1.x86_64 libxml2-2-debuginfo-2.9.9-1.1.x86_64 libxmlsec1-1-debuginfo-1.2.28-1.4.x86_64 libxmlsec1-openssl1-debuginfo-1.2.28-1.4.x86_64 libxslt1-debuginfo-1.1.33-3.2.x86_64 libz1-debuginfo-1.2.11-6.1.x86_64 python3-MarkupSafe-debuginfo-1.1.1-1.4.x86_64 python3-base-debuginfo-3.7.3-1.3.x86_64 python3-cffi-debuginfo-1.13.0-1.1.x86_64 python3-cryptography-debuginfo-2.8-1.1.x86_64 (gdb) bt #0 0x00007f37d30516a5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f37d2eacafc in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib64/libstdc++.so.6 #2 0x00007f37d3e25877 in AsyncMessenger::wait (this=0x5579eab1e000) at /ceph/src/msg/async/AsyncMessenger.cc:535 #3 0x00005579e9361ef4 in MgrStandby::main (this=this@entry=0x7ffec744fbd0, args=...) at /usr/include/c++/9/bits/unique_ptr.h:357 #4 0x00005579e9360baa in main (argc=3, argv=<optimized out>) at /usr/include/c++/9/bits/stl_algobase.h:465
#7 Updated by Laura Paduano over 4 years ago
- Description updated (diff)
#8 Updated by Ricardo Marques over 4 years ago
- Target version set to v15.0.0
#9 Updated by Ricardo Marques over 4 years ago
- Priority changed from Normal to High
#10 Updated by Patrick Donnelly over 4 years ago
- Status changed from 12 to In Progress
- Assignee set to Patrick Donnelly
- Start date deleted (
11/11/2019)
The issue is that the modules have been deleted during shutdown but some notify finishers are still being called. I'm working on a fix.
#11 Updated by Patrick Donnelly over 4 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 31620
#12 Updated by Sage Weil over 4 years ago
this pr fixes the other half of the problem: https://github.com/ceph/ceph/pull/31685
#13 Updated by Patrick Donnelly over 4 years ago
- Related to Bug #42299: mgr/volumes: cleanup libcephfs handles on mgr shutdown added
#14 Updated by Patrick Donnelly over 4 years ago
- Blocked by Backport #42738: nautilus: mgr/volumes: cleanup libcephfs handles on mgr shutdown added
#15 Updated by Patrick Donnelly over 4 years ago
- Blocked by deleted (Backport #42738: nautilus: mgr/volumes: cleanup libcephfs handles on mgr shutdown)
#16 Updated by Patrick Donnelly over 4 years ago
- Blocks Backport #42738: nautilus: mgr/volumes: cleanup libcephfs handles on mgr shutdown added
#17 Updated by Patrick Donnelly over 4 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to nautilus
#18 Updated by Patrick Donnelly over 4 years ago
- Status changed from Pending Backport to Resolved
- Backport deleted (
nautilus)
Nautilus backport will be tracked by #42738.
#19 Updated by Sage Weil over 4 years ago
- Related to Bug #42981: run-backend-api-tests.sh: mgr oneshot signal handlers do not revert to killing process added
#20 Updated by Sage Weil over 4 years ago
- Related to Bug #38573: mgr/ActivePyModule.cc: 54: FAILED ceph_assert(pClassInstance != nullptr) added
#21 Updated by Nathan Cutler over 3 years ago
- Status changed from Resolved to Pending Backport
- Backport set to nautilus
#22 Updated by Nathan Cutler over 3 years ago
- Copied to Backport #48014: nautilus: mgr/dashboard: Executing the run-backend-api-tests script results in infinite loop added
#23 Updated by Nathan Cutler over 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".
#24 Updated by Ernesto Puerta almost 3 years ago
- Project changed from mgr to Dashboard
- Category changed from 151 to Testing & QA