run-backend-api-tests.sh: mgr oneshot signal handlers do not revert to killing process
On master, when running the dashboard tests, the test tries to kill the mgr. The first SIGINT triggers shutdown of the python modules, but that invariable seems to hang due to some deadlock. Subsequent kill signals fail to kill the process.
cd ../src/pybind/mgr/dashboard ./run-backend-api-tests.sh
and it will eventually time out trying to kill ceph-mgr.
The mgr log will note that the first signal was received,
2019-11-22T11:44:12.259-0600 7fef96fa4700 -1 received signal: Terminated from python ../qa/tasks/vstart_runner.py --ignore-missing-binaries tasks.mgr.test_dashboard tasks.mgr.dashboard.test_auth tasks.mgr.dashboard.test_cephfs tasks.mgr.dashboard.test_cluster_configuration tasks.mgr.dashboard.test_erasure_code_pro file tasks.mgr.dashboard.test_ganesha tasks.mgr.dashboard.test_health tasks.mgr.dashboard.test_host tasks.mgr.dashboard.test_logs tasks.mgr.dashboard.test_mgr_module tasks.mgr.dashboard.test_monitor tasks.mgr.dashboard.test_orchestrator tasks.mgr.dashboard.test_osd tasks.mgr.dashboard.test_perf_counters tasks.mgr.d ashboard.test_pool tasks.mgr.dashboard.test_rbd_mirroring tasks.mgr.dashboard.test_rbd tasks.mgr.dashboard.test_requests tasks.mgr.dashboard.test_rgw tasks.mgr.dashboard.test_role tasks.mgr.dashboard.test_settings tasks.mgr.dashboard.test_summary tasks.mgr.dashboard.test_user tasks.mgr.test_module_selftest (PID: 1 56456) UID: 1031 2019-11-22T11:44:12.259-0600 7fef96fa4700 -1 mgr handle_mgr_signal *** Got signal Terminated ***
and shutdown deadlocks (different bug!). but sending another SIGINT fails to kill the process.
the signals are registered with
register_async_signal_handler_oneshot(SIGINT, handle_mgr_signal); register_async_signal_handler_oneshot(SIGTERM, handle_mgr_signal);
and the oneshot sets the SA_RESETHAND flag,
act.sa_flags = SA_SIGINFO | (oneshot ? SA_RESETHAND : 0);
I tested this works correctly with a kludge to ceph_mgr.cc (see attached), but it's not working later on for some reason!
#1 Updated by Sage Weil about 2 months ago
To clarify: the signal handler is (supposed to be) installed as a one-shot: the first SIGINT/SIGTERM will trigger the handling code, and remove the signal handler, reverting to the default, so that the next SIGINT/SIGTERM just kills the process immediately. I'm not sure why this isn't happening, but what I observe is that a SIGTERM is sent hundreds of times but is seemingly ignored.