Bug #41513
closedmgr fails to respawn in 14.2.2->nautilus upgrade
0%
Description
Run: http://pulpito.front.sepia.ceph.com/yuriw-2019-08-22_16:46:25-upgrade:nautilus-p2p-nautilus-distro-basic-smithi/
Jobs: '4241587', '4241602', '4241597', '4241592'
Logs: http://qa-proxy.ceph.com/teuthology/yuriw-2019-08-22_16:46:25-upgrade:nautilus-p2p-nautilus-distro-basic-smithi/4241587/teuthology.log
2019-08-24T16:43:23.634 INFO:teuthology.orchestra.run.smithi180.stderr:s3tests.functional.test_s3.test_object_copy_retaining_metadata ... ok 2019-08-24T16:43:28.339 INFO:teuthology.orchestra.run.smithi180.stderr:s3tests.functional.test_s3.test_object_copy_replacing_metadata ... ok 2019-08-24T16:43:29.701 INFO:teuthology.orchestra.run.smithi180.stderr:s3tests.functional.test_s3.test_object_copy_bucket_not_found ... ok 2019-08-24T16:43:31.093 INFO:teuthology.orchestra.run.smithi180.stderr:s3tests.functional.test_s3.test_object_copy_key_not_found ... ok 2019-08-24T16:43:33.294 DEBUG:teuthology.orchestra.run:got remote process result: 124 2019-08-24T16:43:33.295 ERROR:teuthology.run_tasks:Saw exception from tasks. Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks manager = run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task return task(**kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/task/parallel.py", line 56, in task p.spawn(_run_spawned, ctx, confg, taskname) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 85, in __exit__ for result in self: File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 99, in next resurrect_traceback(result) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 22, in capture_traceback return func(*args, **kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/task/parallel.py", line 64, in _run_spawned mgr = run_tasks.run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task return task(**kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/task/sequential.py", line 49, in task mgr.__enter__() File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/home/teuthworker/src/github.com_ceph_ceph_nautilus/qa/tasks/ceph.py", line 1639, in restart healthy(ctx=ctx, config=dict(cluster=cluster)) File "/home/teuthworker/src/github.com_ceph_ceph_nautilus/qa/tasks/ceph.py", line 1477, in healthy manager.wait_for_clean() File "/home/teuthworker/src/github.com_ceph_ceph_nautilus/qa/tasks/ceph_manager.py", line 2168, in wait_for_clean num_active_clean = self.get_num_active_clean() File "/home/teuthworker/src/github.com_ceph_ceph_nautilus/qa/tasks/ceph_manager.py", line 2069, in get_num_active_clean pgs = self.get_pg_stats() File "/home/teuthworker/src/github.com_ceph_ceph_nautilus/qa/tasks/ceph_manager.py", line 1853, in get_pg_stats out = self.raw_cluster_cmd('pg', 'dump', '--format=json') File "/home/teuthworker/src/github.com_ceph_ceph_nautilus/qa/tasks/ceph_manager.py", line 1157, in raw_cluster_cmd stdout=StringIO(), File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 205, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 437, in run r.wait() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 162, in wait self._raise_for_status() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 184, in _raise_for_status node=self.hostname, label=self.label CommandFailedError: Command failed on smithi150 with status 124: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph pg dump --format=json'
Updated by Yuri Weinstein over 4 years ago
- Subject changed from "https://tracker.ceph.com/issues/41257" in upgrade:nautilus-p2p-nautilus to "Timeout 120 running ceph --cluster ceph pg dump --format=json" in upgrade:nautilus-p2p-nautilus
Updated by Josh Durgin over 4 years ago
It seems in each of these nautilus-p2p parallel runs, 'ceph pg dump' was run after the ceph cli was upgraded to nautilus, but ceph-mgr was not restarted, so it was still running 14.2.2.
In each case, the command times out, but the mgr and audit log have no record of this command at the same time.
Updated by Josh Durgin over 4 years ago
Yuri is re-running with some client debugging here: http://pulpito.ceph.com/yuriw-2019-08-26_22:39:26-upgrade:nautilus-p2p:nautilus-p2p-parallel-wip_nautilus_14.2.3_RC1-distro-basic-smithi/
Updated by Josh Durgin over 4 years ago
- Project changed from Ceph to mgr
- Subject changed from "Timeout 120 running ceph --cluster ceph pg dump --format=json" in upgrade:nautilus-p2p-nautilus to mgr fails to respawn in 14.2.2->nautilus upgrade
- Category set to ceph-mgr
- Priority changed from Normal to Urgent
This appears to be a mgr bug - the mgr tries to respawn at 23:31, but the client (and cluster log) show no active mgr after that.
From the end of /a/yuriw-2019-08-26_22:39:26-upgrade:nautilus-p2p:nautilus-p2p-parallel-wip_nautilus_14.2.3_RC1-distro-basic-smithi/4255489/remote/smithi071/log/ceph-mgr.x.log.gz
2019-08-26 23:31:35.309 7f2ba1146700 1 -- 172.21.15.71:0/10717 <== mon.1 v2:172.21.15.71:3300/0 9 ==== mgrmap(e 4) v1 ==== 14119+0+0 (crc 0 0 0) 0x563a8c3df080 con 0x563a8c279000 2019-08-26 23:31:35.309 7f2ba1146700 4 mgr ms_dispatch active mgrmap(e 4) v1 2019-08-26 23:31:35.309 7f2ba1146700 4 mgr handle_mgr_map received map epoch 4 2019-08-26 23:31:35.309 7f2ba1146700 4 mgr handle_mgr_map active in map: 1 active is 4101 2019-08-26 23:31:35.310 7f2ba1146700 1 mgr respawn e: 'ceph-mgr' 2019-08-26 23:31:35.310 7f2ba1146700 1 mgr respawn 0: 'ceph-mgr' 2019-08-26 23:31:35.310 7f2ba1146700 1 mgr respawn 1: '-f' 2019-08-26 23:31:35.310 7f2ba1146700 1 mgr respawn 2: '--cluster' 2019-08-26 23:31:35.310 7f2ba1146700 1 mgr respawn 3: 'ceph' 2019-08-26 23:31:35.310 7f2ba1146700 1 mgr respawn 4: '-i' 2019-08-26 23:31:35.310 7f2ba1146700 1 mgr respawn 5: 'x' 2019-08-26 23:31:35.310 7f2ba1146700 1 mgr respawn respawning with exe /usr/bin/ceph-mgr;5d646944 (deleted) 2019-08-26 23:31:35.310 7f2ba1146700 1 mgr respawn exe_path /proc/self/exe
I see no sign of the mgr reconnecting after this point in mon.a's log either.
Updated by Josh Durgin over 4 years ago
Reproduced with a manual run. Automatic respawning failed with no trace, but starting ceph-mgr again manually worked fine.
Updated by Kefu Chai over 4 years ago
2019-08-26T23:21:21.949 INFO:teuthology.orchestra.run.smithi071.stdout: ceph-mgr.x86_64 2:14.2.2-469.ge157074.el7 ... 2019-08-26T23:21:21.951 INFO:teuthology.orchestra.run.smithi071.stdout: librados2.x86_64 2:14.2.2-469.ge157074.el7 ... 2019-08-26T23:31:35.431 INFO:tasks.ceph.mgr.x.smithi071.stderr:ceph-mgr: symbol lookup error: ceph-mgr: undefined symbol: _ZTIN4ceph6buffer5errorE
$ c++filt _ZTIN4ceph6buffer5errorE typeinfo for ceph::buffer::error
master|$ nm -C ./lib/libceph-common.so | grep 'typeinfo for ceph::buffer.*error' 0000000002299df8 V typeinfo for ceph::buffer::v14_2_0::error_code 0000000002299e58 V typeinfo for ceph::buffer::v14_2_0::error
- in 14.2.2, mgr expects ceph::buffer::error in libceph-common.so
- in nautilus (14.2.3), we have 4ac8bc79e7fc128f10b38457c93b092b5c8060ad backported in https://github.com/ceph/ceph/pull/29244. so the ceph::buffer::error are actually exported as "ceph::buffer::v14_2_0::error".
and the way how we find the executable prevents ceph-mgr from using the new one just gets installed.
/* Determine the path to our executable, test if Linux /proc/self/exe exists.
* This allows us to exec the same executable even if it has since been
* unlinked.
*/
char exe_path[PATH_MAX] = "";
if (readlink(PROCPREFIX "/proc/self/exe", exe_path, PATH_MAX-1) == -1) {
/* Print CWD for the user's interest */
char buf[PATH_MAX];
char *cwd = getcwd(buf, sizeof(buf));
ceph_assert(cwd);
dout(1) << " cwd " << cwd << dendl;
/* Fall back to a best-effort: just running in our CWD */
strncpy(exe_path, orig_argv[0], PATH_MAX-1);
} else {
dout(1) << "respawning with exe " << exe_path << dendl;
strcpy(exe_path, PROCPREFIX "/proc/self/exe");
}
that's why the log shows
2019-08-26 23:31:35.310 7f2ba1146700 1 mgr respawn respawning with exe /usr/bin/ceph-mgr;5d646944 (deleted)
we were execv'ing an already deleted executable, like reviving a zombie...
Updated by Kefu Chai over 4 years ago
- Status changed from New to 12
- Assignee set to Kefu Chai
Updated by Patrick Donnelly over 4 years ago
IRC conversation:
2019-08-28 08:08:45 batrick joshd: kefu: it looks like the issue is that the old mgr executable can't use the new ceph-common shared library, right? 2019-08-28 08:09:00 joshd yes 2019-08-28 08:09:08 batrick mds will probably have similar issues then 2019-08-28 08:09:27 agil pg 4.1e is stuck inactive for 60643.351410, current state unknown, last acting [] 2019-08-28 08:09:32 batrick at the time it seemed like a good use of /proc/self/exe to prevent accidental upgrades of the mds during respawn 2019-08-28 08:09:37 agil any hint on how to remove them? 2019-08-28 08:09:39 joshd batrick: can do the same fix there if you like 2019-08-28 08:10:57 batrick it doesn't matter too much 2019-08-28 08:11:10 batrick ceph should really run containerized where this would be a non-issue 2019-08-28 08:16:56 --> vtheile_ (~vtheile@nat.nue.novell.com) has joined #ceph-devel 2019-08-28 08:18:35 --> ivotron (~ivotron@177.228.76.97) has joined #ceph-devel 2019-08-28 08:23:51 batrick joshd: looking at this test more closely 2019-08-28 08:24:04 batrick why are we upgrading the cluster starting with mds, then osd ,then mon? 2019-08-28 08:24:18 batrick /ceph/teuthology-archive/yuriw-2019-08-26_22:39:26-upgrade:nautilus-p2p:nautilus-p2p-parallel-wip_nautilus_14.2.3_RC1-distro-basic-smithi/4255489 2019-08-28 08:24:28 joshd batrick: we do different orders in different suites to verify no crashes 2019-08-28 08:24:42 joshd even if some of them aren't the ideal order 2019-08-28 08:26:09 batrick looks like the MgrMap modules changed which triggered the respawn 2019-08-28 08:26:38 --> jcollin (~jcollin@122.167.109.70) has joined #ceph-devel 2019-08-28 08:26:59 batrick it concerns me that this test doesn't even have an explicit restart the mgr to upgrade 2019-08-28 08:26:59 joshd that's right 2019-08-28 08:27:10 batrick but the mgr is upgrading as a side-effect of the monitors updating the mgrmap 2019-08-28 08:27:26 joshd I agree, I think it's an oversight 2019-08-28 08:28:04 batrick since this /proc/self/exe has not been a problem until now, perhaps the right fix is to update the upgrade sequence to restart the mgr 2019-08-28 08:28:35 batrick separately maybe we can discuss changing the execve executable if it's not desirable 2019-08-28 08:29:31 joshd that would be even simpler 2019-08-28 08:29:55 joshd I think that's a good idea, since in practice systemd will end up restarting the failed process as well