Bug #43914
nautilus: ceph tell command times out
0%
Description
2020-01-30T04:08:15.082 INFO:teuthology.orchestra.run.smithi067:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.6 flush_pg_stats 2020-01-30T04:10:15.119 DEBUG:teuthology.orchestra.run:got remote process result: 124 2020-01-30T04:10:15.119 ERROR:teuthology.run_tasks:Manager failed: thrashosds Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 159, in run_tasks suppress = manager.__exit__(*exc_info) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-sage-testing-2020-01-29-1034/qa/tasks/thrashosds.py", line 216, in task cluster_manager.flush_all_pg_stats() File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-sage-testing-2020-01-29-1034/qa/tasks/ceph_manager.py", line 1491, in flush_all_pg_stats self.flush_pg_stats(range(len(self.get_osd_dump()))) File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-sage-testing-2020-01-29-1034/qa/tasks/ceph_manager.py", line 1467, in flush_pg_stats for osd in osds} File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-sage-testing-2020-01-29-1034/qa/tasks/ceph_manager.py", line 1467, in <dictcomp> for osd in osds} File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-sage-testing-2020-01-29-1034/qa/tasks/ceph_manager.py", line 1353, in raw_cluster_cmd stdout=StringIO(), File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 198, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 433, in run r.wait() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 158, in wait self._raise_for_status() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 180, in _raise_for_status node=self.hostname, label=self.label CommandFailedError: Command failed on smithi067 with status 124: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.6 flush_pg_stats'
/a/sage-2020-01-29_20:14:58-rados-wip-sage-testing-2020-01-29-1034-distro-basic-smithi/4718262
notably, these commands normally log profusely to stderr, but this one did not, nor did it generate a log file.
History
#1 Updated by Sage Weil over 3 years ago
- Status changed from New to Need More Info
#2 Updated by Neha Ojha over 3 years ago
This is on nautilus: /a/nojha-2020-02-21_20:34:10-upgrade:mimic-x:stress-split-nautilus-distro-basic-smithi/4788575/
CommandFailedError: Command failed on smithi110 with status 124: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.4 flush_pg_stats'
#3 Updated by Neha Ojha over 3 years ago
First observation from teuthology.log for /a/nojha-2020-02-21_20:34:10-upgrade:mimic-x:stress-split-nautilus-distro-basic-smithi/4788575: we see "ceph tell osd.4 flush_pg_stats" being called thrice at different stages and note that the third one fails.
2020-02-21T21:10:06.676 INFO:teuthology.orchestra.run.smithi110:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.4 flush_pg_stats 2020-02-21T21:13:02.778 INFO:teuthology.orchestra.run.smithi110:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.4 flush_pg_stats 2020-02-21T23:17:49.733 INFO:teuthology.orchestra.run.smithi110:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.4 flush_pg_stats
From ceph-osd.4's log, only the first two have been received:
2020-02-21 21:10:06.752 7f77078f7700 1 -- 172.21.15.58:6804/12942 <== client.4234 172.21.15.110:0/4111042505 2 ==== command(tid 2: {"prefix": "flush_pg_stats"}) v1 ==== 52+0+0 (878328554 0 0) 0x3b1ac00 con 0x3da4300 2020-02-21 21:10:06.752 7f77078f7700 20 osd.4 20 OSD::ms_dispatch: command(tid 2: {"prefix": "flush_pg_stats"}) v1 2020-02-21 21:13:02.857 7fc329a43700 1 -- [v2:172.21.15.58:6800/15244,v1:172.21.15.58:6801/15244] <== client.24271 v1:172.21.15.110:0/2263514793 2 ==== command(tid 2: {"prefix": "flush_pg_stats"}) v1 ==== 52+0+0 (unknown 878328554 0 0) 0xd6f72a0 con 0xdafc880 2020-02-21 21:13:02.857 7fc329a43700 20 osd.4 28 OSD::ms_dispatch: command(tid 2: {"prefix": "flush_pg_stats"}) v1
#4 Updated by Neha Ojha over 3 years ago
This run has more relavant information: /a/nojha-2020-02-26_03:20:34-upgrade:mimic-x:stress-split-nautilus-distro-basic-smithi/4802539
Even here the third flush_pg_stats timed out
From teuthology
2020-02-26T03:56:16.776 INFO:teuthology.orchestra.run.smithi151:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.6 flush_pg_stats 2020-02-26T03:58:33.442 INFO:teuthology.orchestra.run.smithi151:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.6 flush_pg_stats 2020-02-26T05:33:06.811 INFO:teuthology.orchestra.run.smithi151:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph tell osd.6 flush_pg_stats
From osd.6's log
2020-02-26 03:56:16.903 7f880b8c8700 1 -- 172.21.15.83:6812/10841 <== client.4261 172.21.15.151:0/3358669268 2 ==== command(tid 2: {"prefix": "flush_pg_stats"}) v1 ==== 52+0+0 (3076564282 0 0) 0x55a5bee6bc00 con 0x55a5beecf100 2020-02-26 03:58:33.566 7f3d48187700 1 -- [v2:172.21.15.83:6816/13319,v1:172.21.15.83:6817/13319] <== client.34221 v1:172.21.15.151:0/810849350 2 ==== command(tid 2: {"prefix": "flush_pg_stats"}) v1 ==== 52+0+0 (unknown 3076564282 0 0) 0x55add43b8760 con 0x55adf1979680 2020-02-26 05:34:24.472 7f9ae52d7700 1 -- [v2:172.21.15.83:6816/59862,v1:172.21.15.83:6817/59862] <== client.58053 v1:172.21.15.151:0/3801625840 17 ==== command(tid 2: {"prefix": "flush_pg_stats"}) v1 ==== 52+0+0 (unknown 3076564282 0 0) 0x5612f814c780 con 0x5612e91cdf80
Note the timestamps of when the command was sent vs when it was processed by the osd. For the third command it took much longer for the osd to see it, though it did respond once it saw it.
2020-02-26 05:35:10.691 7f9acfc53700 20 osd.6 3705 do_command tid 2 [{"prefix": "flush_pg_stats"}] 2020-02-26 05:35:10.691 7f9acfc53700 1 -- [v2:172.21.15.83:6816/59862,v1:172.21.15.83:6817/59862] --> [v2:172.21.15.151:6832/14617,v1:172.21.15.151:6833/14617] -- pg_stats(62 pgs tid 0 v 0) v2 -- 0x5612f0236300 con 0x5612ea043b00
Since the timeout is 120 seconds, the test failed.
#5 Updated by Sage Weil over 3 years ago
The thread (or lock?) is busy with
2020-02-26 05:34:24.432 7f9ae0474700 20 osd.6 3677 OSD::ms_dispatch: osd_map(3200..3213 src has 1..3213) v4 2020-02-26 05:34:24.436 7f9ae0474700 10 osd.6 3705 do_waiters -- start 2020-02-26 05:34:24.436 7f9ae0474700 10 osd.6 3705 handle_pg_create osd_pg_create(e3678 36.2:2247 36.7:2247 36.b:2247 36.d:2247 36.13:2247 36.1d:2247 36.1e:2247 36.1f:2247 36.20:2247 36.28:2247 36.31:2247 36.3e:2247 37.6:2249 37.c:2249 37.1b:2249 37.1c:2249 37.35:2249 37.37:2249 37.3c:2249 38.2:2251 38.c:2251 38.f: 2020-02-26 05:34:24.436 7f9ae0474700 15 osd.6 3705 require_same_or_newer_map 3678 (i am 3705) 0x5612f45a7600 2020-02-26 05:34:24.436 7f9ae0474700 20 osd.6 3705 mkpg 36.2 e2247@2020-02-26 04:58:03.765875 2020-02-26 05:34:24.436 7f9ae0474700 10 osd.6 3705 build_initial_pg_history 36.2 created 2247 2020-02-26 05:34:24.436 7f9ae0474700 20 osd.6 3705 get_map 2247 - loading and decoding 0x5612ea0a8800 2020-02-26 05:34:24.436 7f9ae0474700 10 osd.6 3705 add_map_bl 2247 5240 bytes 2020-02-26 05:34:24.436 7f9ae0474700 20 osd.6 3705 get_map 2248 - loading and decoding 0x5612ec07c300 2020-02-26 05:34:24.440 7f9ae0474700 10 osd.6 3705 add_map_bl 2248 5251 bytes ... 2020-02-26 05:35:10.691 7f9ae0474700 10 osd.6 3705 add_map_bl 3702 7769 bytes 2020-02-26 05:35:10.691 7f9ae0474700 20 osd.6 3705 get_map 3703 - loading and decoding 0x5612ea531700 2020-02-26 05:35:10.691 7f9ae0474700 10 osd.6 3705 add_map_bl 3703 7769 bytes 2020-02-26 05:35:10.691 7f9ae0474700 20 osd.6 3705 get_map 3704 - loading and decoding 0x5612ea15d900 2020-02-26 05:35:10.691 7f9ae0474700 10 osd.6 3705 add_map_bl 3704 7769 bytes 2020-02-26 05:35:10.691 7f9ae0474700 20 osd.6 3705 build_initial_pg_history check_new_interval interval(2281-2659 up [6,4](6) acting [6,4](6)) up_thru 2655 up_from 614 last_epoch_clean 0 interval(2281-2659 up [6,4](6) acting [6,4](6) maybe_went_rw) : primary up 614-2655 includes interval check_new_interval interval(2660-2680 up [4](4) acting [4](4)) up_thru 2671 up_from 1900 last_epoch_clean 0 interval(2660-2680 up [4](4) acting [4](4) maybe_went_rw) : primary up 1900-2671 includes interval check_new_interval interval(2681-3100 up [6,4](6) acting [6,4](6)) up_thru 3025 up_from 2681 last_epoch_clean 0 interval(2681-3100 up [6,4](6) acting [6,4](6) maybe_went_rw) : primary up 2681-3025 includes interval check_new_interval interval(3101-3118 up [8,4](8) acting [8,4](8)) up_thru 3101 up_from 1908 last_epoch_clean 0 interval(3101-3118 up [8,4](8) acting [8,4](8) maybe_went_rw) : primary up 1908-3101 includes interval check_new_interval interval(3119-3129 up [8,9](8) acting [8,9](8)) up_thru 3122 up_from 1908 last_epoch_clean 0 interval(3119-3129 up [8,9](8) acting [8,9](8) maybe_went_rw) : primary up 1908-3122 includes interval check_new_interval interval(3130-3134 up [6,8](6) acting [6,8](6)) up_thru 3025 up_from 2681 last_epoch_clean 0 interval(3130-3134 up [6,8](6) acting [6,8](6)) : primary up 2681-3025 does not include interval
right after that the do_command thread picks it up and does the work. probably contended on osd_lock or something. but the mkpg for the long period is the real problem
#6 Updated by Sage Weil over 3 years ago
so, this was fixed in nautilus, in the sense that https://github.com/ceph/ceph/pull/27696 went into nautilus.
#7 Updated by Sage Weil over 3 years ago
okay yeah, it's because the command wq uses osd_lock
void _process(Command *c, ThreadPool::TPHandle &) override { osd->osd_lock.lock(); if (osd->is_stopping()) { osd->osd_lock.unlock(); delete c; return; } osd->do_command(c->con.get(), c->tid, c->cmd, c->indata); osd->osd_lock.unlock(); delete c; }
meanwhile, on the pg_create handling, it's an MOSDPGCreate (not MOSDPGCreate2), because require_osd_release < NAUTILUS. So, AFAICS, this problem goes away
(1) once the upgrade to nautilus completes, because osd_lock os not needed for MOSDPGCreate2 handling,
(2) even more completely with octopus, because the mon is passing along the past_intervals, and
(3) if the cluster is not being thrashed for extended periods while a pool is being created (it's the pool creation epoch of 1000+ epochs in the past that led to this).
#8 Updated by Sage Weil over 3 years ago
- Status changed from Need More Info to Fix Under Review
- Pull request ID set to 33560
#9 Updated by Neha Ojha over 3 years ago
- Subject changed from ceph tell command times out to nautilus: ceph tell command times out
#10 Updated by Nathan Cutler over 3 years ago
- Status changed from Fix Under Review to Resolved