Bug #48030: mon/caps.sh: mgr command(pg dump) waits forever due to rados_mon_op_timeout not getting set correctly - RADOS - Ceph

Actions

Copy link

Bug #48030

closed

mon/caps.sh: mgr command(pg dump) waits forever due to rados_mon_op_timeout not getting set correctly

Added by Neha Ojha over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Patrick Donnelly

Category:

Target version:

Ceph - v16.0.0

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

38358

Crash signature (v1):

Crash signature (v2):

Description

2020-10-28T11:57:03.059 INFO:tasks.mon_thrash:Sending STOP to mon f
2020-10-28T11:57:03.060 INFO:tasks.ceph.mon.f:Sent signal 19
2020-10-28T11:57:03.060 INFO:tasks.mon_thrash.mon_thrasher:waiting for 15.0 secs to unfreeze mons
2020-10-28T11:57:16.940 DEBUG:teuthology.orchestra.run:got remote process result: 124
2020-10-28T11:57:16.940 INFO:tasks.workunit:Stopping ['mon/pool_ops.sh', 'mon/crush_ops.sh', 'mon/osd.sh', 'mon/caps.sh'] on client.0...
2020-10-28T11:57:16.941 INFO:teuthology.orchestra.run.smithi001:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2020-10-28T11:57:17.136 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 90, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 69, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph_master/qa/tasks/workunit.py", line 134, in task
    coverage_and_limits=not config.get('no_coverage_and_limits', None))
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph_master/qa/tasks/workunit.py", line 425, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 215, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 446, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 160, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 182, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed (workunit test mon/caps.sh) on smithi001 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=b28174f13f6751174b9367d264ece135eff641ed TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/mon/caps.sh'

/a/teuthology-2020-10-28_07:01:02-rados-master-distro-basic-smithi/5567245

Actions

Copy link

Updated by Neha Ojha over 3 years ago

/a/teuthology-2020-11-04_07:01:02-rados-master-distro-basic-smithi/5590040 looks similar

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Priority changed from Normal to High

/a/teuthology-2020-11-04_07:01:02-rados-master-distro-basic-smithi/5590078

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Priority changed from High to Urgent

Fails on every run

https://pulpito.ceph.com/teuthology-2020-11-08_07:01:02-rados-master-distro-basic-smithi/
https://pulpito.ceph.com/teuthology-2020-11-09_07:01:01-rados-master-distro-basic-smithi/

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Fails deterministically: https://pulpito.ceph.com/nojha-2020-11-09_22:09:31-rados:monthrash-master-distro-basic-smithi/

Removed msgr failure injection and only ran mon/caps.sh [rados:monthrash/{ceph clusters/3-mons mon_election/classic msgr/async-v2only objectstore/bluestore-stupid rados supported-random-distro$/{ubuntu_latest} thrashers/many workloads/rados_mon_workunits}] but it still fails https://pulpito.ceph.com/nojha-2020-11-10_20:16:13-rados:monthrash-master-distro-basic-smithi/. Need to dig into the logs.

Actions

Copy link

Updated by Deepika Upadhyay over 3 years ago

Backport set to octopus

https://pulpito.ceph.com/yuriw-2020-11-10_19:24:45-rados-wip-yuri4-testing-2020-11-10-0959-distro-basic-smithi/
https://trello.com/c/ehTuoslB/1063-wip-yuri4-testing-2020-11-10-0959-old-wip-yuri4-testing-2020-11-09-1025

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Backport deleted (~~octopus~~)

Deepika Upadhyay wrote:

seeing on octopus as well:
https://pulpito.ceph.com/yuriw-2020-11-10_19:24:45-rados-wip-yuri4-testing-2020-11-10-0959-distro-basic-smithi/
https://trello.com/c/ehTuoslB/1063-wip-yuri4-testing-2020-11-10-0959-old-wip-yuri4-testing-2020-11-09-1025

Deepika: How is this octopus?

Actions

Copy link

Updated by Deepika Upadhyay over 3 years ago

aah, was working alongside octopus batch, might had confused, sorry

Actions

Copy link

Updated by Sridhar Seshasayee over 3 years ago

Assignee set to Sridhar Seshasayee

I am assigning this to myself. Looking into the logs.

Actions

Copy link

Updated by Sridhar Seshasayee over 3 years ago

Updating the findings so far from logs under https://pulpito.ceph.com/nojha-2020-11-10_20:16:13-rados:monthrash-master-distro-basic-smithi/.

1. During the mon thrash tests the following workunit (mon/caps.sh) is kicked off.
Note that the timeout for this script is set to 3 hours.

2020-11-10T20:40:14.708 INFO:tasks.workunit:Running workunit mon/caps.sh...
2020-11-10T20:40:14.708 INFO:teuthology.orchestra.run.smithi118:workunit test mon/caps.sh> mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=f0f965e563334578e1aa1b4c4389210994e1ec54 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/mon/caps.sh
2020-11-10T20:40:14.761 INFO:tasks.workunit.client.0.smithi118.stderr:+ tmp=/tmp/cephtest-mon-caps-madness
2020-11-10T20:40:14.761 INFO:tasks.workunit.client.0.smithi118.stderr:+ exit_on_error=1
2020-11-10T20:40:14.761 INFO:tasks.workunit.client.0.smithi118.stderr:+ [[ ! -z '' ]]
2020-11-10T20:40:14.762 INFO:tasks.workunit.client.0.smithi118.stderr:++ uname
2020-11-10T20:40:14.763 INFO:tasks.workunit.client.0.smithi118.stderr:+ '[' Linux = FreeBSD ']'
2020-11-10T20:40:14.763 INFO:tasks.workunit.client.0.smithi118.stderr:+ ETIMEDOUT=110
2020-11-10T20:40:14.763 INFO:tasks.workunit.client.0.smithi118.stderr:+ expect 'ceph auth get-or-create client.bazar > /tmp/cephtest-mon-caps-madness.bazar.keyring' 0
2020-11-10T20:40:14.763 INFO:tasks.workunit.client.0.smithi118.stderr:+ cmd='ceph auth get-or-create client.bazar > /tmp/cephtest-mon-caps-madness.bazar.keyring'
2020-11-10T20:40:14.763 INFO:tasks.workunit.client.0.smithi118.stderr:+ expected_ret=0
...
...

2. The script executes the "pg dump" command with a timeout of 300

2020-11-10T20:40:18.211 INFO:tasks.workunit.client.0.smithi118.stderr:+ export CEPH_ARGS=--rados-mon-op-timeout=300
2020-11-10T20:40:18.211 INFO:tasks.workunit.client.0.smithi118.stderr:+ CEPH_ARGS=--rados-mon-op-timeout=300
2020-11-10T20:40:18.212 INFO:tasks.workunit.client.0.smithi118.stderr:+ expect 'ceph -k /tmp/cephtest-mon-caps-madness.foo.keyring --user foo pg dump' 110
2020-11-10T20:40:18.212 INFO:tasks.workunit.client.0.smithi118.stderr:+ cmd='ceph -k /tmp/cephtest-mon-caps-madness.foo.keyring --user foo pg dump'
2020-11-10T20:40:18.212 INFO:tasks.workunit.client.0.smithi118.stderr:+ expected_ret=110
2020-11-10T20:40:18.212 INFO:tasks.workunit.client.0.smithi118.stderr:+ echo ceph -k /tmp/cephtest-mon-caps-madness.foo.keyring --user foo pg dump
2020-11-10T20:40:18.212 INFO:tasks.workunit.client.0.smithi118.stderr:+ eval ceph -k /tmp/cephtest-mon-caps-madness.foo.keyring --user foo pg dump
2020-11-10T20:40:18.212 INFO:tasks.workunit.client.0.smithi118.stdout:ceph -k /tmp/cephtest-mon-caps-madness.foo.keyring --user foo pg dump
...
...

3. But it appears that the command is stuck indefinitely as after 3 hours
(instead of 300 secs) due to no response from the "pg dump" command, the
script is forcibly killed (SIGTERM) with a return status of 124 shown below,

2020-11-10T23:40:10.426 INFO:tasks.mon_thrash.mon_thrasher:waiting for 20.0 secs to unfreeze mons
2020-11-10T23:40:14.763 DEBUG:teuthology.orchestra.run:got remote process result: 124
2020-11-10T23:40:14.763 INFO:tasks.workunit:Stopping ['mon/caps.sh'] on client.0...
2020-11-10T23:40:14.764 INFO:teuthology.orchestra.run.smithi118:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2020-11-10T23:40:14.983 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 90, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 69, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_neha-ojha_ceph_wip-48030/qa/tasks/workunit.py", line 134, in task
    coverage_and_limits=not config.get('no_coverage_and_limits', None))
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/github.com_neha-ojha_ceph_wip-48030/qa/tasks/workunit.py", line 425, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 215, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 446, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 160, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 182, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed (workunit test mon/caps.sh) on smithi118 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=f0f965e563334578e1aa1b4c4389210994e1ec54 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/mon/caps.sh'

4. The mon.b log shows the last command which is "osd dump" being executed with an
"access denied" status as expected. But post this, I don't see the "pg dump" command
logged anywhere (audit, mon or mgr logs)

2020-11-10T20:40:18.207+0000 7f91b5803700  1 -- [v2:172.21.15.118:3300/0,v1:172.21.15.118:6789/0] <== client.? 172.21.15.118:0/224822075 6 ==== mon_command({"prefix": "osd dump"} v 0) v1 ==== 64+0+0 (secure 0 0 0) 0x559d472fcb80 con 0x559d464af400
2020-11-10T20:40:18.207+0000 7f91b5803700 20 mon.b@0(leader) e1 _ms_dispatch existing session 0x559d47268f40 for client.?
2020-11-10T20:40:18.207+0000 7f91b5803700 20 mon.b@0(leader) e1  entity client.foo caps allow command "auth ls", allow command quorum_status
2020-11-10T20:40:18.207+0000 7f91b5803700  0 mon.b@0(leader) e1 handle_command mon_command({"prefix": "osd dump"} v 0) v1
2020-11-10T20:40:18.207+0000 7f91b5803700 20 is_capable service=osd command=osd dump read addr 172.21.15.118:0/224822075 on cap allow command "auth ls", allow command quorum_status
2020-11-10T20:40:18.207+0000 7f91b5803700 20  allow so far , doing grant allow command "auth ls" 
2020-11-10T20:40:18.207+0000 7f91b5803700 20  allow so far , doing grant allow command quorum_status
2020-11-10T20:40:18.207+0000 7f91b5803700 10 mon.b@0(leader) e1 _allowed_command not capable
2020-11-10T20:40:18.207+0000 7f91b5803700  1 mon.b@0(leader) e1 handle_command access denied
2020-11-10T20:40:18.207+0000 7f91b5803700  0 log_channel(audit) log [DBG] : from='client.? 172.21.15.118:0/224822075' entity='client.foo' cmd=[{"prefix": "osd dump"}]:  access denied
2020-11-10T20:40:18.207+0000 7f91b5803700  1 -- [v2:172.21.15.118:3300/0,v1:172.21.15.118:6789/0] --> [v2:172.21.15.118:3300/0,v1:172.21.15.118:6789/0] -- log(1 entries from seq 172 at 2020-11-10T20:40:18.210009+0000) v1 -- 0x559d47214c40 con 0x559d464acc00
2020-11-10T20:40:18.207+0000 7f91b5803700  2 mon.b@0(leader) e1 send_reply 0x559d47262690 0x559d463aebe0 mon_command_ack([{"prefix": "osd dump"}]=-13 access denied v0) v1
2020-11-10T20:40:18.207+0000 7f91b5803700  1 -- [v2:172.21.15.118:3300/0,v1:172.21.15.118:6789/0] --> 172.21.15.118:0/224822075 -- mon_command_ack([{"prefix": "osd dump"}]=-13 access denied v0) v1 -- 0x559d463aebe0 con 0x559d464af400

Next step is to investigate why the pg dump command was stuck indefinitely.

Actions

Copy link

#10

Updated by Sridhar Seshasayee over 3 years ago

The logs did not show any clues on why the "pg dump" command became hung forever.
The suspicion is that the "rados_mon_op_timeout" value is somehow not getting set correctly even
though it is apparently being set in the caps.sh script to 300 secs.

Also, the tests were run using teuthology without the mon/caps.sh workunit and as
expected the tests passed. Results are here,
https://pulpito.ceph.com/sseshasa-2020-11-26_12:14:23-rados:monthrash-master-distro-basic-smithi/

In order to prove the above suspicion, the issue was reproduced using vstart cluster on the
latest master. And upon manually running the steps similar to the caps.sh script, the "pg dump"
command indeed became stuck forever. For the test, the "rados_mon_op_timeout" value was set to a
smaller value of 15 secs shown below,

$ echo $CEPH_ARGS 
--rados-mon-op-timeout=15

Connecting to the hung process running the "pg dump" command and dumping the value, it can
be confirmed that the value set as part of $CEPH_ARGS is not reflected,

(gdb) bt
#0  0x00007f42770c59f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f42670ce470) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7f42670ce420, cond=0x7f42670ce448) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7f42670ce448, mutex=0x7f42670ce420) at pthread_cond_wait.c:655
#3  0x00007f426b6bec2f in ceph::condition_variable_debug::wait (this=0x7f42670ce448, lock=...) at /home/sseshasa/ceph/src/common/condition_variable_debug.cc:28
#4  0x00007f4274b2a994 in ceph::condition_variable_debug::wait<C_SaferCond::wait()::{lambda()#1}>(std::unique_lock<ceph::mutex_debug_detail::mutex_debug_impl<false> >&, C_SaferCond::wait()::{lambda()#1}) (this=0x7f42670ce448, lock=..., pred=...) at /home/sseshasa/ceph/src/common/condition_variable_debug.h:34
#5  0x00007f4274b23fc4 in C_SaferCond::wait (this=0x7f42670ce3e0) at /home/sseshasa/ceph/src/common/Cond.h:100
#6  0x00007f4274b4e272 in librados::v14_2_0::RadosClient::mgr_command (this=0x7f426005b2a0, cmd=std::vector of length 1, capacity 1 = {...}, inbl=..., outbl=0x7f42670ce5f0, 
    outs=0x7f42670ce610) at /home/sseshasa/ceph/src/librados/RadosClient.cc:858
#7  0x00007f4274a715d9 in _rados_mgr_command (cluster=0x7f426005b2a0, cmd=0x144a410, cmdlen=1, inbuf=0x7f4277864690 "", inbuflen=0, outbuf=0x7f42670ce6d0, outbuflen=0x7f42670ce6d8, 
    outs=0x7f42670ce6e0, outslen=0x7f42670ce700) at /home/sseshasa/ceph/src/librados/librados_c.cc:912
#8  0x00007f427509d99f in __pyx_pf_5rados_5Rados_60mgr_command (__pyx_v_timeout=<optimized out>, __pyx_v_target=0x9d4380 <_Py_NoneStruct>, __pyx_v_inbuf=<optimized out>, 
    __pyx_v_cmd=<optimized out>, __pyx_v_self=0x7f4267bb7938) at /home/sseshasa/ceph/build/src/pybind/rados/rados.c:21583
#9  __pyx_pw_5rados_5Rados_61mgr_command (__pyx_v_self=0x7f4267bb7938, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at /home/sseshasa/ceph/build/src/pybind/rados/rados.c:21281
#10 0x00007f4274fd15cc in __Pyx_CyFunction_CallAsMethod (kw=0x7f4264412168, args=<optimized out>, func=0x7f4267c26048) at /home/sseshasa/ceph/build/src/pybind/rados/rados.c:87622
#11 __Pyx_CyFunction_CallAsMethod (func=0x7f4267c26048, args=<optimized out>, kw=0x7f4264412168) at /home/sseshasa/ceph/build/src/pybind/rados/rados.c:22070

Frame 6 above and line number 858 confirms the fact that the rados_mon_op_timeout value was
not set and the thread waits forver to be notified. Details of frame 6 and the actual value
of "rados_mon_op_timeout" are dumped below,

(gdb) info frame     
Stack level 6, frame at 0x7f42670ce4b0:
 rip = 0x7f4274b4e272 in librados::v14_2_0::RadosClient::mgr_command (/home/sseshasa/ceph/src/librados/RadosClient.cc:858); saved rip = 0x7f4274a715d9
 called by frame at 0x7f42670ce680, caller of frame at 0x7f42670ce3a0
 source language c++.
 Arglist at 0x7f42670ce4a0, args: this=0x7f426005b2a0, cmd=std::vector of length 1, capacity 1 = {...}, inbl=..., outbl=0x7f42670ce5f0, outs=0x7f42670ce610
 Locals at 0x7f42670ce4a0, Previous frame's sp is 0x7f42670ce4b0
 Saved registers:
  rbx at 0x7f42670ce498, rbp at 0x7f42670ce4a0, rip at 0x7f42670ce4a8

(gdb) info locals
l = {_M_device = @0x7f426005bfb0}
cond = {<Context> = {_vptr.Context = 0x7f4274fa45b0 <vtable for C_SaferCond+16>}, lock = {<ceph::mutex_debug_detail::mutex_debugging_base> = {group = "C_SaferCond", id = 31, 
      lockdep = true, backtrace = false, nlock = 0, locked_by = {_M_thread = 0}}, m = pthread_mutex_t = {Type = Error check, Status = Not acquired, Robust = No, Shared = No, 
      Protocol = None}, static recursive = false}, cond = {cond = pthread_cond_t = {Threads known to still execute a wait function = 1, Clock ID = CLOCK_REALTIME, Shared = No}, 
    waiter_mutex = 0x7f42670ce3e8}, done = false, rval = 0}
r = 0
(gdb) p rados_mon_op_timeout
$1 = {__r = 0}

(gdb) p rados_mon_op_timeout.count()
$2 = 0

(gdb) p cmd
$4 = std::vector of length 1, capacity 1 = {"{\"prefix\": \"pg dump\", \"target\": [\"mon-mgr\", \"\"]}"}

The next step is to examine the code path involving propagation of options set in CEPH_ARGS and where
the value is getting dropped.

Actions

Copy link

#11

Updated by Sridhar Seshasayee over 3 years ago

Looking at the constructor of RadosClient class, the "add_observer()" call is missing due to which
the config options are not being propagated. After adding the observer in the RadosClient constructor,
the "pg dump" command is now timing out as expected.

Neha indicated that this PR(https://github.com/ceph/ceph/pull/37529) could have caused the
regression and it does appear to be the case. The author of the above PR would be in the
best position to assess how this change got missed.

Here's a diff of the change I made that resolved the hung command issue,


diff --git a/src/librados/RadosClient.cc b/src/librados/RadosClient.cc
index fa996d4522..d950857126 100644
--- a/src/librados/RadosClient.cc
+++ b/src/librados/RadosClient.cc
@@ -56,7 +56,10 @@ namespace ca = ceph::async;
 namespace cb = ceph::buffer;

 librados::RadosClient::RadosClient(CephContext *cct_)
-  : Dispatcher(cct_->get()) {}
+  : Dispatcher(cct_->get())
+{
+  cct_->_conf.add_observer(this);
+}

There could be other initializations of member variables of this class that
could be missing.

The following commit from a private branch does appear have the above change
but this somehow got missed when merging with the master,

https://github.com/batrick/ceph/commit/c25acb50eaa0ec746de529734ebc2c1761e78d73

I think the author of the above PR would be in the best position to assess the
missing changes and fix the same.

Actions

Copy link

#12

Updated by Neha Ojha over 3 years ago

Subject changed from mon/caps.sh: unfreeze times out to mon/caps.sh: mgr command(pg dump) waits forever due to rados_mon_op_timeout not getting set correctly
Assignee changed from Sridhar Seshasayee to Patrick Donnelly