Bug #36389
untar encounters unexpected EPERM on kclient/multimds cluster with thrashing
0%
Description
2018-10-10T11:22:58.275 INFO:tasks.mds_thrash.fs.[cephfs]:mds.e-s has gained rank=0, replacing gid=4215 2018-10-10T11:22:58.275 INFO:tasks.mds_thrash.fs.[cephfs]:waiting for 7 secs before reviving mds.b 2018-10-10T11:22:59.800 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/scripts/test: Cannot mkdir: Permission denied 2018-10-10T11:22:59.804 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/scripts/test/generate_index.pl: Cannot open: Permission denied 2018-10-10T11:23:00.811 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/scripts/test/coverage_report.pl: Cannot open: Permission denied 2018-10-10T11:23:00.811 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/scripts/test/testconformance.pl: Cannot open: Permission denied 2018-10-10T11:23:00.812 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/scripts/test/Makefile.in: Cannot open: Permission denied 2018-10-10T11:23:00.812 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/scripts/test/testcoverage.pl: Cannot open: Permission denied 2018-10-10T11:23:00.812 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/scripts/test/gsum2html.pl: Cannot open: Permission denied 2018-10-10T11:23:00.812 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/scripts/test/conformance_report.pl: Cannot open: Permission denied 2018-10-10T11:23:00.812 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/scripts/test/Makefile.am: Cannot open: Permission denied 2018-10-10T11:23:00.813 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/scripts/test/gcov2html.pl: Cannot open: Permission denied 2018-10-10T11:23:00.813 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/scripts/Makefile.am: Cannot open: Permission denied 2018-10-10T11:23:00.813 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/scripts: Cannot utime: Permission denied 2018-10-10T11:23:00.813 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/plugins: Cannot mkdir: Permission denied 2018-10-10T11:23:00.813 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/plugins/ipmi: Cannot mkdir: Permission denied 2018-10-10T11:23:00.813 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/plugins/ipmi/ipmi_entity_event.c: Cannot open: Permission denied 2018-10-10T11:23:00.813 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/plugins/ipmi/atca_vshm_rdrs.c: Cannot open: Permission denied 2018-10-10T11:23:00.813 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/plugins/ipmi/ipmi_sensor_event.c: Cannot open: Permission denied 2018-10-10T11:23:00.813 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/plugins/ipmi/ipmi_close.c: Cannot open: Permission denied 2018-10-10T11:23:00.814 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/plugins/ipmi/hotswap.c: Cannot open: Permission denied 2018-10-10T11:23:00.814 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/plugins/ipmi/ipmi_control_event.c: Cannot open: Permission denied 2018-10-10T11:23:00.814 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/plugins/ipmi/ipmi_connection.c: Cannot open: Permission denied 2018-10-10T11:23:00.814 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/plugins/ipmi/ekeyfru.h: Cannot open: Permission denied 2018-10-10T11:23:00.814 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite/plugins/ipmi/ipmi_controls.c: Cannot open: Permission denied 2018-10-10T11:23:00.814 INFO:tasks.workunit.client.0.smithi177.stderr: 2018-10-10T11:23:00.814 INFO:tasks.workunit.client.0.smithi177.stderr:gzip: stdin: Permission denied 2018-10-10T11:23:00.815 INFO:tasks.workunit.client.0.smithi177.stderr:tar: Unexpected EOF in archive 2018-10-10T11:23:00.815 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases/open_hpi_testsuite: Cannot utime: Permission denied 2018-10-10T11:23:00.815 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231/testcases: Cannot utime: Permission denied 2018-10-10T11:23:00.815 INFO:tasks.workunit.client.0.smithi177.stderr:tar: ltp-full-20091231: Cannot utime: Permission denied 2018-10-10T11:23:00.815 INFO:tasks.workunit.client.0.smithi177.stderr:tar: Error is not recoverable: exiting now 2018-10-10T11:23:00.815 DEBUG:teuthology.orchestra.run:got remote process result: 2 2018-10-10T11:23:00.815 INFO:tasks.workunit:Stopping ['suites/fsstress.sh'] on client.0... 2018-10-10T11:23:00.816 INFO:teuthology.orchestra.run.smithi177:Running: 'sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0' 2018-10-10T11:23:00.999 ERROR:teuthology.run_tasks:Saw exception from tasks. Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-pdonnell-testing/teuthology/run_tasks.py", line 86, in run_tasks manager = run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-pdonnell-testing/teuthology/run_tasks.py", line 65, in run_one_task return task(**kwargs) File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20181008.224656/qa/tasks/workunit.py", line 136, in task cleanup=cleanup) File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20181008.224656/qa/tasks/workunit.py", line 286, in _spawn_on_all_clients timeout=timeout) File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-pdonnell-testing/teuthology/parallel.py", line 85, in __exit__ for result in self: File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-pdonnell-testing/teuthology/parallel.py", line 99, in next resurrect_traceback(result) File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-pdonnell-testing/teuthology/parallel.py", line 22, in capture_traceback return func(*args, **kwargs) File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20181008.224656/qa/tasks/workunit.py", line 409, in _run_tests label="workunit test {workunit}".format(workunit=workunit) File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-pdonnell-testing/teuthology/orchestra/remote.py", line 193, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-pdonnell-testing/teuthology/orchestra/run.py", line 429, in run r.wait() File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-pdonnell-testing/teuthology/orchestra/run.py", line 161, in wait self._raise_for_status() File "/home/teuthworker/src/git.ceph.com_git_teuthology_wip-pdonnell-testing/teuthology/orchestra/run.py", line 183, in _raise_for_status node=self.hostname, label=self.label CommandFailedError: Command failed (workunit test suites/fsstress.sh) on smithi177 with status 2: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=wip-pdonnell-testing-20181008.224656 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/suites/fsstress.sh'
From: /ceph/teuthology-archive/pdonnell-2018-10-09_01:07:48-multimds-wip-pdonnell-testing-20181008.224656-distro-basic-smithi/3119273/teuthology.log
Branch: https://github.com/ceph/ceph-ci/tree/wip-pdonnell-testing-20181008.224656
Related issues
History
#1 Updated by Patrick Donnelly almost 5 years ago
- Description updated (diff)
#2 Updated by Zheng Yan almost 5 years ago
The 'Permission denied' error was because client was blacklisted.
client did notice that mds.0 entered reconnect state
2018-10-10T11:19:03.816779+00:00 smithi177 kernel: libceph: mds0 172.21.15.143:6814 socket closed (con state CONNECTING) 2018-10-10T11:19:10.906802+00:00 smithi177 kernel: ceph: mds0 reconnect start 2018-10-10T11:19:10.926818+00:00 smithi177 kernel: libceph: mds0 172.21.15.143:6813 socket closed (con state OPEN)
2018-10-10 11:19:10.900 7f5c5e20c700 1 mds.0.14 handle_mds_map i am now mds.0.14 2018-10-10 11:19:10.900 7f5c5e20c700 1 mds.0.14 handle_mds_map state change up:resolve --> up:reconnect 2018-10-10 11:19:10.900 7f5c5e20c700 1 mds.0.14 reconnect_start 2018-10-10 11:19:10.900 7f5c5e20c700 10 mds.0.server apply_blacklist: killed 0 2018-10-10 11:19:10.900 7f5c5e20c700 4 mds.0.14 reconnect_start: killed 0 blacklisted sessions (1 blacklist entries, 1) ... 2018-10-10 11:20:00.182 7f5c5ba07700 10 mds.0.server reconnect timed out 2018-10-10 11:20:00.182 7f5c5ba07700 1 mds.0.server reconnect gives up on client.4502 172.21.15.177:0/550996098 2018-10-10 11:20:00.182 7f5c5ba07700 0 log_channel(cluster) log [WRN] : evicting unresponsive client smithi177: (4502), after waiting 45 seconds during MDS startup 2018-10-10 11:20:00.182 7f5c5ba07700 4 mds.0.14 Preparing blacklist command... (wait=0) 2018-10-10 11:20:00.183 7f5c5ba07700 4 mds.0.14 Sending mon blacklist command: {"prefix":"osd blacklist", "blacklistop":"add","addr":"172.21.15.177:0/550996098"}
It's likely that there is bug in messenger, which caused message loss. Another example is http://tracker.ceph.com/issues/36349
#3 Updated by Patrick Donnelly almost 5 years ago
- Duplicated by Bug #36387: "workunit test fs/snaps/snaptest-git-ceph.sh) on smithi031 with status 128" added
#4 Updated by Patrick Donnelly almost 5 years ago
- Related to Bug #36540: msg: messages are queued but not sent added
#5 Updated by Patrick Donnelly over 4 years ago
- Target version changed from v14.0.0 to v15.0.0
#6 Updated by Patrick Donnelly over 4 years ago
- Target version deleted (
v15.0.0)
#7 Updated by Patrick Donnelly over 3 years ago
- Target version set to v16.0.0
- Backport changed from mimic,luminous to octopus,nautilus
#8 Updated by Patrick Donnelly about 3 years ago
2020-09-20T18:07:04.733+0000 7fed1ec0d700 5 --1- [v2:172.21.15.29:6826/2755833938,v1:172.21.15.29:6828/2755833938] >> v1:172.21.15.6:0/4159690151 conn(0x55751dd3a800 0x55751db8f800 :6828 s=READ_FOOTER_AND_DISPATCH pgs=237 cs=1 l=0). rx client.4583 seq 31 0x55751e2db180 client_request(client.4583:1662 create #0x10000000835/rpc01 2020-09-20T18:06:45.566163+0000 RETRY=1 REPLAY caller_uid=1000, caller_gid=1229{}) v2 2020-09-20T18:07:04.733+0000 7fed1ec0d700 0 -- [v2:172.21.15.29:6826/2755833938,v1:172.21.15.29:6828/2755833938] >> v1:172.21.15.6:0/4159690151 conn(0x55751dd3a800 legacy=0x55751db8f800 unknown :6828 s=STATE_CONNECTION_ESTABLISHED l=0).read_until injecting socket failure 2020-09-20T18:07:04.733+0000 7fed1c408700 1 -- [v2:172.21.15.29:6826/2755833938,v1:172.21.15.29:6828/2755833938] <== client.4583 v1:172.21.15.6:0/4159690151 31 ==== client_request(client.4583:1662 create #0x10000000835/rpc01 2020-09-20T18:06:45.566163+0000 RETRY=1 REPLAY caller_uid=1000, caller_gid=1229{}) v2 ==== 127+0+62 (unknown 3686338695 0 1524826435) 0x55751e2db180 con 0x55751dd3a800 2020-09-20T18:07:04.733+0000 7fed1ec0d700 1 -- [v2:172.21.15.29:6826/2755833938,v1:172.21.15.29:6828/2755833938] >> v1:172.21.15.6:0/4159690151 conn(0x55751dd3a800 legacy=0x55751db8f800 unknown :6828 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 36 2020-09-20T18:07:04.733+0000 7fed1ec0d700 1 -- [v2:172.21.15.29:6826/2755833938,v1:172.21.15.29:6828/2755833938] >> v1:172.21.15.6:0/4159690151 conn(0x55751dd3a800 legacy=0x55751db8f800 unknown :6828 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed 2020-09-20T18:07:04.733+0000 7fed1ec0d700 1 --1- [v2:172.21.15.29:6826/2755833938,v1:172.21.15.29:6828/2755833938] >> v1:172.21.15.6:0/4159690151 conn(0x55751dd3a800 0x55751db8f800 :6828 s=OPENED pgs=237 cs=1 l=0).handle_message read tag failed 2020-09-20T18:07:04.733+0000 7fed1c408700 20 mds.0.302 get_session have 0x55751dd77f80 client.4583 v1:172.21.15.6:0/4159690151 state open 2020-09-20T18:07:04.733+0000 7fed1c408700 3 mds.0.server queuing replayed op 2020-09-20T18:07:04.733+0000 7fed1c408700 10 -- [v2:172.21.15.29:6826/2755833938,v1:172.21.15.29:6828/2755833938] dispatch_throttle_release 189 to dispatch throttler 189/104857600 2020-09-20T18:07:04.733+0000 7fed1ec0d700 5 --1- [v2:172.21.15.29:6826/2755833938,v1:172.21.15.29:6828/2755833938] >> v1:172.21.15.6:0/4159690151 conn(0x55751dd3a800 0x55751db8f800 :6828 s=OPENED pgs=237 cs=1 l=0).reset_recv_state 2020-09-20T18:07:04.733+0000 7fed1ec0d700 5 --1- [v2:172.21.15.29:6826/2755833938,v1:172.21.15.29:6828/2755833938] >> v1:172.21.15.6:0/4159690151 conn(0x55751dd3a800 0x55751db8f800 :6828 s=OPENED pgs=237 cs=1 l=0).reset_security 2020-09-20T18:07:04.733+0000 7fed1ec0d700 10 --1- [v2:172.21.15.29:6826/2755833938,v1:172.21.15.29:6828/2755833938] >> v1:172.21.15.6:0/4159690151 conn(0x55751dd3a800 0x55751db8f800 :6828 s=OPENED pgs=237 cs=1 l=0).fault with nothing to send, going to standby 2020-09-20T18:07:04.733+0000 7fed1ec0d700 10 -- [v2:172.21.15.29:6826/2755833938,v1:172.21.15.29:6828/2755833938] >> v1:172.21.15.6:0/4159690151 conn(0x55751dd3a800 legacy=0x55751db8f800 unknown :6828 s=STATE_CONNECTION_ESTABLISHED l=0).handle_write 2020-09-20T18:07:04.733+0000 7fed1ec0d700 10 --1- [v2:172.21.15.29:6826/2755833938,v1:172.21.15.29:6828/2755833938] >> v1:172.21.15.6:0/4159690151 conn(0x55751dd3a800 0x55751db8f800 :6828 s=STANDBY pgs=237 cs=1 l=0).write_event 2020-09-20T18:07:05.031+0000 7fed1ec0d700 10 --2- [v2:172.21.15.29:6826/2755833938,v1:172.21.15.29:6828/2755833938] >> [v2:172.21.15.97:6808/22951,v1:172.21.15.97:6809/22951] conn(0x55751d11bc00 0x55751dd13900 crc :-1 s=THROTTLE_DONE pgs=100 cs=0 l=1 rev1=1 rx=0 tx=0).handle_read_frame_dispatch tag=17 2020-09-20T18:07:05.031+0000 7fed1ec0d700 5 --2- [v2:172.21.15.29:6826/2755833938,v1:172.21.15.29:6828/2755833938] >> [v2:172.21.15.97:6808/22951,v1:172.21.15.97:6809/22951] conn(0x55751d11bc00 0x55751dd13900 crc :-1 s=THROTTLE_DONE pgs=100 cs=0 l=1 rev1=1 rx=0 tx=0).handle_message got 206 + 0 + 0 byte message. envelope type=43 src osd.0 off 0 2020-09-20T18:07:05.031+0000 7fed1ec0d700 5 --2- [v2:172.21.15.29:6826/2755833938,v1:172.21.15.29:6828/2755833938] >> [v2:172.21.15.97:6808/22951,v1:172.21.15.97:6809/22951] conn(0x55751d11bc00 0x55751dd13900 crc :-1 s=READ_MESSAGE_COMPLETE pgs=100 cs=0 l=1 rev1=1 rx=0 tx=0).handle_message received message m=0x55751dd3c240 seq=7 from=osd.0 type=43 osd_op_reply(26 10000000753.00000000 [truncate 0,omap-clear] v85'3348 uv3348 ondisk = 0) v8 2020-09-20T18:07:05.031+0000 7fed1ec0d700 1 -- [v2:172.21.15.29:6826/2755833938,v1:172.21.15.29:6828/2755833938] <== osd.0 v2:172.21.15.97:6808/22951 7 ==== osd_op_reply(26 10000000753.00000000 [truncate 0,omap-clear] v85'3348 uv3348 ondisk = 0) v8 ==== 206+0+0 (crc 0 0 0) 0x55751dd3c240 con 0x55751d11bc00 2020-09-20T18:07:05.031+0000 7fed1ec0d700 10 -- [v2:172.21.15.29:6826/2755833938,v1:172.21.15.29:6828/2755833938] dispatch_throttle_release 206 to dispatch throttler 206/104857600 2020-09-20T18:07:05.031+0000 7fed163fc700 10 MDSIOContextBase::complete: 25C_IO_MDC_FragmentPurgeOld 2020-09-20T18:07:05.031+0000 7fed163fc700 10 MDSContext::complete: 25C_IO_MDC_FragmentPurgeOld 2020-09-20T18:07:05.031+0000 7fed163fc700 10 mds.0.cache fragment_old_purged 0x10000000753 2020-09-20T18:07:05.031+0000 7fed163fc700 10 mds.0.cache finish_uncommitted_fragments: base dirfrag 0x10000000753 op finish 2020-09-20T18:07:05.031+0000 7fed153fa700 5 mds.0.log _submit_thread 25165076~169 : EFragment finish 0x10000000753 * by 3 [metablob] 2020-09-20T18:07:05.386+0000 7fed1a404700 20 mds.beacon.d sender thread waiting interval 2.94502s 2020-09-20T18:07:05.423+0000 7fed1ac05700 20 mds.0.302 get_task_status 2020-09-20T18:07:05.423+0000 7fed1ac05700 20 mds.0.302 send_task_status: updating 0 status keys 2020-09-20T18:07:05.423+0000 7fed1ac05700 20 mds.0.302 schedule_update_timer_task 2020-09-20T18:07:05.423+0000 7fed18400700 10 mds.0.cache cache not ready for trimming 2020-09-20T18:07:05.423+0000 7fed18400700 20 mds.0.cache upkeep thread waiting interval 1s 2020-09-20T18:07:05.534+0000 7fed1fc0f700 10 Processor -- accept listen_fd=19 2020-09-20T18:07:05.534+0000 7fed1fc0f700 10 Processor -- accept listen_fd=20 2020-09-20T18:07:05.534+0000 7fed1fc0f700 10 Processor -- accept accepted incoming on sd 36 2020-09-20T18:07:05.534+0000 7fed1fc0f700 10 -- [v2:172.21.15.29:6826/2755833938,v1:172.21.15.29:6828/2755833938] >> conn(0x55751dd3b400 legacy=0x55751db90000 unknown :-1 s=STATE_NONE l=0).accept sd=36 listen_addr v1:172.21.15.29:6828/2755833938 peer_addr v1:172.21.15.6:52912/0 2020-09-20T18:07:05.534+0000 7fed1fc0f700 1 --1- [v2:172.21.15.29:6826/2755833938,v1:172.21.15.29:6828/2755833938] >> conn(0x55751dd3b400 0x55751db90000 :6828 s=ACCEPTING pgs=0 cs=0 l=0).send_server_banner sd=36 legacy v1:172.21.15.29:6828/2755833938 socket_addr v1:172.21.15.29:6828/2755833938 target_addr v1:172.21.15.6:52912/0 2020-09-20T18:07:05.534+0000 7fed1fc0f700 10 -- [v2:172.21.15.29:6826/2755833938,v1:172.21.15.29:6828/2755833938] >> conn(0x55751dd3b400 legacy=0x55751db90000 unknown :6828 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send sent bytes 281 remaining bytes 0 2020-09-20T18:07:05.534+0000 7fed1fc0f700 10 --1- [v2:172.21.15.29:6826/2755833938,v1:172.21.15.29:6828/2755833938] >> conn(0x55751dd3b400 0x55751db90000 :6828 s=ACCEPTING pgs=0 cs=0 l=0).handle_server_banner_write write banner and addr done: - 2020-09-20T18:07:05.534+0000 7fed1fc0f700 10 --1- [v2:172.21.15.29:6826/2755833938,v1:172.21.15.29:6828/2755833938] >> conn(0x55751dd3b400 0x55751db90000 :6828 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner accept peer addr is v1:172.21.15.6:0/4159690151 ... 2020-09-20T18:07:52.385+0000 7fed1ac05700 7 mds.0.server reconnect timed out, 1 clients have not reconnected in time 2020-09-20T18:07:52.386+0000 7fed1ac05700 1 mds.0.server reconnect gives up on client.4583 v1:172.21.15.6:0/4159690151 2020-09-20T18:07:52.386+0000 7fed1ac05700 0 log_channel(cluster) log [WRN] : evicting unresponsive client smithi006: (4583), after waiting 47.6692 seconds during MDS startup
From: /ceph/teuthology-archive/pdonnell-2020-09-20_07:13:47-multimds-wip-pdonnell-testing-20200920.040823-distro-basic-smithi/5451585/remote/smithi029/log/ceph-mds.d.log.gz
Happens after socket injection failure. Not sure why the reconnect didn't work...
#9 Updated by Patrick Donnelly almost 3 years ago
- Related to Bug #47723: corrupted ceph_msg_connect message added
#10 Updated by Patrick Donnelly almost 3 years ago
I think this might be a dup of #47723
#11 Updated by Ilya Dryomov almost 3 years ago
Patrick Donnelly wrote:
I think this might be a dup of #47723
Yes, it probably is. There is no evidence in /ceph/teuthology-archive/pdonnell-2020-09-20_07:13:47-multimds-wip-pdonnell-testing-20200920.040823-distro-basic-smithi/5451585/remote/smithi029/log/ceph-mds.d.log.gz because the messenger log level is too low, but if the data section of CEPH_MSG_CLIENT_RECONNECT was shorter than ceph_msg_connect, I would expect the hang.
#12 Updated by Patrick Donnelly over 2 years ago
- Target version changed from v16.0.0 to v17.0.0
- Backport changed from octopus,nautilus to pacific,octopus,nautilus
#13 Updated by Patrick Donnelly about 1 year ago
- Target version deleted (
v17.0.0)