Bug #23883
openkclient: CephFS kernel client hang
0%
Description
ceph: 12.2.4/12.2.5
os: debian jessie
kernel: 4.9/4.4
After restart all mds(6 in total, 5 active, 1 standby), client will hang at 'stat' operation according to strace. (test command: df, lsof...). It is very easy to trigger this issue if we do 'ls -lR' with a huge directory which includes million files before we restart all mds.
dmesg shows that connection was denied and I did not run into this issue with Jewel. This is very critical because many commands will hang on the client machine, and seems reboot is the only way to rescue.
I am not sure whether this patch(https://github.com/ceph/ceph/pull/21592) will fix this issue too.
or
It is related to authentication bug #110131
[164426.867436] libceph: mds4 10.20.52.159:6800 socket closed (con state OPEN) [164427.871075] libceph: mds0 10.20.52.160:6800 socket closed (con state OPEN) [164431.274627] libceph: mds0 10.20.52.160:6800 socket closed (con state CONNECTING) [164432.042622] libceph: mds0 10.20.52.160:6800 socket closed (con state CONNECTING) [164433.034613] libceph: mds0 10.20.52.160:6800 socket closed (con state CONNECTING) [164435.050590] libceph: mds0 10.20.52.160:6800 socket closed (con state CONNECTING) [164439.213001] libceph: wrong peer, want 10.20.52.160:6800/-713929449, got 10.20.52.160:6800/1282156498 [164439.213003] libceph: mds0 10.20.52.160:6800 wrong peer at address [164479.144207] ceph: mds0 caps stale [164479.144209] ceph: mds4 caps stale [164488.007687] ceph: mds4 reconnect start [164636.153486] ceph: mds0 reconnect start [164647.877800] ceph: mds4 recovery completed [164647.877807] ceph: mds0 recovery completed [164647.878088] libceph: mon0 10.20.63.194:6789 session lost, hunting for new mon [164647.887724] ceph: mds0 reconnect denied [164647.887742] ceph: mds4 reconnect denied [164647.890247] libceph: mon4 10.20.64.66:6789 session established [164654.634241] libceph: mds4 10.20.52.155:6800 socket closed (con state NEGOTIATING) [165547.897829] libceph: mds2 10.20.52.158:6800 socket closed (con state OPEN) [165555.605291] libceph: mds4 10.20.52.155:6800 socket closed (con state OPEN) [166274.096654] libceph: mds1 10.20.52.156:6800 socket closed (con state OPEN) [166469.161566] libceph: mds3 10.20.52.159:6800 socket closed (con state OPEN)
Updated by Patrick Donnelly about 6 years ago
- Subject changed from CephFS kernel client hang to kclient: CephFS kernel client hang
- Priority changed from Normal to High
- Target version set to v13.0.0
- Source set to Community (user)
- Severity changed from 3 - minor to 2 - major
- Component(FS) kceph added
- Labels (FS) multimds added
Updated by Zheng Yan about 6 years ago
- Component(FS) Common/Protocol added
besides, 4.4/4.9 kernel is too old for using multimds.
Updated by wei jin about 6 years ago
Zheng Yan wrote:
besides, 4.4/4.9 kernel is too old for using multimds.
It is very difficult to upgrade kernel versions for online machine.
I exported every user's directory to different ranks when using mulitmds, is it possible for old kernel?
And I reproduced this issue just now by using one active mds, all other five mds are standby.
Updated by wei jin about 6 years ago
Zheng Yan wrote:
please upload mds log
which level?
after setting debug_mds = 20 and debug_ms = 1, log file is about 500MB in one minute.
Updated by Zheng Yan about 6 years ago
debug_mds = 10. only for period that mds is recovering
Updated by wei jin about 6 years ago
2018-04-28 17:58:57.506825 7f2dcf726700 10 mds.0.journal EMetaBlob.replay for [2,head] had [inode 0x100067aa9dd [...2,head] /jinwei_test/tmp/n20-064-084/n20-064-084_1085/ auth v136880 f(v0 m2 018-03-28 00:07:43.390546 1=0+1) n(v15 rc2018-03-28 00:08:23.687127 b9825152 100002=100000+2) (iversion lock) | dirfrag=1 0x56101da4d800] 2018-04-28 17:58:57.506835 7f2dcf726700 10 mds.0.log _replay - complete, 29900 events 2018-04-28 17:58:57.506837 7f2dcf726700 10 mds.0.log _replay_thread kicking waiters 2018-04-28 17:58:57.506841 7f2dcf726700 10 MDSInternalContextBase::complete: 15C_MDS_BootStart 2018-04-28 17:58:57.506844 7f2dcf726700 1 mds.0.140006 replay_done 2018-04-28 17:58:57.506846 7f2dcf726700 1 mds.0.140006 making mds journal writeable 2018-04-28 17:58:57.507025 7f2dcf726700 2 mds.0.140006 i am alone, moving to state reconnect 2018-04-28 17:58:57.507030 7f2dcf726700 3 mds.0.140006 request_state up:reconnect 2018-04-28 17:58:57.507031 7f2dcf726700 10 mds.beacon.n20-052-155 set_want_state: up:replay -> up:reconnect 2018-04-28 17:58:57.507041 7f2dcf726700 10 mds.beacon.n20-052-155 _send up:reconnect seq 3 2018-04-28 17:58:57.507054 7f2dcf726700 10 mds.0.log _replay_thread finish 2018-04-28 17:58:58.365873 7f2dd9ecb700 10 mds.client.admin existing session 0x561016b28d80 for client.2902113 10.20.63.195:0/1496633167 existing con 0, new/authorizing con 0x561026ef0800 2018-04-28 17:58:58.365888 7f2dd9ecb700 10 mds.client.admin ms_verify_authorizer: parsing auth_cap_str='allow *' 2018-04-28 17:58:58.365939 7f2dd6f35700 10 mds.n20-052-155 ms_handle_accept 10.20.63.195:0/1496633167 con 0x561026ef0800 session 0x561016b28d80 2018-04-28 17:58:58.365952 7f2dd6f35700 10 mds.n20-052-155 session connection 0 -> 0x561026ef0800 2018-04-28 17:58:58.365973 7f2dd6f35700 5 mds.n20-052-155 handle_mds_map epoch 140007 from mon.1 2018-04-28 17:58:58.366014 7f2dd6f35700 10 mds.n20-052-155 my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2} 2018-04-28 17:58:58.366021 7f2dd6f35700 10 mds.n20-052-155 mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2} 2018-04-28 17:58:58.366025 7f2dd6f35700 10 mds.n20-052-155 map says I am 10.20.52.155:6800/1926089105 mds.0.140006 state up:reconnect 2018-04-28 17:58:58.366032 7f2dd6f35700 10 mds.n20-052-155 handle_mds_map: handling map as rank 0 2018-04-28 17:58:58.366034 7f2dd6f35700 1 mds.0.140006 handle_mds_map i am now mds.0.140006 2018-04-28 17:58:58.366036 7f2dd6f35700 1 mds.0.140006 handle_mds_map state change up:replay --> up:reconnect 2018-04-28 17:58:58.366047 7f2dd6f35700 1 mds.0.140006 reconnect_start 2018-04-28 17:58:58.366050 7f2dd6f35700 1 mds.0.140006 reopen_log 2018-04-28 17:58:58.366051 7f2dd6f35700 10 mds.0.cache rollback_uncommitted_fragments: 0 pending 2018-04-28 17:58:58.366058 7f2dd6f35700 10 mds.0.server apply_blacklist: killed 0 2018-04-28 17:58:58.366060 7f2dd6f35700 4 mds.0.140006 reconnect_start: killed 0 blacklisted sessions (5 blacklist entries, 2) 2018-04-28 17:58:58.366064 7f2dd6f35700 1 mds.0.server reconnect_clients -- 2 sessions 2018-04-28 17:58:58.366066 7f2dd6f35700 10 mds.0.sessionmap dump 2018-04-28 17:58:58.366067 7f2dd6f35700 10 mds.0.sessionmap client.3193954 0x561016b27a00 state open completed {} prealloc_inos [] used_inos [] 2018-04-28 17:58:58.366070 7f2dd6f35700 10 mds.0.sessionmap client.2902113 0x561016b28d80 state open completed {} prealloc_inos [] used_inos [] 2018-04-28 17:58:58.366105 7f2dd6f35700 7 mds.0.server handle_client_reconnect client.2902113 2018-04-28 17:58:58.366110 7f2dd6f35700 10 mds.0.server reconnect_start 2018-04-28 17:58:58.366063 delay 0.000045 2018-04-28 17:58:58.366145 7f2dd6f35700 0 log_channel(cluster) log [DBG] : reconnect by client.2902113 10.20.63.195:0/1496633167 after 0.000045 2018-04-28 17:58:58.366151 7f2dd6f35700 10 mds.0.cache.snaprealm(0x1 seq 1 0x56101693f840) have_past_parents_open [1,head] 2018-04-28 17:58:58.366154 7f2dd6f35700 10 mds.0.cache.snaprealm(0x1 seq 1 0x56101693f840) build_snap_set [0,head] on snaprealm(0x1 seq 1 lc 0 cr 0 cps 1 snaps={} 0x56101693f840) 2018-04-28 17:58:58.366158 7f2dd6f35700 10 mds.0.cache.snaprealm(0x1 seq 1 0x56101693f840) build_snap_trace my_snaps [] 2018-04-28 17:58:58.366161 7f2dd6f35700 10 mds.0.cache.snaprealm(0x1 seq 1 0x56101693f840) check_cache rebuilt seq 1 cached_seq 1 cached_last_created 0 cached_last_destroyed 0) 2018-04-28 17:58:58.366164 7f2dd6f35700 10 mds.0.cache finish_snaprealm_reconnect client.2902113 up to date on snaprealm(0x1 seq 1 lc 0 cr 0 cps 1 snaps={} 0x56101693f840) 2018-04-28 17:58:58.366169 7f2dd6f35700 10 mds.0.cache.ino(0x1) add_client_cap first cap, joining realm snaprealm(0x1 seq 1 lc 0 cr 0 cps 1 snaps={} 0x56101693f840) 2018-04-28 17:58:58.366723 7f2dd6f35700 10 mds.beacon.n20-052-155 handle_mds_beacon up:reconnect seq 3 rtt 0.859677 2018-04-28 17:59:07.045243 7f2dd6f35700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 4335) from client.2902113 2018-04-28 17:59:07.045267 7f2dd6f35700 10 mds.0.sessionmap touch_session s=0x561016b28d80 name=client.2902113 2018-04-28 17:59:07.507095 7f2dd3f2f700 10 mds.beacon.n20-052-155 _send up:reconnect seq 4 2018-04-28 17:59:07.507566 7f2dd6f35700 10 mds.beacon.n20-052-155 handle_mds_beacon up:reconnect seq 4 rtt 0.000460 2018-04-28 17:59:17.507143 7f2dd3f2f700 10 mds.beacon.n20-052-155 _send up:reconnect seq 5 2018-04-28 17:59:17.507759 7f2dd6f35700 10 mds.beacon.n20-052-155 handle_mds_beacon up:reconnect seq 5 rtt 0.000605 2018-04-28 17:59:27.045248 7f2dd6f35700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 4336) from client.2902113 2018-04-28 17:59:27.045271 7f2dd6f35700 10 mds.0.sessionmap touch_session s=0x561016b28d80 name=client.2902113 2018-04-28 17:59:27.507195 7f2dd3f2f700 10 mds.beacon.n20-052-155 _send up:reconnect seq 6 2018-04-28 17:59:27.507694 7f2dd6f35700 10 mds.beacon.n20-052-155 handle_mds_beacon up:reconnect seq 6 rtt 0.000488 2018-04-28 17:59:37.507244 7f2dd3f2f700 10 mds.beacon.n20-052-155 _send up:reconnect seq 7 2018-04-28 17:59:37.507748 7f2dd6f35700 10 mds.beacon.n20-052-155 handle_mds_beacon up:reconnect seq 7 rtt 0.000492 2018-04-28 17:59:47.045236 7f2dd6f35700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 4337) from client.2902113 2018-04-28 17:59:47.045261 7f2dd6f35700 10 mds.0.sessionmap touch_session s=0x561016b28d80 name=client.2902113 2018-04-28 17:59:47.507299 7f2dd3f2f700 10 mds.beacon.n20-052-155 _send up:reconnect seq 8 2018-04-28 17:59:47.507868 7f2dd6f35700 10 mds.beacon.n20-052-155 handle_mds_beacon up:reconnect seq 8 rtt 0.000556 2018-04-28 17:59:57.507356 7f2dd3f2f700 10 mds.beacon.n20-052-155 _send up:reconnect seq 9 2018-04-28 17:59:57.507858 7f2dd6f35700 10 mds.beacon.n20-052-155 handle_mds_beacon up:reconnect seq 9 rtt 0.000491 2018-04-28 18:00:07.045267 7f2dd6f35700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 4338) from client.2902113 2018-04-28 18:00:07.045290 7f2dd6f35700 10 mds.0.sessionmap touch_session s=0x561016b28d80 name=client.2902113 2018-04-28 18:00:07.507407 7f2dd3f2f700 10 mds.beacon.n20-052-155 _send up:reconnect seq 10 2018-04-28 18:00:07.507902 7f2dd6f35700 10 mds.beacon.n20-052-155 handle_mds_beacon up:reconnect seq 10 rtt 0.000484 2018-04-28 18:00:17.507466 7f2dd3f2f700 10 mds.beacon.n20-052-155 _send up:reconnect seq 11 2018-04-28 18:00:17.508056 7f2dd6f35700 10 mds.beacon.n20-052-155 handle_mds_beacon up:reconnect seq 11 rtt 0.000579 2018-04-28 18:00:27.045255 7f2dd6f35700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 4339) from client.2902113 2018-04-28 18:00:27.045278 7f2dd6f35700 10 mds.0.sessionmap touch_session s=0x561016b28d80 name=client.2902113 2018-04-28 18:00:27.507531 7f2dd3f2f700 10 mds.beacon.n20-052-155 _send up:reconnect seq 12 2018-04-28 18:00:27.508062 7f2dd6f35700 10 mds.beacon.n20-052-155 handle_mds_beacon up:reconnect seq 12 rtt 0.000516 2018-04-28 18:00:37.507588 7f2dd3f2f700 10 mds.beacon.n20-052-155 _send up:reconnect seq 13 2018-04-28 18:00:37.508095 7f2dd6f35700 10 mds.beacon.n20-052-155 handle_mds_beacon up:reconnect seq 13 rtt 0.000496 2018-04-28 18:00:41.637045 7f2dd4730700 10 mds.0.server reconnect timed out 2018-04-28 18:00:41.637051 7f2dd4730700 1 mds.0.server reconnect gave up on client.3193954 10.8.147.33:0/2305264606 2018-04-28 18:00:41.637095 7f2dd4730700 0 log_channel(cluster) log [WRN] : evicting unresponsive client n8-147-033: (3193954), after waiting 100 seconds during MDS startup 2018-04-28 18:00:41.637105 7f2dd4730700 4 mds.0.140006 Preparing blacklist command... (wait=0) 2018-04-28 18:00:41.637124 7f2dd4730700 4 mds.0.140006 Sending mon blacklist command: {"prefix":"osd blacklist", "blacklistop":"add","addr":"10.8.147.33:0/2305264606"} 2018-04-28 18:00:41.637139 7f2dd4730700 1 mds.0.server reconnect will complete once clients are evicted 2018-04-28 18:00:41.704825 7f2dd6f35700 7 mds.0.server operator(): full = 0 epoch = 63658 2018-04-28 18:00:41.704834 7f2dd6f35700 4 mds.0.140006 handle_osd_map epoch 63658, 1 new blacklist entries 2018-04-28 18:00:41.704838 7f2dd6f35700 10 mds.0.server kill_session 0x561016b27a00 2018-04-28 18:00:41.704854 7f2dd6f35700 10 mds.0.server apply_blacklist: killed 1 2018-04-28 18:00:41.704859 7f2dd6f35700 4 mds.0.140006 set_osd_epoch_barrier: epoch=63658 2018-04-28 18:00:41.704868 7f2dcff27700 5 mds.0.log _submit_thread 5911330478902~97 : ESession client.3193954 10.8.147.33:0/2305264606 close cmapv 1107443677 2018-04-28 18:00:41.763626 7f2dd0f29700 10 MDSIOContextBase::complete: 20C_MDS_session_finish 2018-04-28 18:00:41.763632 7f2dd0f29700 10 mds.0.server _session_logged client.3193954 10.8.147.33:0/2305264606 state_seq 2 close 1107443677 2018-04-28 18:00:41.763640 7f2dd0f29700 10 mds.0.sessionmap remove_session s=0x561016b27a00 name=client.3193954 2018-04-28 18:00:41.787344 7f2dd0f29700 4 mds.0.140006 set_osd_epoch_barrier: epoch=63658 2018-04-28 18:00:41.787349 7f2dd0f29700 1 mds.0.140006 session 3193954 was removed while we waited for blacklist 2018-04-28 18:00:41.787352 7f2dd0f29700 10 MDSInternalContextBase::complete: 25MDSInternalContextWrapper 2018-04-28 18:00:41.787356 7f2dd0f29700 7 mds.0.server reconnect_gather_finish. failed on 1 clients 2018-04-28 18:00:41.787358 7f2dd0f29700 10 MDSInternalContextBase::complete: 12C_MDS_VoidFn 2018-04-28 18:00:41.787359 7f2dd0f29700 1 mds.0.140006 reconnect_done 2018-04-28 18:00:41.787360 7f2dd0f29700 3 mds.0.140006 request_state up:rejoin 2018-04-28 18:00:41.787362 7f2dd0f29700 10 mds.beacon.n20-052-155 set_want_state: up:reconnect -> up:rejoin 2018-04-28 18:00:41.787368 7f2dd0f29700 10 mds.beacon.n20-052-155 _send up:rejoin seq 14 2018-04-28 18:00:42.704532 7f2dd6f35700 5 mds.n20-052-155 handle_mds_map epoch 140009 from mon.1 2018-04-28 18:00:42.704568 7f2dd6f35700 10 mds.n20-052-155 my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2} 2018-04-28 18:00:42.704573 7f2dd6f35700 10 mds.n20-052-155 mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2} 2018-04-28 18:00:42.704576 7f2dd6f35700 10 mds.n20-052-155 map says I am 10.20.52.155:6800/1926089105 mds.0.140006 state up:rejoin 2018-04-28 18:00:42.704582 7f2dd6f35700 10 mds.n20-052-155 handle_mds_map: handling map as rank 0 2018-04-28 18:00:42.704584 7f2dd6f35700 1 mds.0.140006 handle_mds_map i am now mds.0.140006 2018-04-28 18:00:42.704586 7f2dd6f35700 1 mds.0.140006 handle_mds_map state change up:reconnect --> up:rejoin 2018-04-28 18:00:42.704596 7f2dd6f35700 1 mds.0.140006 rejoin_start 2018-04-28 18:00:42.704603 7f2dd6f35700 10 mds.0.cache rejoin_start 2018-04-28 18:00:42.704606 7f2dd6f35700 10 mds.0.cache process_imported_caps 2018-04-28 18:00:42.704607 7f2dd6f35700 7 mds.0.cache trim_non_auth 2018-04-28 18:00:42.728120 7f2dd6f35700 10 mds.0.cache |__ 0 auth [dir 0x100 ~mds0/ [2,head] auth v=200553 cv=0/0 dir_auth=0 state=1610612736 f(v0 10=0+10) n(v76 rc2018-04-27 15:49:01.4215 35 b355887 3361=3351+10) hs=3+0,ss=0+0 dirty=3 | child=1 subtree=1 subtreetemp=0 dirty=1 0x561016a18a00] 2018-04-28 18:00:42.728166 7f2dd6f35700 10 mds.0.cache |__ 0 auth [dir 0x1 / [2,head] auth v=749783 cv=0/0 dir_auth=0 state=1610612736 f(v0 m2018-04-26 15:20:11.253382 1=0+1) n(v41662 rc20 18-04-27 15:49:01.421535 b93269611600943 10417207935=10416998889+209046) hs=1+1,ss=0+0 dirty=2 | child=1 subtree=1 subtreetemp=0 dirty=1 0x561016a17c00] 2018-04-28 18:00:42.728403 7f2dd6f35700 1 mds.0.140006 rejoin_joint_start 2018-04-28 18:00:42.728407 7f2dd6f35700 10 mds.0.cache rejoin_send_rejoins with recovery_set 2018-04-28 18:00:42.728411 7f2dd6f35700 10 mds.0.cache nothing to rejoin 2018-04-28 18:00:42.728412 7f2dd6f35700 10 mds.0.cache rejoin_gather_finish 2018-04-28 18:00:42.728413 7f2dd6f35700 10 mds.0.cache open_undef_inodes_dirfrags 0 inodes 0 dirfrags 2018-04-28 18:00:42.728415 7f2dd6f35700 10 mds.0.cache process_imported_caps 2018-04-28 18:00:42.728417 7f2dd6f35700 10 mds.0.cache choose_lock_states_and_reconnect_caps 2018-04-28 18:00:42.728423 7f2dd6f35700 10 mds.0.cache.snaprealm(0x1 seq 1 0x56101693f840) have_past_parents_open [1,head] 2018-04-28 18:00:42.728426 7f2dd6f35700 10 mds.0.cache have past snap parents for realm snaprealm(0x1 seq 1 lc 0 cr 0 cps 1 snaps={} 0x56101693f840) on [inode 0x1 [...2,head] / auth v339255 snaprealm=0x56101693f840 f(v0 m2018-04-26 15:20:11.253382 1=0+1) n(v41662 rc2018-04-27 15:49:01.421535 b93269611600943 10417207936=10416998889+209047)/n(v0 1=0+1) (inest sync dirty) (iversion lock) caps={2902113=pAsLsXsFs/-@0} | dirtyscattered=1 dirfrag=1 caps=1 dirty=1 0x561016a17500] 2018-04-28 18:00:42.728453 7f2dd6f35700 10 mds.0.cache.snaprealm(0x1 seq 1 0x56101693f840) have_past_parents_open [1,head] 2018-04-28 18:00:42.728455 7f2dd6f35700 10 mds.0.cache have past snap parents for realm snaprealm(0x1 seq 1 lc 0 cr 0 cps 1 snaps={} 0x56101693f840) on [inode 0x1 [...2,head] / auth v339255 snaprealm=0x56101693f840 f(v0 m2018-04-26 15:20:11.253382 1=0+1) n(v41662 rc2018-04-27 15:49:01.421535 b93269611600943 10417207936=10416998889+209047)/n(v0 1=0+1) (inest sync dirty) (iversion lock) caps={2902113=pAsLsXsFs/-@0} | dirtyscattered=1 dirfrag=1 caps=1 dirty=1 0x561016a17500] ......
Updated by wei jin about 6 years ago
client kernel dmesg:
[ 818.110868] ceph: loaded (mds proto 32) [ 818.130534] libceph: mon4 10.20.64.66:6789 session established [ 818.149258] libceph: client3193954 fsid c3e505ba-e748-4043-a493-9cc74a215bcd [ 2829.862812] libceph: mds0 10.20.52.155:6800 socket closed (con state OPEN) [ 2847.113131] ceph: mds0 reconnect start [ 2937.976098] ceph: mds0 caps stale [ 2937.976335] libceph: mon4 10.20.64.66:6789 session lost, hunting for new mon [ 2937.996316] ceph: mds0 reconnect success [ 2937.998839] libceph: mon3 10.20.64.30:6789 session established [ 2943.907730] ceph: mds0 recovery completed [ 4929.770933] perf: interrupt took too long (2512 > 2500), lowering kernel.perf_event_max_sample_rate to 79500 [ 7204.949325] libceph: mds0 10.20.52.157:6800 socket closed (con state OPEN) [ 7222.551954] ceph: mds0 reconnect start [ 7247.320959] libceph: mon3 10.20.64.30:6789 session lost, hunting for new mon [ 7376.090566] INFO: task kworker/0:1:313 blocked for more than 120 seconds. [ 7376.090586] Not tainted 4.9.0-0.bpo.3-amd64 #1 Debian 4.9.30-2+deb9u2~bpo8+1 [ 7376.090606] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7376.090626] kworker/0:1 D 0 313 2 0x00000000 [ 7376.090648] Workqueue: events delayed_work [ceph] [ 7376.090650] ffff88532ef64800 0000000000000000 ffff883333301100 ffff883331e87080 [ 7376.090652] ffff88333fc187c0 ffffb3e31962fd58 ffffffff8f202a4d 0000000000000000 [ 7376.090654] ffff883333301100 000000008ecc339b ffff883331e87080 ffff883331e87080 [ 7376.090656] Call Trace: [ 7376.090662] [<ffffffff8f202a4d>] ? __schedule+0x23d/0x6d0 [ 7376.090664] [<ffffffff8f202f12>] ? schedule+0x32/0x80 [ 7376.090665] [<ffffffff8f20319a>] ? schedule_preempt_disabled+0xa/0x10 [ 7376.090667] [<ffffffff8f204d64>] ? __mutex_lock_slowpath+0xb4/0x130 [ 7376.090669] [<ffffffff8f204dfb>] ? mutex_lock+0x1b/0x30 [ 7376.090675] [<ffffffffc0c86cf6>] ? delayed_work+0xf6/0x200 [ceph] [ 7376.090677] [<ffffffff8ec917cb>] ? process_one_work+0x14b/0x410 [ 7376.090679] [<ffffffff8ec92285>] ? worker_thread+0x65/0x4a0 [ 7376.090680] [<ffffffff8ec92220>] ? rescuer_thread+0x340/0x340 [ 7376.090681] [<ffffffff8ec97580>] ? kthread+0xf0/0x110 [ 7376.090683] [<ffffffff8ec2476b>] ? __switch_to+0x2bb/0x700 [ 7376.090685] [<ffffffff8ec97490>] ? kthread_park+0x60/0x60 [ 7376.090686] [<ffffffff8f207a35>] ? ret_from_fork+0x25/0x30 [ 7417.651326] ceph: mds0 caps stale [ 7417.670181] ceph: mds0 reconnect denied [ 7417.672496] libceph: mon0 10.20.63.194:6789 session established [ 7417.682234] libceph: mds0 10.20.52.155:6800 socket closed (con state NEGOTIATING)
Updated by dongdong tao about 6 years ago
Hi Wei,
this is a very interesting problem, from your description, i would like to share my thought:
this should be your client "10.8.147.33" carries too much (millions) file capabilities , 1 million caps is about 100MB, but this still can not explain why this client need to spend more than 100s to send all the caps. so i suspect a very slow network speed between your client(10.8.147.33) and mds(10.20.52.155), looks like they are not in the same segment.
Updated by Jeff Layton about 6 years ago
v4.9 is quite old at this point, so it would be helpful to know if this is something that has already been fixed in more recent kernels. Is this reproducible on a more recent kernel? Ideally, something around v4.16-ish ?
Updated by Patrick Donnelly almost 6 years ago
- Priority changed from High to Normal
- Target version changed from v13.0.0 to v14.0.0
Updated by Patrick Donnelly about 5 years ago
- Target version changed from v14.0.0 to v15.0.0