Bug #47140
closedmgr/volumes: unresponsive Client::abort_conn() when cleaning stale libcephfs handle
0%
Description
Libcephfs connection pool in mgr (mgr_util) identifies stale filesystem handles and cleans them up by calling abort_conn(). In certain teuthology runs abort_conn() never returns causing the test to fail (timeout).
From manager logs:
2020-08-26T05:56:16.163+0000 7fab865a3700 1 -- [v2:172.21.15.142:6800/15025,v1:172.21.15.142:6801/15025] <== client.7542 172.21.15.142:0/1077893343 1 ==== mgr_command(tid 0: {"prefix": "fs subvolume create", "vol_name": "cephfs", "sub_name": "subvolume_0000000000761393", "target": ["mon-mgr", ""]}) v1 ==== 148+0+0 (secure 0 0 0) 0x55f674b95e40 con 0x55f674be9400 2020-08-26T05:56:16.167+0000 7fab865a3700 0 log_channel(audit) log [DBG] : from='client.7542 -' entity='client.admin' cmd=[{"prefix": "fs subvolume create", "vol_name": "cephfs", "sub_name": "subvolume_0000000000761393", "target": ["mon-mgr", ""]}]: dispatch 2020-08-26T05:56:16.167+0000 7fab86da4700 0 [volumes INFO volumes.module] Starting _cmd_fs_subvolume_create(prefix:fs subvolume create, sub_name:subvolume_0000000000761393, target:['mon-mgr', ''], vol_name:cephfs) < "" 2020-08-26T05:56:16.167+0000 7fab86da4700 0 [volumes DEBUG mgr_util] self.fs_id=2, fs_id=3 2020-08-26T05:56:16.167+0000 7fab86da4700 0 [volumes DEBUG mgr_util] self.fs_id=2, fs_id=3 2020-08-26T05:56:16.167+0000 7fab86da4700 0 [volumes INFO mgr_util] aborting connection from cephfs 'cephfs' 2020-08-26T05:56:16.167+0000 7fab86da4700 2 client.5220 unmounting (abort)
So subsequent log messages seen w.r.t. abort returning. esp this log: https://github.com/ceph/ceph/blob/master/src/pybind/mgr/mgr_util.py#L138
Updated by Venky Shankar over 3 years ago
This is again the nasty "subvolume snapshot rm --force" test. Early suspects is the global lock. Hpwever, the hang is not due to the command blocking on the lock (at least for this run).
2020-08-26T06:46:41.472+0000 7f436130d700 1 -- [v2:172.21.15.135:6834/14995,v1:172.21.15.135:6835/14995] <== client.35092 172.21.15.201:0/3007761787 1 ==== mgr_command(tid 0: {"prefix": "fs subvolume snapshot rm", "vol_name": "cephfs", "sub_name": "subvolume_0000000000673810", "snap_name": "snapshot_0000000000470811", "force": true, "target": ["mon-mgr", ""]}) v1 ==== 210+0+0 (secure 0 0 0) 0x560a0d3c0f20 con 0x560a0bfed800 2020-08-26T06:46:41.472+0000 7f436130d700 0 log_channel(audit) log [DBG] : from='client.35092 -' entity='client.admin' cmd=[{"prefix": "fs subvolume snapshot rm", "vol_name": "cephfs", "sub_name": "subvolume_0000000000673810", "snap_name": "snapshot_0000000000470811", "force": true, "target": ["mon-mgr", ""]}]: dispatch 2020-08-26T06:46:41.472+0000 7f4361b0e700 0 [volumes INFO volumes.module] Starting _cmd_fs_subvolume_snapshot_rm(force:True, prefix:fs subvolume snapshot rm, snap_name:snapshot_0000000000470811, sub_name:subvolume_0000000000673810, target:['mon-mgr', ''], vol_name:cephfs) < "" 2020-08-26T06:46:41.472+0000 7f4361b0e700 0 [volumes DEBUG volumes.fs.operations.lock] entering global lock 2020-08-26T06:46:41.472+0000 7f4361b0e700 0 [volumes DEBUG volumes.fs.operations.lock] acquired global lock 2020-08-26T06:46:41.472+0000 7f4361b0e700 0 [volumes DEBUG mgr_util] self.fs_id=84, fs_id=85 2020-08-26T06:46:41.472+0000 7f4361b0e700 0 [volumes DEBUG mgr_util] self.fs_id=84, fs_id=85 2020-08-26T06:46:41.472+0000 7f4361b0e700 0 [volumes INFO mgr_util] aborting connection from cephfs 'cephfs' 2020-08-26T06:46:41.472+0000 7f4361b0e700 2 client.25946 unmounting (abort)
Updated by Shyamsundar Ranganathan over 3 years ago
@Venky looks like a duplicate of https://tracker.ceph.com/issues/46882
From the logs further down,
Job ID: 5376678
Log: smithi142/log/ceph-mgr.y.log.gz
2020-08-26T05:56:16.167+0000 7fab86da4700 2 client.5220 cache still has 0+1 items, waiting (for caps to release?)
...
2020-08-26T05:56:21.167+0000 7fab86da4700 1 client.5220 dump_cache
2020-08-26T05:56:21.167+0000 7fab86da4700 1 client.5220 dump_inode: DISCONNECTED inode 0x1000000025d #0x1000000025d ref 1 0x1000000025d.head(faked_ino=0 ref=1 ll_ref=0 cap_refs={} open={} mode=40755 size=0/0 nlink=0 btime=0.00000 0 mtime=2020-08-26T05:55:23.567230+0000 ctime=2020-08-26T05:55:23.568744+0000 caps=- has_dir_layout 0x55f674305b80)
2020-08-26T05:56:21.167+0000 7fab86da4700 2 client.5220 cache still has 0+1 items, waiting (for caps to release?)
Job ID: 5376679
Log: smithi135/log/ceph-mgr.x.log.gz
2020-08-26T06:46:41.472+0000 7f4361b0e700 2 client.25946 cache still has 0+1 items, waiting (for caps to release?)
...
2020-08-26T06:46:46.476+0000 7f4361b0e700 1 client.25946 dump_cache
2020-08-26T06:46:46.476+0000 7f4361b0e700 1 client.25946 dump_inode: DISCONNECTED inode 0x10000000616 #0x10000000616 ref 1 0x10000000616.2(faked_ino=0 ref=1 ll_ref=0 cap_refs={1024=0,2048=0} open={1=1} mode=100644 size=1048576/0 nlink=1
btime=2020-08-26T06:45:24.554181+0000 mtime=2020-08-26T06:45:24.569918+0000 ctime=2020-08-26T06:45:24.569918+0000 caps=pAsLsXsFscrl objectset[0x10000000616 ts 0/0 objects 1 dirty_or_tx 0] 0x560a0b938580)
2020-08-26T06:46:46.476+0000 7f4361b0e700 2 client.25946 cache still has 0+1 items, waiting (for caps to release?)
Updated by Venky Shankar over 3 years ago
- Is duplicate of Bug #46882: client: mount abort hangs: [volumes INFO mgr_util] aborting connection from cephfs 'cephfs' added