Bug #47140: mgr/volumes: unresponsive Client::abort_conn() when cleaning stale libcephfs handle - CephFS - Ceph

Actions

Copy link

Bug #47140

closed

mgr/volumes: unresponsive Client::abort_conn() when cleaning stale libcephfs handle

Added by Venky Shankar over 3 years ago. Updated over 3 years ago.

Status:

Duplicate

Priority:

High

Assignee:

Category:

Correctness/Safety

Target version:

Ceph - v16.0.0

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

mgr/volumes

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Libcephfs connection pool in mgr (mgr_util) identifies stale filesystem handles and cleans them up by calling abort_conn(). In certain teuthology runs abort_conn() never returns causing the test to fail (timeout).

Sample run: https://pulpito.ceph.com/vshankar-2020-08-26_05:34:12-fs-wip-pdonnell-testing-20200826.032941-distro-basic-smithi/5376678/

From manager logs:

2020-08-26T05:56:16.163+0000 7fab865a3700  1 -- [v2:172.21.15.142:6800/15025,v1:172.21.15.142:6801/15025] <== client.7542 172.21.15.142:0/1077893343 1 ==== mgr_command(tid 0: {"prefix": "fs subvolume create", "vol_name": "cephfs", "sub_name": "subvolume_0000000000761393", "target": ["mon-mgr", ""]}) v1 ==== 148+0+0 (secure 0 0 0) 0x55f674b95e40 con 0x55f674be9400
2020-08-26T05:56:16.167+0000 7fab865a3700  0 log_channel(audit) log [DBG] : from='client.7542 -' entity='client.admin' cmd=[{"prefix": "fs subvolume create", "vol_name": "cephfs", "sub_name": "subvolume_0000000000761393", "target": ["mon-mgr", ""]}]: dispatch
2020-08-26T05:56:16.167+0000 7fab86da4700  0 [volumes INFO volumes.module] Starting _cmd_fs_subvolume_create(prefix:fs subvolume create, sub_name:subvolume_0000000000761393, target:['mon-mgr', ''], vol_name:cephfs) < "" 
2020-08-26T05:56:16.167+0000 7fab86da4700  0 [volumes DEBUG mgr_util] self.fs_id=2, fs_id=3
2020-08-26T05:56:16.167+0000 7fab86da4700  0 [volumes DEBUG mgr_util] self.fs_id=2, fs_id=3
2020-08-26T05:56:16.167+0000 7fab86da4700  0 [volumes INFO mgr_util] aborting connection from cephfs 'cephfs'
2020-08-26T05:56:16.167+0000 7fab86da4700  2 client.5220 unmounting (abort)

So subsequent log messages seen w.r.t. abort returning. esp this log: https://github.com/ceph/ceph/blob/master/src/pybind/mgr/mgr_util.py#L138

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Venky Shankar over 3 years ago

https://pulpito.ceph.com/vshankar-2020-08-26_05:34:12-fs-wip-pdonnell-testing-20200826.032941-distro-basic-smithi/5376679/

This is again the nasty "subvolume snapshot rm --force" test. Early suspects is the global lock. Hpwever, the hang is not due to the command blocking on the lock (at least for this run).

2020-08-26T06:46:41.472+0000 7f436130d700  1 -- [v2:172.21.15.135:6834/14995,v1:172.21.15.135:6835/14995] <== client.35092 172.21.15.201:0/3007761787 1 ==== mgr_command(tid 0: {"prefix": "fs subvolume snapshot rm", "vol_name": "cephfs", "sub_name": "subvolume_0000000000673810", "snap_name": "snapshot_0000000000470811", "force": true, "target": ["mon-mgr", ""]}) v1 ==== 210+0+0 (secure 0 0 0) 0x560a0d3c0f20 con 0x560a0bfed800
2020-08-26T06:46:41.472+0000 7f436130d700  0 log_channel(audit) log [DBG] : from='client.35092 -' entity='client.admin' cmd=[{"prefix": "fs subvolume snapshot rm", "vol_name": "cephfs", "sub_name": "subvolume_0000000000673810", "snap_name": "snapshot_0000000000470811", "force": true, "target": ["mon-mgr", ""]}]: dispatch
2020-08-26T06:46:41.472+0000 7f4361b0e700  0 [volumes INFO volumes.module] Starting _cmd_fs_subvolume_snapshot_rm(force:True, prefix:fs subvolume snapshot rm, snap_name:snapshot_0000000000470811, sub_name:subvolume_0000000000673810, target:['mon-mgr', ''], vol_name:cephfs) < "" 
2020-08-26T06:46:41.472+0000 7f4361b0e700  0 [volumes DEBUG volumes.fs.operations.lock] entering global lock
2020-08-26T06:46:41.472+0000 7f4361b0e700  0 [volumes DEBUG volumes.fs.operations.lock] acquired global lock
2020-08-26T06:46:41.472+0000 7f4361b0e700  0 [volumes DEBUG mgr_util] self.fs_id=84, fs_id=85
2020-08-26T06:46:41.472+0000 7f4361b0e700  0 [volumes DEBUG mgr_util] self.fs_id=84, fs_id=85
2020-08-26T06:46:41.472+0000 7f4361b0e700  0 [volumes INFO mgr_util] aborting connection from cephfs 'cephfs'
2020-08-26T06:46:41.472+0000 7f4361b0e700  2 client.25946 unmounting (abort)

Actions

Copy link

Updated by Shyamsundar Ranganathan over 3 years ago

@Venky looks like a duplicate of https://tracker.ceph.com/issues/46882

From the logs further down,
Job ID: 5376678
Log: smithi142/log/ceph-mgr.y.log.gz

2020-08-26T05:56:16.167+0000 7fab86da4700  2 client.5220 cache still has 0+1 items, waiting (for caps to release?)
...
2020-08-26T05:56:21.167+0000 7fab86da4700  1 client.5220 dump_cache
2020-08-26T05:56:21.167+0000 7fab86da4700  1 client.5220 dump_inode: DISCONNECTED inode 0x1000000025d #0x1000000025d ref 1 0x1000000025d.head(faked_ino=0 ref=1 ll_ref=0 cap_refs={} open={} mode=40755 size=0/0 nlink=0 btime=0.00000        0 mtime=2020-08-26T05:55:23.567230+0000 ctime=2020-08-26T05:55:23.568744+0000 caps=- has_dir_layout 0x55f674305b80)
2020-08-26T05:56:21.167+0000 7fab86da4700  2 client.5220 cache still has 0+1 items, waiting (for caps to release?)

Job ID: 5376679
Log: smithi135/log/ceph-mgr.x.log.gz

2020-08-26T06:46:41.472+0000 7f4361b0e700  2 client.25946 cache still has 0+1 items, waiting (for caps to release?)
...
2020-08-26T06:46:46.476+0000 7f4361b0e700  1 client.25946 dump_cache
2020-08-26T06:46:46.476+0000 7f4361b0e700  1 client.25946 dump_inode: DISCONNECTED inode 0x10000000616 #0x10000000616 ref 1 0x10000000616.2(faked_ino=0 ref=1 ll_ref=0 cap_refs={1024=0,2048=0} open={1=1} mode=100644 size=1048576/0 nlink=1 
btime=2020-08-26T06:45:24.554181+0000 mtime=2020-08-26T06:45:24.569918+0000 ctime=2020-08-26T06:45:24.569918+0000 caps=pAsLsXsFscrl objectset[0x10000000616 ts 0/0 objects 1 dirty_or_tx 0] 0x560a0b938580)
2020-08-26T06:46:46.476+0000 7f4361b0e700  2 client.25946 cache still has 0+1 items, waiting (for caps to release?)

Actions

Copy link