Project

General

Profile

Actions

Bug #48679

closed

client: items pinned in cache preventing unmount

Added by Patrick Donnelly over 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

100%

Source:
Q/A
Tags:
Backport:
pacific,octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2020-12-18T02:17:36.017+0000 7f967501e200  2 client.4623 unmounting
...
2020-12-18T02:17:36.018+0000 7f967501e200  2 client.4623 cache still has 0+25 items, waiting (for caps to release?)
...

From: /ceph/teuthology-archive/pdonnell-2020-12-17_23:13:08-fs-wip-pdonnell-testing-20201217.205941-distro-basic-smithi/5715878/remote/smithi091/log/ceph-client.0.36435.log.gz

Symptom:

2020-12-18T02:32:45.191 ERROR:tasks.cephfs.fuse_mount:process failed to terminate after unmount. This probably indicates a bug within ceph-fuse.
2020-12-18T02:32:45.192 ERROR:teuthology.run_tasks:Manager failed: ceph-fuse
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 173, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20201217.205941/qa/tasks/ceph_fuse.py", line 175, in task
    mount.umount_wait()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20201217.205941/qa/tasks/cephfs/fuse_mount.py", line 380, in umount_wait
    run.wait([self.fuse_daemon], timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 473, in wait
    check_time()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: reached maximum tries (150) after waiting for 900 seconds

From: /ceph/teuthology-archive/pdonnell-2020-12-17_23:13:08-fs-wip-pdonnell-testing-20201217.205941-distro-basic-smithi/5715878/teuthology.log


Subtasks 1 (0 open1 closed)

Bug #49928: client: items pinned in cache preventing unmount x2DuplicateXiubo Li

Actions

Related issues 4 (0 open4 closed)

Related to Linux kernel client - Bug #48640: qa: snapshot mismatch during mds thrashingResolvedXiubo Li

Actions
Copied to CephFS - Backport #50025: pacific: client: items pinned in cache preventing unmountResolvedPatrick DonnellyActions
Copied to CephFS - Backport #50026: nautilus: client: items pinned in cache preventing unmountResolvedNathan CutlerActions
Copied to CephFS - Backport #50027: octopus: client: items pinned in cache preventing unmountResolvedNathan CutlerActions
Actions #1

Updated by Patrick Donnelly over 3 years ago

  description: fs/thrash/workloads/{begin clusters/1-mds-1-client-coloc conf/{client
    mds mon osd} distro/{centos_8} mount/fuse msgr-failures/none objectstore-ec/filestore-xfs
    overrides/{frag_enable session_timeout thrashosds-health whitelist_health whitelist_wrongly_marked_down}
    tasks/{1-thrash/osd 2-workunit/cfuse_workunit_snaptests}}
Actions #2

Updated by Xiubo Li over 3 years ago

Patrick, this one seems similiar to the one I have fixed before, I will take it.
Thanks.

Actions #3

Updated by Xiubo Li over 3 years ago

  • Status changed from New to In Progress
  • Assignee set to Xiubo Li
Actions #4

Updated by Xiubo Li over 3 years ago

For example for the inode 0x10000000e51:

1426613:2020-12-18T01:31:54.411+0000 7f964f7fe700 10 client.4623 queue_cap_snap 0x10000000e51.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={4=0,1024=1,4096=0,8192=1} open={2=0} mode=100664 size=62/4194304 nlink=1 btime=2020-12-18T01:31:49.263043+0000 mtime=2020-12-18T01:31:49.264384+0000 ctime=2020-12-18T01:31:49.264384+0000 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=Fw objectset[0x10000000e51 ts 0/0 objects 1 dirty_or_tx 62] parents=0x100000005fa.head["34"] 0x7f9648532110) snapc 5e=[5e,8] used Fcb
1426614:2020-12-18T01:31:54.411+0000 7f964f7fe700 15 inode.get on 0x7f9648532110 0x10000000e51.head now 7
...
1426615:2020-12-18T01:31:54.411+0000 7f964f7fe700 10 client.4623 finish_cap_snap 0x10000000e51.head(faked_ino=0 ref=7 ll_ref=1 cap_refs={4=0,1024=1,4096=0,8192=1} open={2=0} mode=100664 size=62/4194304 nlink=1 btime=2020-12-18T01:31:49.263043+0000 mtime=2020-12-18T01:31:49.264384+0000 ctime=2020-12-18T01:31:49.264384+0000 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=Fw objectset[0x10000000e51 ts 0/0 objects 1 dirty_or_tx 62] parents=0x100000005fa.head["34"] 0x7f9648532110) capsnap 0x7f9648869a08 used Fcb
1426616:2020-12-18T01:31:54.411+0000 7f964f7fe700 10 client.4623 finish_cap_snap 0x10000000e51.head(faked_ino=0 ref=7 ll_ref=1 cap_refs={4=0,1024=1,4096=0,8192=1} open={2=0} mode=100664 size=62/4194304 nlink=1 btime=2020-12-18T01:31:49.263043+0000 mtime=2020-12-18T01:31:49.264384+0000 ctime=2020-12-18T01:31:49.264384+0000 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=Fw objectset[0x10000000e51 ts 0/0 objects 1 dirty_or_tx 62] parents=0x100000005fa.head["34"] 0x7f9648532110) cap_snap 0x7f9648869a08 used 9216 WRBUFFER, delaying

Because it has the Fb cap, so the flush snaps was delayed, but it seemed never do it later. So the inode still kept in the cap_snaps by increasing a ref.


 76 struct CapSnap {                                                                                                                      
 77   //snapid_t follows;  // map key
 78   InodeRef in;              ===> an inode ref               
 79   SnapContext context;         
 80   int issued, dirty; 

Actions #5

Updated by Xiubo Li over 3 years ago

Xiubo Li wrote:

For example for the inode 0x10000000e51:

[...]

Because it has the Fb cap, so the flush snaps was delayed, but it seemed never do it later. So the inode still kept in the cap_snaps by increasing a ref.
[...]

After the delaying to flush the snaps cap, the inode "10000000e51" didn'r run the "finish_cap_snap()":

2020-12-18T01:32:03.525+0000 7f9665d37700  1 -- 192.168.0.1:0/722175349 <== osd.6 v2:172.21.15.199:6819/33452 35 ==== osd_op_reply(3021 10000000e51.00000000 [write 0~62] v181'385 uv385 ondisk = 0) v8 ==== 164+0+0 (crc 0 0 0) 0x7f9658052040 con 0x7f962000a110
2020-12-18T01:32:03.525+0000 7f9654ff9700 10 client.4623 _flushed 0x10000000e51.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={4=0,1024=1,4096=0,8192=1} open={2=0} mode=100664 size=62/4194304 nlink=1 btime=2020-12-18T01:31:49.263043+0000 mtime=2020-12-18T01:31:49.264384+0000 ctime=2020-12-18T01:31:49.264384+0000 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x10000000e51 ts 0/0 objects 1 dirty_or_tx 0] parents=0x100000005fa.head["34"] 0x7f9648532110)
2020-12-18T01:32:03.525+0000 7f9654ff9700  5 client.4623 put_cap_ref dropped last FILE_BUFFER ref on 0x10000000e51.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={4=0,1024=0,4096=0,8192=0} open={2=0} mode=100664 size=62/4194304 nlink=1 btime=2020-12-18T01:31:49.263043+0000 mtime=2020-12-18T01:31:49.264384+0000 ctime=2020-12-18T01:31:49.264384+0000 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x10000000e51 ts 0/0 objects 1 dirty_or_tx 0] parents=0x100000005fa.head["34"] 0x7f9648532110)
2020-12-18T01:32:03.525+0000 7f9654ff9700  5 client.4623 put_cap_ref dropped last FILE_CACHE ref on 0x10000000e51.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={4=0,1024=0,4096=0,8192=0} open={2=0} mode=100664 size=62/4194304 nlink=1 btime=2020-12-18T01:31:49.263043+0000 mtime=2020-12-18T01:31:49.264384+0000 ctime=2020-12-18T01:31:49.264384+0000 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x10000000e51 ts 0/0 objects 1 dirty_or_tx 0] parents=0x100000005fa.head["34"] 0x7f9648532110)
2020-12-18T01:32:03.525+0000 7f9654ff9700 20 client.4623 put_inode on 0x10000000e51.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={4=0,1024=0,4096=0,8192=0} open={2=0} mode=100664 size=62/4194304 nlink=1 btime=2020-12-18T01:31:49.263043+0000 mtime=2020-12-18T01:31:49.264384+0000 ctime=2020-12-18T01:31:49.264384+0000 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x10000000e51 ts 0/0 objects 1 dirty_or_tx 0] parents=0x100000005fa.head["34"] 0x7f9648532110) n = 2
2020-12-18T01:32:03.525+0000 7f9654ff9700 20 client.4623 put_inode on 0x10000000e51.head(faked_ino=0 ref=6 ll_ref=1 cap_refs={4=0,1024=0,4096=0,8192=0} open={2=0} mode=100664 size=62/4194304 nlink=1 btime=2020-12-18T01:31:49.263043+0000 mtime=2020-12-18T01:31:49.264384+0000 ctime=2020-12-18T01:31:49.264384+0000 caps=pAsLsXsFsxcrwb(0=pAsLsXsFsxcrwb) objectset[0x10000000e51 ts 0/0 objects 1 dirty_or_tx 0] parents=0x100000005fa.head["34"] 0x7f9648532110) n = 1

We should check the "CEPH_CAP_FILE_BUFFER" cap and call the "finish_cap_snap()".

Actions #6

Updated by Xiubo Li over 3 years ago

  • Pull request ID set to 38732
Actions #7

Updated by Xiubo Li over 3 years ago

  • Status changed from In Progress to Fix Under Review
Actions #8

Updated by Patrick Donnelly over 3 years ago

  • Target version changed from v16.0.0 to v17.0.0
  • Backport set to pacific,octopus,nautilus
Actions #9

Updated by Ilya Dryomov about 3 years ago

  • Related to Bug #48640: qa: snapshot mismatch during mds thrashing added
Actions #10

Updated by Patrick Donnelly about 3 years ago

  • Related to Bug #49928: client: items pinned in cache preventing unmount x2 added
Actions #11

Updated by Patrick Donnelly about 3 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Source set to Q/A
Actions #12

Updated by Backport Bot about 3 years ago

  • Copied to Backport #50025: pacific: client: items pinned in cache preventing unmount added
Actions #13

Updated by Backport Bot about 3 years ago

  • Copied to Backport #50026: nautilus: client: items pinned in cache preventing unmount added
Actions #14

Updated by Backport Bot about 3 years ago

  • Copied to Backport #50027: octopus: client: items pinned in cache preventing unmount added
Actions #15

Updated by Loïc Dachary almost 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF