Project

General

Profile

Actions

Bug #44947

open

Hung ops for evicted CephFS clients do not get cleaned up fully

Added by David Piper about 4 years ago. Updated almost 4 years ago.

Status:
Need More Info
Priority:
High
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,

After noticing some hung CephFS operations on my client, I rebooted the client. Ceph has evicted and blacklisted this client, and the hung operations have progressed to the "cleaned up request" event, but they are still listed by dump_ops_in_flight are preventing the rebooted client (which has been assigned a new client ID on remounting) from accessing the same inode. New attempts to access this inode result in additional hung operations. The only way I found to clear the hung ops completely and restore access to the inode was to restart my MDS.

I would have expected Ceph to terminate all operations for a client when that client is evicted. Is this behaviour configurable? Are there additional diags I can collect if this reoccurs?

Details of the current setup:
• ceph version 14.2.5 (ad5bd132e1492173c85fda2cc863152730b16a92) nautilus (stable)
• We're using the ceph kernel driver, kernel: 5.5.7-1.el7.elrepo.x86_64
• The client server has 38 separate directories mounted, all from the same CephFS filesystem.
• All 38 directories are mounted with the same config by three separate clients.
• Mount config (in fstab): 10.225.44.236,10.225.44.237,10.225.44.238:6789:/albacore/system/deploy on /opt/dcl/deploy type ceph (rw,noatime,name=albacore,secret=<hidden>,acl,wsize=32768,rsize=32768,_netdev)

Timeline:

1) 2020-03-28 21:38:58 - a cephFS op from client:366380 on inode .tmp_depl_license_status.svr01 gets stuck at "failed to wrlock, waiting" (see dump_ops_in_flight). Other ops for the same inode over the course of the next few days get stuck in a "dispatched" state (again see dump_ops_in_flight). Ceph health reports multiple slow ops.

2) 2020-03-30 11:11:44.582 - the client server is rebooted (with a "reboot" command from the shell). Ceph MDS logs show us evicting client session 366380. The client no longer appears in the output of `ceph tell mds.0 client ls`

3) 2020-03-30 11:11:44.664068 onwards - all the existing ops_in_flight for this client progress through events "failed to wrlock, waiting", "killing request", "cleaned up request" but the ops are still in the ops_in_flight list and still count towards ceph's slow ops count. The client no longer records these ops under /sys/kernel/debug/ceph/*/mdsc

4) 2020-03-30 11:18:37 - the client server comes back online, remounts the directory from CephFS, getting a new client session ID: 877605

5) 2020-03-30 11:21:06 - client:877605 tries to access the inode in question (.tmp_depl_license_status.svr01) and gets stuck in "failed to wrlock, waiting". More get caught behind it in "dispatched" state again, as before. These ops appear under /sys/kernel/debug/ceph/*/mdsc

Kind regards,

Dave


Files

dump_ops_in_flight.txt (51.3 KB) dump_ops_in_flight.txt David Piper, 04/06/2020 08:27 AM
mds_2.txt (123 KB) mds_2.txt David Piper, 04/06/2020 08:31 AM
Actions #1

Updated by Greg Farnum about 4 years ago

  • Project changed from Ceph to CephFS
  • Category set to Correctness/Safety
  • Priority changed from Normal to High
  • Component(FS) MDS added

This is quite odd — the only way for a request to get marked as cleaned up like that is after it does what should be all of the cleanup, which involves dropping the locks.

The request can be kept around if something keeps a reference to it, which I would guess is why it's still showing up, but I'm not sure how that could be blocking ongoing IO or holding on to locks...

Actions #2

Updated by Zheng Yan about 4 years ago

It's mds bug. If you can compile ceph from source, please try https://github.com/ceph/ceph/pull/34338

Actions #3

Updated by Zheng Yan about 4 years ago

https://github.com/ceph/ceph/pull/32073 can also explain this. please try 14.2.8

Actions #4

Updated by Patrick Donnelly almost 4 years ago

  • Status changed from New to Need More Info
Actions #5

Updated by David Piper almost 4 years ago

We haven't seen this again since I raised the ticket. We've upgraded to 14.2.9 recently; I'll keep an eye out for this happening again.

Actions

Also available in: Atom PDF