Project

General

Profile

Bug #42338

file system keeps on deadlocking with unresolved slow requests (failed to authpin, subtree is being exported)

Added by Kenneth Waegeman over 1 year ago. Updated over 1 year ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, ceph-fuse, kceph
Labels (FS):
multimds
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While syncing data to cephfs , using both fuse or kclient, our file systems always got stuck. After I restart the mds with the slow requests, it works for a small period of time, then it starts with slow requests. Eventually, no operations can be done on the fs until I restart the MDS. This issue does not go away by restarting mds or remounting the clients, it keeps on coming back when syncing data.

2019-10-15 17:49:45.752 7f3174869700  0 mds.0.migrator nicely exporting to mds.1 [dir 0x1000bd8ac7b /backups/kyukonhome/ [2,head] auth{1=2} v=232162868 cv=232162574/232162574 ap=0+1 state=1610874881|complete f(v5134 m2019-10-14 14:00:01.693665 11=2+9) n(v412702 rc2035-10-29 07:32:22.000000 b6728047022346 54954765=50183087+4771678)/n(v412702 rc2035-10-29 07:32:22.000000 b6728047025656 54954765=50183087+4771678) hs=11+0,ss=3+0 dirty=2 | child=1 replicated=1 dirty=1 waiter=0 authpin=0 0x559ba0c9cf00]
2019-10-15 17:49:46.192 7f3174869700  0 mds.0.migrator nicely exporting to mds.1 [dir 0x1000bd8ac7b /backups/kyukonhome/ [2,head] auth{1=2} v=232162868 cv=232162574/232162574 ap=0+2 state=1610874881|complete f(v5134 m2019-10-14 14:00:01.693665 11=2+9) n(v412702 rc2035-10-29 07:32:22.000000 b6728047022346 54954765=50183087+4771678)/n(v412702 rc2035-10-29 07:32:22.000000 b6728047025656 54954765=50183087+4771678) hs=11+0,ss=3+0 dirty=2 | child=1 replicated=1 dirty=1 waiter=0 authpin=0 0x559ba0c9cf00]
2019-10-15 17:49:48.062 7f317a875700  1 mds.mds01 Updating MDS map to version 567880 from mon.0
2019-10-15 17:49:51.332 7f317a875700  1 mds.mds01 Updating MDS map to version 567881 from mon.0
2019-10-15 17:49:59.812 7f317a875700  0 mds.0.migrator nicely exporting to mds.1 [dir 0x1000bd8ac7b /backups/kyukonhome/ [2,head] auth{1=2} pv=232162881 v=232162877 cv=232162574/232162574 ap=2+2 state=1610874881|complete f(v5134 m2019-10-14 14:00:01.693665 11=2+9) n(v412703 rc2035-10-29 07:32:22.000000 b6728048257510 54954854=50183176+4771678) hs=11+0,ss=3+0 dirty=2 | child=1 replicated=1 dirty=1 waiter=0 authpin=1 0x559ba0c9cf00]
2019-10-15 17:50:03.762 7f317a875700  1 mds.mds01 Updating MDS map to version 567882 from mon.0
2019-10-15 17:50:19.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 2 included below; oldest blocked for > 33.413517 secs
2019-10-15 17:50:19.802 7f3178070700  0 log_channel(cluster) log [WRN] : slow request 33.413517 seconds old, received at 2019-10-15 17:49:46.402110: client_request(client.41588185:115259 lookup #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_844.mat 2019-10-15 17:49:46.402038 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-10-15 17:50:19.802 7f3178070700  0 log_channel(cluster) log [WRN] : slow request 32.804341 seconds old, received at 2019-10-15 17:49:47.011286: client_request(client.41588185:115282 lookup #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_1308.mat 2019-10-15 17:49:47.011193 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-10-15 17:50:22.002 7f317a875700  1 mds.mds01 Updating MDS map to version 567883 from mon.0
2019-10-15 17:50:24.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 38.413619 secs
2019-10-15 17:50:29.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 43.413714 secs
2019-10-15 17:50:34.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 48.413976 secs
2019-10-15 17:50:39.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 53.414073 secs
2019-10-15 17:50:44.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 58.414164 secs
2019-10-15 17:50:49.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 2 included below; oldest blocked for > 63.414259 secs
2019-10-15 17:50:49.802 7f3178070700  0 log_channel(cluster) log [WRN] : slow request 63.414258 seconds old, received at 2019-10-15 17:49:46.402110: client_request(client.41588185:115259 lookup #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_844.mat 2019-10-15 17:49:46.402038 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-10-15 17:50:49.802 7f3178070700  0 log_channel(cluster) log [WRN] : slow request 62.805083 seconds old, received at 2019-10-15 17:49:47.011286: client_request(client.41588185:115282 lookup #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_1308.mat 2019-10-15 17:49:47.011193 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-10-15 17:50:54.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 68.414353 secs
2019-10-15 17:50:59.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 73.414450 secs
....
019-10-15 17:53:02.422 7f317a875700  1 mds.mds01 Updating MDS map to version 567906 from mon.0
2019-10-15 17:53:04.812 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 198.417149 secs
2019-10-15 17:53:05.732 7f317a875700  1 mds.mds01 Updating MDS map to version 567907 from mon.0
2019-10-15 17:53:09.812 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 203.417241 secs
2019-10-15 17:53:14.412 7f317a875700  1 mds.mds01 Updating MDS map to version 567908 from mon.0
2019-10-15 17:53:14.812 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 208.417346 secs
2019-10-15 17:53:17.662 7f317a875700  1 mds.mds01 Updating MDS map to version 567909 from mon.0
2019-10-15 17:53:19.812 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 213.417443 secs
2019-10-15 17:53:22.002 7f317a875700  1 mds.mds01 Updating MDS map to version 567910 from mon.0
2019-10-15 17:53:24.812 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 218.417538 secs
2019-10-15 17:53:26.362 7f317a875700  1 mds.mds01 Updating MDS map to version 567911 from mon.0
...
2019-10-15 17:58:14.832 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 508.440553 secs
2019-10-15 17:58:18.062 7f317a875700  1 mds.mds01 Updating MDS map to version 567970 from mon.0
2019-10-15 17:58:19.832 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 513.440638 secs
2019-10-15 17:58:21.672 7f317a875700  1 mds.mds01 Updating MDS map to version 567971 from mon.0
2019-10-15 17:58:24.832 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 518.440702 secs
2019-10-15 17:58:26.162 7f317a875700  1 mds.mds01 Updating MDS map to version 567972 from mon.0
2019-10-15 17:58:29.832 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 523.440775 secs 
...

2019-10-16 11:01:21.730 7f3178070700  0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 61895.334789 secs
2019-10-16 11:01:26.730 7f3178070700  0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 61900.334856 secs
2019-10-16 11:01:31.730 7f3178070700  0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 61905.334924 secs
2019-10-16 11:01:36.730 7f3178070700  0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 61910.334989 secs
2019-10-16 11:01:41.730 7f3178070700  0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 61915.335083 secs
2019-10-16 11:01:46.730 7f3178070700  0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 61920.335180 secs
2019-10-16 11:01:51.729 7f3178070700  0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 61925.335275 secs

This makes our fs unusable.. Anything I can do to further debug this?


Related issues

Duplicates CephFS - Bug #39987: mds: MDCache::cow_inode does not cleanup unneeded client_snap_caps Resolved

History

#1 Updated by Zheng Yan over 1 year ago

do you use snapshot

#2 Updated by Kenneth Waegeman over 1 year ago

yes we have a few snapshots
Thanks

#3 Updated by Zheng Yan over 1 year ago

did you create new snapshots while syncing data?

please run 'ceph daemon mds.x dump cache /tmp/cachedump.x' for all mds and upload cachedump files

#4 Updated by Kenneth Waegeman over 1 year ago

Hi,

I didn't create a new snapshot while syncing data.

Files sent by mail¸ not able to post them here.

#5 Updated by Zheng Yan over 1 year ago

where did you send the log file? please share the file to me () through google drive.

#6 Updated by Patrick Donnelly over 1 year ago

  • Status changed from New to Need More Info

#7 Updated by Kenneth Waegeman over 1 year ago

I've sent it through google drive this time. Thanks again!
K

#8 Updated by Zheng Yan over 1 year ago

  • Status changed from Need More Info to Duplicate

dup of https://tracker.ceph.com/issues/39987. will be fixed by v14.2.5. you can avoid this bug by not creating new snapshots (after restart all mds)

#9 Updated by Patrick Donnelly over 1 year ago

  • Duplicates Bug #39987: mds: MDCache::cow_inode does not cleanup unneeded client_snap_caps added

#10 Updated by Kenneth Waegeman over 1 year ago

I restarted all mds and did not create a snapshot after that, but still seeing those slow requests..

#11 Updated by Zheng Yan over 1 year ago

please dump cache and share it again

#12 Updated by Kenneth Waegeman over 1 year ago

I've restarted the mds at 15:06, didn't take any snapshots and dumped the cache with the slow requests around 16:00. I've sent them through drive.
Thanks again!

#13 Updated by Zheng Yan over 1 year ago

still the same issue. logs show there are inode locks in 'snap->sync' states. try umounting all client and restart all mds

#14 Updated by Kenneth Waegeman over 1 year ago

Still seeing the issue after unmounting and restarting :-(

Also available in: Atom PDF