Bug #42338
closedfile system keeps on deadlocking with unresolved slow requests (failed to authpin, subtree is being exported)
0%
Description
While syncing data to cephfs , using both fuse or kclient, our file systems always got stuck. After I restart the mds with the slow requests, it works for a small period of time, then it starts with slow requests. Eventually, no operations can be done on the fs until I restart the MDS. This issue does not go away by restarting mds or remounting the clients, it keeps on coming back when syncing data.
2019-10-15 17:49:45.752 7f3174869700 0 mds.0.migrator nicely exporting to mds.1 [dir 0x1000bd8ac7b /backups/kyukonhome/ [2,head] auth{1=2} v=232162868 cv=232162574/232162574 ap=0+1 state=1610874881|complete f(v5134 m2019-10-14 14:00:01.693665 11=2+9) n(v412702 rc2035-10-29 07:32:22.000000 b6728047022346 54954765=50183087+4771678)/n(v412702 rc2035-10-29 07:32:22.000000 b6728047025656 54954765=50183087+4771678) hs=11+0,ss=3+0 dirty=2 | child=1 replicated=1 dirty=1 waiter=0 authpin=0 0x559ba0c9cf00] 2019-10-15 17:49:46.192 7f3174869700 0 mds.0.migrator nicely exporting to mds.1 [dir 0x1000bd8ac7b /backups/kyukonhome/ [2,head] auth{1=2} v=232162868 cv=232162574/232162574 ap=0+2 state=1610874881|complete f(v5134 m2019-10-14 14:00:01.693665 11=2+9) n(v412702 rc2035-10-29 07:32:22.000000 b6728047022346 54954765=50183087+4771678)/n(v412702 rc2035-10-29 07:32:22.000000 b6728047025656 54954765=50183087+4771678) hs=11+0,ss=3+0 dirty=2 | child=1 replicated=1 dirty=1 waiter=0 authpin=0 0x559ba0c9cf00] 2019-10-15 17:49:48.062 7f317a875700 1 mds.mds01 Updating MDS map to version 567880 from mon.0 2019-10-15 17:49:51.332 7f317a875700 1 mds.mds01 Updating MDS map to version 567881 from mon.0 2019-10-15 17:49:59.812 7f317a875700 0 mds.0.migrator nicely exporting to mds.1 [dir 0x1000bd8ac7b /backups/kyukonhome/ [2,head] auth{1=2} pv=232162881 v=232162877 cv=232162574/232162574 ap=2+2 state=1610874881|complete f(v5134 m2019-10-14 14:00:01.693665 11=2+9) n(v412703 rc2035-10-29 07:32:22.000000 b6728048257510 54954854=50183176+4771678) hs=11+0,ss=3+0 dirty=2 | child=1 replicated=1 dirty=1 waiter=0 authpin=1 0x559ba0c9cf00] 2019-10-15 17:50:03.762 7f317a875700 1 mds.mds01 Updating MDS map to version 567882 from mon.0 2019-10-15 17:50:19.802 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 2 included below; oldest blocked for > 33.413517 secs 2019-10-15 17:50:19.802 7f3178070700 0 log_channel(cluster) log [WRN] : slow request 33.413517 seconds old, received at 2019-10-15 17:49:46.402110: client_request(client.41588185:115259 lookup #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_844.mat 2019-10-15 17:49:46.402038 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-10-15 17:50:19.802 7f3178070700 0 log_channel(cluster) log [WRN] : slow request 32.804341 seconds old, received at 2019-10-15 17:49:47.011286: client_request(client.41588185:115282 lookup #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_1308.mat 2019-10-15 17:49:47.011193 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-10-15 17:50:22.002 7f317a875700 1 mds.mds01 Updating MDS map to version 567883 from mon.0 2019-10-15 17:50:24.802 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 38.413619 secs 2019-10-15 17:50:29.802 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 43.413714 secs 2019-10-15 17:50:34.802 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 48.413976 secs 2019-10-15 17:50:39.802 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 53.414073 secs 2019-10-15 17:50:44.802 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 58.414164 secs 2019-10-15 17:50:49.802 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 2 included below; oldest blocked for > 63.414259 secs 2019-10-15 17:50:49.802 7f3178070700 0 log_channel(cluster) log [WRN] : slow request 63.414258 seconds old, received at 2019-10-15 17:49:46.402110: client_request(client.41588185:115259 lookup #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_844.mat 2019-10-15 17:49:46.402038 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-10-15 17:50:49.802 7f3178070700 0 log_channel(cluster) log [WRN] : slow request 62.805083 seconds old, received at 2019-10-15 17:49:47.011286: client_request(client.41588185:115282 lookup #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_1308.mat 2019-10-15 17:49:47.011193 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-10-15 17:50:54.802 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 68.414353 secs 2019-10-15 17:50:59.802 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 73.414450 secs .... 019-10-15 17:53:02.422 7f317a875700 1 mds.mds01 Updating MDS map to version 567906 from mon.0 2019-10-15 17:53:04.812 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 198.417149 secs 2019-10-15 17:53:05.732 7f317a875700 1 mds.mds01 Updating MDS map to version 567907 from mon.0 2019-10-15 17:53:09.812 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 203.417241 secs 2019-10-15 17:53:14.412 7f317a875700 1 mds.mds01 Updating MDS map to version 567908 from mon.0 2019-10-15 17:53:14.812 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 208.417346 secs 2019-10-15 17:53:17.662 7f317a875700 1 mds.mds01 Updating MDS map to version 567909 from mon.0 2019-10-15 17:53:19.812 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 213.417443 secs 2019-10-15 17:53:22.002 7f317a875700 1 mds.mds01 Updating MDS map to version 567910 from mon.0 2019-10-15 17:53:24.812 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 218.417538 secs 2019-10-15 17:53:26.362 7f317a875700 1 mds.mds01 Updating MDS map to version 567911 from mon.0 ... 2019-10-15 17:58:14.832 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 508.440553 secs 2019-10-15 17:58:18.062 7f317a875700 1 mds.mds01 Updating MDS map to version 567970 from mon.0 2019-10-15 17:58:19.832 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 513.440638 secs 2019-10-15 17:58:21.672 7f317a875700 1 mds.mds01 Updating MDS map to version 567971 from mon.0 2019-10-15 17:58:24.832 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 518.440702 secs 2019-10-15 17:58:26.162 7f317a875700 1 mds.mds01 Updating MDS map to version 567972 from mon.0 2019-10-15 17:58:29.832 7f3178070700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 523.440775 secs ... 2019-10-16 11:01:21.730 7f3178070700 0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 61895.334789 secs 2019-10-16 11:01:26.730 7f3178070700 0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 61900.334856 secs 2019-10-16 11:01:31.730 7f3178070700 0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 61905.334924 secs 2019-10-16 11:01:36.730 7f3178070700 0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 61910.334989 secs 2019-10-16 11:01:41.730 7f3178070700 0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 61915.335083 secs 2019-10-16 11:01:46.730 7f3178070700 0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 61920.335180 secs 2019-10-16 11:01:51.729 7f3178070700 0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 61925.335275 secs
This makes our fs unusable.. Anything I can do to further debug this?
Updated by Zheng Yan over 4 years ago
did you create new snapshots while syncing data?
please run 'ceph daemon mds.x dump cache /tmp/cachedump.x' for all mds and upload cachedump files
Updated by Kenneth Waegeman over 4 years ago
Hi,
I didn't create a new snapshot while syncing data.
Files sent by mail¸ not able to post them here.
Updated by Zheng Yan over 4 years ago
where did you send the log file? please share the file to me (ukernel@gmail.com) through google drive.
Updated by Patrick Donnelly over 4 years ago
- Status changed from New to Need More Info
Updated by Kenneth Waegeman over 4 years ago
I've sent it through google drive this time. Thanks again!
K
Updated by Zheng Yan over 4 years ago
- Status changed from Need More Info to Duplicate
dup of https://tracker.ceph.com/issues/39987. will be fixed by v14.2.5. you can avoid this bug by not creating new snapshots (after restart all mds)
Updated by Patrick Donnelly over 4 years ago
- Is duplicate of Bug #39987: mds: MDCache::cow_inode does not cleanup unneeded client_snap_caps added
Updated by Kenneth Waegeman over 4 years ago
I restarted all mds and did not create a snapshot after that, but still seeing those slow requests..
Updated by Kenneth Waegeman over 4 years ago
I've restarted the mds at 15:06, didn't take any snapshots and dumped the cache with the slow requests around 16:00. I've sent them through drive.
Thanks again!
Updated by Zheng Yan over 4 years ago
still the same issue. logs show there are inode locks in 'snap->sync' states. try umounting all client and restart all mds
Updated by Kenneth Waegeman over 4 years ago
Still seeing the issue after unmounting and restarting :-(