Bug #42338: file system keeps on deadlocking with unresolved slow requests (failed to authpin, subtree is being exported) - CephFS - Ceph

Actions

Copy link

Bug #42338

closed

file system keeps on deadlocking with unresolved slow requests (failed to authpin, subtree is being exported)

Added by Kenneth Waegeman over 4 years ago. Updated over 4 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

Correctness/Safety

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

Ceph - v14.2.2, Ceph - v14.2.3

ceph-qa-suite:

Component(FS):

MDS, ceph-fuse, kceph

Labels (FS):

multimds

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

While syncing data to cephfs , using both fuse or kclient, our file systems always got stuck. After I restart the mds with the slow requests, it works for a small period of time, then it starts with slow requests. Eventually, no operations can be done on the fs until I restart the MDS. This issue does not go away by restarting mds or remounting the clients, it keeps on coming back when syncing data.

2019-10-15 17:49:45.752 7f3174869700  0 mds.0.migrator nicely exporting to mds.1 [dir 0x1000bd8ac7b /backups/kyukonhome/ [2,head] auth{1=2} v=232162868 cv=232162574/232162574 ap=0+1 state=1610874881|complete f(v5134 m2019-10-14 14:00:01.693665 11=2+9) n(v412702 rc2035-10-29 07:32:22.000000 b6728047022346 54954765=50183087+4771678)/n(v412702 rc2035-10-29 07:32:22.000000 b6728047025656 54954765=50183087+4771678) hs=11+0,ss=3+0 dirty=2 | child=1 replicated=1 dirty=1 waiter=0 authpin=0 0x559ba0c9cf00]
2019-10-15 17:49:46.192 7f3174869700  0 mds.0.migrator nicely exporting to mds.1 [dir 0x1000bd8ac7b /backups/kyukonhome/ [2,head] auth{1=2} v=232162868 cv=232162574/232162574 ap=0+2 state=1610874881|complete f(v5134 m2019-10-14 14:00:01.693665 11=2+9) n(v412702 rc2035-10-29 07:32:22.000000 b6728047022346 54954765=50183087+4771678)/n(v412702 rc2035-10-29 07:32:22.000000 b6728047025656 54954765=50183087+4771678) hs=11+0,ss=3+0 dirty=2 | child=1 replicated=1 dirty=1 waiter=0 authpin=0 0x559ba0c9cf00]
2019-10-15 17:49:48.062 7f317a875700  1 mds.mds01 Updating MDS map to version 567880 from mon.0
2019-10-15 17:49:51.332 7f317a875700  1 mds.mds01 Updating MDS map to version 567881 from mon.0
2019-10-15 17:49:59.812 7f317a875700  0 mds.0.migrator nicely exporting to mds.1 [dir 0x1000bd8ac7b /backups/kyukonhome/ [2,head] auth{1=2} pv=232162881 v=232162877 cv=232162574/232162574 ap=2+2 state=1610874881|complete f(v5134 m2019-10-14 14:00:01.693665 11=2+9) n(v412703 rc2035-10-29 07:32:22.000000 b6728048257510 54954854=50183176+4771678) hs=11+0,ss=3+0 dirty=2 | child=1 replicated=1 dirty=1 waiter=0 authpin=1 0x559ba0c9cf00]
2019-10-15 17:50:03.762 7f317a875700  1 mds.mds01 Updating MDS map to version 567882 from mon.0
2019-10-15 17:50:19.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 2 included below; oldest blocked for > 33.413517 secs
2019-10-15 17:50:19.802 7f3178070700  0 log_channel(cluster) log [WRN] : slow request 33.413517 seconds old, received at 2019-10-15 17:49:46.402110: client_request(client.41588185:115259 lookup #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_844.mat 2019-10-15 17:49:46.402038 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-10-15 17:50:19.802 7f3178070700  0 log_channel(cluster) log [WRN] : slow request 32.804341 seconds old, received at 2019-10-15 17:49:47.011286: client_request(client.41588185:115282 lookup #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_1308.mat 2019-10-15 17:49:47.011193 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-10-15 17:50:22.002 7f317a875700  1 mds.mds01 Updating MDS map to version 567883 from mon.0
2019-10-15 17:50:24.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 38.413619 secs
2019-10-15 17:50:29.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 43.413714 secs
2019-10-15 17:50:34.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 48.413976 secs
2019-10-15 17:50:39.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 53.414073 secs
2019-10-15 17:50:44.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 58.414164 secs
2019-10-15 17:50:49.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 2 included below; oldest blocked for > 63.414259 secs
2019-10-15 17:50:49.802 7f3178070700  0 log_channel(cluster) log [WRN] : slow request 63.414258 seconds old, received at 2019-10-15 17:49:46.402110: client_request(client.41588185:115259 lookup #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_844.mat 2019-10-15 17:49:46.402038 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-10-15 17:50:49.802 7f3178070700  0 log_channel(cluster) log [WRN] : slow request 62.805083 seconds old, received at 2019-10-15 17:49:47.011286: client_request(client.41588185:115282 lookup #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_1308.mat 2019-10-15 17:49:47.011193 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-10-15 17:50:54.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 68.414353 secs
2019-10-15 17:50:59.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 73.414450 secs
....
019-10-15 17:53:02.422 7f317a875700  1 mds.mds01 Updating MDS map to version 567906 from mon.0
2019-10-15 17:53:04.812 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 198.417149 secs
2019-10-15 17:53:05.732 7f317a875700  1 mds.mds01 Updating MDS map to version 567907 from mon.0
2019-10-15 17:53:09.812 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 203.417241 secs
2019-10-15 17:53:14.412 7f317a875700  1 mds.mds01 Updating MDS map to version 567908 from mon.0
2019-10-15 17:53:14.812 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 208.417346 secs
2019-10-15 17:53:17.662 7f317a875700  1 mds.mds01 Updating MDS map to version 567909 from mon.0
2019-10-15 17:53:19.812 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 213.417443 secs
2019-10-15 17:53:22.002 7f317a875700  1 mds.mds01 Updating MDS map to version 567910 from mon.0
2019-10-15 17:53:24.812 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 218.417538 secs
2019-10-15 17:53:26.362 7f317a875700  1 mds.mds01 Updating MDS map to version 567911 from mon.0
...
2019-10-15 17:58:14.832 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 508.440553 secs
2019-10-15 17:58:18.062 7f317a875700  1 mds.mds01 Updating MDS map to version 567970 from mon.0
2019-10-15 17:58:19.832 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 513.440638 secs
2019-10-15 17:58:21.672 7f317a875700  1 mds.mds01 Updating MDS map to version 567971 from mon.0
2019-10-15 17:58:24.832 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 518.440702 secs
2019-10-15 17:58:26.162 7f317a875700  1 mds.mds01 Updating MDS map to version 567972 from mon.0
2019-10-15 17:58:29.832 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 523.440775 secs 
...

2019-10-16 11:01:21.730 7f3178070700  0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 61895.334789 secs
2019-10-16 11:01:26.730 7f3178070700  0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 61900.334856 secs
2019-10-16 11:01:31.730 7f3178070700  0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 61905.334924 secs
2019-10-16 11:01:36.730 7f3178070700  0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 61910.334989 secs
2019-10-16 11:01:41.730 7f3178070700  0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 61915.335083 secs
2019-10-16 11:01:46.730 7f3178070700  0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 61920.335180 secs
2019-10-16 11:01:51.729 7f3178070700  0 log_channel(cluster) log [WRN] : 4 slow requests, 0 included below; oldest blocked for > 61925.335275 secs

This makes our fs unusable.. Anything I can do to further debug this?

Related issues 1 (0 open — 1 closed)