Bug #58617: mds: "Failed to authpin,subtree is being exported" results in large number of blocked requests - CephFS - Ceph

Actions

Copy link

Bug #58617

open

mds: "Failed to authpin,subtree is being exported" results in large number of blocked requests

Added by zhikuo du about 1 year ago. Updated 7 months ago.

Status:

Triaged

Priority:

Normal

Assignee:

zhikuo du

Category:

Correctness/Safety

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

pacific,quincy

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

multimds

Pull request ID:

49940

Crash signature (v1):

Crash signature (v2):

Description

A problem: the cluster(octopus 15.2.16) has large numbers of blocked requests. The error associated with the block is:

2023-01-02T15:59:10.078+0800 7f55b1734700  0 log_channel(cluster) log [WRN] : slow request 15364.865004 seconds old, received at 2023-01-02T11:43:05.214763+0800: client_request(client.450609:59264338 lookup #0x40004e86a72/halo_ce_grace_F4284.spec.pt 2023-01-02T11:43:05.153256+0800 caller_uid=0, caller_gid=0{}) currently failed to authpin, subtree is being exported
2023-01-02T15:59:10.078+0800 7f55b1734700  0 log_channel(cluster) log [WRN] : slow request 15360.774800 seconds old, received at 2023-01-02T11:43:09.304967+0800: client_request(client.450609:59265051 lookup #0x40004e86a72/halo_ce_grace_F3233.wav 2023-01-02T11:43:09.243256+0800 caller_uid=0, caller_gid=0{}) currently failed to authpin, subtree is being exported

Eventually, many requests are blocked for hours. We can restore the cluster by restarting the affected MDS.

The valuable log：

2023-01-02T18:38:32.319+0800 7f55b3738700 10 mds.11.mig show_exporting  exporting to 8: (6) warning 0x40004e86a72.001001100* [dir 0x40004e86a72.001001100* /data/46f/732/03237764b1b2b824550ff4e750/data/vits_data/generate/wavs/ [2,head] auth{0=2,1=1,2=1,3=1,6=1,8=2,10=1} v=116159 cv=116158/116158 dir_auth=11,11 state=1610875907|complete|frozentree|exporting f(v1690 m2022-12-16T15:05:36.373759+0800 1929=1929+0) n(v900 rc2022-12-16T15:05:  36.373759+0800 b822676763 1929=1929+0) hs=1929+0,ss=0+0 | ptrwaiter=1 request=0 child=1 frozen=1 subtree=1 importing=0 replicated=1 dirty=1 waiter=1 authpin=0 0x563e5ee71200]
2023-01-02T18:38:33.103+0800 7f55b3738700 10 mds.11.mig show_exporting  exporting to 8: (6) warning 0x40004e86a72.001001100* [dir 0x40004e86a72.001001100* /data/46f/732/03237764b1b2b824550ff4e750/data/vits_data/generate/wavs/ [2,head] auth{0=2,1=1,2=1,3=1,6=1,8=2,10=1} v=116159 cv=116158/116158 dir_auth=11,11 state=1610875907|complete|frozentree|exporting f(v1690 m2022-12-16T15:05:36.373759+0800 1929=1929+0) n(v900 rc2022-12-16T15:05:  36.373759+0800 b822676763 1929=1929+0) hs=1929+0,ss=0+0 | ptrwaiter=1 request=0 child=1 frozen=1 subtree=1 importing=0 replicated=1 dirty=1 waiter=1 authpin=0 0x563e5ee71200]

After reading the code about migrating，I think the reason is:
When one or more CEPH_SESSION_FLUSHMSG or MSG_MDS_EXPORTDIRNOTIFY msgs is lost, for example, because session is reseted or underlying connection is reconected(I think we belong to this situation); the dir will donot export anymore, this dir will be freezed forever.

Actions

Copy link

Updated by Xiubo Li about 1 year ago

There is another tracker also stuck for Failed to authpin,subtree is being exported:

https://tracker.ceph.com/issues/58488

But it was caused by large mdlog event size issue.

Actions

Copy link

Updated by zhikuo du about 1 year ago

https://tracker.ceph.com/issues/42338
There is another tracker for "Failed to authpin,subtree is being exported".

But we donot use snapshots on this fs, so I donot think it is the same with the problem solved by https://tracker.ceph.com/issues/39987 .

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IYBVGVKCX2OE66EYC34YQNZOZ7BATUZJ/
https://www.spinics.net/lists/ceph-users/msg52056.html
In the internet, we can see this problem's reports too.

Actions

Copy link

Updated by Xiubo Li about 1 year ago

zhikuo du wrote:

https://tracker.ceph.com/issues/42338
There is another tracker for "Failed to authpin,subtree is being exported".

But we donot use snapshots on this fs, so I donot think it is the same with the problem solved by https://tracker.ceph.com/issues/39987 .

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IYBVGVKCX2OE66EYC34YQNZOZ7BATUZJ/
https://www.spinics.net/lists/ceph-users/msg52056.html
In the internet, we can see this problem's reports too.

Just mentioned it here is because the results and logs of them look a little like.

Actions

Copy link

Updated by zhikuo du about 1 year ago

1, The first commit in PR 49940 is for case cluster hanged in state EXPORT_WARNING which is consistent with the log.
2, For similar case cluster hanged in state EXPORT_NOTIFYING, a new commit has been added. It is noteworthy that the fix donot cancel the export of this dir, but just resend the msgs.

Actions

Copy link