Project

General

Profile

Actions

Bug #58617

open

mds: "Failed to authpin,subtree is being exported" results in large number of blocked requests

Added by zhikuo du about 1 year ago. Updated 7 months ago.

Status:
Triaged
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
multimds
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

A problem: the cluster(octopus 15.2.16) has large numbers of blocked requests. The error associated with the block is:

2023-01-02T15:59:10.078+0800 7f55b1734700  0 log_channel(cluster) log [WRN] : slow request 15364.865004 seconds old, received at 2023-01-02T11:43:05.214763+0800: client_request(client.450609:59264338 lookup #0x40004e86a72/halo_ce_grace_F4284.spec.pt 2023-01-02T11:43:05.153256+0800 caller_uid=0, caller_gid=0{}) currently failed to authpin, subtree is being exported
2023-01-02T15:59:10.078+0800 7f55b1734700 0 log_channel(cluster) log [WRN] : slow request 15360.774800 seconds old, received at 2023-01-02T11:43:09.304967+0800: client_request(client.450609:59265051 lookup #0x40004e86a72/halo_ce_grace_F3233.wav 2023-01-02T11:43:09.243256+0800 caller_uid=0, caller_gid=0{}) currently failed to authpin, subtree is being exported

Eventually, many requests are blocked for hours. We can restore the cluster by restarting the affected MDS.

The valuable log:

2023-01-02T18:38:32.319+0800 7f55b3738700 10 mds.11.mig show_exporting  exporting to 8: (6) warning 0x40004e86a72.001001100* [dir 0x40004e86a72.001001100* /data/46f/732/03237764b1b2b824550ff4e750/data/vits_data/generate/wavs/ [2,head] auth{0=2,1=1,2=1,3=1,6=1,8=2,10=1} v=116159 cv=116158/116158 dir_auth=11,11 state=1610875907|complete|frozentree|exporting f(v1690 m2022-12-16T15:05:36.373759+0800 1929=1929+0) n(v900 rc2022-12-16T15:05:  36.373759+0800 b822676763 1929=1929+0) hs=1929+0,ss=0+0 | ptrwaiter=1 request=0 child=1 frozen=1 subtree=1 importing=0 replicated=1 dirty=1 waiter=1 authpin=0 0x563e5ee71200]
2023-01-02T18:38:33.103+0800 7f55b3738700 10 mds.11.mig show_exporting exporting to 8: (6) warning 0x40004e86a72.001001100* [dir 0x40004e86a72.001001100* /data/46f/732/03237764b1b2b824550ff4e750/data/vits_data/generate/wavs/ [2,head] auth{0=2,1=1,2=1,3=1,6=1,8=2,10=1} v=116159 cv=116158/116158 dir_auth=11,11 state=1610875907|complete|frozentree|exporting f(v1690 m2022-12-16T15:05:36.373759+0800 1929=1929+0) n(v900 rc2022-12-16T15:05: 36.373759+0800 b822676763 1929=1929+0) hs=1929+0,ss=0+0 | ptrwaiter=1 request=0 child=1 frozen=1 subtree=1 importing=0 replicated=1 dirty=1 waiter=1 authpin=0 0x563e5ee71200]

After reading the code about migrating,I think the reason is:
When one or more CEPH_SESSION_FLUSHMSG or MSG_MDS_EXPORTDIRNOTIFY msgs is lost, for example, because session is reseted or underlying connection is reconected(I think we belong to this situation); the dir will donot export anymore, this dir will be freezed forever.

Actions #1

Updated by Xiubo Li about 1 year ago

There is another tracker also stuck for Failed to authpin,subtree is being exported:

https://tracker.ceph.com/issues/58488

But it was caused by large mdlog event size issue.

Actions #2

Updated by zhikuo du about 1 year ago

https://tracker.ceph.com/issues/42338
There is another tracker for "Failed to authpin,subtree is being exported".

But we donot use snapshots on this fs, so I donot think it is the same with the problem solved by https://tracker.ceph.com/issues/39987 .

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IYBVGVKCX2OE66EYC34YQNZOZ7BATUZJ/
https://www.spinics.net/lists/ceph-users/msg52056.html
In the internet, we can see this problem's reports too.

Actions #3

Updated by Xiubo Li about 1 year ago

zhikuo du wrote:

https://tracker.ceph.com/issues/42338
There is another tracker for "Failed to authpin,subtree is being exported".

But we donot use snapshots on this fs, so I donot think it is the same with the problem solved by https://tracker.ceph.com/issues/39987 .

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IYBVGVKCX2OE66EYC34YQNZOZ7BATUZJ/
https://www.spinics.net/lists/ceph-users/msg52056.html
In the internet, we can see this problem's reports too.

Just mentioned it here is because the results and logs of them look a little like.

Actions #4

Updated by zhikuo du about 1 year ago

1, The first commit in PR 49940 is for case cluster hanged in state EXPORT_WARNING which is consistent with the log.
2, For similar case cluster hanged in state EXPORT_NOTIFYING, a new commit has been added. It is noteworthy that the fix donot cancel the export of this dir, but just resend the msgs.

Actions #5

Updated by Venky Shankar about 1 year ago

  • Status changed from New to Triaged
  • Assignee set to zhikuo du
  • Target version set to v18.0.0
  • Source set to Community (user)
  • Tags deleted (mds)
  • Backport set to pacific,quincy
Actions #6

Updated by Patrick Donnelly 7 months ago

  • Target version deleted (v18.0.0)
Actions

Also available in: Atom PDF