Bug #58617
mds: "Failed to authpin,subtree is being exported" results in large number of blocked requests
0%
Description
A problem: the cluster(octopus 15.2.16) has large numbers of blocked requests. The error associated with the block is:
2023-01-02T15:59:10.078+0800 7f55b1734700 0 log_channel(cluster) log [WRN] : slow request 15364.865004 seconds old, received at 2023-01-02T11:43:05.214763+0800: client_request(client.450609:59264338 lookup #0x40004e86a72/halo_ce_grace_F4284.spec.pt 2023-01-02T11:43:05.153256+0800 caller_uid=0, caller_gid=0{}) currently failed to authpin, subtree is being exported
2023-01-02T15:59:10.078+0800 7f55b1734700 0 log_channel(cluster) log [WRN] : slow request 15360.774800 seconds old, received at 2023-01-02T11:43:09.304967+0800: client_request(client.450609:59265051 lookup #0x40004e86a72/halo_ce_grace_F3233.wav 2023-01-02T11:43:09.243256+0800 caller_uid=0, caller_gid=0{}) currently failed to authpin, subtree is being exported
Eventually, many requests are blocked for hours. We can restore the cluster by restarting the affected MDS.
The valuable log:
2023-01-02T18:38:32.319+0800 7f55b3738700 10 mds.11.mig show_exporting exporting to 8: (6) warning 0x40004e86a72.001001100* [dir 0x40004e86a72.001001100* /data/46f/732/03237764b1b2b824550ff4e750/data/vits_data/generate/wavs/ [2,head] auth{0=2,1=1,2=1,3=1,6=1,8=2,10=1} v=116159 cv=116158/116158 dir_auth=11,11 state=1610875907|complete|frozentree|exporting f(v1690 m2022-12-16T15:05:36.373759+0800 1929=1929+0) n(v900 rc2022-12-16T15:05: 36.373759+0800 b822676763 1929=1929+0) hs=1929+0,ss=0+0 | ptrwaiter=1 request=0 child=1 frozen=1 subtree=1 importing=0 replicated=1 dirty=1 waiter=1 authpin=0 0x563e5ee71200]
2023-01-02T18:38:33.103+0800 7f55b3738700 10 mds.11.mig show_exporting exporting to 8: (6) warning 0x40004e86a72.001001100* [dir 0x40004e86a72.001001100* /data/46f/732/03237764b1b2b824550ff4e750/data/vits_data/generate/wavs/ [2,head] auth{0=2,1=1,2=1,3=1,6=1,8=2,10=1} v=116159 cv=116158/116158 dir_auth=11,11 state=1610875907|complete|frozentree|exporting f(v1690 m2022-12-16T15:05:36.373759+0800 1929=1929+0) n(v900 rc2022-12-16T15:05: 36.373759+0800 b822676763 1929=1929+0) hs=1929+0,ss=0+0 | ptrwaiter=1 request=0 child=1 frozen=1 subtree=1 importing=0 replicated=1 dirty=1 waiter=1 authpin=0 0x563e5ee71200]
After reading the code about migrating,I think the reason is:
When one or more CEPH_SESSION_FLUSHMSG or MSG_MDS_EXPORTDIRNOTIFY msgs is lost, for example, because session is reseted or underlying connection is reconected(I think we belong to this situation); the dir will donot export anymore, this dir will be freezed forever.
History
#1 Updated by Xiubo Li about 2 months ago
There is another tracker also stuck for Failed to authpin,subtree is being exported:
https://tracker.ceph.com/issues/58488
But it was caused by large mdlog event size issue.
#2 Updated by zhikuo du about 2 months ago
https://tracker.ceph.com/issues/42338
There is another tracker for "Failed to authpin,subtree is being exported".
But we donot use snapshots on this fs, so I donot think it is the same with the problem solved by https://tracker.ceph.com/issues/39987 .
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IYBVGVKCX2OE66EYC34YQNZOZ7BATUZJ/
https://www.spinics.net/lists/ceph-users/msg52056.html
In the internet, we can see this problem's reports too.
#3 Updated by Xiubo Li about 2 months ago
zhikuo du wrote:
https://tracker.ceph.com/issues/42338
There is another tracker for "Failed to authpin,subtree is being exported".But we donot use snapshots on this fs, so I donot think it is the same with the problem solved by https://tracker.ceph.com/issues/39987 .
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IYBVGVKCX2OE66EYC34YQNZOZ7BATUZJ/
https://www.spinics.net/lists/ceph-users/msg52056.html
In the internet, we can see this problem's reports too.
Just mentioned it here is because the results and logs of them look a little like.
#4 Updated by zhikuo du about 2 months ago
1, The first commit in PR 49940 is for case cluster hanged in state EXPORT_WARNING which is consistent with the log.
2, For similar case cluster hanged in state EXPORT_NOTIFYING, a new commit has been added. It is noteworthy that the fix donot cancel the export of this dir, but just resend the msgs.
#5 Updated by Venky Shankar about 2 months ago
- Status changed from New to Triaged
- Assignee set to zhikuo du
- Target version set to v18.0.0
- Source set to Community (user)
- Tags deleted (
mds) - Backport set to pacific,quincy