Project

General

Profile

Actions

Bug #19955

closed

Too many stat ops when MDS trying to probe a large file

Added by Sandy Xu almost 7 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
jewel, kraken
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When MDS recovers, it may emit tons of `stat` ops to OSDs trying to probe a large file, which as a result prevents clients to access this file. In our environment, the cluster complains:

2017-05-16 09:29:20.131180 7f9cc4914700 0 log_channel(cluster) log [WRN] : slow request 962.445029 seconds old, received at 2017-05-16 09:13:17.686086: client_request(client.1225271:14693 getattr pAsLsXsFs #10000012f8f 2017-05-16 09:02:36.373213 RETRY=2) currently failed to rdlock, waiting

The MDS objecter log shows a lot of `stat` ops(inode 10000012f8f refers to a file of size 10TB):

2017-05-16 11:29:15.129724 7f9cc2910700 10 mds.0.objecter _op_submit op 0x55664bc241c0
2017-05-16 11:29:15.129902 7f9cc2910700 10 mds.0.objecter _calc_target pgid 2.b5d43f4c acting [63,69,50,23,64,49]
2017-05-16 11:29:15.129911 7f9cc2910700 10 mds.0.objecter _op_submit oid 10000012f8f.00003b8d '@2' '@2' [stat] tid 5729647 osd.63
2017-05-16 11:29:15.129920 7f9cc2910700 5 mds.0.objecter 1 unacked, 0 uncommitted
2017-05-16 11:29:15.138341 7f9cb737e700 10 mds.0.objecter ms_dispatch 0x55664b4a0000 osd_op_reply(5729647 10000012f8f.00003b8d [stat] v0'0 uv7450 ondisk = 0) v7
2017-05-16 11:29:15.138352 7f9cb737e700 10 mds.0.objecter in handle_osd_op_reply
2017-05-16 11:29:15.138356 7f9cb737e700 7 mds.0.objecter handle_osd_op_reply 5729647 ondisk v 0'0 uv 7450 in 2.b5d43f4c attempt 0
2017-05-16 11:29:15.138359 7f9cb737e700 10 mds.0.objecter op 0 rval 0 len 16
2017-05-16 11:29:15.138365 7f9cb737e700 5 mds.0.objecter 0 unacked, 0 uncommitted
2017-05-16 11:29:15.138418 7f9cc2910700 10 mds.0.objecter _op_submit op 0x55664ec53200
2017-05-16 11:29:15.138594 7f9cc2910700 10 mds.0.objecter _calc_target pgid 2.876e8608 acting [34,1,57,30,41,2]
2017-05-16 11:29:15.138603 7f9cc2910700 10 mds.0.objecter _op_submit oid 10000012f8f.00003b8c '@2' '@2' [stat] tid 5729648 osd.34
2017-05-16 11:29:15.138613 7f9cc2910700 5 mds.0.objecter 1 unacked, 0 uncommitted
2017-05-16 11:29:15.139957 7f9cb6772700 10 mds.0.objecter ms_dispatch 0x55664b4a0000 osd_op_reply(5729648 10000012f8f.00003b8c [stat] v0'0 uv7416 ondisk = 0) v7
2017-05-16 11:29:15.139969 7f9cb6772700 10 mds.0.objecter in handle_osd_op_reply
2017-05-16 11:29:15.139974 7f9cb6772700 7 mds.0.objecter handle_osd_op_reply 5729648 ondisk v 0'0 uv 7416 in 2.876e8608 attempt 0
2017-05-16 11:29:15.139980 7f9cb6772700 10 mds.0.objecter op 0 rval 0 len 16

I guess the MDS is probably busy running here:
https://github.com/ceph/ceph/blob/master/src/mds/RecoveryQueue.cc#L102-L104


Related issues 2 (0 open2 closed)

Copied to CephFS - Backport #20148: jewel: Too many stat ops when MDS trying to probe a large fileResolvedNathan CutlerActions
Copied to CephFS - Backport #20149: kraken: Too many stat ops when MDS trying to probe a large fileRejectedActions
Actions

Also available in: Atom PDF