Project

General

Profile

Bug #19955

Too many stat ops when MDS trying to probe a large file

Added by Sangdi Xu 4 months ago. Updated 15 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
05/17/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
jewel, kraken
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Release:
jewel
Component(FS):
MDS
Needs Doc:
No

Description

When MDS recovers, it may emit tons of `stat` ops to OSDs trying to probe a large file, which as a result prevents clients to access this file. In our environment, the cluster complains:

2017-05-16 09:29:20.131180 7f9cc4914700 0 log_channel(cluster) log [WRN] : slow request 962.445029 seconds old, received at 2017-05-16 09:13:17.686086: client_request(client.1225271:14693 getattr pAsLsXsFs #10000012f8f 2017-05-16 09:02:36.373213 RETRY=2) currently failed to rdlock, waiting

The MDS objecter log shows a lot of `stat` ops(inode 10000012f8f refers to a file of size 10TB):

2017-05-16 11:29:15.129724 7f9cc2910700 10 mds.0.objecter _op_submit op 0x55664bc241c0
2017-05-16 11:29:15.129902 7f9cc2910700 10 mds.0.objecter _calc_target pgid 2.b5d43f4c acting [63,69,50,23,64,49]
2017-05-16 11:29:15.129911 7f9cc2910700 10 mds.0.objecter _op_submit oid 10000012f8f.00003b8d '@2' '@2' [stat] tid 5729647 osd.63
2017-05-16 11:29:15.129920 7f9cc2910700 5 mds.0.objecter 1 unacked, 0 uncommitted
2017-05-16 11:29:15.138341 7f9cb737e700 10 mds.0.objecter ms_dispatch 0x55664b4a0000 osd_op_reply(5729647 10000012f8f.00003b8d [stat] v0'0 uv7450 ondisk = 0) v7
2017-05-16 11:29:15.138352 7f9cb737e700 10 mds.0.objecter in handle_osd_op_reply
2017-05-16 11:29:15.138356 7f9cb737e700 7 mds.0.objecter handle_osd_op_reply 5729647 ondisk v 0'0 uv 7450 in 2.b5d43f4c attempt 0
2017-05-16 11:29:15.138359 7f9cb737e700 10 mds.0.objecter op 0 rval 0 len 16
2017-05-16 11:29:15.138365 7f9cb737e700 5 mds.0.objecter 0 unacked, 0 uncommitted
2017-05-16 11:29:15.138418 7f9cc2910700 10 mds.0.objecter _op_submit op 0x55664ec53200
2017-05-16 11:29:15.138594 7f9cc2910700 10 mds.0.objecter _calc_target pgid 2.876e8608 acting [34,1,57,30,41,2]
2017-05-16 11:29:15.138603 7f9cc2910700 10 mds.0.objecter _op_submit oid 10000012f8f.00003b8c '@2' '@2' [stat] tid 5729648 osd.34
2017-05-16 11:29:15.138613 7f9cc2910700 5 mds.0.objecter 1 unacked, 0 uncommitted
2017-05-16 11:29:15.139957 7f9cb6772700 10 mds.0.objecter ms_dispatch 0x55664b4a0000 osd_op_reply(5729648 10000012f8f.00003b8c [stat] v0'0 uv7416 ondisk = 0) v7
2017-05-16 11:29:15.139969 7f9cb6772700 10 mds.0.objecter in handle_osd_op_reply
2017-05-16 11:29:15.139974 7f9cb6772700 7 mds.0.objecter handle_osd_op_reply 5729648 ondisk v 0'0 uv 7416 in 2.876e8608 attempt 0
2017-05-16 11:29:15.139980 7f9cb6772700 10 mds.0.objecter op 0 rval 0 len 16

I guess the MDS is probably busy running here:
https://github.com/ceph/ceph/blob/master/src/mds/RecoveryQueue.cc#L102-L104


Related issues

Copied to fs - Backport #20148: jewel: Too many stat ops when MDS trying to probe a large file Resolved
Copied to fs - Backport #20149: kraken: Too many stat ops when MDS trying to probe a large file Rejected

History

#1 Updated by John Spray 4 months ago

File size recovery should only happen when a client has failed to reconnect properly during the MDS restart. Is that happening here?

#2 Updated by Zheng Yan 4 months ago

  • Status changed from New to Need Review

#3 Updated by Sangdi Xu 4 months ago

John Spray wrote:

File size recovery should only happen when a client has failed to reconnect properly during the MDS restart. Is that happening here?

I'm afraid not, at least I didn't find any reconnect-phase-related error/failure in the MDS log. See:

2017-05-16 09:13:17.685743 7f9cc721a700 1 mds.0.270 handle_mds_map state change up:replay --> up:reconnect
2017-05-16 09:13:17.685753 7f9cc721a700 1 mds.0.270 reconnect_start
2017-05-16 09:13:17.685755 7f9cc721a700 1 mds.0.270 reopen_log
2017-05-16 09:13:17.685762 7f9cc721a700 1 mds.0.server reconnect_clients -- 4 sessions
2017-05-16 09:13:17.686137 7f9cc721a700 0 log_channel(cluster) log [DBG] : reconnect by client.1299088 172.12.15.6:0/1397634078 after 0.000358
2017-05-16 09:13:17.686308 7f9cc721a700 0 log_channel(cluster) log [DBG] : reconnect by client.1225271 172.12.15.10:0/2404582558 after 0.000540
2017-05-16 09:13:17.686341 7f9cc721a700 0 log_channel(cluster) log [DBG] : reconnect by client.844340 172.12.15.11:0/1952305680 after 0.000575
2017-05-16 09:13:17.686436 7f9cc721a700 0 log_channel(cluster) log [DBG] : reconnect by client.1272254 172.12.15.6:0/2634976994 after 0.000671
2017-05-16 09:13:17.686445 7f9cc721a700 1 mds.0.270 reconnect_done
2017-05-16 09:13:18.686234 7f9cc721a700 1 mds.0.270 handle_mds_map i am now mds.0.270
2017-05-16 09:13:18.686239 7f9cc721a700 1 mds.0.270 handle_mds_map state change up:reconnect --> up:rejoin
2017-05-16 09:13:18.686249 7f9cc721a700 1 mds.0.270 rejoin_start
2017-05-16 09:13:18.686516 7f9cc721a700 1 mds.0.270 rejoin_joint_start
2017-05-16 09:13:18.686570 7f9cc721a700 1 mds.0.270 rejoin_done
2017-05-16 09:13:19.688104 7f9cc721a700 1 mds.0.270 handle_mds_map i am now mds.0.270
2017-05-16 09:13:19.688110 7f9cc721a700 1 mds.0.270 handle_mds_map state change up:rejoin --> up:active
2017-05-16 09:13:19.688120 7f9cc721a700 1 mds.0.270 recovery_done -- successful recovery!
2017-05-16 09:13:19.688231 7f9cc721a700 1 mds.0.270 active_start
2017-05-16 09:13:19.690655 7f9cc721a700 1 mds.0.270 cluster recovered.
2017-05-16 09:13:50.115214 7f9cc4914700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 32.429071 secs
2017-05-16 09:13:50.115242 7f9cc4914700 0 log_channel(cluster) log [WRN] : slow request 32.429071 seconds old, received at 2017-05-16 09:13:17.686086: client_request(client.1225271:14693 getattr pAsLsXsFs #10000012f8f 2017-05-16 09:02:36.373213 RETRY=2) currently failed to rdlock, waiting
2017-05-16 09:14:20.115731 7f9cc4914700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 62.429599 secs
2017-05-16 09:14:20.115742 7f9cc4914700 0 log_channel(cluster) log [WRN] : slow request 62.429599 seconds old, received at 2017-05-16 09:13:17.686086: client_request(client.1225271:14693 getattr pAsLsXsFs #10000012f8f 2017-05-16 09:02:36.373213 RETRY=2) currently failed to rdlock, waiting
2017-05-16 09:15:20.116604 7f9cc4914700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 122.430466 secs
2017-05-16 09:15:20.116616 7f9cc4914700 0 log_channel(cluster) log [WRN] : slow request 122.430466 seconds old, received at 2017-05-16 09:13:17.686086: client_request(client.1225271:14693 getattr pAsLsXsFs #10000012f8f 2017-05-16 09:02:36.373213 RETRY=2) currently failed to rdlock, waiting
2017-05-16 09:17:20.118358 7f9cc4914700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 242.432208 secs
2017-05-16 09:17:20.118369 7f9cc4914700 0 log_channel(cluster) log [WRN] : slow request 242.432208 seconds old, received at 2017-05-16 09:13:17.686086: client_request(client.1225271:14693 getattr pAsLsXsFs #10000012f8f 2017-05-16 09:02:36.373213 RETRY=2) currently failed to rdlock, waiting
2017-05-16 09:21:20.122082 7f9cc4914700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 482.435939 secs
2017-05-16 09:21:20.122093 7f9cc4914700 0 log_channel(cluster) log [WRN] : slow request 482.435939 seconds old, received at 2017-05-16 09:13:17.686086: client_request(client.1225271:14693 getattr pAsLsXsFs #10000012f8f 2017-05-16 09:02:36.373213 RETRY=2) currently failed to rdlock, waiting
2017-05-16 09:29:20.131170 7f9cc4914700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 962.445029 secs
2017-05-16 09:29:20.131180 7f9cc4914700 0 log_channel(cluster) log [WRN] : slow request 962.445029 seconds old, received at 2017-05-16 09:13:17.686086: client_request(client.1225271:14693 getattr pAsLsXsFs #10000012f8f 2017-05-16 09:02:36.373213 RETRY=2) currently failed to rdlock, waiting

#4 Updated by Zheng Yan 4 months ago

It's possible that client_range is not properly cleared in some corner cases. Need detail log to find out.

#5 Updated by John Spray 4 months ago

Zheng -- should we backport this? What will happen to old clients trying to access MDSs using the new behaviour?

#6 Updated by Zheng Yan 4 months ago

Yes, we should backport it

when writing large file, old client does not report file size in time. It may not report new file size until file size reaches max file .

#7 Updated by John Spray 4 months ago

  • Status changed from Need Review to Pending Backport
  • Backport set to jewel, kraken

#8 Updated by Nathan Cutler 4 months ago

  • Copied to Backport #20148: jewel: Too many stat ops when MDS trying to probe a large file added

#9 Updated by Nathan Cutler 4 months ago

  • Copied to Backport #20149: kraken: Too many stat ops when MDS trying to probe a large file added

#10 Updated by Sangdi Xu 3 months ago

Later I found two more related problems that may need further discussion:
  1. Some tools(like fio) will set the size of test file first, then gradually write contents to it. In this case, MDS still has to probe a lot to finally find file size(aka the last existing object of the file).
  2. In the current recovery implementation, MDS is asked to determine mtime of the file as well, so it has to probe all objects, which means the probe process will still take pretty long time for a large file, no matter Zheng's patch is applied or not.
    see: https://github.com/ceph/ceph/blob/master/src/osdc/Filer.cc#L259-L260
This issue can be reproduced by the following steps:
  1. Create a large file, e.g. `fio -filename=test.txt --size=10T --numjobs=4 --iodepth=16 --ioengine=libaio --ramp_time=100 --runtime=6000 --direct=1 --group_reporting --rw=write --bs=1024K -name=test`
  2. After a few seconds, stop it with Ctrl-C
  3. Restart the active MDS

It doesn't occur every time, but the probability is not low.

#11 Updated by Nathan Cutler 15 days ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF