Project

General

Profile

Support #25089

Many slow requests

Added by Robert Sander about 3 years ago. Updated over 2 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Component(FS):
MDS
Labels (FS):
multimds
Pull request ID:

Description

We see many slow requests on a multi MDS setup with 12.2.7.

With 12.2.4 this was not the case, at least not that much.

2018-07-25 10:14:52.417720 mon.ceph01 [WRN] Health check failed: 1 MDSs report slow requests (MDS_SLOW_REQUEST)
2018-07-25 10:14:51.245950 mds.ceph03 [WRN] 2 slow requests, 2 included below; oldest blocked for > 33.760319 secs
2018-07-25 10:14:51.245958 mds.ceph03 [WRN] slow request 33.760319 seconds old, received at 2018-07-25 10:14:17.485572: client_request(client.153879532:1509046 lookup #0x10000000000/home 2018-07-25 10:14:17.484696 caller_uid=1015, caller_gid=1015{}) currently failed to authpin local pins
2018-07-25 10:14:51.245965 mds.ceph03 [WRN] slow request 31.961043 seconds old, received at 2018-07-25 10:14:19.284848: client_request(client.153879532:1509047 lookup #0x10000000000/vol 2018-07-25 10:14:19.280739 caller_uid=0, caller_gid=0{}) currently failed to authpin local pins
2018-07-25 10:14:56.246069 mds.ceph03 [WRN] 7 slow requests, 5 included below; oldest blocked for > 38.760401 secs
2018-07-25 10:14:56.246076 mds.ceph03 [WRN] slow request 34.121549 seconds old, received at 2018-07-25 10:14:22.124423: client_request(client.154479915:1933797 lookup #0x10000000000/vol 2018-07-25 10:14:22.123282 caller_uid=0, caller_gid=0{}) currently failed to authpin local pins
2018-07-25 10:14:56.246081 mds.ceph03 [WRN] slow request 33.253892 seconds old, received at 2018-07-25 10:14:22.992080: client_request(client.154479915:1933798 lookup #0x10000000000/home 2018-07-25 10:14:22.991306 caller_uid=1111, caller_gid=3009{}) currently failed to authpin local pins
2018-07-25 10:14:56.246086 mds.ceph03 [WRN] slow request 32.919725 seconds old, received at 2018-07-25 10:14:23.326247: client_request(client.154886789:3800753 getattr pAsLsXsFs #0x100005c5eae 2018-07-25 10:14:23.321833 caller_uid=65534, caller_gid=65534{}) currently failed to authpin local pins
2018-07-25 10:14:56.246090 mds.ceph03 [WRN] slow request 31.955614 seconds old, received at 2018-07-25 10:14:24.290358: client_request(client.154479915:1933801 getattr pAsLsXsFs #0x10000000000 2018-07-25 10:14:24.287342 caller_uid=0, caller_gid=0{}) currently failed to rdlock, waiting
2018-07-25 10:14:56.246094 mds.ceph03 [WRN] slow request 31.319837 seconds old, received at 2018-07-25 10:14:24.926135: client_request(client.154886789:3800755 lookup #0x10000000000/vol 2018-07-25 10:14:24.921880 caller_uid=0, caller_gid=0{}) currently failed to authpin local pins
2018-07-25 10:15:09.714818 mon.ceph01 [WRN] Health check update: 2 MDSs report slow requests (MDS_SLOW_REQUEST)
2018-07-25 10:15:07.992519 mds.ceph02 [WRN] 1 slow requests, 1 included below; oldest blocked for > 34.475350 secs
2018-07-25 10:15:07.993364 mds.ceph02 [WRN] slow request 34.475350 seconds old, received at 2018-07-25 10:14:33.517074: client_request(client.153879532:1509060 getattr pAsLsXsFs #0x100013aa51f 2018-07-25 10:14:33.513075 caller_uid=1015, caller_gid=1015{}) currently initiated
2018-07-25 10:15:20.777839 mon.ceph01 [INF] MDS health message cleared (mds.2): 1 slow requests are blocked > 30 sec
2018-07-25 10:15:21.827595 mon.ceph01 [WRN] Health check update: 1 MDSs report slow requests (MDS_SLOW_REQUEST)
2018-07-25 10:15:24.318133 mon.ceph01 [INF] MDS health message cleared (mds.0): 7 slow requests are blocked > 30 sec
2018-07-25 10:15:24.742362 mon.ceph01 [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow requests)
2018-07-25 10:16:04.643491 mon.ceph01 [WRN] Health check failed: 1 MDSs report slow requests (MDS_SLOW_REQUEST)
2018-07-25 10:16:03.064511 mds.ceph01 [WRN] 3 slow requests, 3 included below; oldest blocked for > 31.964549 secs
2018-07-25 10:16:03.064518 mds.ceph01 [WRN] slow request 31.964549 seconds old, received at 2018-07-25 10:15:31.099864: client_request(client.154886789:3800777 lookup #0x10000000001/hm 2018-07-25 10:15:31.095792 caller_uid=0, caller_gid=0{}) currently failed to authpin local pins
2018-07-25 10:16:03.064522 mds.ceph01 [WRN] slow request 31.926208 seconds old, received at 2018-07-25 10:15:31.138205: client_request(client.154479915:1937002 lookup #0x10000000001/hm 2018-07-25 10:15:31.133185 caller_uid=0, caller_gid=0{}) currently failed to authpin local pins
2018-07-25 10:16:03.064525 mds.ceph01 [WRN] slow request 31.947578 seconds old, received at 2018-07-25 10:15:31.116836: client_request(client.153879532:1509342 lookup #0x10000000001/se 2018-07-25 10:15:31.110435 caller_uid=0, caller_gid=0{}) currently failed to authpin local pins
2018-07-25 10:16:31.714097 mon.ceph01 [INF] MDS health message cleared (mds.1): 3 slow requests are blocked > 30 sec
2018-07-25 10:16:31.990076 mon.ceph01 [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow requests)

There are currently 3 MDS active. I am currently trying to reduce that with "ceph fs set cephfs max_mds 1" but rank 1 and rank 2 MDS still are active. The documentation says they should stop themselves and transfer their cache to the remaining MDS.

History

#1 Updated by Zheng Yan about 3 years ago

  • Status changed from New to 4

you need to run "ceph mds deactive xxx". see http://docs.ceph.com/docs/luminous/cephfs/multimds/

#2 Updated by Robert Sander about 3 years ago

Zheng Yan wrote:

you need to run "ceph mds deactive xxx". see http://docs.ceph.com/docs/luminous/cephfs/multimds/

Now I see this. I was reading the master documentation which is now mimic…

I was able to reduce the number of active MDS to 1. Now I have not seen any slow requests since 1 hour.

I also reduced the number of CephFS clients (this is a CTDB clustered Samba server) and will now start the nodes again.

#3 Updated by Patrick Donnelly about 3 years ago

  • Tracker changed from Bug to Support
  • Status changed from 4 to Rejected

Please seek further help on ceph-users first until it's more certain a bug is found.

#4 Updated by Robert Sander about 3 years ago

Why was this rejected?

I see slow requests when running more than one active MDS in Ceph Luminous. Ceph Luminous says it supports multiple active MDS.
I have not seen this behaviour in 12.2.4 but only after upgrading to 12.2.7 recently.

This looks like a bug to me.

#5 Updated by Patrick Donnelly about 3 years ago

Robert Sander wrote:

Why was this rejected?

I see slow requests when running more than one active MDS in Ceph Luminous. Ceph Luminous says it supports multiple active MDS.
I have not seen this behaviour in 12.2.4 but only after upgrading to 12.2.7 recently.

This looks like a bug to me.

The presence of health warnings (which may be related to other parts of Ceph which have changed) does not by itself indicate a bug. It'd be helpful to see feedback from other community members (on ceph-users) to see if this is a problem unique to your cluster configuration or there is some baseline to draw insight from.

#6 Updated by Jeremy Austin almost 3 years ago

I believe I'm seeing this bug. I was running multiple active mds, and exporting cephfs-fuse via samba. After upgrading to 12.2.7, I experienced complete mds lockup about once per day, and typically when a client was traversing the cephfs tree (correlated with search indexing servers.)

Increasing mds caches helped somewhat, but didn't completely prevent periodic slowdowns, even when a full lockup did not occur.

I'm moving back to a single active mds to see if it helps. (In this case, with a single client [samba exporter], there's not likely any performance benefit of multiple mds anyway.]

#7 Updated by Patrick Donnelly over 2 years ago

  • Category deleted (90)

Also available in: Atom PDF