Project

General

Profile

Actions

Bug #17656

open

cephfs: high concurrent causing slow request

Added by william sheng over 7 years ago. Updated about 7 years ago.

Status:
Need More Info
Priority:
Low
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
getattr pAsLsXsFs
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

background:

we use cephfs as CDN backend, when CDN vendor prefetch video files in cephfs, it will cause high concurrent purge ops which cause ceph warning

logs:
2016-10-21 09:12:58.505854 mds.0 192.168.194.118:6800/114274 14182 : cluster [WRN] 46 slow requests, 5 included below; oldest blocked for > 196.557580 secs
2016-10-21 09:12:58.505861 mds.0 192.168.194.118:6800/114274 14183 : cluster [WRN] slow request 121.856035 seconds old, received at 2016-10-21 09:10:56.649713: client_request(client.6137540:145761 getattr pAsLsXsFs #1000001afb6 2016-10-21 09:10:56.656620) currently failed to rdlock, waiting
2016-10-21 09:12:58.505865 mds.0 192.168.194.118:6800/114274 14184 : cluster [WRN] slow request 121.731314 seconds old, received at 2016-10-21 09:10:56.774434: client_request(client.6137540:145762 getattr pAsLsXsFs #1000001afb6 2016-10-21 09:10:56.781622) currently failed to rdlock, waiting
2016-10-21 09:12:58.505869 mds.0 192.168.194.118:6800/114274 14185 : cluster [WRN] slow request 121.145797 seconds old, received at 2016-10-21 09:10:57.359951: client_request(client.3759987:1135738 getattr pAsLsXsFs #1000001afb6 2016-10-21 09:10:57.364154) currently failed to rdlock, waiting
2016-10-21 09:12:58.505873 mds.0 192.168.194.118:6800/114274 14186 : cluster [WRN] slow request 33.414109 seconds old, received at 2016-10-21 09:12:25.091639: client_request(client.6137540:145782 getattr pAsLsXsFs #1000001afb6 2016-10-21 09:12:25.099527) currently failed to rdlock, waiting
2016-10-21 09:12:58.505877 mds.0 192.168.194.118:6800/114274 14187 : cluster [WRN] slow request 32.263839 seconds old, received at 2016-10-21 09:12:26.241909: client_request(client.3759987:1135763 getattr pAsLsXsFs #1000001afb6 2016-10-21 09:12:26.245974) currently failed to rdlock, waiting
2016-10-21 09:13:03.505931 mds.0 192.168.194.118:6800/114274 14188 : cluster [WRN] 46 slow requests, 3 included below; oldest blocked for > 201.557676 secs
2016-10-21 09:13:03.505936 mds.0 192.168.194.118:6800/114274 14189 : cluster [WRN] slow request 124.578078 seconds old, received at 2016-10-21 09:10:58.927767: client_request(client.6137540:145763 getattr pAsLsXsFs #1000001afb6 2016-10-21 09:10:58.934669) currently failed to rdlock, waiting
2016-10-21 09:13:03.505940 mds.0 192.168.194.118:6800/114274 14190 : cluster [WRN] slow request 130.552917 seconds old, received at 2016-10-21 09:10:52.952927: client_request(client.3759987:1135732 getattr pAsLsXsFs #1000001afb6 2016-10-21 09:10:52.956163) currently failed to rdlock, waiting
2016-10-21 09:13:03.505943 mds.0 192.168.194.118:6800/114274 14191 : cluster [WRN] slow request 129.566257 seconds old, received at 2016-10-21 09:10:53.939588: client_request(client.6137540:145755 getattr pAsLsXsFs #1000001afb6 2016-10-21 09:10:53.946561) currently failed to rdlock, waiting


Files

node1.rar (730 KB) node1.rar jichao sun, 03/02/2017 11:51 AM
node2.rar (798 KB) node2.rar jichao sun, 03/02/2017 11:51 AM
node3.rar (811 KB) node3.rar jichao sun, 03/02/2017 11:51 AM
node4.rar (820 KB) node4.rar jichao sun, 03/02/2017 11:51 AM
node5.rar (808 KB) node5.rar jichao sun, 03/02/2017 11:51 AM
Actions #1

Updated by Haomai Wang over 7 years ago

  • Description updated (diff)
Actions #2

Updated by Haomai Wang over 7 years ago

  • Subject changed from cluster [WRN] 22 slow requests, 2 included below; oldest blocked for > 96.555802 secs to cephfs: high concurrent causing slow request
  • Target version deleted (v10.2.4)
Actions #3

Updated by Kefu Chai over 7 years ago

William, could you describe the issue from the Ceph's perspective in detail?

Actions #4

Updated by Greg Farnum over 7 years ago

  • Status changed from New to Need More Info
  • Priority changed from High to Low

Just from the description it sounds like we're backing up while the MDS purges deleted files from RADOS. You can adjust the mds_max_purge_files, mds_max_purge_ops, and mds_max_purge_ops_per_pg config options on the MDS to allow faster purging. Increasing the "mds cache size" (default 100k) may also put this problem off a bit.

Actions #5

Updated by william sheng over 7 years ago

Greg Farnum wrote:

Just from the description it sounds like we're backing up while the MDS purges deleted files from RADOS. You can adjust the mds_max_purge_files, mds_max_purge_ops, and mds_max_purge_ops_per_pg config options on the MDS to allow faster purging. Increasing the "mds cache size" (default 100k) may also put this problem off a bit.

OK ,I verify ...

Actions #6

Updated by william sheng over 7 years ago

william sheng wrote:

Greg Farnum wrote:

Just from the description it sounds like we're backing up while the MDS purges deleted files from RADOS. You can adjust the mds_max_purge_files, mds_max_purge_ops, and mds_max_purge_ops_per_pg config options on the MDS to allow faster purging. Increasing the "mds cache size" (default 100k) may also put this problem off a bit.

OK ,I verify ...

My ceph configure:

[global]
fsid = 75f7dde4-d350-4853-9f73-33b0036b2ed2
mon_initial_members = l10-10-118 l10-10-120 l10-10-122 l10-10-124 l10-10-126
mon_host = 10.10.10.118 10.10.10.120 10.10.10.122 10.10.10.124 10.10.10.126
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

public_network = 10.10.10.0/24
osd_pool_default_size = 3
osd_pool_default_min_size = 2
osd_pool_default_pgp_num = 1024
osd_pool_default_pg_num = 1024
osd_crush_chooseleaf_type = 1
mon_clock_drift_allowed = 2
mon_clock_drift_warn_backoff = 30
hellomon_osd_full_ratio = .85
mon_osd_nearfull_ratio = .85
osd_op_complaint_time = 120
mds_max_purge_files = 256
mds_max_purge_ops = 16384
max_mds = 20

My Ceph log !!!!

016-12-02 02:47:48.560591 mon.0 10.10.10.118:6789/0 277601 : cluster [INF] HEALTH_WARN; mds0: Client l168-222-45 failing to respond to capability release
2016-12-02 02:47:48.862549 mon.0 10.10.10.118:6789/0 277602 : cluster [INF] pgmap v8189234: 7132 pgs: 7132 active+clean; 2808 GB data, 8981 GB used, 383 TB / 392 TB avail
2016-12-02 02:47:49.871613 mon.0 10.10.10.118:6789/0 277603 : cluster [INF] pgmap v8189235: 7132 pgs: 7132 active+clean; 2808 GB data, 8981 GB used, 383 TB / 392 TB avail; 1527 B/s wr, 0 op/s
2016-12-02 02:47:44.638579 mds.0 10.10.10.120:6800/8833 230 : cluster [WRN] 20 slow requests, 2 included below; oldest blocked for > 112.913897 secs
2016-12-02 02:47:44.638585 mds.0 10.10.10.120:6800/8833 231 : cluster [WRN] slow request 63.695560 seconds old, received at 2016-12-02 02:46:40.942933: client_request(client.7814035:508614 getattr pAsLsXsFs #1000002ce7d 2016-12-02 02:46:40.944241) currently failed to rdlock, waiting
2016-12-02 02:47:44.638604 mds.0 10.10.10.120:6800/8833 232 : cluster [WRN] slow request 33.619523 seconds old, received at 2016-12-02 02:47:11.018970: client_request(client.7814035:508619 getattr pAsLsXsFs #1000002ce7d 2016-12-02 02:47:11.019795) currently failed to rdlock, waiting
2016-12-02 02:47:49.638645 mds.0 10.10.10.120:6800/8833 233 : cluster [WRN] 20 slow requests, 1 included below; oldest blocked for > 117.913988 secs
2016-12-02 02:47:49.638652 mds.0 10.10.10.120:6800/8833 234 : cluster [WRN] slow request 60.762285 seconds old, received at 2016-12-02 02:46:48.876300: client_request(client.7509649:900606 getattr pAsLsXsFs #1000002ce7d 2016-12-02 02:46:48.876545) currently failed to rdlock, waiting
2016-12-02 02:47:51.875852 mon.0 10.10.10.118:6789/0 277604 : cluster [INF] pgmap v8189236: 7132 pgs: 7132 active+clean; 2808 GB data, 8981 GB used, 383 TB / 392 TB avail; 1019 B/s wr, 0 op/s
2016-12-02 02:47:52.884659 mon.0 10.10.10.118:6789/0 277605 : cluster [INF] pgmap v8189237: 7132 pgs: 7132 active+clean; 2808 GB data, 8981 GB used, 383 TB / 392 TB avail
2016-12-02 02:47:53.886552 mon.0 10.10.10.118:6789/0 277606 : cluster [INF] pgmap v8189238: 7132 pgs: 7132 active+clean; 2808 GB data, 8981 GB used, 383 TB / 392 TB avail
2016-12-02 02:47:54.888791 mon.0 10.10.10.118:6789/0 277607 : cluster [INF] pgmap v8189239: 7132 pgs: 7132 active+clean; 2808 GB data, 8981 GB used, 383 TB / 392 TB avail
2016-12-02 02:47:54.638795 mds.0 10.10.10.120:6800/8833 235 : cluster [WRN] 21 slow requests, 5 included below; oldest blocked for > 122.914086 secs
2016-12-02 02:47:54.638830 mds.0 10.10.10.120:6800/8833 236 : cluster [WRN] slow request 122.914086 seconds old, received at 2016-12-02 02:45:51.724596: client_request(client.7814035:508607 getattr pAsLsXsFs #1000002ce7d 2016-12-02 02:45:51.726334) currently failed to rdlock, waiting
2016-12-02 02:47:54.638835 mds.0 10.10.10.120:6800/8833 237 : cluster [WRN] slow request 122.906916 seconds old, received at 2016-12-02 02:45:51.731766: client_request(client.7509649:900595 getattr pAsLsXsFs #1000002ce7d 2016-12-02 02:45:51.731778) currently failed to rdlock, waiting
2016-12-02 02:47:54.638841 mds.0 10.10.10.120:6800/8833 238 : cluster [WRN] slow request 122.891636 seconds old, received at 2016-12-02 02:45:51.747046: client_request(client.7509649:900596 getattr pAsLsXsFs #1000002ce7d 2016-12-02 02:45:51.746778) currently failed to rdlock, waiting
2016-12-02 02:47:54.638848 mds.0 10.10.10.120:6800/8833 239 : cluster [WRN] slow request 122.882734 seconds old, received at 2016-12-02 02:45:51.755948: client_request(client.7814035:508608 getattr pAsLsXsFs #1000002ce7d 2016-12-02 02:45:51.758335) currently failed to rdlock, waiting
@

Actions #7

Updated by jichao sun about 7 years ago

I have the same problem too!!!

Updated by jichao sun about 7 years ago

jichao sun wrote:

I have the same problem too!!!

Actions

Also available in: Atom PDF