Project

General

Profile

Actions

Support #16528

closed

Stuck with CephFS with 1M files in one dir

Added by elder one almost 8 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Tags:
cephfs
Reviewed:
Affected Versions:
Component(FS):
Labels (FS):
Pull request ID:

Description

I'm pretty much stukc with cephfs (jewel 10.2.2) with 1 million 0 byte files in one dir left behind from unsuccessful bonnie++ run.
Can't list/delete files or that directory. At least in reasonable time. (Waited for 4 hours).

Ceph kernel client with 4.4.14 kernel on Ubuntu 14.04.

2 metadata servers, 1 active, another in standby-replay mode.

From active mds log when I try to access dir:

2016-06-29 19:05:39.053345 7fcc198da700  1 heartbeat_map reset_timeout 'MDSRank' had timed out after 10
2016-06-29 19:05:57.472549 7fcc198da700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 33.047606 secs
2016-06-29 19:05:57.472563 7fcc198da700  0 log_channel(cluster) log [WRN] : slow request 33.047606 seconds old, received at 2016-06-29 19:05:24.424893: client_request(client.28058182:3 readdir #100000021c2 2016-06-29 19:05:24.420352) currently acquired locks
2016-06-29 19:06:25.457003 7fcc198da700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 61.032040 secs
2016-06-29 19:06:25.457022 7fcc198da700  0 log_channel(cluster) log [WRN] : slow request 61.032040 seconds old, received at 2016-06-29 19:05:24.424893: client_request(client.28058182:3 readdir #100000021c2 2016-06-29 19:05:24.420352) currently acquired locks
2016-06-29 19:08:02.719494 7fcc198da700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 35.652664 secs
2016-06-29 19:08:02.719520 7fcc198da700  0 log_channel(cluster) log [WRN] : slow request 35.652664 seconds old, received at 2016-06-29 19:07:27.066769: client_request(client.28058182:6 readdir #100000021c2 00000ef1d6BKMay 2016-06-29 19:07:27.054945) currently acquired locks
2016-06-29 19:08:30.702883 7fcc198da700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 63.636058 secs
2016-06-29 19:08:30.702899 7fcc198da700  0 log_channel(cluster) log [WRN] : slow request 63.636058 seconds old, received at 2016-06-29 19:07:27.066769: client_request(client.28058182:6 readdir #100000021c2 00000ef1d6BKMay 2016-06-29 19:07:27.054945) currently acquired locks
2016-06-29 19:10:43.184162 7fcc198da700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 35.598239 secs
2016-06-29 19:10:43.184177 7fcc198da700  0 log_channel(cluster) log [WRN] : slow request 35.598239 seconds old, received at 2016-06-29 19:10:07.585870: client_request(client.28058182:9 readdir #100000021c2 0000022f71S 2016-06-29 19:10:07.574338) currently acquired locks
2016-06-29 19:11:10.606474 7fcc198da700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 63.020538 secs
2016-06-29 19:11:10.606503 7fcc198da700  0 log_channel(cluster) log [WRN] : slow request 63.020538 seconds old, received at 2016-06-29 19:10:07.585870: client_request(client.28058182:9 readdir #100000021c2 0000022f71S 2016-06-29 19:10:07.574338) currently acquired locks
2016-06-29 19:13:43.266474 7fcc198da700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 35.753201 secs
2016-06-29 19:13:43.266489 7fcc198da700  0 log_channel(cluster) log [WRN] : slow request 35.753201 seconds old, received at 2016-06-29 19:13:07.513218: client_request(client.28058182:13 readdir #100000021c2 0000012468KDbTAmV 2016-06-29 19:13:07.502135) currently acquired locks
2016-06-29 19:14:11.108257 7fcc198da700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 63.594955 secs
2016-06-29 19:14:11.108278 7fcc198da700  0 log_channel(cluster) log [WRN] : slow request 63.594955 seconds old, received at 2016-06-29 19:13:07.513218: client_request(client.28058182:13 readdir #100000021c2 0000012468KDbTAmV 2016-06-29 19:13:07.502135) currently acquired locks

Only one client mounted to fs as follows:
192.168.30.71,192.168.30.72,192.168.30.73:/ on /mnt/cephfs type ceph (name=admin,rsize=2097152,wsize=2097152,readdir_max_entries=10240,readdir_max_bytes=2097152,key=client.admin)

rados df

pool name                 KB      objects       clones     degraded      unfound           rd        rd KB           wr        wr KB
cephfs_data                0      1022482            0            0            0         9938     14623544      3655424     25191882
cephfs_metadata        35953           31            0            0            0        28177     67305532       147345      6187783

Ceph cluster status is OK


Files

ceph.conf (2.53 KB) ceph.conf elder one, 06/29/2016 04:59 PM
Actions #1

Updated by elder one almost 8 years ago

Actions #2

Updated by Greg Farnum almost 8 years ago

  • Tracker changed from Bug to Support
  • Status changed from New to Closed

Assuming your MDS server has enough memory (it probably does), turn up the "mds cache size" to a number larger than 1 million and it should work.

This is a hangup consequence right now of how directory listing works and our having disabled fragmentation. We've got various mitigations going on as well as work to get the proper solution in.

Actions #3

Updated by elder one almost 8 years ago

Thank you!

Raised "mds cache size" to 3M and it took couple of minutes to list this dir.

Actions #4

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF