Support #16528: Stuck with CephFS with 1M files in one dir - CephFS - Ceph

Actions

Copy link

Support #16528

closed

Stuck with CephFS with 1M files in one dir

Added by elder one almost 8 years ago. Updated over 7 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Tags:

cephfs

Reviewed:

Affected Versions:

Ceph - v10.2.2

Component(FS):

Labels (FS):

Pull request ID:

Description

I'm pretty much stukc with cephfs (jewel 10.2.2) with 1 million 0 byte files in one dir left behind from unsuccessful bonnie++ run.
Can't list/delete files or that directory. At least in reasonable time. (Waited for 4 hours).

Ceph kernel client with 4.4.14 kernel on Ubuntu 14.04.

2 metadata servers, 1 active, another in standby-replay mode.

From active mds log when I try to access dir:

2016-06-29 19:05:39.053345 7fcc198da700  1 heartbeat_map reset_timeout 'MDSRank' had timed out after 10
2016-06-29 19:05:57.472549 7fcc198da700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 33.047606 secs
2016-06-29 19:05:57.472563 7fcc198da700  0 log_channel(cluster) log [WRN] : slow request 33.047606 seconds old, received at 2016-06-29 19:05:24.424893: client_request(client.28058182:3 readdir #100000021c2 2016-06-29 19:05:24.420352) currently acquired locks
2016-06-29 19:06:25.457003 7fcc198da700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 61.032040 secs
2016-06-29 19:06:25.457022 7fcc198da700  0 log_channel(cluster) log [WRN] : slow request 61.032040 seconds old, received at 2016-06-29 19:05:24.424893: client_request(client.28058182:3 readdir #100000021c2 2016-06-29 19:05:24.420352) currently acquired locks
2016-06-29 19:08:02.719494 7fcc198da700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 35.652664 secs
2016-06-29 19:08:02.719520 7fcc198da700  0 log_channel(cluster) log [WRN] : slow request 35.652664 seconds old, received at 2016-06-29 19:07:27.066769: client_request(client.28058182:6 readdir #100000021c2 00000ef1d6BKMay 2016-06-29 19:07:27.054945) currently acquired locks
2016-06-29 19:08:30.702883 7fcc198da700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 63.636058 secs
2016-06-29 19:08:30.702899 7fcc198da700  0 log_channel(cluster) log [WRN] : slow request 63.636058 seconds old, received at 2016-06-29 19:07:27.066769: client_request(client.28058182:6 readdir #100000021c2 00000ef1d6BKMay 2016-06-29 19:07:27.054945) currently acquired locks
2016-06-29 19:10:43.184162 7fcc198da700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 35.598239 secs
2016-06-29 19:10:43.184177 7fcc198da700  0 log_channel(cluster) log [WRN] : slow request 35.598239 seconds old, received at 2016-06-29 19:10:07.585870: client_request(client.28058182:9 readdir #100000021c2 0000022f71S 2016-06-29 19:10:07.574338) currently acquired locks
2016-06-29 19:11:10.606474 7fcc198da700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 63.020538 secs
2016-06-29 19:11:10.606503 7fcc198da700  0 log_channel(cluster) log [WRN] : slow request 63.020538 seconds old, received at 2016-06-29 19:10:07.585870: client_request(client.28058182:9 readdir #100000021c2 0000022f71S 2016-06-29 19:10:07.574338) currently acquired locks
2016-06-29 19:13:43.266474 7fcc198da700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 35.753201 secs
2016-06-29 19:13:43.266489 7fcc198da700  0 log_channel(cluster) log [WRN] : slow request 35.753201 seconds old, received at 2016-06-29 19:13:07.513218: client_request(client.28058182:13 readdir #100000021c2 0000012468KDbTAmV 2016-06-29 19:13:07.502135) currently acquired locks
2016-06-29 19:14:11.108257 7fcc198da700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 63.594955 secs
2016-06-29 19:14:11.108278 7fcc198da700  0 log_channel(cluster) log [WRN] : slow request 63.594955 seconds old, received at 2016-06-29 19:13:07.513218: client_request(client.28058182:13 readdir #100000021c2 0000012468KDbTAmV 2016-06-29 19:13:07.502135) currently acquired locks

Only one client mounted to fs as follows:
192.168.30.71,192.168.30.72,192.168.30.73:/ on /mnt/cephfs type ceph (name=admin,rsize=2097152,wsize=2097152,readdir_max_entries=10240,readdir_max_bytes=2097152,key=client.admin)

rados df

pool name                 KB      objects       clones     degraded      unfound           rd        rd KB           wr        wr KB
cephfs_data                0      1022482            0            0            0         9938     14623544      3655424     25191882
cephfs_metadata        35953           31            0            0            0        28177     67305532       147345      6187783

Ceph cluster status is OK

Files

ceph.conf (2.53 KB) ceph.conf

elder one, 06/29/2016 04:59 PM

Actions

Copy link

Updated by elder one almost 8 years ago

File ceph.conf ceph.conf added

Actions

Copy link

Updated by Greg Farnum almost 8 years ago

Tracker changed from Bug to Support
Status changed from New to Closed

Assuming your MDS server has enough memory (it probably does), turn up the "mds cache size" to a number larger than 1 million and it should work.

This is a hangup consequence right now of how directory listing works and our having disabled fragmentation. We've got various mitigations going on as well as work to get the proper solution in.

Actions

Copy link