Support #16528
closedStuck with CephFS with 1M files in one dir
0%
Description
I'm pretty much stukc with cephfs (jewel 10.2.2) with 1 million 0 byte files in one dir left behind from unsuccessful bonnie++ run.
Can't list/delete files or that directory. At least in reasonable time. (Waited for 4 hours).
Ceph kernel client with 4.4.14 kernel on Ubuntu 14.04.
2 metadata servers, 1 active, another in standby-replay mode.
From active mds log when I try to access dir:
2016-06-29 19:05:39.053345 7fcc198da700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 10 2016-06-29 19:05:57.472549 7fcc198da700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 33.047606 secs 2016-06-29 19:05:57.472563 7fcc198da700 0 log_channel(cluster) log [WRN] : slow request 33.047606 seconds old, received at 2016-06-29 19:05:24.424893: client_request(client.28058182:3 readdir #100000021c2 2016-06-29 19:05:24.420352) currently acquired locks 2016-06-29 19:06:25.457003 7fcc198da700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 61.032040 secs 2016-06-29 19:06:25.457022 7fcc198da700 0 log_channel(cluster) log [WRN] : slow request 61.032040 seconds old, received at 2016-06-29 19:05:24.424893: client_request(client.28058182:3 readdir #100000021c2 2016-06-29 19:05:24.420352) currently acquired locks 2016-06-29 19:08:02.719494 7fcc198da700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 35.652664 secs 2016-06-29 19:08:02.719520 7fcc198da700 0 log_channel(cluster) log [WRN] : slow request 35.652664 seconds old, received at 2016-06-29 19:07:27.066769: client_request(client.28058182:6 readdir #100000021c2 00000ef1d6BKMay 2016-06-29 19:07:27.054945) currently acquired locks 2016-06-29 19:08:30.702883 7fcc198da700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 63.636058 secs 2016-06-29 19:08:30.702899 7fcc198da700 0 log_channel(cluster) log [WRN] : slow request 63.636058 seconds old, received at 2016-06-29 19:07:27.066769: client_request(client.28058182:6 readdir #100000021c2 00000ef1d6BKMay 2016-06-29 19:07:27.054945) currently acquired locks 2016-06-29 19:10:43.184162 7fcc198da700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 35.598239 secs 2016-06-29 19:10:43.184177 7fcc198da700 0 log_channel(cluster) log [WRN] : slow request 35.598239 seconds old, received at 2016-06-29 19:10:07.585870: client_request(client.28058182:9 readdir #100000021c2 0000022f71S 2016-06-29 19:10:07.574338) currently acquired locks 2016-06-29 19:11:10.606474 7fcc198da700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 63.020538 secs 2016-06-29 19:11:10.606503 7fcc198da700 0 log_channel(cluster) log [WRN] : slow request 63.020538 seconds old, received at 2016-06-29 19:10:07.585870: client_request(client.28058182:9 readdir #100000021c2 0000022f71S 2016-06-29 19:10:07.574338) currently acquired locks 2016-06-29 19:13:43.266474 7fcc198da700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 35.753201 secs 2016-06-29 19:13:43.266489 7fcc198da700 0 log_channel(cluster) log [WRN] : slow request 35.753201 seconds old, received at 2016-06-29 19:13:07.513218: client_request(client.28058182:13 readdir #100000021c2 0000012468KDbTAmV 2016-06-29 19:13:07.502135) currently acquired locks 2016-06-29 19:14:11.108257 7fcc198da700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 63.594955 secs 2016-06-29 19:14:11.108278 7fcc198da700 0 log_channel(cluster) log [WRN] : slow request 63.594955 seconds old, received at 2016-06-29 19:13:07.513218: client_request(client.28058182:13 readdir #100000021c2 0000012468KDbTAmV 2016-06-29 19:13:07.502135) currently acquired locks
Only one client mounted to fs as follows:
192.168.30.71,192.168.30.72,192.168.30.73:/ on /mnt/cephfs type ceph (name=admin,rsize=2097152,wsize=2097152,readdir_max_entries=10240,readdir_max_bytes=2097152,key=client.admin)
rados df
pool name KB objects clones degraded unfound rd rd KB wr wr KB cephfs_data 0 1022482 0 0 0 9938 14623544 3655424 25191882 cephfs_metadata 35953 31 0 0 0 28177 67305532 147345 6187783
Ceph cluster status is OK
Files
Updated by Greg Farnum almost 8 years ago
- Tracker changed from Bug to Support
- Status changed from New to Closed
Assuming your MDS server has enough memory (it probably does), turn up the "mds cache size" to a number larger than 1 million and it should work.
This is a hangup consequence right now of how directory listing works and our having disabled fragmentation. We've got various mitigations going on as well as work to get the proper solution in.
Updated by elder one almost 8 years ago
Thank you!
Raised "mds cache size" to 3M and it took couple of minutes to list this dir.
Updated by John Spray over 7 years ago
- Project changed from Ceph to CephFS
- Category deleted (
1)
Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.