Bug #21551
openCeph FS not recovering space on Luminous
0%
Description
I was running a test on a Ceph file system where I was creating and deleting about 45,000 files in a loop, and every hour I was taking a snapshot. When the file system got over 60% full I had a cron job that deleted snapshots until the file system size was back under 60% full. This test ran for several days, until I noticed the file system was hung at it had totally filled up one of the OSD and multiple other OSDs were close to being full. I added 6 more OSDs to the cluster to get out of the full condition. Once I could access the file system, I checked and there were no snapshots and I removed all files in the ceph file system, but I cannot get the space to recover. I rebooted all nodes, and the space still does not recover. It has now been several days stuck in this state.
ls -la /cephfs/ total 4 drwxr-xr-x 1 root root 0 Sep 25 17:38 . drwxr-xr-x 23 root root 4096 Sep 5 16:41 .. du -a /cephfs/ 0 /cephfs/ du -a /cephfs/.snap 0 /cephfs/.snap ls -la /cephfs/.snap total 0 drwxr-xr-x 1 root root 0 Dec 31 1969 . drwxr-xr-x 1 root root 0 Sep 25 17:38 .. df /cephfs/ Filesystem 1K-blocks Used Available Use% Mounted on 10.14.2.11:6789,10.14.2.12:6789,10.14.2.13:6789:/ 1481248768 1006370816 474877952 68% /cephfs grep ceph /proc/mounts 10.14.2.11:6789,10.14.2.12:6789,10.14.2.13:6789:/ /cephfs ceph rw,noatime,name=cephfs,secret=<hidden>,rbytes,acl 0 0 ceph df detail GLOBAL: SIZE AVAIL RAW USED %RAW USED OBJECTS 1412G 452G 959G 67.94 725k POOLS: NAME ID QUOTA OBJECTS QUOTA BYTES USED %USED MAX AVAIL OBJECTS DIRTY READ WRITE RAW USED cephfs_data 1 N/A N/A 285G 51.11 272G 642994 627k 23664k 35531k 855G cephfs_metadata 2 N/A N/A 125M 0.05 272G 100401 100401 1974k 15320k 377M ceph -s cluster: id: 85a91bbe-b287-11e4-889f-001517987704 health: HEALTH_OK services: mon: 3 daemons, quorum ede-c1-mon01,ede-c1-mon02,ede-c1-mon03 mgr: ede-c1-mon01(active), standbys: ede-c1-mon03, ede-c1-mon02 mds: cephfs-1/1/1 up {0=ede-c1-mon01=up:active}, 1 up:standby-replay, 1 up:standby osd: 24 osds: 24 up, 24 in data: pools: 2 pools, 1280 pgs objects: 725k objects, 285 GB usage: 959 GB used, 452 GB / 1412 GB avail pgs: 1280 active+clean io: client: 852 B/s rd, 2 op/s rd, 0 op/s wr ceph fs ls name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ] ceph fs status cephfs - 1 clients ====== +------+----------------+--------------+---------------+-------+-------+ | Rank | State | MDS | Activity | dns | inos | +------+----------------+--------------+---------------+-------+-------+ | 0 | active | ede-c1-mon01 | Reqs: 0 /s | 17.7k | 16.3k | | 0-s | standby-replay | ede-c1-mon02 | Evts: 0 /s | 0 | 0 | +------+----------------+--------------+---------------+-------+-------+ +-----------------+----------+-------+-------+ | Pool | type | used | avail | +-----------------+----------+-------+-------+ | cephfs_metadata | metadata | 132M | 293G | | cephfs_data | data | 306G | 293G | +-----------------+----------+-------+-------+ +--------------+ | Standby MDS | +--------------+ | ede-c1-mon03 | +--------------+ MDS version: ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc) ceph -v ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc) OS: Ubuntu 16.04 kernel: uname -a Linux ede-c1-adm01 4.13.0-041300-generic #201709031731 SMP Sun Sep 3 21:33:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Updated by Patrick Donnelly over 6 years ago
Snapshots are not considered stable (especially with multiple active metadata servers). There are proposed fixes in the works:
https://github.com/ceph/ceph/pull/16779
If you have found a new bug, that's certainly useful. If you're willing: retry with those patches and if you still have a problem, please report back.
Updated by Zheng Yan over 6 years ago
Could you please run 'ceph daemon mds.ede-c1-mon01 dump cache /tmp/cachedump' and upload cachedump. Besides, please set debug_mds=10, restart mds, let mds run a few minutes and upload mds log.
Updated by Eric Eastman over 6 years ago
The command 'ceph daemon mds.ede-c1-mon01 dump cache /tmp/cachedump' did not give any output so I ran
ceph daemon mds.ede-c1-mon01 dump cache > cachedump
which created an 83 MB file which I bzip2 and put on our ftp server.
I set the debug_mds=10 in the ceph.conf file and restarted the mds process and capture about 7 minutes of run which create 207MB file that I also bzip2.
The files are at:
ftp://ftp.keepertech.com/outgoing/eric/ceph_logs/cachedump.bz2
ftp://ftp.keepertech.com/outgoing/eric/ceph_logs/ceph-mds.ede-c1-mon01.log-debug10.bz2
On Patrick comment: I am running a single active MDS with the second one as a standby with replay with the option:
mds_standby_replay = true
I am more the happy to retry with the patches, if they will help on a single MDS system. Please let me know if I should apply these patches to 12.2.0 or master or ?
Let me know if you need anything else off the current system.
Updated by Zheng Yan over 6 years ago
there are lots of "mds.0.purge_queue _consume: not readable right now" in the log.looks like purge queue stayed in non-readable state
please set debug_mds=5 and debug_journaler=10, restart mds, let mds run a few minutes and upload mds log
Updated by Eric Eastman over 6 years ago
I uploaded the new mds run with
debug_mds=5
debug_journaler=10
to:
Updated by Zheng Yan over 6 years ago
2017-09-26 09:16:41.000627 7f58662b4700 10 mds.0.journaler.pq(rw) _prefetch 2017-09-26 09:16:41.012367 7f58662b4700 10 mds.0.journaler.pq(rw) _finish_read got 1850138846~3743522 2017-09-26 09:16:41.012375 7f58662b4700 10 mds.0.journaler.pq(rw) _assimilate_prefetch 1850138846~3743522 2017-09-26 09:16:41.012376 7f58662b4700 10 mds.0.journaler.pq(rw) _assimilate_prefetch gap of 4194304 from received_pos 1853882368 to first prefetched buffer 1858076672 2017-09-26 09:16:41.012378 7f58662b4700 10 mds.0.journaler.pq(rw) _assimilate_prefetch read_buf now 1850138846~3743522, read pointers 1850138846/1853882368/1895825408 2017-09-26 09:16:41.012416 7f58662b4700 -1 mds.0.journaler.pq(rw) _decode error from assimilate_prefetch
looks like purge queue journal is corrupted. When was the filesystem created? I know a bug (when developing luminous) that can cause this corruption, but it has already been fix in ceph version 12.2.0
please upload objects 500.00000000 and 500.000001b9, I will help you to recover it.
Updated by Zheng Yan over 6 years ago
- Related to Bug #19593: purge queue and standby replay mds added
Updated by Eric Eastman over 6 years ago
This file system was create with Ceph v12.2.0. This cluster was cleanly installed with Ceph v12.2.0 and was never upgraded.
I uploaded the two objects from the pool cephfs_metadata and put them at:
ftp://ftp.keepertech.com/outgoing/eric/ceph_logs/500.00000000.dat.bz2
ftp://ftp.keepertech.com/outgoing/eric/ceph_logs/500.000001b9.dat.bz2
This is a test cluster. I can recreate the file system and data easily, so please do not waste time recovering it unless it helps you analyze the issue.
Updated by Zheng Yan over 6 years ago
OK, it's likely caused by http://tracker.ceph.com/issues/19593. please don't enable standby reply for now