Bug #51824
pacific scrub ~mds_dir causes stray related ceph_assert, abort and OOM
0%
Description
We are testing the scalability of pacific snapshots, snap trimming, and num_stray accounting and ran into some crashes today.
We started this exercise with a cluster having num_stray = 22M. (To get to this point, we had created lots of files, snapshotted the dirs, and then removed the snapshots).
Basically, we're trying to understand why num_stray is not decreasing. We tried to list the stray omap entries `rados -p cephfs.meta listomapkeys 600.00000000`, but this is empty.
So we decided to perform a readonly scrub of the ~mdsdir to see what it would report about the stray dirstats.
1. # ceph tell mds.0 scrub start ~mdsdir recursive
This led to a ceph_assert(o->get_num_ref() == 0):
2021-07-23T09:46:24.969+0200 7fd57f258700 1 mds.cephdata20b-106a771b77 asok_command: scrub start {path=~mdsdir,prefix=scrub start,scrubops=[recursive]} (starting...) 2021-07-23T09:46:24.969+0200 7fd577248700 0 log_channel(cluster) log [INF] : scrub queued for path: ~mds0 2021-07-23T09:46:24.969+0200 7fd577248700 0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [~mds0] 2021-07-23T09:46:24.969+0200 7fd577248700 0 log_channel(cluster) log [INF] : scrub summary: active paths [~mds0] 2021-07-23T09:46:26.420+0200 7fd577248700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/MDCache.cc: In function 'void MDCache::remove_inode( CInode*)' thread 7fd577248700 time 2021-07-23T09:46:26.414153+0200/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/MDCache.cc: 336: FAILED ceph_assert(o->get_num_ref() == 0) ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7fd586e2bb24] 2: /usr/lib64/ceph/libceph-common.so.2(+0x276d3e) [0x7fd586e2bd3e] 3: (MDCache::remove_inode(CInode*)+0x598) [0x55e9b69bfcf8] 4: (StrayManager::_purge_stray_logged(CDentry*, unsigned long, boost::intrusive_ptr<MutationImpl>&)+0x304) [0x55e9b6a7d354] 5: (MDSContext::complete(int)+0x56) [0x55e9b6bd7b96] 6: (MDSIOContextBase::complete(int)+0x5ac) [0x55e9b6bd83cc] 7: (MDSLogContextBase::complete(int)+0x44) [0x55e9b6bd8644] 8: (Finisher::finisher_thread_entry()+0x1a5) [0x7fd586ecab65] 9: /lib64/libpthread.so.0(+0x814a) [0x7fd585bcb14a] 10: clone()
2. The MDS reconnected, we noticed it was now decreasing the num_strays and using a lot of CPU.
We started the scrub again on mdsdir and got an abort:
2021-07-23T09:52:22.804+0200 7fecdbda7700 1 mds.cephdata20b-106a771b77 asok_command: scrub start {path=~mdsdir,prefix=scrub start,scrubops=[recursive]} (starting...) 2021-07-23T09:52:22.804+0200 7fecd3d97700 0 log_channel(cluster) log [INF] : scrub queued for path: ~mds0 2021-07-23T09:52:22.804+0200 7fecd3d97700 0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [~mds0] 2021-07-23T09:52:22.804+0200 7fecd3d97700 0 log_channel(cluster) log [INF] : scrub summary: active paths [~mds0] 2021-07-23T09:52:41.494+0200 7fecd4d99700 -1 mds.0.cache.strays Rogue reference after purge to [dentry #0x100/stray6/1002112a666 [62f,head] auth (dversion lock) v=38380708 ino=0x1002112a666 state=1073741860 | inodepin=1 purging=1 dirty=0 0x5621bc8d8780] 2021-07-23T09:52:41.495+0200 7fecd4d99700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/StrayManager.cc: In function 'void StrayManager::_purge_stray_purged(CDentry*, bool)' thread 7fecd4d99700 time 2021-07-23T09:52:41.495501+0200 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/StrayManager.cc: 218: ceph_abort_msg("rogue reference to purging inode") ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x7fece397c264] 2: (StrayManager::_purge_stray_purged(CDentry*, bool)+0x2db) [0x5621af1667eb] 3: (MDSContext::complete(int)+0x56) [0x5621af2c2b96] 4: (MDSIOContextBase::complete(int)+0x5ac) [0x5621af2c33cc] 5: (Finisher::finisher_thread_entry()+0x1a5) [0x7fece3a1ab65] 6: /lib64/libpthread.so.0(+0x814a) [0x7fece271b14a] 7: clone()
3. After this restart, we started a third scrub on ~mdsdir, causing the mds to miss heartbeats for around a minute. We also noticed that num_strays is now reported as 0, but we're not sure when this changed.
This time the MDS eventually went OOM:
kernel: Out of memory: Killed process 842857 (ceph-mds) total-vm:36420156kB, anon-rss:35940832>
As of now, the MDS is stable and we won't run another scrub.
The mds log is at ceph-post-file: be2e6b89-088c-4cbc-99ed-0a931a60f6a0
We have the coredump from the first assert if it can be useful:
# coredumpctl TIME PID UID GID SIG COREFILE EXE Fri 2021-07-23 09:47:04 CEST 1939250 167 167 6 present /usr/bin/ceph-mds
Related issues
History
#1 Updated by Dan van der Ster over 2 years ago
After a few minutes num_strays is now 19966371 and unchanging.
#2 Updated by Patrick Donnelly over 2 years ago
- Status changed from New to Triaged
- Assignee set to Milind Changire
#3 Updated by Milind Changire over 2 years ago
- Assignee changed from Milind Changire to Dan van der Ster
Dan,
1. how many active mds were there in the cluster ?
2. was there any dir pinning active ?
3. could you list any other specifc config changes done during these tests ?
#4 Updated by Dan van der Ster over 2 years ago
Milind Changire wrote:
Dan,
1. how many active mds were there in the cluster ?
One
2. was there any dir pinning active ?
No we checked all the export_pins are -1.
3. could you list any other specifc config changes done during these tests ?
mds advanced mds_beacon_grace 120.000000 mds basic mds_cache_memory_limit 4294967296 mds advanced mds_export_ephemeral_distributed true mds advanced mds_forward_all_requests_to_auth true mds dev mds_max_export_size 20971520 mds advanced mds_max_snaps_per_dir 400 mds advanced mds_recall_warning_threshold 500000
#5 Updated by Milind Changire over 2 years ago
- Assignee changed from Dan van der Ster to Milind Changire
#6 Updated by Main Name over 2 years ago
Same issue with roughly 1.6M folders.
- Generated a Folder tree with 1611111 Folders
- Make snapshot
- Delete Folder tree
- High number of nstrays reported > start scrubbing ~mds_dir
- high CPU load on MDS which kills the service and the standby MDS takes over
- Scrubbing was done multiple times with consequent MDS crashes until at some point 0 nstrays was reached
One Active MDS and one Standby are deployed. No Dir PINNING active.
#7 Updated by Main Name over 2 years ago
Main Name wrote:
Same issue with roughly 1.6M folders.
- Generated a Folder tree with 1611111 Folders
- Make snapshot
- Delete Folder tree
- High number of nstrays reported > start scrubbing ~mds_dir
- high CPU load on MDS which kills the service and the standby MDS takes over
- Scrubbing was done multiple times with consequent MDS crashes until at some point 0 nstrays was reached
One Active MDS and one Standby are deployed. No Dir PINNING active.
Using Version 16.2.6
#8 Updated by Venky Shankar over 1 year ago
- Category set to Correctness/Safety
- Assignee changed from Milind Changire to Dhairya Parmar
- Target version set to v18.0.0
- Backport set to pacific,quincy
Dhairya is working on this as a part of https://github.com/ceph/ceph/pull/47649
#9 Updated by Dhairya Parmar over 1 year ago
I can tell this issue exists even in Quincy. The Ceph environment I used operates on`ceph 17.0.0-14883-g10f7d25d`, find below the log of the test run:
http://qa-proxy.ceph.com/teuthology/dparmar-2022-09-16_20:46:54-fs:functional-main-distro-default-smithi/7035975/teuthology.log
The testcase is aimed to evaluate strays by running recursive scrub on `~mdsdir`. It fails with the exact same stacktrace as mentioned in the description. It occurs on `scrub status` after performing the recursive scrub. Surprisingly for this testcase, `num_stray = 1847` is nothing near to the metioned value(i.e. `num_stray = 22M`). This resulting at Giga/Tera/Peta level should be anticipated. This also asserts this problem is just not pacific specific. Maybe this can also exist in octopus? I'll try running the testcase in a octopus cluster and find it out.
PS: I also found this: https://www.spinics.net/lists/ceph-users/msg71128.html on web while working on this issue. I.e. It can also lead to damaged `dir_frags`. Unsure about the severity of this but thought of sharing it.
#10 Updated by Venky Shankar over 1 year ago
Dhairya Parmar wrote:
I can tell this issue exists even in Quincy. The Ceph environment I used operates on`ceph 17.0.0-14883-g10f7d25d`, find below the log of the test run:
http://qa-proxy.ceph.com/teuthology/dparmar-2022-09-16_20:46:54-fs:functional-main-distro-default-smithi/7035975/teuthology.logThe testcase is aimed to evaluate strays by running recursive scrub on `~mdsdir`. It fails with the exact same stacktrace as mentioned in the description. It occurs on `scrub status` after performing the recursive scrub. Surprisingly for this testcase, `num_stray = 1847` is nothing near to the metioned value(i.e. `num_stray = 22M`). This resulting at Giga/Tera/Peta level should be anticipated. This also asserts this problem is just not pacific specific. Maybe this can also exist in octopus? I'll try running the testcase in a octopus cluster and find it out.
Don't bother reproducing it in octopus right now - it most likely exists. Let's get an RCA first.
PS: I also found this: https://www.spinics.net/lists/ceph-users/msg71128.html on web while working on this issue. I.e. It can also lead to damaged `dir_frags`. Unsure about the severity of this but thought of sharing it.
That seems to be coming when fetching omap entries for a frag. This needs to be fixed as part of scrubbing ~mdsdir.
#11 Updated by Venky Shankar over 1 year ago
Dhairya,
Were you able to RCA this?
#12 Updated by Dhairya Parmar over 1 year ago
Venky Shankar wrote:
Dhairya,
Were you able to RCA this?
Okay so the issue seems to be with the inode that is pinned for scrub and when it reaches ceph_assert(o->get_num_ref() == 0) in MDCache::remove_inode, things just blast. This happens when the inode is already purging and is also pushed into the scrub stack. This has been taken care off in PR https://github.com/ceph/ceph/pull/47649. Currently the testcase deals with only a few thousand strays but I will soon ramp it up to check if there is still any breakdown or not. Will keep updated.
#13 Updated by Dhairya Parmar about 1 year ago
- Pull request ID set to 47649
#14 Updated by Dhairya Parmar about 1 year ago
- Status changed from Triaged to Fix Under Review
#15 Updated by Venky Shankar almost 1 year ago
- Status changed from Fix Under Review to Resolved
- Target version changed from v18.0.0 to v19.0.0
- Backport changed from pacific,quincy to reef,quincy,pacific
- Component(FS) MDS added
- Labels (FS) crash added
#16 Updated by Dhairya Parmar 12 months ago
- Status changed from Resolved to Pending Backport
#17 Updated by Backport Bot 12 months ago
- Copied to Backport #59263: reef: pacific scrub ~mds_dir causes stray related ceph_assert, abort and OOM added
#18 Updated by Backport Bot 12 months ago
- Copied to Backport #59264: pacific: pacific scrub ~mds_dir causes stray related ceph_assert, abort and OOM added
#19 Updated by Backport Bot 12 months ago
- Copied to Backport #59265: quincy: pacific scrub ~mds_dir causes stray related ceph_assert, abort and OOM added
#20 Updated by Backport Bot 12 months ago
- Tags set to backport_processed