Project

General

Profile

Bug #51824

pacific scrub ~mds_dir causes stray related ceph_assert, abort and OOM

Added by Dan van der Ster over 2 years ago. Updated 12 months ago.

Status:
Pending Backport
Priority:
Normal
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
backport_processed
Backport:
reef,quincy,pacific
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are testing the scalability of pacific snapshots, snap trimming, and num_stray accounting and ran into some crashes today.

We started this exercise with a cluster having num_stray = 22M. (To get to this point, we had created lots of files, snapshotted the dirs, and then removed the snapshots).

Basically, we're trying to understand why num_stray is not decreasing. We tried to list the stray omap entries `rados -p cephfs.meta listomapkeys 600.00000000`, but this is empty.

So we decided to perform a readonly scrub of the ~mdsdir to see what it would report about the stray dirstats.

1. # ceph tell mds.0 scrub start ~mdsdir recursive

This led to a ceph_assert(o->get_num_ref() == 0):

2021-07-23T09:46:24.969+0200 7fd57f258700  1 mds.cephdata20b-106a771b77 asok_command: scrub start {path=~mdsdir,prefix=scrub start,scrubops=[recursive]} (starting...)
2021-07-23T09:46:24.969+0200 7fd577248700  0 log_channel(cluster) log [INF] : scrub queued for path: ~mds0
2021-07-23T09:46:24.969+0200 7fd577248700  0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [~mds0]
2021-07-23T09:46:24.969+0200 7fd577248700  0 log_channel(cluster) log [INF] : scrub summary: active paths [~mds0]

2021-07-23T09:46:26.420+0200 7fd577248700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/MDCache.cc: In function 'void MDCache::remove_inode(
CInode*)' thread 7fd577248700 time 2021-07-23T09:46:26.414153+0200/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/MDCache.cc: 336: FAILED ceph_assert(o->get_num_ref() == 0)

 ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7fd586e2bb24]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x276d3e) [0x7fd586e2bd3e]
 3: (MDCache::remove_inode(CInode*)+0x598) [0x55e9b69bfcf8]
 4: (StrayManager::_purge_stray_logged(CDentry*, unsigned long, boost::intrusive_ptr<MutationImpl>&)+0x304) [0x55e9b6a7d354]
 5: (MDSContext::complete(int)+0x56) [0x55e9b6bd7b96]
 6: (MDSIOContextBase::complete(int)+0x5ac) [0x55e9b6bd83cc]
 7: (MDSLogContextBase::complete(int)+0x44) [0x55e9b6bd8644]
 8: (Finisher::finisher_thread_entry()+0x1a5) [0x7fd586ecab65]
 9: /lib64/libpthread.so.0(+0x814a) [0x7fd585bcb14a]
 10: clone()

2. The MDS reconnected, we noticed it was now decreasing the num_strays and using a lot of CPU.
We started the scrub again on mdsdir and got an abort:

2021-07-23T09:52:22.804+0200 7fecdbda7700  1 mds.cephdata20b-106a771b77 asok_command: scrub start {path=~mdsdir,prefix=scrub start,scrubops=[recursive]} (starting...)
2021-07-23T09:52:22.804+0200 7fecd3d97700  0 log_channel(cluster) log [INF] : scrub queued for path: ~mds0
2021-07-23T09:52:22.804+0200 7fecd3d97700  0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [~mds0]
2021-07-23T09:52:22.804+0200 7fecd3d97700  0 log_channel(cluster) log [INF] : scrub summary: active paths [~mds0]

2021-07-23T09:52:41.494+0200 7fecd4d99700 -1 mds.0.cache.strays Rogue reference after purge to [dentry #0x100/stray6/1002112a666 [62f,head] auth (dversion lock) v=38380708 ino=0x1002112a666 state=1073741860 | inodepin=1 purging=1 dirty=0 0x5621bc8d8780]
2021-07-23T09:52:41.495+0200 7fecd4d99700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/StrayManager.cc: In function 'void StrayManager::_purge_stray_purged(CDentry*, bool)' thread 7fecd4d99700 time 2021-07-23T09:52:41.495501+0200
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/StrayManager.cc: 218: ceph_abort_msg("rogue reference to purging inode")

 ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x7fece397c264]
 2: (StrayManager::_purge_stray_purged(CDentry*, bool)+0x2db) [0x5621af1667eb]
 3: (MDSContext::complete(int)+0x56) [0x5621af2c2b96]
 4: (MDSIOContextBase::complete(int)+0x5ac) [0x5621af2c33cc]
 5: (Finisher::finisher_thread_entry()+0x1a5) [0x7fece3a1ab65]
 6: /lib64/libpthread.so.0(+0x814a) [0x7fece271b14a]
 7: clone()

3. After this restart, we started a third scrub on ~mdsdir, causing the mds to miss heartbeats for around a minute. We also noticed that num_strays is now reported as 0, but we're not sure when this changed.
This time the MDS eventually went OOM:

kernel: Out of memory: Killed process 842857 (ceph-mds) total-vm:36420156kB, anon-rss:35940832>

As of now, the MDS is stable and we won't run another scrub.

The mds log is at ceph-post-file: be2e6b89-088c-4cbc-99ed-0a931a60f6a0
We have the coredump from the first assert if it can be useful:

# coredumpctl 
TIME                            PID   UID   GID SIG COREFILE  EXE
Fri 2021-07-23 09:47:04 CEST 1939250   167   167   6 present   /usr/bin/ceph-mds


Related issues

Copied to CephFS - Backport #59263: reef: pacific scrub ~mds_dir causes stray related ceph_assert, abort and OOM In Progress
Copied to CephFS - Backport #59264: pacific: pacific scrub ~mds_dir causes stray related ceph_assert, abort and OOM Resolved
Copied to CephFS - Backport #59265: quincy: pacific scrub ~mds_dir causes stray related ceph_assert, abort and OOM In Progress

History

#1 Updated by Dan van der Ster over 2 years ago

After a few minutes num_strays is now 19966371 and unchanging.

#2 Updated by Patrick Donnelly over 2 years ago

  • Status changed from New to Triaged
  • Assignee set to Milind Changire

#3 Updated by Milind Changire over 2 years ago

  • Assignee changed from Milind Changire to Dan van der Ster

Dan,
1. how many active mds were there in the cluster ?
2. was there any dir pinning active ?
3. could you list any other specifc config changes done during these tests ?

#4 Updated by Dan van der Ster over 2 years ago

Milind Changire wrote:

Dan,
1. how many active mds were there in the cluster ?

One

2. was there any dir pinning active ?

No we checked all the export_pins are -1.

3. could you list any other specifc config changes done during these tests ?

  mds           advanced  mds_beacon_grace                       120.000000             
  mds           basic     mds_cache_memory_limit                 4294967296             
  mds           advanced  mds_export_ephemeral_distributed       true                   
  mds           advanced  mds_forward_all_requests_to_auth       true                   
  mds           dev       mds_max_export_size                    20971520               
  mds           advanced  mds_max_snaps_per_dir                  400                    
  mds           advanced  mds_recall_warning_threshold           500000                 

#5 Updated by Milind Changire over 2 years ago

  • Assignee changed from Dan van der Ster to Milind Changire

#6 Updated by Main Name over 2 years ago

Same issue with roughly 1.6M folders.

  • Generated a Folder tree with 1611111 Folders
  • Make snapshot
  • Delete Folder tree
  • High number of nstrays reported > start scrubbing ~mds_dir
    • high CPU load on MDS which kills the service and the standby MDS takes over
    • Scrubbing was done multiple times with consequent MDS crashes until at some point 0 nstrays was reached

One Active MDS and one Standby are deployed. No Dir PINNING active.

#7 Updated by Main Name over 2 years ago

Main Name wrote:

Same issue with roughly 1.6M folders.

  • Generated a Folder tree with 1611111 Folders
  • Make snapshot
  • Delete Folder tree
  • High number of nstrays reported > start scrubbing ~mds_dir
    • high CPU load on MDS which kills the service and the standby MDS takes over
    • Scrubbing was done multiple times with consequent MDS crashes until at some point 0 nstrays was reached

One Active MDS and one Standby are deployed. No Dir PINNING active.

Using Version 16.2.6

#8 Updated by Venky Shankar over 1 year ago

  • Category set to Correctness/Safety
  • Assignee changed from Milind Changire to Dhairya Parmar
  • Target version set to v18.0.0
  • Backport set to pacific,quincy

Dhairya is working on this as a part of https://github.com/ceph/ceph/pull/47649

#9 Updated by Dhairya Parmar over 1 year ago

I can tell this issue exists even in Quincy. The Ceph environment I used operates on`ceph 17.0.0-14883-g10f7d25d`, find below the log of the test run:
http://qa-proxy.ceph.com/teuthology/dparmar-2022-09-16_20:46:54-fs:functional-main-distro-default-smithi/7035975/teuthology.log

The testcase is aimed to evaluate strays by running recursive scrub on `~mdsdir`. It fails with the exact same stacktrace as mentioned in the description. It occurs on `scrub status` after performing the recursive scrub. Surprisingly for this testcase, `num_stray = 1847` is nothing near to the metioned value(i.e. `num_stray = 22M`). This resulting at Giga/Tera/Peta level should be anticipated. This also asserts this problem is just not pacific specific. Maybe this can also exist in octopus? I'll try running the testcase in a octopus cluster and find it out.

PS: I also found this: https://www.spinics.net/lists/ceph-users/msg71128.html on web while working on this issue. I.e. It can also lead to damaged `dir_frags`. Unsure about the severity of this but thought of sharing it.

#10 Updated by Venky Shankar over 1 year ago

Dhairya Parmar wrote:

I can tell this issue exists even in Quincy. The Ceph environment I used operates on`ceph 17.0.0-14883-g10f7d25d`, find below the log of the test run:
http://qa-proxy.ceph.com/teuthology/dparmar-2022-09-16_20:46:54-fs:functional-main-distro-default-smithi/7035975/teuthology.log

The testcase is aimed to evaluate strays by running recursive scrub on `~mdsdir`. It fails with the exact same stacktrace as mentioned in the description. It occurs on `scrub status` after performing the recursive scrub. Surprisingly for this testcase, `num_stray = 1847` is nothing near to the metioned value(i.e. `num_stray = 22M`). This resulting at Giga/Tera/Peta level should be anticipated. This also asserts this problem is just not pacific specific. Maybe this can also exist in octopus? I'll try running the testcase in a octopus cluster and find it out.

Don't bother reproducing it in octopus right now - it most likely exists. Let's get an RCA first.

PS: I also found this: https://www.spinics.net/lists/ceph-users/msg71128.html on web while working on this issue. I.e. It can also lead to damaged `dir_frags`. Unsure about the severity of this but thought of sharing it.

That seems to be coming when fetching omap entries for a frag. This needs to be fixed as part of scrubbing ~mdsdir.

#11 Updated by Venky Shankar over 1 year ago

Dhairya,

Were you able to RCA this?

#12 Updated by Dhairya Parmar over 1 year ago

Venky Shankar wrote:

Dhairya,

Were you able to RCA this?

Okay so the issue seems to be with the inode that is pinned for scrub and when it reaches ceph_assert(o->get_num_ref() == 0) in MDCache::remove_inode, things just blast. This happens when the inode is already purging and is also pushed into the scrub stack. This has been taken care off in PR https://github.com/ceph/ceph/pull/47649. Currently the testcase deals with only a few thousand strays but I will soon ramp it up to check if there is still any breakdown or not. Will keep updated.

#13 Updated by Dhairya Parmar about 1 year ago

  • Pull request ID set to 47649

#14 Updated by Dhairya Parmar about 1 year ago

  • Status changed from Triaged to Fix Under Review

#15 Updated by Venky Shankar almost 1 year ago

  • Status changed from Fix Under Review to Resolved
  • Target version changed from v18.0.0 to v19.0.0
  • Backport changed from pacific,quincy to reef,quincy,pacific
  • Component(FS) MDS added
  • Labels (FS) crash added

#16 Updated by Dhairya Parmar 12 months ago

  • Status changed from Resolved to Pending Backport

#17 Updated by Backport Bot 12 months ago

  • Copied to Backport #59263: reef: pacific scrub ~mds_dir causes stray related ceph_assert, abort and OOM added

#18 Updated by Backport Bot 12 months ago

  • Copied to Backport #59264: pacific: pacific scrub ~mds_dir causes stray related ceph_assert, abort and OOM added

#19 Updated by Backport Bot 12 months ago

  • Copied to Backport #59265: quincy: pacific scrub ~mds_dir causes stray related ceph_assert, abort and OOM added

#20 Updated by Backport Bot 12 months ago

  • Tags set to backport_processed

Also available in: Atom PDF