Project

General

Profile

Bug #51824

pacific scrub ~mds_dir causes stray related ceph_assert, abort and OOM

Added by Dan van der Ster 6 months ago. Updated 3 months ago.

Status:
Triaged
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are testing the scalability of pacific snapshots, snap trimming, and num_stray accounting and ran into some crashes today.

We started this exercise with a cluster having num_stray = 22M. (To get to this point, we had created lots of files, snapshotted the dirs, and then removed the snapshots).

Basically, we're trying to understand why num_stray is not decreasing. We tried to list the stray omap entries `rados -p cephfs.meta listomapkeys 600.00000000`, but this is empty.

So we decided to perform a readonly scrub of the ~mdsdir to see what it would report about the stray dirstats.

1. # ceph tell mds.0 scrub start ~mdsdir recursive

This led to a ceph_assert(o->get_num_ref() == 0):

2021-07-23T09:46:24.969+0200 7fd57f258700  1 mds.cephdata20b-106a771b77 asok_command: scrub start {path=~mdsdir,prefix=scrub start,scrubops=[recursive]} (starting...)
2021-07-23T09:46:24.969+0200 7fd577248700  0 log_channel(cluster) log [INF] : scrub queued for path: ~mds0
2021-07-23T09:46:24.969+0200 7fd577248700  0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [~mds0]
2021-07-23T09:46:24.969+0200 7fd577248700  0 log_channel(cluster) log [INF] : scrub summary: active paths [~mds0]

2021-07-23T09:46:26.420+0200 7fd577248700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/MDCache.cc: In function 'void MDCache::remove_inode(
CInode*)' thread 7fd577248700 time 2021-07-23T09:46:26.414153+0200/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/MDCache.cc: 336: FAILED ceph_assert(o->get_num_ref() == 0)

 ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7fd586e2bb24]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x276d3e) [0x7fd586e2bd3e]
 3: (MDCache::remove_inode(CInode*)+0x598) [0x55e9b69bfcf8]
 4: (StrayManager::_purge_stray_logged(CDentry*, unsigned long, boost::intrusive_ptr<MutationImpl>&)+0x304) [0x55e9b6a7d354]
 5: (MDSContext::complete(int)+0x56) [0x55e9b6bd7b96]
 6: (MDSIOContextBase::complete(int)+0x5ac) [0x55e9b6bd83cc]
 7: (MDSLogContextBase::complete(int)+0x44) [0x55e9b6bd8644]
 8: (Finisher::finisher_thread_entry()+0x1a5) [0x7fd586ecab65]
 9: /lib64/libpthread.so.0(+0x814a) [0x7fd585bcb14a]
 10: clone()

2. The MDS reconnected, we noticed it was now decreasing the num_strays and using a lot of CPU.
We started the scrub again on mdsdir and got an abort:

2021-07-23T09:52:22.804+0200 7fecdbda7700  1 mds.cephdata20b-106a771b77 asok_command: scrub start {path=~mdsdir,prefix=scrub start,scrubops=[recursive]} (starting...)
2021-07-23T09:52:22.804+0200 7fecd3d97700  0 log_channel(cluster) log [INF] : scrub queued for path: ~mds0
2021-07-23T09:52:22.804+0200 7fecd3d97700  0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [~mds0]
2021-07-23T09:52:22.804+0200 7fecd3d97700  0 log_channel(cluster) log [INF] : scrub summary: active paths [~mds0]

2021-07-23T09:52:41.494+0200 7fecd4d99700 -1 mds.0.cache.strays Rogue reference after purge to [dentry #0x100/stray6/1002112a666 [62f,head] auth (dversion lock) v=38380708 ino=0x1002112a666 state=1073741860 | inodepin=1 purging=1 dirty=0 0x5621bc8d8780]
2021-07-23T09:52:41.495+0200 7fecd4d99700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/StrayManager.cc: In function 'void StrayManager::_purge_stray_purged(CDentry*, bool)' thread 7fecd4d99700 time 2021-07-23T09:52:41.495501+0200
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/StrayManager.cc: 218: ceph_abort_msg("rogue reference to purging inode")

 ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x7fece397c264]
 2: (StrayManager::_purge_stray_purged(CDentry*, bool)+0x2db) [0x5621af1667eb]
 3: (MDSContext::complete(int)+0x56) [0x5621af2c2b96]
 4: (MDSIOContextBase::complete(int)+0x5ac) [0x5621af2c33cc]
 5: (Finisher::finisher_thread_entry()+0x1a5) [0x7fece3a1ab65]
 6: /lib64/libpthread.so.0(+0x814a) [0x7fece271b14a]
 7: clone()

3. After this restart, we started a third scrub on ~mdsdir, causing the mds to miss heartbeats for around a minute. We also noticed that num_strays is now reported as 0, but we're not sure when this changed.
This time the MDS eventually went OOM:

kernel: Out of memory: Killed process 842857 (ceph-mds) total-vm:36420156kB, anon-rss:35940832>

As of now, the MDS is stable and we won't run another scrub.

The mds log is at ceph-post-file: be2e6b89-088c-4cbc-99ed-0a931a60f6a0
We have the coredump from the first assert if it can be useful:

# coredumpctl 
TIME                            PID   UID   GID SIG COREFILE  EXE
Fri 2021-07-23 09:47:04 CEST 1939250   167   167   6 present   /usr/bin/ceph-mds

History

#1 Updated by Dan van der Ster 6 months ago

After a few minutes num_strays is now 19966371 and unchanging.

#2 Updated by Patrick Donnelly 6 months ago

  • Status changed from New to Triaged
  • Assignee set to Milind Changire

#3 Updated by Milind Changire 6 months ago

  • Assignee changed from Milind Changire to Dan van der Ster

Dan,
1. how many active mds were there in the cluster ?
2. was there any dir pinning active ?
3. could you list any other specifc config changes done during these tests ?

#4 Updated by Dan van der Ster 6 months ago

Milind Changire wrote:

Dan,
1. how many active mds were there in the cluster ?

One

2. was there any dir pinning active ?

No we checked all the export_pins are -1.

3. could you list any other specifc config changes done during these tests ?

  mds           advanced  mds_beacon_grace                       120.000000             
  mds           basic     mds_cache_memory_limit                 4294967296             
  mds           advanced  mds_export_ephemeral_distributed       true                   
  mds           advanced  mds_forward_all_requests_to_auth       true                   
  mds           dev       mds_max_export_size                    20971520               
  mds           advanced  mds_max_snaps_per_dir                  400                    
  mds           advanced  mds_recall_warning_threshold           500000                 

#5 Updated by Milind Changire 6 months ago

  • Assignee changed from Dan van der Ster to Milind Changire

#6 Updated by Main Name 3 months ago

Same issue with roughly 1.6M folders.

  • Generated a Folder tree with 1611111 Folders
  • Make snapshot
  • Delete Folder tree
  • High number of nstrays reported > start scrubbing ~mds_dir
    • high CPU load on MDS which kills the service and the standby MDS takes over
    • Scrubbing was done multiple times with consequent MDS crashes until at some point 0 nstrays was reached

One Active MDS and one Standby are deployed. No Dir PINNING active.

#7 Updated by Main Name 3 months ago

Main Name wrote:

Same issue with roughly 1.6M folders.

  • Generated a Folder tree with 1611111 Folders
  • Make snapshot
  • Delete Folder tree
  • High number of nstrays reported > start scrubbing ~mds_dir
    • high CPU load on MDS which kills the service and the standby MDS takes over
    • Scrubbing was done multiple times with consequent MDS crashes until at some point 0 nstrays was reached

One Active MDS and one Standby are deployed. No Dir PINNING active.

Using Version 16.2.6

Also available in: Atom PDF