Project

General

Profile

Actions

Bug #50511

open

osd: rmdir .snap/snap triggers snaptrim and then crashes various OSDs

Added by Rainer Stumbaum almost 3 years ago. Updated almost 3 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,
we use http://manpages.ubuntu.com/manpages/artful/man1/cephfs-snap.1.html to create hourly, daily, weekly and monthly snapshots wihtin our cephFS filesystem.

Unfortunately the script did not run for two weeks - I fixed that - and then the mass deletion of older snapshots brought the 5 node Ceph cluster down.

I only could fix it by stopping the snap trimming by executing
ceph config set osd osd_pg_max_concurrent_snap_trims 0

Various OSDs went down with a crash.

And all the Proxmox KVM guests on top slowed down or crashed.

Please find attached the log file of a crashed OSD.

How can I reenable snap trimming without breaking the Ceph again?

Best regards
Rainer


Files

ceph-osd.0.txt.zip (569 KB) ceph-osd.0.txt.zip OSD Logfile Rainer Stumbaum, 04/25/2021 06:10 PM
ceph-osd-logs.zip (290 KB) ceph-osd-logs.zip Rainer Stumbaum, 04/27/2021 07:53 AM

Related issues 1 (1 open0 closed)

Related to bluestore - Bug #47446: No snap trim progress after removing large snapshotsNew

Actions
Actions #1

Updated by Patrick Donnelly almost 3 years ago

 ceph version 15.2.10 (c5cb846e8e85c920ff64c704524419fc9d2b8c34) octopus (stable)
 1: (()+0x12730) [0x7f6f72ba5730]
 2: (rocksdb::autovector<rocksdb::IteratorWrapperBase<rocksdb::Slice>*, 8ul>::operator[](unsigned long)+0) [0x55a4d6e87fd4]
 3: (rocksdb::BinaryHeap<rocksdb::IteratorWrapperBase<rocksdb::Slice>*, rocksdb::MinIteratorComparator>::downheap(unsigned long)+0x24b) [0x55a4d6e87557]
 4: (rocksdb::BinaryHeap<rocksdb::IteratorWrapperBase<rocksdb::Slice>*, rocksdb::MinIteratorComparator>::replace_top(rocksdb::IteratorWrapperBase<rocksdb::Slice>* const&)+0x76) [0x55a4d6e86668]  
 5: (rocksdb::MergingIterator::Next()+0x176) [0x55a4d6e8545e]
 6: (rocksdb::DBIter::FindNextUserEntryInternal(bool, bool)+0xbde) [0x55a4d6d10480]
 7: (rocksdb::DBIter::FindNextUserEntry(bool, bool)+0x71) [0x55a4d6d185a3]
 8: (rocksdb::DBIter::Seek(rocksdb::Slice const&)+0x439) [0x55a4d6d13cf1]
 9: (rocksdb::ArenaWrappedDBIter::Seek(rocksdb::Slice const&)+0x27) [0x55a4d6d186ed]
 10: (RocksDBStore::RocksDBWholeSpaceIteratorImpl::lower_bound(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x46) [0x55a4d6c31476]
 11: (BlueStore::OmapIteratorImpl::OmapIteratorImpl(boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, std::shared_ptr<KeyValueDB::IteratorImpl>)+0x172) [0x55a4d66f6352]
 12: (BlueStore::get_omap_iterator(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&)+0x42f) [0x55a4d674082f]
 13: (OSDriver::get_next(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list>*)+0x45) [0x55a4d6475095]
 14: (SnapMapper::get_next_objects_to_trim(snapid_t, unsigned int, std::vector<hobject_t, std::allocator<hobject_t> >*)+0x5c2) [0x55a4d6477bc2]
 15: (PrimaryLogPG::AwaitAsyncWork::react(PrimaryLogPG::DoSnapWork const&)+0x27e) [0x55a4d63d65de]
 16: (boost::statechart::simple_state<PrimaryLogPG::AwaitAsyncWork, PrimaryLogPG::Trimming, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x119) [0x55a4d644e5b9]
 17: (boost::statechart::state_machine<PrimaryLogPG::SnapTrimmer, PrimaryLogPG::NotTrimming, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6c) [0x55a4d6420fbc]
 18: (PrimaryLogPG::snap_trimmer(unsigned int)+0x171) [0x55a4d6396ec1]
 19: (ceph::osd::scheduler::PGSnapTrim::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1b) [0x55a4d64ddecb]
 20: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12fa) [0x55a4d62b6e8a]
 21: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) [0x55a4d68bab24]
 22: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a4d68bd5a0]
 23: (()+0x7fa3) [0x7f6f72b9afa3]
 24: (clone()+0x3f) [0x7f6f727484cf]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #2

Updated by Patrick Donnelly almost 3 years ago

  • Project changed from CephFS to RADOS
  • Subject changed from cephfs: rmdir .snap/snap triggers snaptrim and then crashes various OSDs to osd: rmdir .snap/snap triggers snaptrim and then crashes various OSDs
  • Component(RADOS) BlueStore, OSD added
Actions #3

Updated by Rainer Stumbaum almost 3 years ago

Hi,
I am able to reproduce this behaviour:
- Create lots of snapshots on CephFS on an active filesystem
- rmdir lots of snapshots

This starts the snaptrim and then crashes one or more OSDs. Attached logs of the crashing OSDs.

Tried with the following parameters:

root@proxmox01:~# ceph config show osd.0
NAME                              VALUE                             SOURCE    OVERRIDES  IGNORES
auth_client_required              cephx                             file
auth_cluster_required             cephx                             file
auth_service_required             cephx                             file
bluestore_compression_algorithm   lz4                               file
bluestore_compression_mode        aggressive                        file
cluster_network                   10.50.10.0/24                     file
daemonize                         false                             override
debug_asok                        0/0                               file
debug_auth                        0/0                               file
debug_buffer                      0/0                               file
debug_client                      0/0                               file
debug_context                     0/0                               file
debug_crush                       0/0                               file
debug_filer                       0/0                               file
debug_filestore                   0/0                               file
debug_finisher                    0/0                               file
debug_heartbeatmap                0/0                               file
debug_journal                     0/0                               file
debug_journaler                   0/0                               file
debug_lockdep                     0/0                               file
debug_mds                         0/0                               file
debug_mds_balancer                0/0                               file
debug_mds_locker                  0/0                               file
debug_mds_log                     0/0                               file
debug_mds_log_expire              0/0                               file
debug_mds_migrator                0/0                               file
debug_mon                         0/0                               file
debug_monc                        0/0                               file
debug_ms                          0/0                               file
debug_objclass                    0/0                               file
debug_objectcacher                0/0                               file
debug_objecter                    0/0                               file
debug_optracker                   0/0                               file
debug_osd                         0/0                               file
debug_paxos                       0/0                               file
debug_perfcounter                 0/0                               file
debug_rados                       0/0                               file
debug_rbd                         0/0                               file
debug_rgw                         0/0                               file
debug_throttle                    0/0                               file
debug_timer                       0/0                               file
debug_tp                          0/0                               file
keyring                           $osd_data/keyring                 default
leveldb_log                                                         default
mon_allow_pool_delete             false                             file
mon_host                          10.20.42.1 10.20.42.2 10.20.42.3  file
mon_target_pg_per_osd             100                               mon
no_config_file                    false                             override
osd_journal_size                  5120                              file
osd_max_trimming_pgs              1                                 mon
osd_pg_max_concurrent_snap_trims  0                                 mon
osd_pool_default_min_size         2                                 file
osd_pool_default_size             3                                 file
osd_scrub_begin_hour              23                                mon
osd_scrub_end_hour                5                                 mon
osd_snap_trim_sleep               2.000000                          mon
osd_snap_trim_sleep_ssd           2.000000                          mon
public_network                    10.20.42.0/24                     file
rbd_default_features              61                                default
setgroup                          ceph                              cmdline
setuser                           ceph                              cmdline
root@proxmox01:~#

Best regards
Rainer

Actions #4

Updated by Neha Ojha almost 3 years ago

  • Related to Bug #47446: No snap trim progress after removing large snapshots added
Actions #5

Updated by Neha Ojha almost 3 years ago

  • Status changed from New to Need More Info

osd.0 and osd.7 show different crashes. Would it be possible for you to capture a coredump or osd logs with debug_osd=20, debug_bluefs=20 debug_bluestore=20 debug_bdev=20 debug_ms=1?

Actions #6

Updated by Rainer Stumbaum almost 3 years ago

Hi,
this is what I did a few days ago from a recommendation on all the OSD servers:

ceph config set osd bluefs_buffered_io true
cat /var/lib/ceph/osd/ceph-*/whoami | xargs -I{} ceph daemon osd.{} compact

The latter took about 5-10 minutes per OSD.

And now I just did the following:

ceph config set osd debug_osd 20
ceph config set osd debug_bluefs 20
ceph config set osd debug_bluestore 20
ceph config set osd debug_bdev 20
ceph config set osd debug_ms 1
ceph config set osd osd_pg_max_concurrent_snap_trims 1
rmdir /mnt/pve/cephfs/vol/.snap/2021-03-31_000002+0200_daily
rmdir /mnt/pve/cephfs/vol/.snap/2021-04-01_000002+0200_daily/

There where some short running snaptrims triggered in "ceph status", but now more OSD crashes.

So I am not able to reproduce the behaviour on the test system anymore...

On production system I did the same - the snap trims are taking a lot longer:

root@proxmox07:~# ceph status
  cluster:
    id:     80c20b8a-00f0-4e3a-954f-e6940f603ba3
    health: HEALTH_WARN
            noout flag(s) set

  services:
    mon: 5 daemons, quorum proxmox07,proxmox08,proxmox09,proxmox10,proxmox11 (age 8d)
    mgr: proxmox08(active, since 13d), standbys: proxmox07, proxmox10, proxmox11, proxmox09
    mds: cephfs:1 {0=proxmox09=up:active} 4 up:standby
    osd: 30 osds: 30 up (since 4d), 30 in (since 2w)
         flags noout

  data:
    pools:   5 pools, 2186 pgs
    objects: 49.19M objects, 14 TiB
    usage:   35 TiB used, 52 TiB / 87 TiB avail
    pgs:     2058 active+clean
             98   active+clean+snaptrim_wait
             30   active+clean+snaptrim

  io:
    client:   13 MiB/s rd, 46 MiB/s wr, 820 op/s rd, 1.42k op/s wr

root@proxmox07:~#

Still no OSD crashes nor bad side effects.

I am going to leave the snap trims run for the next two hours now and see what happens...

Actions #7

Updated by Rainer Stumbaum almost 3 years ago

So,
the snaptrim of the 128 PGs in that cephfs_data pool took about two hours now.

No OSDs and no clients were harmed during that process. It seems that executing "ceph daemon osd.XX compact" for all OSDs did the trick and fixed my problem.

Is it a good idea - when creating and deleting snapshost on an hourly basis - to weekly run above command?

Actions #8

Updated by Igor Fedotov almost 3 years ago

Given all the symptoms I'm pretty sure the root cause for the issue is a "degraded" state of RocksDB after bulk data removal. There are multiple complains about dramatic DB performance drop after PG removal/migration. Which can be perfectly worked around by DB compaction.

I can see no reasons why bulk snaptrim wouldn't suffer from the same issue.

Relevant patches for PG removals are:

https://github.com/ceph/ceph/pull/37314
https://github.com/ceph/ceph/pull/37496

Highly likely we need some redesign on snapshots tracking and removal, to some degree similar to the one for PG removal (https://github.com/ceph/ceph/pull/37496)

Actions #9

Updated by Igor Fedotov almost 3 years ago

  • Project changed from RADOS to bluestore
Actions #10

Updated by Igor Fedotov almost 3 years ago

Rainer Stumbaum wrote:

So,
the snaptrim of the 128 PGs in that cephfs_data pool took about two hours now.

No OSDs and no clients were harmed during that process. It seems that executing "ceph daemon osd.XX compact" for all OSDs did the trick and fixed my problem.

Is it a good idea - when creating and deleting snapshost on an hourly basis - to weekly run above command?

I haven't seen any setups like that but it looks like online DB compaction isn't that harmful operation. So it makes sense to try.

Actions #11

Updated by Ist Gab almost 3 years ago

Had similar issue with RBD pool removal, the only thing that helped me as Igor suggested, stop the osd and run ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-26/ compact
then restart the osd.

Actions #12

Updated by Ist Gab almost 3 years ago

Some small correction, in some case this compaction didn't help either so I had to reinstall the osd.

Actions #13

Updated by Christian Rohmann almost 3 years ago

Igor Fedotov wrote:

Given all the symptoms I'm pretty sure the root cause for the issue is a "degraded" state of RocksDB after bulk data removal. There are multiple complains about dramatic DB performance drop after PG removal/migration. Which can be perfectly worked around by DB compaction.

I can see no reasons why bulk snaptrim wouldn't suffer from the same issue.

Relevant patches for PG removals are:

https://github.com/ceph/ceph/pull/37314
https://github.com/ceph/ceph/pull/37496

Highly likely we need some redesign on snapshots tracking and removal, to some degree similar to the one for PG removal (https://github.com/ceph/ceph/pull/37496)

Igor. I was about to raise a bug about extreme degradation of the whole cluster during snap_trims and then found this very issue which sounds quite similar. But I want to double check if a dedicated bug report might be sensible ... this is what happened:

We upgraded a 4-node HDD-only cluster (I know, I know) from Nautilus to Octopus and also had the osd fsck to create the new omap (as described in) https://docs.ceph.com/en/octopus/releases/octopus/#rados.

Things went fine, but when we deleted a few image snapshots (OpenStack cinder backups) we observed no more user I/O being served, flapping OSDs, ... the lot.

We also played around with adjustments to snaptrim such as osd_snap_trim_sleep but no luck.
Finally we found various hints at RocksDB compactation on the OSDs and ran this on all OSDs in the cluster. When deleting snapshots now the snaptrim operations are through in seconds and barely visible via ceph status. As we did not have any such issues prior to the Ceph Upgrade, could this be due to the conversion to OMAP which causes RocksDB to be degraded this heavily?
Maybe something to note in the upgrade guide then?

Are the PRs you are referring to also targeting this issue?

Is there any way to properly monitor and maintain a cluster about its RocksDB degradation / urge to be compacted? Is there anything I missed in the recommendations about running compact manually at times? Does this never happen automatically?
I know there are the options: * mon_compact_on_start * osd_compact_on_start (with Pacific)

but that requires OSDs to be restarted frequently enough.

Actions

Also available in: Atom PDF