Project

General

Profile

Actions

Bug #63658

open

OSD trim_maps - possible too slow lead to using too much storage space

Added by jianwei zhang 5 months ago. Updated 4 months ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
11/28/2023
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

current osdmap trim code logic in ceph-osd:
1. osd receives MOSDMap sent from mon or osd, maybe 40 osdmaps, will call OSD::handle_osd_map
2. OSD::handle_osd_map will call OSD::trim_maps for osdmap trim
3. OSD::trim_maps can trim up to 30 (osd_target_transaction_size) osdmap epochs at one time

the number of received osdmaps is large, but the number of trimmed osdmaps is small.
Over time, a large number of osdmaps that need to be trimmed but cannot be trimmed will accumulate on the osd,
thus occupying a large amount of osd storage space.

There is another scene:
1. When pg is in an abnormal state (not active+clean) for a long time,
2. When osdmap changes, ceph-mon will accumulate a large number of osdmaps without trimming or updating mon first osdmap epoch.
3. ceph-osd found that mon first osdmap epoch was not updated and trim_maps would not be performed.
4. When pgs is restored to active+clean, due to the code logic of osd processing trim_maps above, a maximum of 30 osdmap epochs can be trimmed at a time. The incremental osdmap in the middle may occupy the storage space of osd for a long time, and trim cannot be obtained, even if able to trim

solutions that come to mind:
do we need to perform OSD::trim_maps periodically in the OSD::tick function?

Related issues:
https://tracker.ceph.com/issues/61962

void OSD::handle_osd_map(MOSDMap *m)
{
  // store new maps: queue for disk and put in the osdmap cache
  epoch_t start = std::max(superblock.get_newest_map() + 1, first);
  for (epoch_t e = start; e <= last; e++) {
    //maybe receives 40 osdmap epochs
  }

  ......
  if (!superblock.maps.empty()) {
    trim_maps(m->cluster_osdmap_trim_lower_bound);
    pg_num_history.prune(superblock.get_oldest_map());
  }
  ......
}

# to adjust various transactions that batch smaller items
- name: osd_target_transaction_size
  type: int
  level: advanced
  default: 30
  with_legacy: true

void OSD::trim_maps(epoch_t oldest)
{
  epoch_t min = std::min(oldest, service.map_cache.cached_key_lower_bound());
  dout(20) <<  __func__ << ": min=" << min << " oldest_map=" 
           << superblock.get_oldest_map() << dendl;
  if (min <= superblock.get_oldest_map())
    return;

  // Trim from the superblock's oldest_map up to `min`.
  // Break if we have exceeded the txn target size.
  ObjectStore::Transaction t;
  while (superblock.get_oldest_map() < min &&
         t.get_num_ops() < cct->_conf->osd_target_transaction_size) {
    dout(20) << " removing old osdmap epoch " << superblock.get_oldest_map() << dendl;
    t.remove(coll_t::meta(), get_osdmap_pobject_name(superblock.get_oldest_map()));
    t.remove(coll_t::meta(), get_inc_osdmap_pobject_name(superblock.get_oldest_map()));
    superblock.maps.erase(superblock.get_oldest_map());
  }

  service.publish_superblock(superblock);
  write_superblock(cct, superblock, t);
  int tr = store->queue_transaction(service.meta_ch, std::move(t), nullptr);
  ceph_assert(tr == 0);

  // we should not trim past service.map_cache.cached_key_lower_bound() 
  // as there may still be PGs with those map epochs recorded.
  ceph_assert(min <= service.map_cache.cached_key_lower_bound());
}
Actions #1

Updated by dongdong tao 5 months ago

@jianwei zhang
do you have steps to reproduce the OSDMap accumulating inside the OSD in your first scenario ?

Actions #2

Updated by jianwei zhang 5 months ago

void OSD::handle_osd_map(MOSDMap *m)
{
  ......
  if (superblock.cluster_osdmap_trim_lower_bound <
      m->cluster_osdmap_trim_lower_bound) {
    superblock.cluster_osdmap_trim_lower_bound =
      m->cluster_osdmap_trim_lower_bound;
    dout(10) << " superblock cluster_osdmap_trim_lower_bound new epoch is: " 
             << superblock.cluster_osdmap_trim_lower_bound << dendl;
    ceph_assert(
      superblock.cluster_osdmap_trim_lower_bound >= superblock.get_oldest_map());
  }

  ......

  if (!superblock.maps.empty()) {
    trim_maps(m->cluster_osdmap_trim_lower_bound);
    pg_num_history.prune(superblock.get_oldest_map());
  }

  ......
}

void OSD::trim_maps(epoch_t oldest)
{
  epoch_t min = std::min(oldest, service.map_cache.cached_key_lower_bound());
  dout(20) <<  __func__ << ": min=" << min << " oldest_map=" 
           << superblock.get_oldest_map() << dendl;
  if (min <= superblock.get_oldest_map())
    return;

  // Trim from the superblock's oldest_map up to `min`.
  // Break if we have exceeded the txn target size.
  ObjectStore::Transaction t;
  while (superblock.get_oldest_map() < min &&
         t.get_num_ops() < cct->_conf->osd_target_transaction_size) {
    dout(20) << " removing old osdmap epoch " << superblock.get_oldest_map() << dendl;
    t.remove(coll_t::meta(), get_osdmap_pobject_name(superblock.get_oldest_map()));
    t.remove(coll_t::meta(), get_inc_osdmap_pobject_name(superblock.get_oldest_map()));
    superblock.maps.erase(superblock.get_oldest_map());
  }

  service.publish_superblock(superblock);
  write_superblock(cct, superblock, t);
  int tr = store->queue_transaction(service.meta_ch, std::move(t), nullptr);
  ceph_assert(tr == 0);

  // we should not trim past service.map_cache.cached_key_lower_bound() 
  // as there may still be PGs with those map epochs recorded.
  ceph_assert(min <= service.map_cache.cached_key_lower_bound());
}

void OSD::tick()
{
  ceph_assert(ceph_mutex_is_locked(osd_lock));
  dout(10) << "tick" << dendl;

  // trim_maps are scheduled every hour, and the lower boundary is superblock.cluster_osdmap_trim_lower_bound

}

Actions #4

Updated by Radoslaw Zarzynski 5 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 54686
Actions

Also available in: Atom PDF