Bug #63658
openOSD trim_maps - possible too slow lead to using too much storage space
0%
Description
current osdmap trim code logic in ceph-osd:
1. osd receives MOSDMap sent from mon or osd, maybe 40 osdmaps, will call OSD::handle_osd_map
2. OSD::handle_osd_map will call OSD::trim_maps for osdmap trim
3. OSD::trim_maps can trim up to 30 (osd_target_transaction_size) osdmap epochs at one time
the number of received osdmaps is large, but the number of trimmed osdmaps is small.
Over time, a large number of osdmaps that need to be trimmed but cannot be trimmed will accumulate on the osd,
thus occupying a large amount of osd storage space.
There is another scene:
1. When pg is in an abnormal state (not active+clean) for a long time,
2. When osdmap changes, ceph-mon will accumulate a large number of osdmaps without trimming or updating mon first osdmap epoch.
3. ceph-osd found that mon first osdmap epoch was not updated and trim_maps would not be performed.
4. When pgs is restored to active+clean, due to the code logic of osd processing trim_maps above, a maximum of 30 osdmap epochs can be trimmed at a time. The incremental osdmap in the middle may occupy the storage space of osd for a long time, and trim cannot be obtained, even if able to trim
solutions that come to mind:
do we need to perform OSD::trim_maps periodically in the OSD::tick function?
Related issues:
https://tracker.ceph.com/issues/61962
void OSD::handle_osd_map(MOSDMap *m) { // store new maps: queue for disk and put in the osdmap cache epoch_t start = std::max(superblock.get_newest_map() + 1, first); for (epoch_t e = start; e <= last; e++) { //maybe receives 40 osdmap epochs } ...... if (!superblock.maps.empty()) { trim_maps(m->cluster_osdmap_trim_lower_bound); pg_num_history.prune(superblock.get_oldest_map()); } ...... } # to adjust various transactions that batch smaller items - name: osd_target_transaction_size type: int level: advanced default: 30 with_legacy: true void OSD::trim_maps(epoch_t oldest) { epoch_t min = std::min(oldest, service.map_cache.cached_key_lower_bound()); dout(20) << __func__ << ": min=" << min << " oldest_map=" << superblock.get_oldest_map() << dendl; if (min <= superblock.get_oldest_map()) return; // Trim from the superblock's oldest_map up to `min`. // Break if we have exceeded the txn target size. ObjectStore::Transaction t; while (superblock.get_oldest_map() < min && t.get_num_ops() < cct->_conf->osd_target_transaction_size) { dout(20) << " removing old osdmap epoch " << superblock.get_oldest_map() << dendl; t.remove(coll_t::meta(), get_osdmap_pobject_name(superblock.get_oldest_map())); t.remove(coll_t::meta(), get_inc_osdmap_pobject_name(superblock.get_oldest_map())); superblock.maps.erase(superblock.get_oldest_map()); } service.publish_superblock(superblock); write_superblock(cct, superblock, t); int tr = store->queue_transaction(service.meta_ch, std::move(t), nullptr); ceph_assert(tr == 0); // we should not trim past service.map_cache.cached_key_lower_bound() // as there may still be PGs with those map epochs recorded. ceph_assert(min <= service.map_cache.cached_key_lower_bound()); }
Updated by dongdong tao 5 months ago
@jianwei zhang
do you have steps to reproduce the OSDMap accumulating inside the OSD in your first scenario ?
Updated by jianwei zhang 5 months ago
void OSD::handle_osd_map(MOSDMap *m) { ...... if (superblock.cluster_osdmap_trim_lower_bound < m->cluster_osdmap_trim_lower_bound) { superblock.cluster_osdmap_trim_lower_bound = m->cluster_osdmap_trim_lower_bound; dout(10) << " superblock cluster_osdmap_trim_lower_bound new epoch is: " << superblock.cluster_osdmap_trim_lower_bound << dendl; ceph_assert( superblock.cluster_osdmap_trim_lower_bound >= superblock.get_oldest_map()); } ...... if (!superblock.maps.empty()) { trim_maps(m->cluster_osdmap_trim_lower_bound); pg_num_history.prune(superblock.get_oldest_map()); } ...... } void OSD::trim_maps(epoch_t oldest) { epoch_t min = std::min(oldest, service.map_cache.cached_key_lower_bound()); dout(20) << __func__ << ": min=" << min << " oldest_map=" << superblock.get_oldest_map() << dendl; if (min <= superblock.get_oldest_map()) return; // Trim from the superblock's oldest_map up to `min`. // Break if we have exceeded the txn target size. ObjectStore::Transaction t; while (superblock.get_oldest_map() < min && t.get_num_ops() < cct->_conf->osd_target_transaction_size) { dout(20) << " removing old osdmap epoch " << superblock.get_oldest_map() << dendl; t.remove(coll_t::meta(), get_osdmap_pobject_name(superblock.get_oldest_map())); t.remove(coll_t::meta(), get_inc_osdmap_pobject_name(superblock.get_oldest_map())); superblock.maps.erase(superblock.get_oldest_map()); } service.publish_superblock(superblock); write_superblock(cct, superblock, t); int tr = store->queue_transaction(service.meta_ch, std::move(t), nullptr); ceph_assert(tr == 0); // we should not trim past service.map_cache.cached_key_lower_bound() // as there may still be PGs with those map epochs recorded. ceph_assert(min <= service.map_cache.cached_key_lower_bound()); } void OSD::tick() { ceph_assert(ceph_mutex_is_locked(osd_lock)); dout(10) << "tick" << dendl; // trim_maps are scheduled every hour, and the lower boundary is superblock.cluster_osdmap_trim_lower_bound }
Updated by Radoslaw Zarzynski 5 months ago
- Status changed from New to Fix Under Review
- Pull request ID set to 54686