Project

General

Profile

Actions

Bug #63609

open

osd acquire map_cache_lock high latency

Added by jianwei zhang 6 months ago. Updated 6 months ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
11/22/2023
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description


1 osd, 128 pgs
5 shard-queue/thread, 128/5 = 26 pgs

MOSDMap receive 40 epochs osdmap once, store to bluestore, and consume_map for every pg

2023-11-22T13:02:46.218+0800 7ff03b597700  1 -- v2:127.0.0.1:6821/912494 <== mon.0 v2:127.0.0.1:40545/0 7 ==== osd_map(2571..2610 src has 252..2687) v4 ==== 639477+0+0 (crc 0 0 0) 0x55b311457500 con 0x55b3122cfc00
2023-11-22T13:02:47.078+0800 7ff03c599700 10 osd.0 2570 _committed_osd_maps 2571..2610

pg 1.4 consume osdmap  2571..2610  needs 2.271 seconds

26 pgs * 2.271 = 59.046 seconds  // very high

This will cause osd to be unable to catch up with the latest osdmap epoch in time.

The root cause is
1. The map_cache_lock lock range is too large, it needs to protect 3 lru caches

   //osd map cache (past osd maps)
   ceph::mutex map_cache_lock = ceph::make_mutex("OSDService::map_cache_lock");
   SharedLRU<epoch_t, const OSDMap> map_cache;
   SimpleLRU<epoch_t, ceph::buffer::list> map_bl_cache;
   SimpleLRU<epoch_t, ceph::buffer::list> map_bl_inc_cache;

2. When lru map_cache misses, it will read the disk and OSDMap::decode

2023-11-22T13:02:50.010 - 2023-11-22T13:02:47.739 = 2.271 秒

- OSDService::try_get_map

  - all_lat = 1.765460617

  - map_cache_lock_lat = 1.281302137  // high latency

  - map_cache_lookup_lat = 0.000025702

  - map_cache_key_lower_bound_lat = 0.00001841

  - _get_map_bl_lat = 0.059360301

  - map_decode_lat = 0.424754057
                                                                                                                           all_lat        map_cache_lock_lat        map_cache_lookup_lat        map_cache_key_lower_bound_lat        _get_map_bl_lat        map_decode_lat
2023-11-22T13:02:48.078    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.060132444    map_cache_lock_lat    0.049500227    map_cache_lookup_lat    0.049500927    map_cache_key_lower_bound_lat    0.000000467    _get_map_bl_lat    0.001472938    map_decode_lat    0.009158111
2023-11-22T13:02:48.092    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.009916984    map_cache_lock_lat    0.000000139    map_cache_lookup_lat    0.000000458    map_cache_key_lower_bound_lat    0.000000222    _get_map_bl_lat    0.001354447    map_decode_lat    0.008561857
2023-11-22T13:02:48.162    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.06527137    map_cache_lock_lat    0.054349292    map_cache_lookup_lat    0.054349929    map_cache_key_lower_bound_lat    0.000000302    _get_map_bl_lat    0.00148555    map_decode_lat    0.009435585
2023-11-22T13:02:48.228    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.062032465    map_cache_lock_lat    0.050890993    map_cache_lookup_lat    0.050891679    map_cache_key_lower_bound_lat    0.000000359    _get_map_bl_lat    0.001522906    map_decode_lat    0.009617516
2023-11-22T13:02:48.296    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.064207315    map_cache_lock_lat    0.053334683    map_cache_lookup_lat    0.053335417    map_cache_key_lower_bound_lat    0.000000482    _get_map_bl_lat    0.00156488    map_decode_lat    0.009306537
2023-11-22T13:02:48.356    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.055843525    map_cache_lock_lat    0.037233017    map_cache_lookup_lat    0.037233509    map_cache_key_lower_bound_lat    0.00000034    _get_map_bl_lat    0.001857088    map_decode_lat    0.016752589
2023-11-22T13:02:48.427    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.063933834    map_cache_lock_lat    0.042691372    map_cache_lookup_lat    0.042692229    map_cache_key_lower_bound_lat    0.000001041    _get_map_bl_lat    0.002116326    map_decode_lat    0.019124236
2023-11-22T13:02:48.496    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.061010312    map_cache_lock_lat    0.042960327    map_cache_lookup_lat    0.042961389    map_cache_key_lower_bound_lat    0.000000979    _get_map_bl_lat    0.002153001    map_decode_lat    0.015894946
2023-11-22T13:02:48.567    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.065189451    map_cache_lock_lat    0.050964955    map_cache_lookup_lat    0.050965529    map_cache_key_lower_bound_lat    0.000000656    _get_map_bl_lat    0.001760848    map_decode_lat    0.012462417
2023-11-22T13:02:48.649    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.075792655    map_cache_lock_lat    0.05511919    map_cache_lookup_lat    0.055120368    map_cache_key_lower_bound_lat    0.000000894    _get_map_bl_lat    0.002399989    map_decode_lat    0.018271407
2023-11-22T13:02:48.673    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.016864773    map_cache_lock_lat    0.000000163    map_cache_lookup_lat    0.000000714    map_cache_key_lower_bound_lat    0.000000423    _get_map_bl_lat    0.001980496    map_decode_lat    0.014883139
2023-11-22T13:02:48.746    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.066411771    map_cache_lock_lat    0.053015575    map_cache_lookup_lat    0.053016394    map_cache_key_lower_bound_lat    0.000000381    _get_map_bl_lat    0.001849816    map_decode_lat    0.011545174
2023-11-22T13:02:48.812    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.061037824    map_cache_lock_lat    0.050285898    map_cache_lookup_lat    0.050286617    map_cache_key_lower_bound_lat    0.000000472    _get_map_bl_lat    0.001592858    map_decode_lat    0.009157878
2023-11-22T13:02:48.886    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.070571721    map_cache_lock_lat    0.05933978    map_cache_lookup_lat    0.059340537    map_cache_key_lower_bound_lat    0.000000588    _get_map_bl_lat    0.001702452    map_decode_lat    0.009528146
2023-11-22T13:02:48.971    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.080418512    map_cache_lock_lat    0.060756236    map_cache_lookup_lat    0.060757279    map_cache_key_lower_bound_lat    0.000000724    _get_map_bl_lat    0.002362467    map_decode_lat    0.017298045
2023-11-22T13:02:48.994    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.016526777    map_cache_lock_lat    0.000000174    map_cache_lookup_lat    0.000000666    map_cache_key_lower_bound_lat    0.000000431    _get_map_bl_lat    0.002043063    map_decode_lat    0.014482617
2023-11-22T13:02:49.071    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.071925208    map_cache_lock_lat    0.052028365    map_cache_lookup_lat    0.052029613    map_cache_key_lower_bound_lat    0.000000667    _get_map_bl_lat    0.002546297    map_decode_lat    0.01734863
2023-11-22T13:02:49.157    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.078764386    map_cache_lock_lat    0.058724303    map_cache_lookup_lat    0.058725547    map_cache_key_lower_bound_lat    0.000000754    _get_map_bl_lat    0.002463211    map_decode_lat    0.017574869
2023-11-22T13:02:49.240    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.075854778    map_cache_lock_lat    0.055942215    map_cache_lookup_lat    0.055943493    map_cache_key_lower_bound_lat    0.000000868    _get_map_bl_lat    0.00240916    map_decode_lat    0.017501255
2023-11-22T13:02:49.263    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.016464595    map_cache_lock_lat    0.000000179    map_cache_lookup_lat    0.00000073    map_cache_key_lower_bound_lat    0.000000443    _get_map_bl_lat    0.001978494    map_decode_lat    0.014484928
2023-11-22T13:02:49.346    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.077999257    map_cache_lock_lat    0.058135744    map_cache_lookup_lat    0.058136925    map_cache_key_lower_bound_lat    0.000000484    _get_map_bl_lat    0.002515038    map_decode_lat    0.017346809
2023-11-22T13:02:49.431    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.078063644    map_cache_lock_lat    0.056774575    map_cache_lookup_lat    0.056775816    map_cache_key_lower_bound_lat    0.00000071    _get_map_bl_lat    0.002506333    map_decode_lat    0.018780785
2023-11-22T13:02:49.517    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.078985013    map_cache_lock_lat    0.057551544    map_cache_lookup_lat    0.057552725    map_cache_key_lower_bound_lat    0.000000874    _get_map_bl_lat    0.002575265    map_decode_lat    0.018856149
2023-11-22T13:02:49.597    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.073418237    map_cache_lock_lat    0.053344097    map_cache_lookup_lat    0.053345256    map_cache_key_lower_bound_lat    0.000000835    _get_map_bl_lat    0.002311108    map_decode_lat    0.017761037
2023-11-22T13:02:49.681    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.077397563    map_cache_lock_lat    0.056956548    map_cache_lookup_lat    0.056957699    map_cache_key_lower_bound_lat    0.000001048    _get_map_bl_lat    0.002435201    map_decode_lat    0.018003613
2023-11-22T13:02:49.762    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.074050911    map_cache_lock_lat    0.056898676    map_cache_lookup_lat    0.056899667    map_cache_key_lower_bound_lat    0.000000643    _get_map_bl_lat    0.00209178    map_decode_lat    0.015058822
2023-11-22T13:02:49.846    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.077292934    map_cache_lock_lat    0.057717863    map_cache_lookup_lat    0.057719007    map_cache_key_lower_bound_lat    0.000001004    _get_map_bl_lat    0.002327521    map_decode_lat    0.017245404
2023-11-22T13:02:49.927    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.074901029    map_cache_lock_lat    0.056785829    map_cache_lookup_lat    0.056787081    map_cache_key_lower_bound_lat    0.000000878    _get_map_bl_lat    0.00214289    map_decode_lat    0.015970185
2023-11-22T13:02:49.948    7ff02e57d700    1    osd.0    2610    try_get_map    OSDService::try_get_map    all_lat    0.015181329    map_cache_lock_lat    0.000000178    map_cache_lookup_lat    0.000000639    map_cache_key_lower_bound_lat    0.000000441    _get_map_bl_lat    0.001838878    map_decode_lat    0.013341371
                                                                                                                     1.765460617        1.281302137        1.281327839        0.00001841        0.059360301        0.424754057
OSDMapRef OSDService::try_get_map(epoch_t epoch)
{
  auto ts1 = ceph::real_clock::now();
  std::lock_guard l(map_cache_lock);
  auto ts1_1 = ceph::real_clock::now();
  OSDMapRef retval = map_cache.lookup(epoch);
  if (retval) {
    dout(30) << "get_map " << epoch << " -cached" << dendl;
    logger->inc(l_osd_map_cache_hit);
    return retval;
  }
  auto ts2 = ceph::real_clock::now();
  {
    logger->inc(l_osd_map_cache_miss);
    epoch_t lb = map_cache.cached_key_lower_bound();
    if (epoch < lb) {
      dout(30) << "get_map " << epoch << " - miss, below lower bound" << dendl;
      logger->inc(l_osd_map_cache_miss_low);
      logger->inc(l_osd_map_cache_miss_low_avg, lb - epoch);
    }
  }

  auto ts3 = ceph::real_clock::now();
  OSDMap *map = new OSDMap;
  if (epoch > 0) {
    dout(20) << "get_map " << epoch << " - loading and decoding " << map << dendl;
    bufferlist bl;
    if (!_get_map_bl(epoch, bl) || bl.length() == 0) {
      derr << "failed to load OSD map for epoch " << epoch << ", got " << bl.length() << " bytes" << dendl;
      delete map;
      return OSDMapRef();
    }
    auto ts4 = ceph::real_clock::now();
    map->decode(bl);
    auto ts5 = ceph::real_clock::now();
    auto all_lat = ts5 - ts1;
    auto map_cache_lock_lat = ts1_1 - ts1;
    auto map_cache_lookup_lat = ts2 - ts1_1;
    auto map_cache_key_lower_bound_lat = ts3 - ts2;
    auto _get_map_bl_lat = ts4 - ts3;
    auto map_decode_lat = ts5 - ts4;
    dout(1) << __func__ << " OSDService::try_get_map zhangjianwei2: " 
           << " all_lat=" << all_lat
           << " map_cache_lock_lat=" << map_cache_lock_lat
           << " map_cache_lookup_lat=" << map_cache_lookup_lat
           << " map_cache_key_lower_bound_lat=" << map_cache_key_lower_bound_lat
           << " _get_map_bl_lat=" << _get_map_bl_lat
           << " map_decode_lat=" << map_decode_lat
           << dendl;
  } else {
    dout(20) << "get_map " << epoch << " - return initial " << map << dendl;
  }
  return _add_map(map);
}
Actions

Also available in: Atom PDF