Bug #62918: all write io block but no flush is triggered - RADOS - Ceph

Actions

Copy link

Bug #62918

open

all write io block but no flush is triggered

Added by Jack Lv 8 months ago. Updated 5 months ago.

Status:

Fix Under Review

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

53622

Crash signature (v1):

Crash signature (v2):

Description

If memory page is large, _maybe_wait_for_writeback may wait for dirty bufferheads exceed limit, but dirty data size is smaller than target dirty,
so fluser_entry will not run flush, then all write io block.

Actions

Copy link

Updated by Venky Shankar 7 months ago

Project changed from CephFS to RADOS

Jack Lv wrote:

If memory page is large, _maybe_wait_for_writeback may wait for dirty bufferheads exceed limit, but dirty data size is smaller than target dirty,
so fluser_entry will not run flush, then all write io block.

Do you have a reproducer? Going by your description, setting some ObjectCacher config lower possibly reproduce this?

Also, this is more suited for the RADOS project.

Actions

Copy link

Updated by Radoslaw Zarzynski 6 months ago

Status changed from New to Need More Info

Actions

Copy link

Updated by Jack Lv 6 months ago

Hi, run ceph-fuse in OS that have a 64k(or 32k etc) memory page, then do multithread small io(4k) write on ceph-fuse directory, it is easy to reproduce(use lagger client_oc_max_dirty_age).

When bufferheads num of dirty data exceed limit, but dirty data size does not, flusher_entry just do nothing even all ios have block at _maybe_wait_for_writeback, it is perhaps unreasonable.

void ObjectCacher::_maybe_wait_for_writeback(uint64_t len,
                                             ZTracer::Trace *trace)
{ 
  ceph_assert(ceph_mutex_is_locked(lock));
  ceph::mono_time start = ceph::mono_clock::now();
  int blocked = 0;
  // wait for writeback?
  //  - wait for dirty and tx bytes (relative to the max_dirty threshold)
  //  - do not wait for bytes other waiters are waiting on.  this means that
  //    threads do not wait for each other.  this effectively allows the cache
  //    size to balloon proportional to the data that is in flight.

  *uint64_t max_dirty_bh = max_dirty >> BUFFER_MEMORY_WEIGHT;*

Actions

Copy link

Updated by Jack Lv 6 months ago

_maybe_wait_for_writeback calculate max_dirty_bh by max_dirty and BUFFER_MEMORY_WEIGHT(related with memory pages).

Actions

Copy link

Updated by Radoslaw Zarzynski 6 months ago

The BUFFER_MEMORY_WEIGHT is just an alias over CEPH_PAGE_SHIFT:

./osdc/ObjectCacher.cc:#define BUFFER_MEMORY_WEIGHT CEPH_PAGE_SHIFT  // memory usage of BufferHead, count in (1<<n)

which in turn is:

namespace ceph {
  // these are in common/page.cc
  extern unsigned _page_size;
  extern unsigned long _page_mask;
  extern unsigned _page_shift;
}

#endif

#define CEPH_PAGE_SIZE ceph::_page_size
#define CEPH_PAGE_MASK ceph::_page_mask
#define CEPH_PAGE_SHIFT ceph::_page_shift

These globals are calculated the following way:

namespace ceph {

  // page size crap, see page.h
  int _get_bits_of(int v) {
    int n = 0;
    while (v) {
      n++;
      v = v >> 1;
    }
    return n;
  }

  #ifdef _WIN32
  unsigned _get_page_size() {
    SYSTEM_INFO system_info;
    GetSystemInfo(&system_info);
    return system_info.dwPageSize;
  }

  unsigned _page_size = _get_page_size();
  #else
  unsigned _page_size = sysconf(_SC_PAGESIZE);
  #endif
  unsigned long _page_mask = ~(unsigned long)(_page_size - 1);
  unsigned _page_shift = _get_bits_of(_page_size - 1);

}

Could you provide the value of sysconf(_SC_PAGESIZE) maybe?

Actions

Copy link

Updated by Jack Lv 6 months ago

it is 65536 in my test env

Actions

Copy link

Updated by Radoslaw Zarzynski 6 months ago

So your point that the while's body isn't executed when max_dirty_bh is calculated for 64k pages, right?

void ObjectCacher::_maybe_wait_for_writeback(uint64_t len,
                                             ZTracer::Trace *trace)
{
  ceph_assert(ceph_mutex_is_locked(lock));
  ceph::mono_time start = ceph::mono_clock::now();
  int blocked = 0;
  // wait for writeback?
  //  - wait for dirty and tx bytes (relative to the max_dirty threshold)
  //  - do not wait for bytes other waiters are waiting on.  this means that
  //    threads do not wait for each other.  this effectively allows the cache
  //    size to balloon proportional to the data that is in flight.

  uint64_t max_dirty_bh = max_dirty >> BUFFER_MEMORY_WEIGHT;
  while (get_stat_dirty() + get_stat_tx() > 0 &&
         (((uint64_t)(get_stat_dirty() + get_stat_tx()) >=
          max_dirty + get_stat_dirty_waiting()) ||
         (dirty_or_tx_bh.size() >=
          max_dirty_bh + get_stat_nr_dirty_waiters()))) {

    if (blocked == 0) {
      trace->event("start wait for writeback");
    }
    ldout(cct, 10) << __func__ << " waiting for dirty|tx " 
                   << (get_stat_dirty() + get_stat_tx()) << " >= max " 
                   << max_dirty << " + dirty_waiting " 
                   << get_stat_dirty_waiting() << dendl;
    flusher_cond.notify_all();
    stat_dirty_waiting += len;
    ++stat_nr_dirty_waiters;
    std::unique_lock l{lock, std::adopt_lock};
    stat_cond.wait(l);
    l.release();
    stat_dirty_waiting -= len;
    --stat_nr_dirty_waiters;
    ++blocked;
    ldout(cct, 10) << __func__ << " woke up" << dendl;
  }

Actions

Copy link

Updated by Jack Lv 5 months ago

hi，my point is that： while _maybe_wait_for_writeback block for max_dirty_bh(calculated by 64k pages)，fluser_entry do not trigger flush

Actions

Copy link

Updated by Patrick Donnelly 5 months ago

Status changed from Need More Info to Fix Under Review
Pull request ID set to 53622

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #62918

all write io block but no flush is triggered

Updated by Venky Shankar 7 months ago

Updated by Radoslaw Zarzynski 6 months ago

Updated by Jack Lv 6 months ago

Updated by Jack Lv 6 months ago

Updated by Radoslaw Zarzynski 6 months ago

Updated by Jack Lv 6 months ago

Updated by Radoslaw Zarzynski 6 months ago

Updated by Jack Lv 5 months ago

Updated by Patrick Donnelly 5 months ago