Bug #18436: Qemu crash triggered by network issues - rbd - Ceph

Actions

Copy link

Bug #18436

closed

Qemu crash triggered by network issues

Added by Brian Andrus over 7 years ago. Updated over 6 years ago.

Status:

Resolved

Priority:

High

Assignee:

Jason Dillaman

Target version:

% Done:

Source:

Development

Tags:

Backport:

kraken,jewel,hammer

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Noticed in the form of I/O errors and disk corruption at the operating system level. Further investigation uncovered this in the qemu log:

osdc/ObjectCacher.cc: In function 'void ObjectCacher::trim()' thread 7fdb28f8e700 time 2017-01-05 21:56:28.221215 osdc/ObjectCacher.cc: 1001: FAILED assert(bh->is_clean() || bh->is_zero()) ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90) 1: (()+0x16c3db) [0x7fdfe86513db] 2: (()+0x3a2836) [0x7fdfe8887836] 3: (()+0x3b0c7b) [0x7fdfe8895c7b] 4: (()+0x3b7c70) [0x7fdfe889cc70] 5: (()+0x4c7c9) [0x7fdfe85317c9] 6: (()+0x3b4bd4) [0x7fdfe8899bd4] 7: (()+0x3aa12d) [0x7fdfe888f12d] 8: (()+0x3b6ee1) [0x7fdfe889bee1] 9: (()+0x4c7c9) [0x7fdfe85317c9] 10: (()+0xa2105) [0x7fdfe8587105] 11: (()+0x4c7c9) [0x7fdfe85317c9] 12: (()+0x953ad) [0x7fdfeab623ad] 13: (()+0x70469) [0x7fdfeab3d469] 14: (()+0x134928) [0x7fdfeac01928] 15: (()+0x8184) [0x7fdfe43aa184] 16: (clone()+0x6d) [0x7fdfe40d737d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' 2017-01-05 21:56:28.636+0000: shutting down

At the time of issues, we were experiencing some network infrastructure issues that were causing intermittent packet loss.

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Dan Mick over 7 years ago

Category set to librbd
Assignee set to Jason Dillaman
Priority changed from Normal to High
Target version set to v0.94.10
Source set to Development

Jason, Brian filed this after investigating a problem today with docs.ceph.com's data drive (in the Dreamhost iad2 openstack instance). A very similar assertion failure occurred a few days ago with that same host's root drive:

[15:10:23] <fzylogic> and the qemu log shows you hit an assert in I'm guessing librbd
[15:10:33] <fzylogic> osdc/ObjectCacher.cc: In function 'void ObjectCacher::trim()' thread 7f05b57fa700 time 2016-12-31 06:50:42.595707 osdc/ObjectCacher.cc: 1001: FAILED assert(bh->is_clean() || bh->is_zero())

so it seems to be a recurring issue.

I know details are thin here.

Actions

Copy link

Updated by Dan Mick over 7 years ago

I do have a snapshot of the root drive from the failure on 12/31, and looked at it briefly to see if I could spot some kind of corruption based on what fsck said, but had no success. I could conceivably copy that 40GB image somewhere if it were useful in some way.

Actions

Copy link

Updated by Jason Dillaman over 7 years ago

Note: there is a known cache corruption bug [1] that is scheduled to be fixed in 0.94.10, but that wouldn't explain the assertion failure.

[1] http://tracker.ceph.com/issues/16546

Actions

Copy link

Updated by Jason Dillaman over 7 years ago

Status changed from New to Need More Info

@Brian: any chance you had your nodes configured to capture core dumps? That would make solving this relatively simple.

Actions

Copy link

Updated by Jason Dillaman over 7 years ago

Project changed from Ceph to rbd
Category deleted (~~librbd~~)

Actions

Copy link

Updated by Jason Dillaman over 7 years ago

It's possible that somehow a cache read request failed (which shouldn't be possible), which resulted in the bh being marked in the error state [1]. With enough cache pressure, the `trim()` would have attempted to evict a block in the error state and would have failed the assertion.

I still need to verify this assumption but even fixing this assertion failure doesn't get to the root cause of why the read request failed (which would still result in IO errors within your VM).

@Brian: do you have any non-default settings in your Ceph configuration (either client or server side)? OSD timeouts?

[1] https://github.com/ceph/ceph/blob/v0.94.9/src/osdc/ObjectCacher.cc#L807

Actions

Copy link

Updated by Brian Andrus over 7 years ago

Jason Dillaman wrote:

@Brian: any chance you had your nodes configured to capture core dumps? That would make solving this relatively simple.

Provided via private channels due to sensitive information contained within.

@Brian: do you have any non-default settings in your Ceph configuration (either client or server side)? OSD timeouts?

[1] https://github.com/ceph/ceph/blob/v0.94.9/src/osdc/ObjectCacher.cc#L807

[client]
rados mon op timeout = 180
osd op timeout = 180

I realize default is 0, and could explain the bulk of our problems. I'll try to determine when/why this was set in our environment.

Actions

Copy link

Updated by Jason Dillaman over 7 years ago

@Brian: Thanks! That "osd op timeout = 180" setting is the problem -- it would have caused "ETIMEDOUT" errors to bubble up to your VMs and it set the stage for the cache to have bhs marked as error state. I would recommend yanking that setting post haste from any librbd client configs.

Actions

Copy link

Updated by Jason Dillaman over 7 years ago

The provided core dump confirms that the bh state was error with an error code of ETIMEDOUT (-110).

(gdb) print (*('ObjectCacher::BufferHead'*)0x00007f4748ff3380)
$17 = {
  <LRUObject> = {
    lru_next = 0x0, 
    lru_prev = 0x0, 
    lru_pinned = false, 
    lru = 0x0, 
    lru_list = 0x0
  }, 
  members of ObjectCacher::BufferHead: 
  static STATE_MISSING = 0, 
  static STATE_CLEAN = 1, 
  static STATE_ZERO = 2, 
  static STATE_DIRTY = 3, 
  static STATE_RX = 4, 
  static STATE_TX = 5, 
  static STATE_ERROR = 6, 
  state = 6, 
  ref = 0, 
  ex = {
    start = 184320, 
    length = 8192
  }, 
  dontneed = false, 
  ob = 0x7f4748377250, 
  bl = {
    _buffers = empty std::list, 
    _len = 0, 
    _memcopy_count = 0, 
    append_buffer = {
      _raw = 0x0, 
      _off = 0, 
      _len = 0
    }, 
    last_p = {
      bl = 0x7f4748ff33d0, 
      ls = 0x7f4748ff33d0, 
      off = 0, 
      p = {
        _raw = , 
        _off = 0, 
        _len = 0
      }, 
      p_off = 0
    }, 
    static CLAIM_DEFAULT = 0, 
    static CLAIM_ALLOW_NONSHAREABLE = 1
  }, 
  last_write_tid = 0, 
  last_read_tid = 44752963, 
  last_write = {
    tv = {
      tv_sec = 0, 
      tv_nsec = 0
    }
  }, 
  snapc = {
    seq = {
      val = 0
    }, 
    snaps = std::vector of length 0, capacity 0
  }, 
  error = -110, 
  waitfor_read = std::map with 0 elements
}

Actions

Copy link

#10