Project

General

Profile

Bug #18436

Qemu crash triggered by network issues

Added by Brian Andrus 4 months ago. Updated 3 months ago.

Status:
Pending Backport
Priority:
High
Target version:
-
Start date:
01/05/2017
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
kraken,jewel,hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

Noticed in the form of I/O errors and disk corruption at the operating system level. Further investigation uncovered this in the qemu log:

osdc/ObjectCacher.cc: In function 'void ObjectCacher::trim()' thread 7fdb28f8e700 time 2017-01-05 21:56:28.221215
osdc/ObjectCacher.cc: 1001: FAILED assert(bh->is_clean() || bh->is_zero())
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: (()+0x16c3db) [0x7fdfe86513db]
2: (()+0x3a2836) [0x7fdfe8887836]
3: (()+0x3b0c7b) [0x7fdfe8895c7b]
4: (()+0x3b7c70) [0x7fdfe889cc70]
5: (()+0x4c7c9) [0x7fdfe85317c9]
6: (()+0x3b4bd4) [0x7fdfe8899bd4]
7: (()+0x3aa12d) [0x7fdfe888f12d]
8: (()+0x3b6ee1) [0x7fdfe889bee1]
9: (()+0x4c7c9) [0x7fdfe85317c9]
10: (()+0xa2105) [0x7fdfe8587105]
11: (()+0x4c7c9) [0x7fdfe85317c9]
12: (()+0x953ad) [0x7fdfeab623ad]
13: (()+0x70469) [0x7fdfeab3d469]
14: (()+0x134928) [0x7fdfeac01928]
15: (()+0x8184) [0x7fdfe43aa184]
16: (clone()+0x6d) [0x7fdfe40d737d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
2017-01-05 21:56:28.636+0000: shutting down

At the time of issues, we were experiencing some network infrastructure issues that were causing intermittent packet loss.


Related issues

Copied to Backport #18774: hammer: Qemu crash triggered by network issues In Progress
Copied to Backport #18775: jewel: Qemu crash triggered by network issues Resolved
Copied to Backport #18776: kraken: Qemu crash triggered by network issues Resolved

History

#1 Updated by Dan Mick 4 months ago

  • Category set to librbd
  • Assignee set to Jason Dillaman
  • Priority changed from Normal to High
  • Target version set to v0.94.10
  • Source set to Development

Jason, Brian filed this after investigating a problem today with docs.ceph.com's data drive (in the Dreamhost iad2 openstack instance). A very similar assertion failure occurred a few days ago with that same host's root drive:

[15:10:23] <fzylogic> and the qemu log shows you hit an assert in I'm guessing librbd
[15:10:33] <fzylogic> osdc/ObjectCacher.cc: In function 'void ObjectCacher::trim()' thread 7f05b57fa700 time 2016-12-31 06:50:42.595707 osdc/ObjectCacher.cc: 1001: FAILED assert(bh->is_clean() || bh->is_zero())

so it seems to be a recurring issue.

I know details are thin here.

#2 Updated by Dan Mick 4 months ago

I do have a snapshot of the root drive from the failure on 12/31, and looked at it briefly to see if I could spot some kind of corruption based on what fsck said, but had no success. I could conceivably copy that 40GB image somewhere if it were useful in some way.

#3 Updated by Jason Dillaman 4 months ago

Note: there is a known cache corruption bug [1] that is scheduled to be fixed in 0.94.10, but that wouldn't explain the assertion failure.

[1] http://tracker.ceph.com/issues/16546

#4 Updated by Jason Dillaman 4 months ago

  • Status changed from New to Need More Info

@Brian: any chance you had your nodes configured to capture core dumps? That would make solving this relatively simple.

#5 Updated by Jason Dillaman 4 months ago

  • Project changed from Ceph to rbd
  • Category deleted (librbd)

#6 Updated by Jason Dillaman 4 months ago

It's possible that somehow a cache read request failed (which shouldn't be possible), which resulted in the bh being marked in the error state [1]. With enough cache pressure, the `trim()` would have attempted to evict a block in the error state and would have failed the assertion.

I still need to verify this assumption but even fixing this assertion failure doesn't get to the root cause of why the read request failed (which would still result in IO errors within your VM).

@Brian: do you have any non-default settings in your Ceph configuration (either client or server side)? OSD timeouts?

[1] https://github.com/ceph/ceph/blob/v0.94.9/src/osdc/ObjectCacher.cc#L807

#7 Updated by Brian Andrus 4 months ago

Jason Dillaman wrote:

@Brian: any chance you had your nodes configured to capture core dumps? That would make solving this relatively simple.

Provided via private channels due to sensitive information contained within.

@Brian: do you have any non-default settings in your Ceph configuration (either client or server side)? OSD timeouts?

[1] https://github.com/ceph/ceph/blob/v0.94.9/src/osdc/ObjectCacher.cc#L807

[client]
rados mon op timeout = 180
osd op timeout = 180

I realize default is 0, and could explain the bulk of our problems. I'll try to determine when/why this was set in our environment.

#8 Updated by Jason Dillaman 4 months ago

@Brian: Thanks! That "osd op timeout = 180" setting is the problem -- it would have caused "ETIMEDOUT" errors to bubble up to your VMs and it set the stage for the cache to have bhs marked as error state. I would recommend yanking that setting post haste from any librbd client configs.

#9 Updated by Jason Dillaman 4 months ago

The provided core dump confirms that the bh state was error with an error code of ETIMEDOUT (-110).

(gdb) print (*('ObjectCacher::BufferHead'*)0x00007f4748ff3380)
$17 = {
  <LRUObject> = {
    lru_next = 0x0, 
    lru_prev = 0x0, 
    lru_pinned = false, 
    lru = 0x0, 
    lru_list = 0x0
  }, 
  members of ObjectCacher::BufferHead: 
  static STATE_MISSING = 0, 
  static STATE_CLEAN = 1, 
  static STATE_ZERO = 2, 
  static STATE_DIRTY = 3, 
  static STATE_RX = 4, 
  static STATE_TX = 5, 
  static STATE_ERROR = 6, 
  state = 6, 
  ref = 0, 
  ex = {
    start = 184320, 
    length = 8192
  }, 
  dontneed = false, 
  ob = 0x7f4748377250, 
  bl = {
    _buffers = empty std::list, 
    _len = 0, 
    _memcopy_count = 0, 
    append_buffer = {
      _raw = 0x0, 
      _off = 0, 
      _len = 0
    }, 
    last_p = {
      bl = 0x7f4748ff33d0, 
      ls = 0x7f4748ff33d0, 
      off = 0, 
      p = {
        _raw = , 
        _off = 0, 
        _len = 0
      }, 
      p_off = 0
    }, 
    static CLAIM_DEFAULT = 0, 
    static CLAIM_ALLOW_NONSHAREABLE = 1
  }, 
  last_write_tid = 0, 
  last_read_tid = 44752963, 
  last_write = {
    tv = {
      tv_sec = 0, 
      tv_nsec = 0
    }
  }, 
  snapc = {
    seq = {
      val = 0
    }, 
    snaps = std::vector of length 0, capacity 0
  }, 
  error = -110, 
  waitfor_read = std::map with 0 elements
}

#10 Updated by Jason Dillaman 3 months ago

  • Status changed from Need More Info to In Progress

#11 Updated by Jason Dillaman 3 months ago

  • Status changed from In Progress to Need Review
  • Backport set to kraken,jewel,hammer

#12 Updated by Jason Dillaman 3 months ago

  • Target version deleted (v0.94.10)

#13 Updated by Jason Dillaman 3 months ago

  • Status changed from Need Review to Pending Backport

#14 Updated by Nathan Cutler 3 months ago

  • Copied to Backport #18774: hammer: Qemu crash triggered by network issues added

#15 Updated by Nathan Cutler 3 months ago

  • Copied to Backport #18775: jewel: Qemu crash triggered by network issues added

#16 Updated by Nathan Cutler 3 months ago

  • Copied to Backport #18776: kraken: Qemu crash triggered by network issues added

Also available in: Atom PDF