Project

General

Profile

Actions

Bug #19983

closed

osds abort on shutdown with assert(/build/ceph-12.0.2/src/os/bluestore/KernelDevice.cc: 364: FAILED assert(r >= 0))

Added by xw zhang almost 7 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
BlueStore
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

version:
root@node0:~# ceph -v
ceph version 12.0.2 (5a1b6b3269da99a18984c138c23935e5eb96f73e)

bluestore+rbd+ec+overwrite+iscsi,ec k/m == 4/2,12 osds

2017-05-17 17:00:24.446351 7f994a645700 -1 /build/ceph-12.0.2/src/os/bluestore/KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 7f994a645700 time 2017-05-17 17:00:24.442872
/build/ceph-12.0.2/src/os/bluestore/KernelDevice.cc: 364: FAILED assert(r >= 0)

ceph version 12.0.2 (5a1b6b3269da99a18984c138c23935e5eb96f73e)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x563f2f015072]
2: (KernelDevice::_aio_thread()+0x1301) [0x563f2ef9ac61]
3: (KernelDevice::AioCompletionThread::entry()+0xd) [0x563f2ef9d45d]
4: (()+0x76ba) [0x7f99523176ba]
5: (clone()+0x6d) [0x7f995138e82d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #1

Updated by Brad Hubbard almost 7 years ago

/a/bhubbard-2017-05-24_05:25:43-rados-wip-badone-testing---basic-smithi/1224591/teuthology.log

2017-05-24T07:27:09.928 INFO:tasks.workunit.client.0.smithi084.stderr:2017-05-24 07:27:09.927439 7f8f003e9700  0 -- 172.21.15.84:0/674029674 >> 172.21.15.84:6801/836721 pipe(0x7f8f0801bd10 sd=9 :56304 s=2 pgs=168 cs=1 l=1 c=0x7f8f0801e360).injecting socket failure
2017-05-24T07:27:09.931 INFO:tasks.workunit.client.0.smithi084.stderr:2017-05-24 07:27:09.927602 7f8f2d86c980  0 -- 172.21.15.84:0/674029674 submit_message osd_op(client.4304.0:127 5.7 5:e49d7777:::benchmark_data_smithi084_844084_object126:head [omap-set-vals] snapc 0=[] ondisk+write+known_if_redirected e40) v8 remote, 172.21.15.84:6801/836721, failed lossy con, dropping message 0x55f4a4642fd0
2017-05-24T07:27:10.052 INFO:tasks.ceph.osd.2.smithi084.stderr:/build/ceph-12.0.2-1221-geb5c02d/src/os/bluestore/BlockDevice.h: In function 'void IOContext::aio_wake()' thread 7f6732579700 time 2017-05-24 07:27:10.052106
2017-05-24T07:27:10.055 INFO:tasks.ceph.osd.2.smithi084.stderr:/build/ceph-12.0.2-1221-geb5c02d/src/os/bluestore/BlockDevice.h: 65: FAILED assert(num_running == 0)
2017-05-24T07:27:10.059 INFO:tasks.ceph.osd.2.smithi084.stderr: ceph version 12.0.2-1221-geb5c02d (eb5c02df634b34ca45ecbf1eaf4440ccf806845f)
2017-05-24T07:27:10.063 INFO:tasks.ceph.osd.2.smithi084.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x56299e8f14e2]
2017-05-24T07:27:10.067 INFO:tasks.ceph.osd.2.smithi084.stderr: 2: (KernelDevice::_aio_thread()+0xd94) [0x56299e898384]
2017-05-24T07:27:10.076 INFO:tasks.ceph.osd.2.smithi084.stderr: 3: (KernelDevice::AioCompletionThread::entry()+0xd) [0x56299e89aa9d]
2017-05-24T07:27:10.079 INFO:tasks.ceph.osd.2.smithi084.stderr: 4: (()+0x770a) [0x7f673d09170a]
2017-05-24T07:27:10.082 INFO:tasks.ceph.osd.2.smithi084.stderr: 5: (clone()+0x6d) [0x7f673c10882d]
2017-05-24T07:27:10.088 INFO:tasks.ceph.osd.2.smithi084.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actually, this may be different but appears to be a similar race in the same function?

Actions #2

Updated by Brad Hubbard almost 7 years ago

  • Priority changed from Normal to Urgent
Actions #3

Updated by xw zhang almost 7 years ago

I pulled out a disk, and then there was the problem.

Actions #4

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category set to Correctness/Safety
  • Component(RADOS) BlueStore added
Actions #5

Updated by Sage Weil almost 7 years ago

  • Status changed from New to Need More Info

Do you mean you pulled out the disk, and then ceph-osd crashed? That is normal--the disk si gone!

Or, do you mean that you pulled the disk, rebooted the server, and ceph-osd crashed on startup?

Actions #6

Updated by Sage Weil almost 7 years ago

  • Status changed from Need More Info to Closed
Actions

Also available in: Atom PDF