Project

General

Profile

Actions

Bug #48876

closed

osd crash in bluestore code

Added by Jeff Layton over 3 years ago. Updated over 3 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

OSD crash seen when doing some cephfs testing with some experimental MDS and client patches. Build was based on top of commit d20916964984242e51:

 0 0) 0x555eb4ad5680 con 0x555eb8f00880
    -5> 2021-01-14T15:11:54.195+0000 7fc30b3b9700 15 osd.0 16 enqueue_op 0x555f0ef961a0 prio 63 type 42 cost 2906 latency 0.000169 epoch 16 osd_op(mds.0.3:2234971 2.1e 2.31eb5b5e (undecoded) ondisk+write+known_if_redirected+full_force e16) v8
    -4> 2021-01-14T15:11:54.195+0000 7fc30b3b9700 20 osd.0 op_wq(0) _enqueue OpSchedulerItem(2.1e PGOpItem(op=osd_op(mds.0.3:2234971 2.1e 2.31eb5b5e (undecoded) ondisk+write+known_if_redirected+full_force e16) v8) prio 63 cost 2906 e16)
    -3> 2021-01-14T15:11:54.196+0000 7fc30b3b9700  1 -- [v2:192.168.1.3:6802/1394,v1:192.168.1.3:6803/1394] <== mds.0 v2:192.168.1.3:6810/1590666873 328408 ==== osd_op(mds.0.3:2234972 2.1e 2.31eb5b5e (undecoded) ondisk+write+known_if_redirected+full_force e16) v8 ==== 221+0+2361 (crc 0
 0 0) 0x555ef98e1400 con 0x555eb8f00880
    -2> 2021-01-14T15:11:54.196+0000 7fc30b3b9700 15 osd.0 16 enqueue_op 0x555ec44dc680 prio 63 type 42 cost 2361 latency 0.000154 epoch 16 osd_op(mds.0.3:2234972 2.1e 2.31eb5b5e (undecoded) ondisk+write+known_if_redirected+full_force e16) v8
    -1> 2021-01-14T15:11:54.196+0000 7fc30b3b9700 20 osd.0 op_wq(0) _enqueue OpSchedulerItem(2.1e PGOpItem(op=osd_op(mds.0.3:2234972 2.1e 2.31eb5b5e (undecoded) ondisk+write+known_if_redirected+full_force e16) v8) prio 63 cost 2361 e16)
     0> 2021-01-14T15:11:54.198+0000 7fc302b1d700 -1 *** Caught signal (Aborted) **
 in thread 7fc302b1d700 thread_name:bstore_aio

 ceph version 16.0.0-8885-ga78b75d9b3e (a78b75d9b3e06d5dc96b4266f8c79f39944b1ccf) pacific (dev)
 1: /ceph/build/bin/ceph-osd(+0x3182de6) [0x555ea0f4ede6]
 2: /lib64/libpthread.so.0(+0x12b20) [0x7fc30e995b20]
 3: gsignal()
 4: abort()
 5: /lib64/libc.so.6(+0x21b09) [0x7fc30d5e7b09]
 6: /lib64/libc.so.6(+0x2fde6) [0x7fc30d5f5de6]
 7: (boost::intrusive::list_impl<boost::intrusive::mhtraits<BlueStore::OpSequencer, boost::intrusive::list_member_hook<>, &BlueStore::OpSequencer::deferred_osr_queue_item>, unsigned long, true, void>::iterator_to(BlueStore::OpSequencer&)+0xdf) [0x555ea0df6499]
 8: (BlueStore::_deferred_aio_finish(BlueStore::OpSequencer*)+0x334) [0x555ea0d96fcc]
 9: (BlueStore::DeferredBatch::aio_finish(BlueStore*)+0x27) [0x555ea0dd5c95]
 10: /ceph/build/bin/ceph-osd(+0x2f73be8) [0x555ea0d3fbe8]
 11: (KernelDevice::_aio_thread()+0x12ac) [0x555ea16232ea]
 12: (KernelDevice::AioCompletionThread::entry()+0x1c) [0x555ea162ce28]
 13: (Thread::entry_wrapper()+0x83) [0x555ea0fd5c51]
 14: (Thread::_entry_func(void*)+0x18) [0x555ea0fd5bc4]
 15: /lib64/libpthread.so.0(+0x814a) [0x7fc30e98b14a]
 16: clone()

I was able to start the OSD back up again and it's still running for now.

I'm basically just mounting a cephfs filesystem and running xfstest generic/013 on it (fsstress test). I have a full log, but it's pretty large.


Files

osd.log.crash.gz (241 KB) osd.log.crash.gz last 10k lines of log Jeff Layton, 01/14/2021 03:43 PM

Related issues 1 (0 open1 closed)

Is duplicate of bluestore - Bug #48776: ObjectStore/StoreTest hangsResolved

Actions
Actions #1

Updated by Neha Ojha over 3 years ago

  • Project changed from RADOS to bluestore
Actions #2

Updated by Igor Fedotov over 3 years ago

@Jeff Lee - would you please share yet another 10000 lines of log prior to the one you've already attached.

Actions #3

Updated by Jeff Layton over 3 years ago

Unfortunately, I don't have the rest of the log after all. I'm OOTO for a few days, but should be back on Monday. I'll see if I can reproduce it then and get a larger log segment.

Actions #4

Updated by Igor Fedotov over 3 years ago

  • Status changed from New to Duplicate
  • Parent task set to #48776

Despite different symptoms the root cause is pretty the same - osr locking regression caused by https://github.com/ceph/ceph/pull/30027

Actions #5

Updated by Igor Fedotov over 3 years ago

  • Parent task deleted (#48776)
Actions #6

Updated by Igor Fedotov over 3 years ago

  • Is duplicate of Bug #48776: ObjectStore/StoreTest hangs added
Actions

Also available in: Atom PDF