Bug #48876
closedosd crash in bluestore code
0%
Description
OSD crash seen when doing some cephfs testing with some experimental MDS and client patches. Build was based on top of commit d20916964984242e51:
0 0) 0x555eb4ad5680 con 0x555eb8f00880 -5> 2021-01-14T15:11:54.195+0000 7fc30b3b9700 15 osd.0 16 enqueue_op 0x555f0ef961a0 prio 63 type 42 cost 2906 latency 0.000169 epoch 16 osd_op(mds.0.3:2234971 2.1e 2.31eb5b5e (undecoded) ondisk+write+known_if_redirected+full_force e16) v8 -4> 2021-01-14T15:11:54.195+0000 7fc30b3b9700 20 osd.0 op_wq(0) _enqueue OpSchedulerItem(2.1e PGOpItem(op=osd_op(mds.0.3:2234971 2.1e 2.31eb5b5e (undecoded) ondisk+write+known_if_redirected+full_force e16) v8) prio 63 cost 2906 e16) -3> 2021-01-14T15:11:54.196+0000 7fc30b3b9700 1 -- [v2:192.168.1.3:6802/1394,v1:192.168.1.3:6803/1394] <== mds.0 v2:192.168.1.3:6810/1590666873 328408 ==== osd_op(mds.0.3:2234972 2.1e 2.31eb5b5e (undecoded) ondisk+write+known_if_redirected+full_force e16) v8 ==== 221+0+2361 (crc 0 0 0) 0x555ef98e1400 con 0x555eb8f00880 -2> 2021-01-14T15:11:54.196+0000 7fc30b3b9700 15 osd.0 16 enqueue_op 0x555ec44dc680 prio 63 type 42 cost 2361 latency 0.000154 epoch 16 osd_op(mds.0.3:2234972 2.1e 2.31eb5b5e (undecoded) ondisk+write+known_if_redirected+full_force e16) v8 -1> 2021-01-14T15:11:54.196+0000 7fc30b3b9700 20 osd.0 op_wq(0) _enqueue OpSchedulerItem(2.1e PGOpItem(op=osd_op(mds.0.3:2234972 2.1e 2.31eb5b5e (undecoded) ondisk+write+known_if_redirected+full_force e16) v8) prio 63 cost 2361 e16) 0> 2021-01-14T15:11:54.198+0000 7fc302b1d700 -1 *** Caught signal (Aborted) ** in thread 7fc302b1d700 thread_name:bstore_aio ceph version 16.0.0-8885-ga78b75d9b3e (a78b75d9b3e06d5dc96b4266f8c79f39944b1ccf) pacific (dev) 1: /ceph/build/bin/ceph-osd(+0x3182de6) [0x555ea0f4ede6] 2: /lib64/libpthread.so.0(+0x12b20) [0x7fc30e995b20] 3: gsignal() 4: abort() 5: /lib64/libc.so.6(+0x21b09) [0x7fc30d5e7b09] 6: /lib64/libc.so.6(+0x2fde6) [0x7fc30d5f5de6] 7: (boost::intrusive::list_impl<boost::intrusive::mhtraits<BlueStore::OpSequencer, boost::intrusive::list_member_hook<>, &BlueStore::OpSequencer::deferred_osr_queue_item>, unsigned long, true, void>::iterator_to(BlueStore::OpSequencer&)+0xdf) [0x555ea0df6499] 8: (BlueStore::_deferred_aio_finish(BlueStore::OpSequencer*)+0x334) [0x555ea0d96fcc] 9: (BlueStore::DeferredBatch::aio_finish(BlueStore*)+0x27) [0x555ea0dd5c95] 10: /ceph/build/bin/ceph-osd(+0x2f73be8) [0x555ea0d3fbe8] 11: (KernelDevice::_aio_thread()+0x12ac) [0x555ea16232ea] 12: (KernelDevice::AioCompletionThread::entry()+0x1c) [0x555ea162ce28] 13: (Thread::entry_wrapper()+0x83) [0x555ea0fd5c51] 14: (Thread::_entry_func(void*)+0x18) [0x555ea0fd5bc4] 15: /lib64/libpthread.so.0(+0x814a) [0x7fc30e98b14a] 16: clone()
I was able to start the OSD back up again and it's still running for now.
I'm basically just mounting a cephfs filesystem and running xfstest generic/013 on it (fsstress test). I have a full log, but it's pretty large.
Files
Updated by Igor Fedotov over 3 years ago
@Jeff Lee - would you please share yet another 10000 lines of log prior to the one you've already attached.
Updated by Jeff Layton over 3 years ago
Unfortunately, I don't have the rest of the log after all. I'm OOTO for a few days, but should be back on Monday. I'll see if I can reproduce it then and get a larger log segment.
Updated by Igor Fedotov over 3 years ago
- Status changed from New to Duplicate
- Parent task set to #48776
Despite different symptoms the root cause is pretty the same - osr locking regression caused by https://github.com/ceph/ceph/pull/30027
Updated by Igor Fedotov over 3 years ago
- Is duplicate of Bug #48776: ObjectStore/StoreTest hangs added