Project

General

Profile

Bug #18375

bluestore: bluefs_preextend_wal_files=true is not crash consistent

Added by Bassam Tabbara over 2 years ago. Updated 3 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Target version:
Start date:
12/31/2016
Due date:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

We're seeing a case when running 11.1.0 bits in Rook where a crash happens (due to some mempool issue) and the OSD fails to start subsequently. The full log is here https://www.dropbox.com/s/9v9xp2x6m1fpzpl/20161230.castle6.log.gz?dl=0 (its large since bluefs logging is set to 20 -- as requested by Sage). Its osd10 in the logs. Here's a summary of events.

First a the mempool crash at around Dec 30 23:32:43

Dec 30 23:32:43 castle6 rkt[1859]: 2016-12-30 23:32:43.167024 I | osd10: 2016-12-30 23:32:43.166982 7f7131d43700 -1 *** Caught signal (Segmentation fault) **
Dec 30 23:32:43 castle6 rkt[1859]: 2016-12-30 23:32:43.167046 I | osd10:  in thread 7f7131d43700 thread_name:rookd
Dec 30 23:32:43 castle6 rkt[1859]: 2016-12-30 23:32:43.167055 I | osd10:
Dec 30 23:32:43 castle6 rkt[1859]: 2016-12-30 23:32:43.167063 I | osd10:  ceph version 11.0.2-2419-g9f96471 (9f964718530f45b7e89e8b5eec9d889f99053f8d)
Dec 30 23:32:43 castle6 rkt[1859]: 2016-12-30 23:32:43.167072 I | osd10:  1: [0x185c14e]
Dec 30 23:32:43 castle6 rkt[1859]: 2016-12-30 23:32:43.167080 I | osd10:  2: [0xc2cfb0]
Dec 30 23:32:43 castle6 rkt[1859]: 2016-12-30 23:32:43.167088 I | osd10:  3: [0xc147ef]
Dec 30 23:32:43 castle6 rkt[1859]: 2016-12-30 23:32:43.167096 I | osd10:  4: [0x23967e2]
Dec 30 23:32:43 castle6 rkt[1859]: 2016-12-30 23:32:43.167104 I | osd10:  5: [0xef4213]
Dec 30 23:32:43 castle6 rkt[1859]: 2016-12-30 23:32:43.167111 I | osd10:  6: [0xf00575]
Dec 30 23:32:43 castle6 rkt[1859]: 2016-12-30 23:32:43.167119 I | osd10:  7: [0x1f03a27]
Dec 30 23:32:43 castle6 rkt[1859]: 2016-12-30 23:32:43.167126 I | osd10:  8: [0x1f03fd0]
Dec 30 23:32:43 castle6 rkt[1859]: 2016-12-30 23:32:43.167133 I | osd10:  9: [0x1d090fb]
Dec 30 23:32:43 castle6 rkt[1859]: 2016-12-30 23:32:43.167140 I | osd10:  10: [0x1d12fe5]
Dec 30 23:32:43 castle6 rkt[1859]: 2016-12-30 23:32:43.167148 I | osd10:  11: [0x1f0a3b6]
Dec 30 23:32:43 castle6 rkt[1859]: 2016-12-30 23:32:43.167156 I | osd10:  12: [0x1f05571]
Dec 30 23:32:43 castle6 rkt[1859]: 2016-12-30 23:32:43.167164 I | osd10:  13: [0x22661ff]
Dec 30 23:32:43 castle6 rkt[1859]: 2016-12-30 23:32:43.167171 I | osd10:  14: [0xc234b5]
Dec 30 23:32:43 castle6 rkt[1859]: 2016-12-30 23:32:43.167178 I | osd10:  15: [0x234be3b]
Dec 30 23:32:43 castle6 rkt[1859]: 2016-12-30 23:32:43.167186 I | osd10:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Dec 30 23:32:43 castle6 rkt[1859]: 2016-12-30 23:32:43.167192 I | osd10:

Here's the related backtrace (we're having an issue with symbols):

(gdb) bt
#0  raise (sig=sig@entry=11) at ../sysdeps/unix/sysv/linux/raise.c:58
#1  0x000000000185c2ec in reraise_fatal (signum=11) at /home/rook/go/src/github.com/rook/rook/ceph/src/global/signal_handler.cc:72
#2  handle_fatal_signal (signum=11) at /home/rook/go/src/github.com/rook/rook/ceph/src/global/signal_handler.cc:134
#3  <signal handler called>
#4  0x0000000000c147ef in (anonymous namespace)::do_memalign(unsigned long, unsigned long) ()
#5  0x00000000023967e2 in tc_posix_memalign ()
#6  0x0000000000ef4213 in mempool::pool_allocator<(mempool::pool_index_t)3, char>::allocate_aligned (p=0x0, align=8, n=4096, this=0x32a8a20 <mempool::buffer_data::alloc_char>) at /home/rook/go/src/github.com/rook/rook/ceph/src/include/mempool.h:322
#7  ceph::buffer::raw_combined::create (align=8, len=4008) at /home/rook/go/src/github.com/rook/rook/ceph/src/common/buffer.cc:272
#8  ceph::buffer::list::append (this=this@entry=0x7f7131d3f2f0, data=0x1c6f39c0 "\345\372\223K&J\364>v\313\253\a@\n\300\225\002N\001\321\230\375Y\256u\313cQ\211\324\315\373", len=32) at /home/rook/go/src/github.com/rook/rook/ceph/src/common/buffer.cc:1810
#9  0x0000000000f00575 in CryptoAESKeyHandler::encrypt (this=<optimized out>, in=..., out=..., error=0x0) at /home/rook/go/src/github.com/rook/rook/ceph/src/auth/Crypto.cc:166
#10 0x0000000001f03a27 in CryptoKey::encrypt (error=0x0, cct=<optimized out>, out=..., in=..., this=0x1b3358f8) at /home/rook/go/src/github.com/rook/rook/ceph/src/auth/Crypto.h:110
#11 CephxSessionHandler::_calc_signature (this=this@entry=0x1b3358e0, m=m@entry=0x1b3e6000, psig=psig@entry=0x7f7131d3f4b8) at /home/rook/go/src/github.com/rook/rook/ceph/src/auth/cephx/CephxSessionHandler.cc:52
#12 0x0000000001f03fd0 in CephxSessionHandler::sign_message (this=0x1b3358e0, m=0x1b3e6000) at /home/rook/go/src/github.com/rook/rook/ceph/src/auth/cephx/CephxSessionHandler.cc:77
#13 0x0000000001d090fb in AsyncConnection::write_message (this=this@entry=0x1b24e000, m=m@entry=0x1b3e6000, bl=..., more=<optimized out>) at /home/rook/go/src/github.com/rook/rook/ceph/src/msg/async/AsyncConnection.cc:2174
#14 0x0000000001d12fe5 in AsyncConnection::handle_write (this=0x1b24e000) at /home/rook/go/src/github.com/rook/rook/ceph/src/msg/async/AsyncConnection.cc:2412
#15 0x0000000001f0a3b6 in EventCenter::process_events (this=this@entry=0xc3ace80, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000) at /home/rook/go/src/github.com/rook/rook/ceph/src/msg/async/Event.cc:430
#16 0x0000000001f05571 in NetworkStack::<lambda()>::operator()(void) const (__closure=0xc5e26f8) at /home/rook/go/src/github.com/rook/rook/ceph/src/msg/async/Stack.cc:46
#17 0x00000000022661ff in execute_native_thread_routine ()
#18 0x0000000000c234b5 in start_thread (arg=0x7f7131d43700) at pthread_create.c:333
#19 0x000000000234be3b in clone ()
(gdb)

this leads to the OSD process crashing and then its restarted. When it attempts to open the rocksdb db it runs into "rocksdb: Corruption: missing start of fragmented record(2)" errors:

Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.919280 I | osd10: 2016-12-30 23:32:45.919217 bfb4240 20 bluefs _read left 0x100000 len 0x8000
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.919297 I | osd10: 2016-12-30 23:32:45.919231 bfb4240 20 bluefs _read got 32768
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.919409 I | osd10: 2016-12-30 23:32:45.919381 bfb4240 10 bluefs _read h 0xcff7c00 0x708000~8000 from file(ino 20 size 0x800000 mtime 2016-12-30 23:32:34.858951 bdev 0 extents [0:0x500000+800000])
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.919419 I | osd10: 2016-12-30 23:32:45.919390 bfb4240 20 bluefs _read left 0xf8000 len 0x8000
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.919429 I | osd10: 2016-12-30 23:32:45.919398 bfb4240 20 bluefs _read got 32768
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.919548 I | osd10: 2016-12-30 23:32:45.919524 bfb4240 10 bluefs _read h 0xcff7c00 0x710000~8000 from file(ino 20 size 0x800000 mtime 2016-12-30 23:32:34.858951 bdev 0 extents [0:0x500000+800000])
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.919558 I | osd10: 2016-12-30 23:32:45.919530 bfb4240 20 bluefs _read left 0xf0000 len 0x8000
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.919565 I | osd10: 2016-12-30 23:32:45.919538 bfb4240 20 bluefs _read got 32768
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.919721 I | osd10: 2016-12-30 23:32:45.919693 bfb4240 10 bluefs _read h 0xcff7c00 0x718000~8000 from file(ino 20 size 0x800000 mtime 2016-12-30 23:32:34.858951 bdev 0 extents [0:0x500000+800000])
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.919732 I | osd10: 2016-12-30 23:32:45.919701 bfb4240 20 bluefs _read left 0xe8000 len 0x8000
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.919742 I | osd10: 2016-12-30 23:32:45.919709 bfb4240 20 bluefs _read got 32768
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.919862 I | osd10: 2016-12-30 23:32:45.919838 bfb4240 10 bluefs _read h 0xcff7c00 0x720000~8000 from file(ino 20 size 0x800000 mtime 2016-12-30 23:32:34.858951 bdev 0 extents [0:0x500000+800000])
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.919878 I | osd10: 2016-12-30 23:32:45.919845 bfb4240 20 bluefs _read left 0xe0000 len 0x8000
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.919886 I | osd10: 2016-12-30 23:32:45.919852 bfb4240 20 bluefs _read got 32768
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.920031 I | osd10: 2016-12-30 23:32:45.920008 bfb4240 10 bluefs _read h 0xcff7c00 0x728000~8000 from file(ino 20 size 0x800000 mtime 2016-12-30 23:32:34.858951 bdev 0 extents [0:0x500000+800000])
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.920041 I | osd10: 2016-12-30 23:32:45.920014 bfb4240 20 bluefs _read left 0xd8000 len 0x8000
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.920049 I | osd10: 2016-12-30 23:32:45.920022 bfb4240 20 bluefs _read got 32768
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.920157 I | osd10: 2016-12-30 23:32:45.920134 bfb4240 10 bluefs _read h 0xcff7c00 0x730000~8000 from file(ino 20 size 0x800000 mtime 2016-12-30 23:32:34.858951 bdev 0 extents [0:0x500000+800000])
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.920166 I | osd10: 2016-12-30 23:32:45.920140 bfb4240 20 bluefs _read left 0xd0000 len 0x8000
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.920174 I | osd10: 2016-12-30 23:32:45.920148 bfb4240 20 bluefs _read got 32768
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.920184 I | osd10: 2016-12-30 23:32:45.920154 bfb4240  3 rocksdb: db.wal/000012.log: dropping 549 bytes; Corruption: missing start of fragmented record(2)
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.920425 I | osd10: 2016-12-30 23:32:45.920398 bfb4240 10 bluefs unlock_file 0xc7c9850 on file(ino 2 size 0x0 mtime 2016-12-30 23:30:18.189715 bdev 0 extents [])
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.920474 I | osd10: 2016-12-30 23:32:45.920442 bfb4240 -1 rocksdb: Corruption: missing start of fragmented record(2)
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.920486 I | osd10: 2016-12-30 23:32:45.920450 bfb4240 -1 bluestore(/var/lib/rook/osd10) _open_db erroring opening db:
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.920495 I | osd10: 2016-12-30 23:32:45.920453 bfb4240  1 bluefs umount
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.920505 I | osd10: 2016-12-30 23:32:45.920455 bfb4240 10 bluefs sync_metadata - no pending log events
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.920512 I | osd10: 2016-12-30 23:32:45.920457 bfb4240 10 bluefs _close_writer 0xd0fcd00 type 0
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.920520 I | osd10: 2016-12-30 23:32:45.920461 bfb4240 20 bluefs _stop_alloc
Dec 30 23:32:46 castle6 rkt[1859]: 2016-12-30 23:32:45.921630 I | osd10: 2016-12-30 23:32:45.921572 bfb4240  1 bdev(/var/lib/rook/osd10/block.wal) close

And the OSD never recovers. This is quite reproducible so let me know if you want us to try a patch.

Not sure what lead to the mempool issue in the first place. I can open a different ticket on that if needed.


Related issues

Related to Ceph - Bug #18338: bluestore_preextend_wal_files appears to corrupt logs Duplicate 12/22/2016

History

#1 Updated by Sage Weil over 2 years ago

  • Assignee set to Sage Weil
  • Priority changed from Normal to Immediate

#2 Updated by Sage Weil over 2 years ago

Can you try to reproduce it with bluefs_preextend_wal_files = false ?

#3 Updated by Bassam Tabbara over 2 years ago

we'll give it a try and get back to you here.

#4 Updated by Bassam Tabbara over 2 years ago

we do not see the rocksdb corruption with bluefs_preextend_wal_files set to false. When the OSD crash (memory issue above), they seem to come back up.

#5 Updated by Sage Weil over 2 years ago

  • Subject changed from blustore not crash consistent in 11.1.0 to bluestore: bluefs_preextend_wal_files=true is not crash consistent
  • Status changed from New to Verified
  • Priority changed from Immediate to Urgent

#6 Updated by Sage Weil over 2 years ago

  • Priority changed from Urgent to High

#7 Updated by Sage Weil over 2 years ago

  • Related to Bug #18338: bluestore_preextend_wal_files appears to corrupt logs added

#8 Updated by Greg Farnum over 2 years ago

  • Project changed from Ceph to RADOS
  • Category changed from 107 to Correctness/Safety
  • Component(RADOS) BlueStore added

#9 Updated by Sage Weil about 2 years ago

  • Priority changed from High to Normal

#10 Updated by Greg Farnum almost 2 years ago

  • Project changed from RADOS to bluestore
  • Category deleted (Correctness/Safety)

#11 Updated by Sage Weil over 1 year ago

  • Priority changed from Normal to Low

#12 Updated by Igor Fedotov 3 months ago

  • Pull request ID set to 26909

#13 Updated by Igor Fedotov 3 months ago

  • Status changed from Verified to Resolved

This is also duplicated by https://tracker.ceph.com/issues/38559

Marking this as resolved, backporting to be tracked in #38559.

Also available in: Atom PDF