Bug #24639
closed[segfault] segfault in BlueFS::read
0%
Description
Via ceph-deploy on my admin host; I created two encrypted bluestore OSDs which after between 4 and 24 hours started persistently flapping with a segfault in the systemd logs. The crash happens immediately on launch, 100% of the time on both OSDs.
The crash happens in the same stack for both ceph-osd and ceph-bluestore-tool.
Realizing the affected host was behind the rest of the recently-upgraded-to-Luminous cluster (exact version unknown, probably latest 16.04 LTS in Canonical repo), I proceeded to upgrade via the Luminous PPA, hoping it was an issue with the experimental bluestore code).
I have now reproduced the bug with Luminous, upgraded the host to 18.04 LTS, and further to Mimic, and not seen a change in behavior, leading me to believe these OSDs are now in some data state which reliably reproduces this issue.
Jun 23 20:12:19 Largo systemd[1]: ceph-osd@0.service: Service hold-off time over, scheduling restart. Jun 23 20:12:19 Largo systemd[1]: ceph-osd@0.service: Scheduled restart job, restart counter is at 51. Jun 23 20:12:19 Largo systemd[1]: Stopped Ceph object storage daemon osd.0. Jun 23 20:12:19 Largo systemd[1]: Starting Ceph object storage daemon osd.0... Jun 23 20:12:19 Largo systemd[1]: Started Ceph object storage daemon osd.0. Jun 23 20:12:20 Largo ceph-osd[58404]: 2018-06-23 20:12:20.022 7fdb54cc2280 -1 Public network was set, but cluster network was not set Jun 23 20:12:20 Largo ceph-osd[58404]: 2018-06-23 20:12:20.022 7fdb54cc2280 -1 Using public network also for cluster network Jun 23 20:12:20 Largo ceph-osd[58404]: starting osd.0 at - osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal Jun 23 20:12:20 Largo ceph-osd[58404]: *** Caught signal (Segmentation fault) ** Jun 23 20:12:20 Largo ceph-osd[58404]: in thread 7fdb54cc2280 thread_name:ceph-osd Jun 23 20:12:20 Largo ceph-osd[58404]: ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable) Jun 23 20:12:20 Largo ceph-osd[58404]: 1: (()+0x915140) [0x55ed327c7140] Jun 23 20:12:20 Largo ceph-osd[58404]: 2: (()+0x12890) [0x7fdb4a5dc890] Jun 23 20:12:20 Largo ceph-osd[58404]: 3: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*)+0x367) [ Jun 23 20:12:20 Largo ceph-osd[58404]: 4: (BlueFS::_replay(bool, bool)+0x214) [0x55ed3277c654] Jun 23 20:12:20 Largo ceph-osd[58404]: 5: (BlueFS::mount()+0x1f1) [0x55ed32780ea1] Jun 23 20:12:20 Largo ceph-osd[58404]: 6: (BlueStore::_open_db(bool, bool)+0x1840) [0x55ed326abae0] Jun 23 20:12:20 Largo ceph-osd[58404]: 7: (BlueStore::_mount(bool, bool)+0x4b7) [0x55ed326db407] Jun 23 20:12:20 Largo ceph-osd[58404]: 8: (OSD::init()+0x295) [0x55ed32286305] Jun 23 20:12:20 Largo ceph-osd[58404]: 9: (main()+0x268d) [0x55ed3217507d] Jun 23 20:12:20 Largo ceph-osd[58404]: 10: (__libc_start_main()+0xe7) [0x7fdb49495b97] Jun 23 20:12:20 Largo ceph-osd[58404]: 11: (_start()+0x2a) [0x55ed3223d38a] Jun 23 20:12:20 Largo ceph-osd[58404]: 2018-06-23 20:12:20.318 7fdb54cc2280 -1 *** Caught signal (Segmentation fault) ** Jun 23 20:12:20 Largo ceph-osd[58404]: in thread 7fdb54cc2280 thread_name:ceph-osd Jun 23 20:12:20 Largo ceph-osd[58404]: ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable) Jun 23 20:12:20 Largo ceph-osd[58404]: 1: (()+0x915140) [0x55ed327c7140] Jun 23 20:12:20 Largo ceph-osd[58404]: 2: (()+0x12890) [0x7fdb4a5dc890] Jun 23 20:12:20 Largo ceph-osd[58404]: 3: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*)+0x367) [ Jun 23 20:12:20 Largo ceph-osd[58404]: 4: (BlueFS::_replay(bool, bool)+0x214) [0x55ed3277c654] Jun 23 20:12:20 Largo ceph-osd[58404]: 5: (BlueFS::mount()+0x1f1) [0x55ed32780ea1] Jun 23 20:12:20 Largo ceph-osd[58404]: 6: (BlueStore::_open_db(bool, bool)+0x1840) [0x55ed326abae0] Jun 23 20:12:20 Largo ceph-osd[58404]: 7: (BlueStore::_mount(bool, bool)+0x4b7) [0x55ed326db407] Jun 23 20:12:20 Largo ceph-osd[58404]: 8: (OSD::init()+0x295) [0x55ed32286305] Jun 23 20:12:20 Largo ceph-osd[58404]: 9: (main()+0x268d) [0x55ed3217507d] Jun 23 20:12:20 Largo ceph-osd[58404]: 10: (__libc_start_main()+0xe7) [0x7fdb49495b97] Jun 23 20:12:20 Largo ceph-osd[58404]: 11: (_start()+0x2a) [0x55ed3223d38a] Jun 23 20:12:20 Largo ceph-osd[58404]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Jun 23 20:12:20 Largo ceph-osd[58404]: -39> 2018-06-23 20:12:20.022 7fdb54cc2280 -1 Public network was set, but cluster network was not set Jun 23 20:12:20 Largo ceph-osd[58404]: -38> 2018-06-23 20:12:20.022 7fdb54cc2280 -1 Using public network also for cluster network Jun 23 20:12:20 Largo ceph-osd[58404]: 0> 2018-06-23 20:12:20.318 7fdb54cc2280 -1 *** Caught signal (Segmentation fault) ** Jun 23 20:12:20 Largo ceph-osd[58404]: in thread 7fdb54cc2280 thread_name:ceph-osd Jun 23 20:12:20 Largo ceph-osd[58404]: ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable) Jun 23 20:12:20 Largo ceph-osd[58404]: 1: (()+0x915140) [0x55ed327c7140] Jun 23 20:12:20 Largo ceph-osd[58404]: 2: (()+0x12890) [0x7fdb4a5dc890] Jun 23 20:12:20 Largo ceph-osd[58404]: 3: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*)+0x367) [ Jun 23 20:12:20 Largo ceph-osd[58404]: 4: (BlueFS::_replay(bool, bool)+0x214) [0x55ed3277c654] Jun 23 20:12:20 Largo ceph-osd[58404]: 5: (BlueFS::mount()+0x1f1) [0x55ed32780ea1] Jun 23 20:12:20 Largo ceph-osd[58404]: 6: (BlueStore::_open_db(bool, bool)+0x1840) [0x55ed326abae0] Jun 23 20:12:20 Largo ceph-osd[58404]: 7: (BlueStore::_mount(bool, bool)+0x4b7) [0x55ed326db407] Jun 23 20:12:20 Largo ceph-osd[58404]: 8: (OSD::init()+0x295) [0x55ed32286305] Jun 23 20:12:20 Largo ceph-osd[58404]: 9: (main()+0x268d) [0x55ed3217507d] Jun 23 20:12:20 Largo ceph-osd[58404]: 10: (__libc_start_main()+0xe7) [0x7fdb49495b97] Jun 23 20:12:20 Largo ceph-osd[58404]: 11: (_start()+0x2a) [0x55ed3223d38a] Jun 23 20:12:20 Largo ceph-osd[58404]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Jun 23 20:12:20 Largo systemd[1]: ceph-osd@0.service: Main process exited, code=dumped, status=11/SEGV Jun 23 20:12:20 Largo systemd[1]: ceph-osd@0.service: Failed with result 'core-dump'.
Starting program: /usr/bin/ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-0 --no-mon-config [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". [New Thread 0x7fffea952700 (LWP 62958)] [New Thread 0x7fffe997a700 (LWP 62959)] [New Thread 0x7fffe9179700 (LWP 62960)] [New Thread 0x7fffe8978700 (LWP 62961)] [New Thread 0x7fffe8177700 (LWP 62962)] [New Thread 0x7fffe7976700 (LWP 62963)] Thread 1 "ceph-bluestore-" hit Breakpoint 1, BlueFS::_read (this=this@entry=0x55555688c600, h=h@entry=0x555556862e80, buf=buf@entry=0x555556862e88, off=0, len=<optimised out>, outbl=outbl@entry=0x7fffffffaf10, out=0x0) at ./src/os/bluestore/BlueFS.cc:1097 1097 in ./src/os/bluestore/BlueFS.cc (gdb) bt #0 BlueFS::_read (this=this@entry=0x55555688c600, h=h@entry=0x555556862e80, buf=buf@entry=0x555556862e88, off=0, len=<optimised out>, outbl=outbl@entry=0x7fffffffaf10, out=0x0) at ./src/os/bluestore/BlueFS.cc:1097 #1 0x00005555557a99c4 in BlueFS::_replay (this=this@entry=0x55555688c600, noop=noop@entry=false, to_stdout=to_stdout@entry=false) at ./src/os/bluestore/BlueFS.cc:596 #2 0x00005555557ae211 in BlueFS::mount (this=0x55555688c600) at ./src/os/bluestore/BlueFS.cc:440 #3 0x0000555555812400 in BlueStore::_open_db (this=this@entry=0x7fffffffc680, create=create@entry=false, to_repair_db=to_repair_db@entry=false) at ./src/os/bluestore/BlueStore.cc:4845 #4 0x0000555555836c71 in BlueStore::_fsck (this=0x7fffffffc680, deep=false, repair=<optimised out>) at ./src/os/bluestore/BlueStore.cc:5867 #5 0x00005555556ccd57 in BlueStore::fsck (deep=<optimised out>, this=0x7fffffffc680) at ./src/os/bluestore/BlueStore.h:2171 #6 main (argc=<optimised out>, argv=<optimised out>) at ./src/os/bluestore/bluestore_tool.cc:306 (gdb) s Thread 1 "ceph-bluestore-" received signal SIGSEGV, Segmentation fault. BlueFS::_read (this=this@entry=0x55555688c600, h=h@entry=0x555556862e80, buf=buf@entry=0x555556862e88, off=0, len=<optimised out>, outbl=outbl@entry=0x7fffffffaf10, out=0x0) at ./src/os/bluestore/BlueFS.cc:1097 1097 in ./src/os/bluestore/BlueFS.cc (gdb)
Files