Bug #16400: Ceph OSD crashes suddenly after restart when using bluestore - Ceph - Ceph

Actions

Copy link

Bug #16400

closed

Ceph OSD crashes suddenly after restart when using bluestore

Added by Yuri Gorshkov almost 8 years ago. Updated over 7 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Brad Hubbard

Category:

OSD

Target version:

% Done:

Source:

Community (user)

Tags:

bluestore

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v10.2.2

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi there.

We've set up a small 4-node Bluestore cluster for testing and I noticed that sometimes the OSD would not start after a shutdown, throwing assertion immediately. Here's a snippet from the OSD stdout and I've attached the OSD log:

2016-06-21 18:14:22.956887 7f376e0bd800 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
2016-06-21 18:14:22.957140 7f376e0bd800 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
2016-06-21 18:14:22.957229 7f376e0bd800 -1 WARNING: experimental feature 'bluestore' is enabled
Please be aware that this feature is experimental, untested,
unsupported, and may result in data corruption, data loss,
and/or irreparable damage to your cluster. Do not use
feature with important data.

starting osd.33 at :/0 osd_data /var/lib/ceph/osd/cephsml-33 /var/lib/ceph/osd/cephsml-33/journal
2016-06-21 18:14:22.979703 7f376e0bd800 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
2016-06-21 18:14:23.033441 7f376e0bd800 -1 WARNING: experimental feature 'rocksdb' is enabled
Please be aware that this feature is experimental, untested,
unsupported, and may result in data corruption, data loss,
and/or irreparable damage to your cluster. Do not use
feature with important data.

2016-06-21 18:14:24.024998 7f376e0bd800 -1 WARNING: experimental feature 'rocksdb' is enabled
Please be aware that this feature is experimental, untested,
unsupported, and may result in data corruption, data loss,
and/or irreparable damage to your cluster. Do not use
feature with important data.

osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f376e0bd800 time 2016-06-21 18:14:24.088885
osd/OSD.h: 885: FAILED assert(ret)
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f376eaea5b5]
2: (OSDService::get_map(unsigned int)+0x3d) [0x7f376e4c893d]
3: (OSD::init()+0x1fe2) [0x7f376e47bdb2]
4: (main()+0x2c55) [0x7f376e3dfbe5]
5: (__libc_start_main()+0xf5) [0x7f376afceb15]
6: (()+0x353009) [0x7f376e42a009]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-06-21 18:14:24.090582 7f376e0bd800 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f376e0bd800 time 2016-06-21 18:14:24.088885
osd/OSD.h: 885: FAILED assert(ret)

ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f376eaea5b5]
 2: (OSDService::get_map(unsigned int)+0x3d) [0x7f376e4c893d]
 3: (OSD::init()+0x1fe2) [0x7f376e47bdb2]
 4: (main()+0x2c55) [0x7f376e3dfbe5]
 5: (__libc_start_main()+0xf5) [0x7f376afceb15]
 6: (()+0x353009) [0x7f376e42a009]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

-665> 2016-06-21 18:14:22.956887 7f376e0bd800 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
  -664> 2016-06-21 18:14:22.957140 7f376e0bd800 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
  -660> 2016-06-21 18:14:22.957229 7f376e0bd800 -1 WARNING: experimental feature 'bluestore' is enabled
Please be aware that this feature is experimental, untested,
unsupported, and may result in data corruption, data loss,
and/or irreparable damage to your cluster.  Do not use
feature with important data.

-650> 2016-06-21 18:14:22.979703 7f376e0bd800 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
  -620> 2016-06-21 18:14:23.033441 7f376e0bd800 -1 WARNING: experimental feature 'rocksdb' is enabled
Please be aware that this feature is experimental, untested,
unsupported, and may result in data corruption, data loss,
and/or irreparable damage to your cluster.  Do not use
feature with important data.

-162> 2016-06-21 18:14:24.024998 7f376e0bd800 -1 WARNING: experimental feature 'rocksdb' is enabled
Please be aware that this feature is experimental, untested,
unsupported, and may result in data corruption, data loss,
and/or irreparable damage to your cluster.  Do not use
feature with important data.

0> 2016-06-21 18:14:24.090582 7f376e0bd800 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f376e0bd800 time 2016-06-21 18:14:24.088885
osd/OSD.h: 885: FAILED assert(ret)

ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f376eaea5b5]
 2: (OSDService::get_map(unsigned int)+0x3d) [0x7f376e4c893d]
 3: (OSD::init()+0x1fe2) [0x7f376e47bdb2]
 4: (main()+0x2c55) [0x7f376e3dfbe5]
 5: (__libc_start_main()+0xf5) [0x7f376afceb15]
 6: (()+0x353009) [0x7f376e42a009]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Caught signal (Aborted)
in thread 7f376e0bd800 thread_name:ceph-osd
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (()+0x91341a) [0x7f376e9ea41a]
2: (()+0xf100) [0x7f376ca20100]
3: (gsignal()+0x37) [0x7f376afe25f7]
4: (abort()+0x148) [0x7f376afe3ce8]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7f376eaea797]
6: (OSDService::get_map(unsigned int)+0x3d) [0x7f376e4c893d]
7: (OSD::init()+0x1fe2) [0x7f376e47bdb2]
8: (main()+0x2c55) [0x7f376e3dfbe5]
9: (__libc_start_main()+0xf5) [0x7f376afceb15]
10: (()+0x353009) [0x7f376e42a009]
2016-06-21 18:14:24.095919 7f376e0bd800 -1 Caught signal (Aborted) *
in thread 7f376e0bd800 thread_name:ceph-osd
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (()+0x91341a) [0x7f376e9ea41a]
2: (()+0xf100) [0x7f376ca20100]
3: (gsignal()+0x37) [0x7f376afe25f7]
4: (abort()+0x148) [0x7f376afe3ce8]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7f376eaea797]
6: (OSDService::get_map(unsigned int)+0x3d) [0x7f376e4c893d]
7: (OSD::init()+0x1fe2) [0x7f376e47bdb2]
8: (main()+0x2c55) [0x7f376e3dfbe5]
9: (__libc_start_main()+0xf5) [0x7f376afceb15]
10: (()+0x353009) [0x7f376e42a009]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
```
0> 2016-06-21 18:14:24.095919 7f376e0bd800 -1 ** Caught signal (Aborted) *
 in thread 7f376e0bd800 thread_name:ceph-osd
```
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (()+0x91341a) [0x7f376e9ea41a]
2: (()+0xf100) [0x7f376ca20100]
3: (gsignal()+0x37) [0x7f376afe25f7]
4: (abort()+0x148) [0x7f376afe3ce8]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7f376eaea797]
6: (OSDService::get_map(unsigned int)+0x3d) [0x7f376e4c893d]
7: (OSD::init()+0x1fe2) [0x7f376e47bdb2]
8: (main()+0x2c55) [0x7f376e3dfbe5]
9: (__libc_start_main()+0xf5) [0x7f376afceb15]
10: (()+0x353009) [0x7f376e42a009]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Aborted

Files

Download all files

cephsml-osd.33.log.gz (21.7 KB) cephsml-osd.33.log.gz		Yuri Gorshkov, 06/21/2016 03:13 PM
osdlog.tar (210 KB) osdlog.tar		Yuri Gorshkov, 06/22/2016 03:22 PM
cephsml-osd.1.log (311 KB) cephsml-osd.1.log		Yuri Gorshkov, 08/11/2016 12:21 PM

Actions

Copy link

Updated by Yuri Gorshkov almost 8 years ago

FYI I've attached the objdump -rdS output for our version of ceph-osd (taken from official jewel repositories).

Actions

Copy link

Updated by Brad Hubbard almost 8 years ago

Could you try this again with debug osd set to 30?

I know 30 is high but some of these messages only output at 30.

dout(30) << "get_map " << epoch << " - miss, below lower bound" << dendl;

Please upload the log once you have generated it.

Actions

Copy link

Updated by Yuri Gorshkov almost 8 years ago

File osdlog.tar osdlog.tar added

So, we were (re-)deploying our test cluster with ceph-ansible and suddenly Ansible decided to restart all the OSDs. After the OSDs came back up, there were ~10 out of 96 that wouldn't come up.
Out of these, I picked up two and noticed they had different, albeit similar failure modes. Here are the logs from the attempts to start the damaged OSDs.

Since I now know how to reproduce the situation, I would try running our remaining OSDs with debug 30 ASAP to see what goes on when Ansible tried to stop them.

Actions

Copy link

Updated by Yuri Gorshkov almost 8 years ago

We now tried to reproduce by letting Ansible restart all the OSDs but to no avail - seems like the failure was confined to a specific set of OSDs:

[ygorshkov@bt-mnsk1-mstor03 ~]$ sudo ceph --cluster cephsml osd tree|egrep '(host|down)'
2016-06-22 18:47:21.559526 7f434a5f6700 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
2016-06-22 18:47:21.582836 7f434a5f6700 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb
-2 0.00220 host bt-mnsk1-mstor04
3 0.00009 osd.3 down 0 1.00000
84 0.00009 osd.84 down 0 1.00000
-3 0.00220 host bt-mnsk1-mstor01
0 0.00009 osd.0 down 0 1.00000
12 0.00009 osd.12 down 0 1.00000
34 0.00009 osd.34 down 0 1.00000
40 0.00009 osd.40 down 0 1.00000
52 0.00009 osd.52 down 0 1.00000
76 0.00009 osd.76 down 0 1.00000
93 0.00009 osd.93 down 0 1.00000
-4 0.00220 host bt-mnsk1-mstor02
2 0.00009 osd.2 down 0 1.00000
13 0.00009 osd.13 down 0 1.00000
17 0.00009 osd.17 down 0 1.00000
29 0.00009 osd.29 down 0 1.00000
53 0.00009 osd.53 down 0 1.00000
59 0.00009 osd.59 down 0 1.00000
71 0.00009 osd.71 down 0 1.00000
-5 0.00220 host bt-mnsk1-mstor03
1 0.00009 osd.1 down 0 1.00000
45 0.00009 osd.45 down 0 1.00000
46 0.00009 osd.46 down 0 1.00000
64 0.00009 osd.64 down 0 1.00000
88 0.00009 osd.88 down 0 1.00000
26 0.00009 osd.26 down 0 1.00000

Actions

Copy link

Updated by Brad Hubbard almost 8 years ago

For osd.0 it looks like some sort of corruption with regard to epoch 845.

2016-06-22 18:14:57.678449 7f8df1d14800 20 osd.0 0 get_map 845 - loading and decoding 0x7f8dfea0fb00
osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f8df1d14800 time 2016-06-22 18:14:57.678744
osd/OSD.h: 885: FAILED assert(ret)

As far as the log from osd.12 it looks more like an IO error of some sort.

2016-06-22 18:15:08.266918 7f10c9790800 -1 bluefs mount failed to replay log: (5) Input/output error
2016-06-22 18:15:08.266935 7f10c9790800 -1 bluestore(/var/lib/ceph/osd/cephsml-12) _open_db failed bluefs mount: (5) Input/output error
2016-06-22 18:15:08.266940 7f10c9790800 1 bdev(/var/lib/ceph/osd/cephsml-12/block) close
2016-06-22 18:15:08.508404 7f10c9790800 1 bdev(/var/lib/ceph/osd/cephsml-12/block) close
2016-06-22 18:15:08.734524 7f10c9790800 -1 osd.12 0 OSD:init: unable to mount object store
2016-06-22 18:15:08.734542 7f10c9790800 -1 [0;31m ** ERROR: osd init failed: (5) Input/output error[0m

IO errors are usually indicative of an issue below the application layer.

Can you double check the storage in use here?

Actions

Copy link

Updated by Yuri Gorshkov almost 8 years ago

Hi.

After the first glance the underlying storage appears to be normal. I will double-check, though.

When our colleague tried to remove the now-defunct OSD (osd.88), another OSD on another host (osd.63) has failed. (I've sanitized the host names in the output for privacy reason, the hostnames themselves weren't changed).

See the attached file for the output of osd.63.

016-06-23 13:45:59.807982 mon.2 [INF] from='client.? 192.168.36.7:0/1957761979' entity='client.admin' cmd=[{"prefix": "osd crush remove", "name": "osd.88"}]: dispatch
2016-06-23 13:45:59.808968 mon.0 [INF] from='client.724418 :/0' entity='client.admin' cmd=[{"prefix": "osd crush remove", "name": "osd.88"}]: dispatch
2016-06-23 13:45:59.933294 mon.0 [INF] from='client.724418 :/0' entity='client.admin' cmd='[{"prefix": "osd crush remove", "name": "osd.88"}]': finished
2016-06-23 13:45:59.936040 mon.0 [INF] osdmap e1103: 91 osds: 83 up, 83 in
2016-06-23 13:46:00.233312 mon.1 [INF] from='client.? 192.168.36.7:0/954660534' entity='client.admin' cmd=[{"prefix": "auth del", "entity": "osd.88"}]: dispatch
2016-06-23 13:46:00.234787 mon.0 [INF] from='client.724483 :/0' entity='client.admin' cmd=[{"prefix": "auth del", "entity": "osd.88"}]: dispatch
2016-06-23 13:46:00.237446 mon.0 [INF] from='client.724483 :/0' entity='client.admin' cmd='[{"prefix": "auth del", "entity": "osd.88"}]': finished
2016-06-23 13:46:00.474265 mon.0 [INF] from='client.? 192.168.36.7:0/4208656487' entity='client.admin' cmd=[{"prefix": "osd rm", "ids": ["88"]}]: dispatch
2016-06-23 13:46:01.002949 mon.0 [INF] from='client.? 192.168.36.7:0/4208656487' entity='client.admin' cmd='[{"prefix": "osd rm", "ids": ["88"]}]': finished
2016-06-23 13:46:01.008355 mon.0 [INF] osdmap e1104: 90 osds: 83 up, 83 in
2016-06-23 13:46:13.467462 mon.2 [INF] from='client.? 192.168.36.6:0/3852514611' entity='osd.63' cmd=[{"prefix": "osd crush create-or-move", "args": ["host=test1-stor02", "root=default"], "id": 63, "weight": 0.0001}]: dispatch
2016-06-23 13:46:13.468355 mon.0 [INF] from='client.724421 :/0' entity='osd.63' cmd=[{"prefix": "osd crush create-or-move", "args": ["host=test1-stor02", "root=default"], "id": 63, "weight": 0.0001}]: dispatch
2016-06-23 13:46:15.476682 mon.1 [INF] from='client.? 192.168.36.6:0/365138351' entity='osd.63' cmd=[{"prefix": "osd crush create-or-move", "args": ["host=test1-stor02", "root=default"], "id": 63, "weight": 0.0001}]: dispatch
2016-06-23 13:46:15.477345 mon.0 [INF] from='client.724486 :/0' entity='osd.63' cmd=[{"prefix": "osd crush create-or-move", "args": ["host=test1-stor02", "root=default"], "id": 63, "weight": 0.0001}]: dispatch
2016-06-23 13:46:17.204182 mon.0 [INF] from='client.? 192.168.36.6:0/2354660382' entity='osd.63' cmd=[{"prefix": "osd crush create-or-move", "args": ["host=test1-stor02", "root=default"], "id": 63, "weight": 0.0001}]: dispatch
2016-06-23 13:46:30.260836 mon.0 [INF] osdmap e1106: 90 osds: 82 up, 83 in

Actions

Copy link

Updated by Yuri Gorshkov almost 8 years ago

Also, I think that losing 12 OSD drives in a cluster simultaneously due to storage failure is extremely unlikely ;). (though, given the recent events in my personal life I wouldn't be even that surprised by it - stuff happens, you know ;))

Actions

Copy link

Updated by Yuri Gorshkov almost 8 years ago

After the cluster started backfilling, we suddenly saw ~22 additional OSDs get out of the cluster with the same error:

4: (abort()+0x148) [0x7fc24fbc7ce8]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7fc2536ce797]
6: (OSDService::get_map(unsigned int)+0x3d) [0x7fc2530ac93d]

2016-06-23 14:03:19.049344 mon.0 [INF] osd.60 out (down for 301.028768)
2016-06-23 14:03:19.049391 mon.0 [INF] osd.74 out (down for 300.076516)
2016-06-23 14:03:19.116875 mon.0 [INF] osdmap e1253: 96 osds: 63 up, 71 in
2016-06-23 14:03:20.145336 mon.0 [INF] osdmap e1254: 96 osds: 63 up, 71 in
2016-06-23 14:03:21.133691 mon.0 [INF] osdmap e1255: 96 osds: 63 up, 71 in
2016-06-23 14:03:16.743618 osd.85 [WRN] 2 slow requests, 2 included below; oldest blocked for > 30.502325 secs
2016-06-23 14:03:16.743660 osd.85 [WRN] slow request 30.502325 seconds old, received at 2016-06-23 14:02:46.241154: osd_op(client.124348.0:117420 4.3768e783 (undecoded) ondisk+write+known_if_redirected e1249) currently waiting for active
2016-06-23 14:03:16.743666 osd.85 [WRN] slow request 30.338302 seconds old, received at 2016-06-23 14:02:46.405176: osd_op(client.124348.0:117420 4.3768e783 (undecoded) ondisk+retry+write+known_if_redirected e1249) currently waiting for active
2016-06-23 14:03:20.112443 osd.85 [INF] 4.3 starting backfill to osd.49 from (0'0,0'0] MAX to 1089'3230
2016-06-23 14:03:20.118829 osd.85 [INF] 4.3 starting backfill to osd.53 from (0'0,0'0] MAX to 1089'3230

Actions

Copy link

Updated by Brad Hubbard almost 8 years ago

Yuri Gorshkov wrote:

Also, I think that losing 12 OSD drives in a cluster simultaneously due to storage failure is extremely unlikely ;). (though, given the recent events in my personal life I wouldn't be even that surprised by it - stuff happens, you know ;))

I'm certainly on the losing end if I attempt to argue 12 concurrent failures :)

Nevertheless, I am interested in the Input/output errors and their role in this. How many of these errors were seen on how many OSDs and what is the time line? I'd like to try and establish what was "event zero". I speculate that something has corrupted a map and that map has been populated to numerous OSDs causing them to fail on that map. is the last message in all the crashes "get_map 845" if debug is turned up to at least 20?

Actions

Copy link

#10

Updated by Sage Weil almost 8 years ago

the EIO is coming from bluefs failing to mount. which probably is a bug in bluefs or rocksdb, and not an error on the media. If you can reproduce the failed restart with debug bluefs = 20 that'll give us some clue.

Note that bluestore on master has diverged significantly from jewel already, and lots of known bugs in jewel we aren't ever going to fix... it just has to be stable enough to benchmark and give people a taste of what's to come. If the bluefs mount failure looks concerning we'll want to try to reproduce it on master...

Actions

Copy link

#11

Updated by Brad Hubbard almost 8 years ago

Assignee set to Brad Hubbard
Tags set to bluestore

Actions

Copy link

#12

Updated by Nathan Cutler over 7 years ago

Target version changed from 519 to v10.2.2

Actions

Copy link

#13

Updated by Nathan Cutler over 7 years ago

Target version deleted (~~v10.2.2~~)

Actions

Copy link

#14

Updated by Yuri Gorshkov over 7 years ago

File cephsml-osd.1.log cephsml-osd.1.log added

We were able to reproduce the bug on our OSDs. Apparently, if

bluestore rocksdb options = recycle_log_file_num=16

is set, then we're hitting this bug. Disabling this option wouldn't allow for OSD to start normally, but if the OSD was running without this option from the beginning, then the OSD works normally.

Also, we set

bluestore bluefs = false

, so it's almost definitely not a bluefs bug :)

Actions

Copy link

#15

Updated by Yuri Gorshkov over 7 years ago

Here is the archive of the OSD's data partition: http://eftp.smlabs.net/ygorshkov/tracker.ceph.com/tracker.ceph.com-160400-cephsml-1.tar.xz

Actions

Copy link

#16

Updated by Brad Hubbard over 7 years ago

$ zgrep bluefs$ cephsml-osd.33.log.gz
1/ 5 bluefs
1/ 5 bluefs

Could we get logs at debug bluefs = 20 as Sage requested please?

Actions

Copy link

#17

Updated by Brad Hubbard over 7 years ago

Given the fact that the Bluestore format and code has changed substantially since jewel and in the absence of further information I'm going to close this out if no one objects?

Actions

Copy link

#18

Updated by Brad Hubbard over 7 years ago

Status changed from New to Closed

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #16400

Ceph OSD crashes suddenly after restart when using bluestore

Updated by Yuri Gorshkov almost 8 years ago

Updated by Brad Hubbard almost 8 years ago

Updated by Yuri Gorshkov almost 8 years ago

Updated by Yuri Gorshkov almost 8 years ago

Updated by Brad Hubbard almost 8 years ago

Updated by Yuri Gorshkov almost 8 years ago

Updated by Yuri Gorshkov almost 8 years ago

Updated by Yuri Gorshkov almost 8 years ago

Updated by Brad Hubbard almost 8 years ago

Updated by Sage Weil almost 8 years ago

Updated by Brad Hubbard almost 8 years ago

Updated by Nathan Cutler over 7 years ago

Updated by Nathan Cutler over 7 years ago

Updated by Yuri Gorshkov over 7 years ago

Updated by Yuri Gorshkov over 7 years ago

Updated by Brad Hubbard over 7 years ago

Updated by Brad Hubbard over 7 years ago

Updated by Brad Hubbard over 7 years ago