Bug #41240: All of the cluster SSDs aborted at around the same time and will not start. - RADOS - Ceph

Actions

Copy link

Bug #41240

closed

All of the cluster SSDs aborted at around the same time and will not start.

Added by Troy Ablan over 4 years ago. Updated about 4 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Brad Hubbard

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v13.2.6

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

There are 14 SSDs and a few hundred HDDs in this cluster.

The SSDs all crashed at around the same time, and they all abort immediately when started again.

They contain the rgw index and log pools, and a few smaller ones, and the bluestore side was nowhere near full.

Example logs via ceph-post-file

36177605-7519-437f-b4d7-0a2d9482d2f6
77bf5a07-1be3-4e77-badd-c6a9780aa011

Here's an exchange from IRC (times UTC-0700)

2019-08-13 17:05:12 < MooingLemur> I'm sorry to have this spill over into -devel, but, I have a cluster which had every device for several pools crash and won't restart. It immediately aborts. http://dpaste.com/0R5RFQ7
2019-08-13 17:05:47 < MooingLemur> I have a feeling this is due to bluefs not being balanced, because when I use ceph-kvstore-tool to compact, it aborts with bluefs enospc
2019-08-13 17:06:33 < MooingLemur> This is on Mimic 13.2.6, each OSD is a single device (not LVM, and no separate wal/db)
2019-08-13 17:07:35 < MooingLemur> Is there any way to verify that bluefs being short on space is what's going on, and if it is, what's my best remedy?
2019-08-13 17:07:37 * chardan (~chardan@97-120-4-46.ptld.qwest.net) has quit [Ping timeout: 480 seconds]
2019-08-13 17:11:08 tnielsen (~Adium@72-24-84-166.cpe.cableone.net) has quit [Quit: Leaving.]
2019-08-13 17:12:10 neha (~nojha@4.14.35.89) has quit [Quit: Leaving]
2019-08-13 17:13:42 < joshd> MooingLemur: you can set bluefs_alloc_size smaller than the default 1M, e.g. 256K or 64K to get ceph-kvstore-tool running. ENOSPC from bluefs means there aren't enough free
extents of the right size. compaction will end up using more space until it deletes old tables, so that's consistent with your observations
2019-08-13 17:15:34 beaver (~beaver@82.114.86.32) has quit [Ping timeout: 480 seconds]
2019-08-13 17:16:37 chardan (~chardan@2607:fb90:8068:253:82ef:2dc:618c:407e) has joined #ceph-devel
2019-08-13 17:20:22 huangjun (~huangjun@219.139.206.177) has quit [Remote host closed the connection]
2019-08-13 17:25:03 huangjun (~huangjun@219.139.206.177) has joined #ceph-devel
2019-08-13 17:25:08 < MooingLemur> joshd: that option passed on the command line, env var, or ceph.conf? (does ceph-kvstore-tool even read ceph.conf?)
2019-08-13 17:27:54 adam_ (~user@23.249.39.108) has joined #ceph-devel
2019-08-13 17:28:55 < MooingLemur> 2019-08-14 00:27:15.146 7ff5de70c700 1 bluefs allocate failed to allocate 0x4150000 on bdev 1, free 0xca0000; fallback to bdev 2
2019-08-13 17:28:56 < joshd> iirc it might only read ceph.conf
2019-08-13 17:29:30 < MooingLemur> looks like that works out to about 13MB free
2019-08-13 17:30:06 < joshd> that's when it'll fallback to getting space from bdev 2 (bluestore's block dev)
2019-08-13 17:30:18 < MooingLemur> 2019-08-14 00:27:15.146 7ff5de70c700 -1 bluefs _allocate failed to allocate 0x4150000 on bdev 2, dne
2019-08-13 17:31:26 < MooingLemur> if there's no separate db device, I would suspect that bdev 1 is the only place
2019-08-13 17:31:58 < joshd> there's still an internal separation between bluefs and bluestore
2019-08-13 17:32:55 < joshd> with separate bdev objects internally
2019-08-13 17:35:23 Tamil1 (~Adium@cpe-107-184-157-143.socal.res.rr.com) has joined #ceph-devel
2019-08-13 17:36:28 < MooingLemur> joshd: here's the full output of ceph-kvstore-tool http://dpaste.com/04Y5PHD
2019-08-13 17:37:19 < MooingLemur> (the option was in ceph.conf as well: "bluefs_alloc_size = 65536" under [global]
2019-08-13 17:42:12 Tamil (~Adium@cpe-107-184-157-143.socal.res.rr.com) has quit [Ping timeout: 480 seconds]
2019-08-13 17:44:02 wliu (~wliu@115.238.229.136) has quit [Ping timeout: 480 seconds]
2019-08-13 17:45:20 huangjun (~huangjun@219.139.206.177) has quit [Remote host closed the connection]
2019-08-13 17:46:05 < joshd> MooingLemur: I'd suggest trying to use ceph-objectstore-tool's get-map/set-map to replace the corrupted osdmap, without compacting
2019-08-13 17:48:19 < joshd> I've got to go now, good luck
2019-08-13 17:48:40 < MooingLemur> pull the osdmap from another osd?
2019-08-13 17:48:45 < MooingLemur> thanks for the pointers
2019-08-13 17:49:22 chardan (~chardan@2607:fb90:8068:253:82ef:2dc:618c:407e) has quit [Quit: Leaving]
2019-08-13 17:49:27 wliu_ (~wliu@115.238.229.136) has joined #ceph-devel
2019-08-13 17:50:58 huangjun (~huangjun@219.139.206.177) has joined #ceph-devel
2019-08-13 17:51:37 < MooingLemur> https://bpaste.net/show/z4y_ when running ceph-objectstore-tool
2019-08-13 17:57:57 < MooingLemur> aw man, can't even run `ceph-bluestore-tool repair` anymore without it aborting due to enospc
2019-08-13 17:58:25 zyan (~zhyan@240e:398:5f3:6880::f1f) has joined #ceph-devel
2019-08-13 17:59:00 distributedone (~distribut@100.42.98.196) has quit [Ping timeout: 480 seconds]
2019-08-13 18:09:30 tnielsen (~Adium@174.27.30.140) has joined #ceph-devel
2019-08-13 18:16:12 pdhange (~pdhange@59-100-169-49.cust.static-ipl.aapt.com.au) has quit [Ping timeout: 480 seconds]
2019-08-13 18:17:39 pdhange (~pdhange@61-69-103-54.mel.static-ipl.aapt.com.au) has joined #ceph-devel
2019-08-13 18:17:46 gsitlani (~gsitlani@1.23.226.86) has joined #ceph-devel
2019-08-13 18:28:16 * tnielsen (~Adium@174.27.30.140) has quit [Quit: Leaving.]
2019-08-13 18:33:49 < MooingLemur> welp, it appears that I can't set the osdmap (after getting the osdmap from another osd)
2019-08-13 18:34:02 < MooingLemur> osdmap (#-1:9b48f9f2:::osdmap.81046:0#) does not exist.
2019-08-13 18:34:26 < MooingLemur> perhaps I'm misunderstanding how this tool works, documentation is very sparse
2019-08-13 18:34:39 < MooingLemur> But it also could be that something more serious is amiss here

I tried compacting osd.53 using ceph-kvstore-tool and somehow I've made this OSD worse. I don't think I've made any other changes to any OSDs at the time of issue creation.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Troy Ablan over 4 years ago

Additional info that may be useful: the rgw instance has over 1B objects spread across about 66k buckets.

Actions

Copy link

Updated by Troy Ablan over 4 years ago

I have imaged one of the affected OSDs and uploaded a gzipped version of it in the hopes that direct inspection by a dev could be useful.

ceph-post-file: ea05fa1c-a61d-4ff1-9750-6c4f42e2808e

It's a 1TB dd image gzipped down to about 41GB.

Actions

Copy link

Updated by Troy Ablan over 4 years ago

output of `ceph report` as requested by badone on #ceph IRC

ceph-post-file: b9adf52a-3ce8-4f33-93cb-29bd8daaf947

Actions

Copy link

Updated by Troy Ablan over 4 years ago

With guidance from badone on IRC, all of the osds are running and all of the pgs are active.

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036592.html

While the root cause is still an open question, I don't know if the return on time investment is worth the dev effort since it could be hardware, so I consider this issue ok to close.

Actions

Copy link

Updated by Brad Hubbard over 4 years ago

Hi Troy,

Before we close this I'll take a look at the image you uploaded to see if I can work out the nature of the corruption. Could you upload any copies of the 15 or so maps you still have so I can use those in my analysis as well (only if they are still available of course).

Actions

Copy link

Updated by Brad Hubbard over 4 years ago

Status changed from New to Triaged
Assignee set to Brad Hubbard
Severity changed from 1 - critical to 3 - minor

Reducing severity since the cluster is currently healthy.

Actions

Copy link

Updated by Troy Ablan over 4 years ago

Brad Hubbard wrote:

Hi Troy,

Before we close this I'll take a look at the image you uploaded to see if I can work out the nature of the corruption. Could you upload any copies of the 15 or so maps you still have so I can use those in my analysis as well (only if they are still available of course).

Do I look like a man who has ever used the rm command? :)

ceph-post-file: 62585fb4-7726-427d-b615-2ecffdad5b2c

Thanks

-Troy

Actions

Copy link

Updated by Brad Hubbard over 4 years ago

Status changed from Triaged to 12

I had a chance to get back to this.

I fuse mounted the uploaded image and copied the osdmap data for epoch 80890 out (after verifying the COT failed to extract it). The good and bad copy both have the same number of bytes. I then produced a hexdump of this data and compared that to a hexdump of the good copy you uploaded and ran a diff on these.

# diff osdmap.80890.hexdump osdmap.80890.data.hexdump
3904c3904
< 0010010  \0  \0 004  \0  \0  \0 377 377 377 377  \n  \0 004  \0 342 377
---
> 0010010  \0  \0  \0  \0  \0  \0 377 377 377 377  \n  \0 004  \0 342 377
3936c3936
< 0010210  \0  \0 004  \0  \0  \0  \t  \0  \0  \0 016  \0  \0  \0   &  \0
---
> 0010210  \0  \0  \0  \0  \0  \0  \t  \0  \0  \0 016  \0  \0  \0   &  \0
3992c3992
< 0010680  \0  \0   G  \0  \0  \0   H  \0  \0  \0   K  \0  \0  \0 004  \0
---
> 0010680  \0  \0   G  \0  \0  \0   H  \0  \0  \0   K  \0  \0  \0  \0  \0
4024,4025c4024,4025
< 00108a0 004  \0  \0  \0 004  \0  \0  \0 004  \0  \0  \0 004  \0  \0  \0
< 00108b0 004  \0  \0  \0 003  \0  \0  \0 003  \0  \0  \0 003  \0  \0  \0
---
> 00108a0  \0  \0  \0  \0 004  \0  \0  \0  \0  \0  \0  \0 004  \0  \0  \0
> 00108b0  \0  \0  \0  \0 003  \0  \0  \0 003  \0  \0  \0 003  \0  \0  \0
4041,4042c4041,4042
< 00109d0 006  \0  \0  \0 006  \0  \0  \0 004  \0  \0  \0 004  \0  \0  \0
< 00109e0 004  \0  \0  \0 004  \0  \0  \0 004  \0  \0  \0 004  \0  \0  \0
---
> 00109d0 006  \0  \0  \0 006  \0  \0  \0 004  \0  \0  \0  \0  \0  \0  \0
> 00109e0 004  \0  \0  \0  \0  \0  \0  \0 004  \0  \0  \0  \0  \0  \0  \0
4063c4063
< 0010b50 004  \0  \0  \0 004  \0  \0  \0 004  \0  \0  \0 003  \0  \0  \0
---
> 0010b50 004  \0  \0  \0  \0  \0  \0  \0 004  \0  \0  \0 003  \0  \0  \0
4083c4083
< 0010d40 004  \0  \0  \0 004  \0  \0  \0 003  \0  \0  \0 003  \0  \0  \0
---
> 0010d40  \0  \0  \0  \0 004  \0  \0  \0 003  \0  \0  \0 003  \0  \0  \0
4101,4102c4101,4102
< 0010f10  \0  \0 251 001  \0  \0 252 001  \0  \0  \0  \0 004  \0  \0  \0
< 0010f20 004  \0  \0  \0 004  \0  \0  \0 004  \0  \0  \0 003  \0  \0  \0
---
> 0010f10  \0  \0 251 001  \0  \0 252 001  \0  \0  \0  \0  \0  \0  \0  \0
> 0010f20 004  \0  \0  \0  \0  \0  \0  \0 004  \0  \0  \0 003  \0  \0  \0
4121,4122c4121,4122
< 00110f0  \0  \0 252 001  \0  \0  \0  \0 004  \0  \0  \0 004  \0  \0  \0
< 0011100 004  \0  \0  \0 004  \0  \0  \0 003  \0  \0  \0 003  \0  \0  \0
---
> 00110f0  \0  \0 252 001  \0  \0  \0  \0  \0  \0  \0  \0 004  \0  \0  \0
> 0011100  \0  \0  \0  \0 004  \0  \0  \0 003  \0  \0  \0 003  \0  \0  \0

This is easier to interpret with the benefit of colour in something like vimdiff but basically it boils down to the fact that on multiple lines where the value "004" should be present there is a "\0" null instead. There appear to be 15 occurrences. This is looking less than random but I still need to see if I can debug this further. The '4' here may be a representation of 'CRUSH_BUCKET_STRAW = 4' which of course is the bucket alg that it is erroring on.

It seems there are 18 buckets that use the "straw" algorithm so no direct correlation there.

$ jq '.crushmap.buckets[]|select(.alg=="straw")|.alg' ceph-report.txt|wc
     18      18     144

I need to step through this with the debugger next I think.

Actions

Copy link

Updated by Neha Ojha over 4 years ago

Project changed from bluestore to RADOS

Actions

Copy link

#10

Updated by Patrick Donnelly over 4 years ago

Status changed from 12 to New

Actions

Copy link

#11

Updated by Brad Hubbard over 4 years ago

Status changed from New to Can't reproduce

Actions

Copy link

#12

Updated by Dan van der Ster about 4 years ago

FTR this looks identical to https://tracker.ceph.com/issues/39525#note-6

Actions

Copy link

#13

Updated by Brad Hubbard about 4 years ago

Related to Bug #39525: lz4 compressor corrupts data when buffers are unaligned added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #41240

All of the cluster SSDs aborted at around the same time and will not start.

Updated by Troy Ablan over 4 years ago

Updated by Troy Ablan over 4 years ago

Updated by Troy Ablan over 4 years ago

Updated by Troy Ablan over 4 years ago

Updated by Brad Hubbard over 4 years ago

Updated by Brad Hubbard over 4 years ago

Updated by Troy Ablan over 4 years ago

Updated by Brad Hubbard over 4 years ago

Updated by Neha Ojha over 4 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Brad Hubbard over 4 years ago

Updated by Dan van der Ster about 4 years ago

Updated by Brad Hubbard about 4 years ago