Bug #40300: ceph-osd segfault: "rocksdb: Corruption: file is too short" - bluestore - Ceph

Actions

Copy link

Bug #40300

open

ceph-osd segfault: "rocksdb: Corruption: file is too short"

Added by Harald Staub almost 5 years ago. Updated about 3 years ago.

Status:

New

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

luminous, mimic, nautilus, octopus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Cluster is Nautilus 14.2.1, 350 OSDs with BlueStore.

Steps that led to the problem:

1. There is a bucket with about 60 million objects, without shards.

2. radosgw-admin bucket reshard --bucket $BIG_BUCKET --num-shards 1024

3. Resharding looked fine first, it counted up to the number of objects,
but then it hang.

4. 3 OSDs crashed with a segfault: "rocksdb: Corruption: file is too short"

5. Trying to start the OSDs manually led to the same segfaults.

6. ceph-bluestore-tool repair ...

7. The repairs all aborted, with the same rocksdb error as above.

8. Now 1 PG is stale. It belongs to the radosgw bucket index pool, and it contained the index of this big bucket.

Is there any hope in getting these rocksdbs up again?

Some more details that may be of interest.

ceph-bluestore-repair says:

2019-06-12 11:15:38.345 7f56269670c0 -1 rocksdb: Corruption: file is too short (6139497190 bytes) to be an sstabledb/079728.sst
2019-06-12 11:15:38.345 7f56269670c0 -1 bluestore(/var/lib/ceph/osd/ceph-49) _open_db erroring opening db:
error from fsck: (5) Input/output error

The repairs also showed several warnings like:

tcmalloc: large alloc 17162051584 bytes == 0x56167918a000 @ 0x7f5626521887 0x56126a287229 0x56126a2873a3 0x56126a5dc1ec 0x56126a584ce2 0x56126a586a05 0x56126a587dd0 0x56126a589344 0x56126a38c3cf 0x56126a2eae94 0x56126a30654e 0x56126a337ae1 0x56126a1a73a1 0x7f561b228b97 0x56126a28077a

The processes showed up with like 45 GB of RAM used. Fortunately, there was no Out-Of-Memory.

Harry

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Florian Haas almost 5 years ago

Possibly helpful pointers:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-November/031595.html

Bug #21062

https://github.com/facebook/rocksdb/issues/3480

https://zhuanlan.zhihu.com/p/29650392

Not sure if any of these are actually helpful, though.

Actions

Copy link

Updated by Valery Tschopp almost 5 years ago

To add some info.

All of our 6 radosgw, behind a nginx load balancer, logs are full with such messages:

...
2019-06-12 15:25:55.916 7f176a299700 0 WARNING: curl operation timed out, network average transfer speed less than 1024 Bytes per second during 300 seconds.
2019-06-12 15:26:13.676 7f176a299700 0 WARNING: curl operation timed out, network average transfer speed less than 1024 Bytes per second during 300 seconds.
2019-06-12 15:26:36.247 7f176a299700 0 WARNING: curl operation timed out, network average transfer speed less than 1024 Bytes per second during 300 seconds.
2019-06-12 15:26:36.247 7f15d3353700 1 ====== req done req=0x55aef08f6940 op status=0 http_status=403 latency=600.265s ======
2019-06-12 15:26:50.603 7f176ca9e700 0 WARNING: RGWRados::log_usage(): user name empty (bucket=), skipping
2019-06-12 15:26:51.263 7f176a299700 0 WARNING: curl operation timed out, network average transfer speed less than 1024 Bytes per second during 300 seconds.
2019-06-12 15:26:51.267 7f15d0b4e700 1 ====== req done req=0x55aef0762940 op status=0 http_status=403 latency=601.409s ======
2019-06-12 15:27:20.607 7f176ca9e700 0 WARNING: RGWRados::log_usage(): user name empty (bucket=), skipping
2019-06-12 15:27:35.094 7f15952d7700 1 ====== starting new request req=0x55aef1452940 =====
2019-06-12 15:27:35.102 7f176a299700 0 WARNING: curl operation timed out, network average transfer speed less than 1024 Bytes per second during 300 seconds.
2019-06-12 15:27:35.102 7f176a299700 0 WARNING: curl operation timed out, network average transfer speed less than 1024 Bytes per second during 300 seconds.
2019-06-12 15:27:35.102 7f15ceb4a700 1 ====== req done req=0x55aef06b2940 op status=0 http_status=403 latency=600.549s ======
2019-06-12 15:27:50.606 7f176ca9e700 0 WARNING: RGWRados::log_usage(): user name empty (bucket=), skipping
2019-06-12 15:27:52.766 7f176a299700 0 WARNING: curl operation timed out, network average transfer speed less than 1024 Bytes per second during 300 seconds.
2019-06-12 15:27:52.766 7f15c933f700 1 ====== req done req=0x55aef0444940 op status=0 http_status=403 latency=601.077s ======
2019-06-12 15:28:08.058 7f176a299700 0 WARNING: curl operation timed out, network average transfer speed less than 1024 Bytes per second during 300 seconds.
2019-06-12 15:28:20.605 7f176ca9e700 0 WARNING: RGWRados::log_usage(): user name empty (bucket=), skipping
2019-06-12 15:28:31.709 7f15922d1700 1 ====== starting new request req=0x55aeefa0c940 =====
2019-06-12 15:28:31.713 7f176a299700 0 WARNING: curl operation timed out, network average transfer speed less than 1024 Bytes per second during 300 seconds.
2019-06-12 15:28:31.713 7f15c3b34700 1 ====== req done req=0x55aef0e76940 op status=0 http_status=403 latency=600.193s ======
2019-06-12 15:28:33.189 7f158f2cb700 1 ====== starting new request req=0x55aef1462940 =====
2019-06-12 15:28:45.685 7f176a299700 0 WARNING: curl operation timed out, network average transfer speed less than 1024 Bytes per second during 300 seconds.
2019-06-12 15:28:45.685 7f15a72fb700 1 ====== req done req=0x55aeeff94940 op status=0 http_status=401 latency=300.607s ======
2019-06-12 15:28:48.689 7f176a299700 0 WARNING: curl operation timed out, network average transfer speed less than 1024 Bytes per second during 300 seconds.
2019-06-12 15:28:48.689 7f15c2b32700 1 ====== req done req=0x55aef15a2940 op status=0 http_status=403 latency=601.349s ======
...

Actions

Copy link

Updated by Harald Staub almost 5 years ago

Thanks to the awesome help of several people, we managed to work around this problem.
With

bluestore rocksdb options = "compaction_readahead_size=0"

(tip of Wido den Hollander) the OSDs came up.

Some more background information in this thread:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-June/035327.html

Actions

Copy link

Updated by Harald Staub almost 5 years ago

This was enough to bring up the 3 OSDs, get back the stale PG, complete the resharding, i.e. get rid of the large omap object.

Actions

Copy link

Updated by Sage Weil almost 5 years ago

So, I think the question is what we can/should do to avoid this next time. A few options:

- This a rocksdb readahead problem. Turning off readahead is the correct workaround if this happens again.
- BlueFS failed to behave when rocksdb readahead went off the rails (with a 17 GB read) due it's assumptions about memory management. rocksdb on top of a normal fs worked fine because the page cache wasnt' obligated to hold the entire readahead range in the page cache at once. Fix bluefs readahead to be more "best-effort."
- There is a bluefs overflow bug or similar.

The stack trace I saw failed in tc_new (i.e., tcmalloc), so i think it's not the last one. So.. is it worth making bluefs readahead better-behaved when rocksdb readahead goes nuts?

Actions

Copy link