Project

General

Profile

Actions

Bug #49170

closed

BlueFS might end-up with huge WAL files when upgrading OMAPs

Added by Igor Fedotov over 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
pacific, octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When performing OMAP on-disk format upgrade BlueStore's fsck might flood BlueFS with converting transactions which results in huge WAL files at BlueFS.
Presumably tend to happen with slow DB drives and large OMAPs...
Symptoms:
Multiple warnings like the following:
fsck warning: #6:05e72f0d:::.dir.334d8ebf-4ddc-4f24-bd2f-9b3bc6fa711e.3268439386.1.633:head# has omap that is not per-pool or pgmeta
...
Multiple indications similar to:
rocksdb: [db/column_family.cc:826] [default] Stalling writes because of estimated pending compaction bytes 69323286921 rate 16777213
or
rocksdb: [db/column_family.cc:826] [default] Stalling writes because of estimated pending compaction bytes 71251735338 rate 16777216

then bulk disk space allocation:
bluefs _allocate unable to allocate 0x43a313b on bdev 2, free 0xffffffffffffffff; fallback to slow device expander
bluefs _add_block_extent bdev 1 0xd770940000~40000000 skip 0
bluefs _add_block_extent bdev 1 0xd7b0940000~40000000 skip 0
bluefs _add_block_extent bdev 1 0xd8600f0000~40000000 skip 0
bluefs _add_block_extent bdev 1 0xd8a00f0000~40000000 skip 0
bluefs _add_block_extent bdev 1 0xd8e00f0000~40000000 skip 0
bluefs _add_block_extent bdev 1 0xd9200f0000~40000000 skip 0
bluefs _add_block_extent bdev 1 0xd9814c0000~40000000 skip 0
bluefs _add_block_extent bdev 1 0xd9c14c0000~40000000 skip 0
bluefs _add_block_extent bdev 1 0xda014c0000~40000000 skip 0
bluefs _add_block_extent bdev 1 0xda414c0000~40000000 skip 0
bluefs _add_block_extent bdev 1 0xda814c0000~40000000 skip 0
bluefs _add_block_extent bdev 1 0xd6f51b0000~38740000 skip 0

followed by bulk RAM allocation:
tcmalloc: large alloc 2105548800 bytes 0x55cabfbc0000 0x7fe00a02af2d 0x7fe00a04af8b 0x55c74f216b69 0x55c74f216ce3 0x55c74fd592a7 0x55c74fda0c2e 0x55c74fda0d3a 0x55c74fbecf8b 0x55c74fbef6cd 0x55c74f7212e3 0x55c74f7239cb 0x55c74f761ff8 0x55c74f726f62 0x55c74f727ff7 0x55c74f735652 0x55c74f28740a 0x55c74f1fb3ca 0x7fe0097ca09b 0x55c74f20de7a
tcmalloc: large alloc 4211089408 bytes 0x55cb3d3c2000
0x7fe00a02af2d 0x7fe00a04af8b 0x55c74f216b69 0x55c74f216ce3 0x55c74fd592a7 0x55c74fda0c2e 0x55c74fda0d3a 0x55c74fbecf8b 0x55c74fbef6cd 0x55c74f7212e3 0x55c74f7239cb 0x55c74f761ff8 0x55c74f726f62 0x55c74f727ff7 0x55c74f735652 0x55c74f28740a 0x55c74f1fb3ca 0x7fe0097ca09b 0x55c74f20de7a
tcmalloc: large alloc 8422170624 bytes == 0x55cc38bc4000 @ 0x7fe00a02af2d 0x7fe00a04af8b 0x55c74f216b69 0x55c74f216ce3 0x55c74fd592a7 0x55c74fda0c2e 0x55c74fda0d3a 0x55c74fbecf8b 0x55c74fbef6cd 0x55c74f7212e3 0x55c74f7239cb 0x55c74f761ff8 0x55c74f726f62 0x55c74f727ff7 0x55c74f735652 0x55c74f28740a 0x55c74f1fb3ca 0x7fe0097ca09b 0x55c74f20de7a

Finally this tend to end up with RocksDB corruption error:
rocksdb: [db/column_family.cc:781] [default] Stalling writes because we have 3 immutable memtables (waiting for flush), max_write_buffer_number is set to 4 rate 16777216
rocksdb: [db/db_impl_compaction_flush.cc:2123] Waiting after background flush error: Corruption: file is too short (6731261008 bytes) to be an sstabledb/767009.sstAccumulated background error counts: 1
rocksdb: [db/column_family.cc:781] [default] Stalling writes because we have 3 immutable memtables (waiting for flush), max_write_buffer_number is set to 4 rate 16777216
rocksdb: [db/column_family.cc:781] [default] Stalling writes because we have 3 immutable memtables (waiting for flush), max_write_buffer_number is set to 4 rate 16777216
rocksdb: submit_common error: Corruption: file is too short (6731261008 bytes) to be an sstabledb/767009.sst code = 2 Rocksdb transaction:
Delete( Prefix = M key = 0x0000000012161b12'-')
Delete( Prefix = M key = 0x0000000012161b12'.xxxxxxx')
Delete( Prefix = M key = 0x0000000012161b12'.xxxxxxx')
....


Related issues 4 (1 open3 closed)

Related to bluestore - Bug #49168: Bluefs improperly handles huge (>4GB) writes which causes data corruptionResolved

Actions
Related to bluestore - Bug #40300: ceph-osd segfault: "rocksdb: Corruption: file is too short"New

Actions
Copied to bluestore - Backport #52432: octopus: BlueFS might end-up with huge WAL files when upgrading OMAPsResolvedActions
Copied to bluestore - Backport #52433: pacific: BlueFS might end-up with huge WAL files when upgrading OMAPsResolvedActions
Actions

Also available in: Atom PDF