Project

General

Profile

Bug #49170

BlueFS might end-up with huge WAL files when upgrading OMAPs

Added by Igor Fedotov 6 months ago. Updated 4 months ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
pacific, octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When performing OMAP on-disk format upgrade BlueStore's fsck might flood BlueFS with converting transactions which results in huge WAL files at BlueFS.
Presumably tend to happen with slow DB drives and large OMAPs...
Symptoms:
Multiple warnings like the following:
fsck warning: #6:05e72f0d:::.dir.334d8ebf-4ddc-4f24-bd2f-9b3bc6fa711e.3268439386.1.633:head# has omap that is not per-pool or pgmeta
...
Multiple indications similar to:
rocksdb: [db/column_family.cc:826] [default] Stalling writes because of estimated pending compaction bytes 69323286921 rate 16777213
or
rocksdb: [db/column_family.cc:826] [default] Stalling writes because of estimated pending compaction bytes 71251735338 rate 16777216

then bulk disk space allocation:
bluefs _allocate unable to allocate 0x43a313b on bdev 2, free 0xffffffffffffffff; fallback to slow device expander
bluefs _add_block_extent bdev 1 0xd770940000~40000000 skip 0
bluefs _add_block_extent bdev 1 0xd7b0940000~40000000 skip 0
bluefs _add_block_extent bdev 1 0xd8600f0000~40000000 skip 0
bluefs _add_block_extent bdev 1 0xd8a00f0000~40000000 skip 0
bluefs _add_block_extent bdev 1 0xd8e00f0000~40000000 skip 0
bluefs _add_block_extent bdev 1 0xd9200f0000~40000000 skip 0
bluefs _add_block_extent bdev 1 0xd9814c0000~40000000 skip 0
bluefs _add_block_extent bdev 1 0xd9c14c0000~40000000 skip 0
bluefs _add_block_extent bdev 1 0xda014c0000~40000000 skip 0
bluefs _add_block_extent bdev 1 0xda414c0000~40000000 skip 0
bluefs _add_block_extent bdev 1 0xda814c0000~40000000 skip 0
bluefs _add_block_extent bdev 1 0xd6f51b0000~38740000 skip 0

followed by bulk RAM allocation:
tcmalloc: large alloc 2105548800 bytes 0x55cabfbc0000 0x7fe00a02af2d 0x7fe00a04af8b 0x55c74f216b69 0x55c74f216ce3 0x55c74fd592a7 0x55c74fda0c2e 0x55c74fda0d3a 0x55c74fbecf8b 0x55c74fbef6cd 0x55c74f7212e3 0x55c74f7239cb 0x55c74f761ff8 0x55c74f726f62 0x55c74f727ff7 0x55c74f735652 0x55c74f28740a 0x55c74f1fb3ca 0x7fe0097ca09b 0x55c74f20de7a
tcmalloc: large alloc 4211089408 bytes 0x55cb3d3c2000
0x7fe00a02af2d 0x7fe00a04af8b 0x55c74f216b69 0x55c74f216ce3 0x55c74fd592a7 0x55c74fda0c2e 0x55c74fda0d3a 0x55c74fbecf8b 0x55c74fbef6cd 0x55c74f7212e3 0x55c74f7239cb 0x55c74f761ff8 0x55c74f726f62 0x55c74f727ff7 0x55c74f735652 0x55c74f28740a 0x55c74f1fb3ca 0x7fe0097ca09b 0x55c74f20de7a
tcmalloc: large alloc 8422170624 bytes == 0x55cc38bc4000 @ 0x7fe00a02af2d 0x7fe00a04af8b 0x55c74f216b69 0x55c74f216ce3 0x55c74fd592a7 0x55c74fda0c2e 0x55c74fda0d3a 0x55c74fbecf8b 0x55c74fbef6cd 0x55c74f7212e3 0x55c74f7239cb 0x55c74f761ff8 0x55c74f726f62 0x55c74f727ff7 0x55c74f735652 0x55c74f28740a 0x55c74f1fb3ca 0x7fe0097ca09b 0x55c74f20de7a

Finally this tend to end up with RocksDB corruption error:
rocksdb: [db/column_family.cc:781] [default] Stalling writes because we have 3 immutable memtables (waiting for flush), max_write_buffer_number is set to 4 rate 16777216
rocksdb: [db/db_impl_compaction_flush.cc:2123] Waiting after background flush error: Corruption: file is too short (6731261008 bytes) to be an sstabledb/767009.sstAccumulated background error counts: 1
rocksdb: [db/column_family.cc:781] [default] Stalling writes because we have 3 immutable memtables (waiting for flush), max_write_buffer_number is set to 4 rate 16777216
rocksdb: [db/column_family.cc:781] [default] Stalling writes because we have 3 immutable memtables (waiting for flush), max_write_buffer_number is set to 4 rate 16777216
rocksdb: submit_common error: Corruption: file is too short (6731261008 bytes) to be an sstabledb/767009.sst code = 2 Rocksdb transaction:
Delete( Prefix = M key = 0x0000000012161b12'-')
Delete( Prefix = M key = 0x0000000012161b12'.xxxxxxx')
Delete( Prefix = M key = 0x0000000012161b12'.xxxxxxx')
....


Related issues

Related to bluestore - Bug #49168: Bluefs improperly handles huge (>4GB) writes which causes data corruption Resolved
Related to bluestore - Bug #40300: ceph-osd segfault: "rocksdb: Corruption: file is too short" New

History

#1 Updated by Igor Fedotov 6 months ago

  • Related to Bug #49168: Bluefs improperly handles huge (>4GB) writes which causes data corruption added

#2 Updated by Igor Fedotov 6 months ago

Huge WAL files aren't good by themselves but BlueFS has a bug in handling >4GB writes which presumably cause data corruption

#3 Updated by Igor Fedotov 6 months ago

  • Related to Bug #40300: ceph-osd segfault: "rocksdb: Corruption: file is too short" added

#4 Updated by Igor Fedotov 6 months ago

And I believe we've heard about some more cases when BlueFS got corrupted after upgrade to Octopus....

#5 Updated by Igor Fedotov 6 months ago

Here is a sample of such a huge (7+GB) WAL file:

0x0:  op_file_update  file(ino 431989 size 0x1cf0c7888 mtime 2021-02-04T09:55:30.729220+0100 allocated 1cf0d0000 extents 
...
0x0: op_dir_link db.wal/542175.log to 431989

#6 Updated by Igor Fedotov 6 months ago

  • Status changed from New to In Progress
  • Backport set to pacific, octopus, nautilus, mimic
  • Pull request ID set to 39377

#7 Updated by Igor Fedotov 6 months ago

  • Backport changed from pacific, octopus, nautilus, mimic to pacific, octopus

#8 Updated by Igor Fedotov 4 months ago

  • Status changed from In Progress to Pending Backport

#9 Updated by Neha Ojha 4 months ago

  • Status changed from Pending Backport to Fix Under Review

Also available in: Atom PDF