Bug #45613
ObjectStore/StoreTestSpecificAUSize.SyntheticMatrixCompression/2 failed
0%
Description
2020-05-19T09:58:54.096 INFO:teuthology.orchestra.run.smithi135.stdout:[ RUN ] ObjectStore/StoreTestSpecificAUSize.SyntheticMatrixCompression/2 2020-05-19T09:58:56.724 INFO:teuthology.orchestra.run.smithi135.stdout:---------------------- 1 / 8 ---------------------- 2020-05-19T09:58:56.724 INFO:teuthology.orchestra.run.smithi135.stdout: bluestore_min_alloc_size = 4096 2020-05-19T09:58:56.724 INFO:teuthology.orchestra.run.smithi135.stdout: max_write = 1048576 2020-05-19T09:58:56.725 INFO:teuthology.orchestra.run.smithi135.stdout: max_size = 4194304 2020-05-19T09:58:56.725 INFO:teuthology.orchestra.run.smithi135.stdout: alignment = 65536 2020-05-19T09:58:56.725 INFO:teuthology.orchestra.run.smithi135.stdout: bluestore_compression_mode = force 2020-05-19T09:58:56.725 INFO:teuthology.orchestra.run.smithi135.stdout: bluestore_default_buffered_write = false 2020-05-19T09:58:56.725 INFO:teuthology.orchestra.run.smithi135.stdout: bluestore_sync_submit_transaction = true 2020-05-19T09:58:56.726 INFO:teuthology.orchestra.run.smithi135.stderr:seeding object 0 2020-05-19T09:58:56.862 INFO:teuthology.orchestra.run.smithi135.stderr:seeding object 500 2020-05-19T09:58:56.988 INFO:teuthology.orchestra.run.smithi135.stderr:Op 0 2020-05-19T09:58:56.988 INFO:teuthology.orchestra.run.smithi135.stderr:available_objects: 1000 in_flight_objects: 0 total objects: 1000 in_flight 0 2020-05-19T09:59:24.332 INFO:teuthology.orchestra.run.smithi135.stderr:Op 1000 2020-05-19T09:59:24.352 INFO:teuthology.orchestra.run.smithi135.stderr:available_objects: 1004 in_flight_objects: 0 total objects: 1004 in_flight 0 2020-05-19T09:59:55.582 INFO:teuthology.orchestra.run.smithi135.stderr:Op 2000 2020-05-19T09:59:55.583 INFO:teuthology.orchestra.run.smithi135.stderr:available_objects: 1011 in_flight_objects: 1 total objects: 1012 in_flight 1 2020-05-19T10:00:14.822 INFO:teuthology.orchestra.run.smithi135.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-1670-g1f885b6/rpm/el8/BUILD/ceph-16.0.0-1670-g1f885b6/src/test/objectstore/store_test.cc: In function 'void SyntheticWorkloadState::fsck(bool)' thread 7f8dfa42f240 time 2020-05-19T10:00:14.820590+0000 2020-05-19T10:00:14.823 INFO:teuthology.orchestra.run.smithi135.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-1670-g1f885b6/rpm/el8/BUILD/ceph-16.0.0-1670-g1f885b6/src/test/objectstore/store_test.cc: 4550: FAILED ceph_assert(r == 0 || r == -95) 2020-05-19T10:00:14.824 INFO:teuthology.orchestra.run.smithi135.stderr: ceph version 16.0.0-1670-g1f885b6 (1f885b6c2fdde1f3ab249a56f7b4b763556f4373) pacific (dev) 2020-05-19T10:00:14.824 INFO:teuthology.orchestra.run.smithi135.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f8def662d90] 2020-05-19T10:00:14.824 INFO:teuthology.orchestra.run.smithi135.stderr: 2: (()+0x275faa) [0x7f8def662faa] 2020-05-19T10:00:14.825 INFO:teuthology.orchestra.run.smithi135.stderr: 3: (StoreTest::doSyntheticTest(int, unsigned long, unsigned long, unsigned long)+0x15af) [0x5570f36b107f] 2020-05-19T10:00:14.825 INFO:teuthology.orchestra.run.smithi135.stderr: 4: (StoreTestSpecificAUSize::do_matrix_choose(char const* (*) [10], int, int, int, std::function<void (unsigned long, unsigned long, unsigned long, unsigned long)>)+0x1f6) [0x5570f36f7486] 2020-05-19T10:00:14.825 INFO:teuthology.orchestra.run.smithi135.stderr: 5: (StoreTestSpecificAUSize::do_matrix_choose(char const* (*) [10], int, int, int, std::function<void (unsigned long, unsigned long, unsigned long, unsigned long)>)+0x148) [0x5570f36f73d8] 2020-05-19T10:00:14.825 INFO:teuthology.orchestra.run.smithi135.stderr: 6: (StoreTestSpecificAUSize::do_matrix_choose(char const* (*) [10], int, int, int, std::function<void (unsigned long, unsigned long, unsigned long, unsigned long)>)+0x148) [0x5570f36f73d8] 2020-05-19T10:00:14.825 INFO:teuthology.orchestra.run.smithi135.stderr: 7: (StoreTestSpecificAUSize::do_matrix_choose(char const* (*) [10], int, int, int, std::function<void (unsigned long, unsigned long, unsigned long, unsigned long)>)+0x148) [0x5570f36f73d8] 2020-05-19T10:00:14.826 INFO:teuthology.orchestra.run.smithi135.stderr: 8: (StoreTestSpecificAUSize::do_matrix_choose(char const* (*) [10], int, int, int, std::function<void (unsigned long, unsigned long, unsigned long, unsigned long)>)+0x148) [0x5570f36f73d8] 2020-05-19T10:00:14.826 INFO:teuthology.orchestra.run.smithi135.stderr: 9: (StoreTestSpecificAUSize::do_matrix_choose(char const* (*) [10], int, int, int, std::function<void (unsigned long, unsigned long, unsigned long, unsigned long)>)+0x148) [0x5570f36f73d8] 2020-05-19T10:00:14.826 INFO:teuthology.orchestra.run.smithi135.stderr: 10: (StoreTestSpecificAUSize::do_matrix_choose(char const* (*) [10], int, int, int, std::function<void (unsigned long, unsigned long, unsigned long, unsigned long)>)+0x148) [0x5570f36f73d8] 2020-05-19T10:00:14.826 INFO:teuthology.orchestra.run.smithi135.stderr: 11: (StoreTestSpecificAUSize::do_matrix(char const* (*) [10], std::function<void (unsigned long, unsigned long, unsigned long, unsigned long)>)+0x1ee) [0x5570f36f7d1e] 2020-05-19T10:00:14.827 INFO:teuthology.orchestra.run.smithi135.stderr: 12: (StoreTestSpecificAUSize_SyntheticMatrixCompression_Test::TestBody()+0x20b) [0x5570f3641aeb] 2020-05-19T10:00:14.827 INFO:teuthology.orchestra.run.smithi135.stderr: 13: (void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x8f) [0x5570f39a9e5f] 2020-05-19T10:00:14.827 INFO:teuthology.orchestra.run.smithi135.stderr: 14: (testing::Test::Run()+0xcb) [0x5570f399f7ab] 2020-05-19T10:00:14.827 INFO:teuthology.orchestra.run.smithi135.stderr: 15: (testing::TestInfo::Run()+0x135) [0x5570f399f915] 2020-05-19T10:00:14.827 INFO:teuthology.orchestra.run.smithi135.stderr: 16: (testing::TestSuite::Run()+0xb5) [0x5570f399f9f5] 2020-05-19T10:00:14.828 INFO:teuthology.orchestra.run.smithi135.stderr: 17: (testing::internal::UnitTestImpl::RunAllTests()+0x425) [0x5570f399feb5] 2020-05-19T10:00:14.828 INFO:teuthology.orchestra.run.smithi135.stderr: 18: (bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x8f) [0x5570f39aa3cf] 2020-05-19T10:00:14.828 INFO:teuthology.orchestra.run.smithi135.stderr: 19: (testing::UnitTest::Run()+0xa0) [0x5570f39a0160] 2020-05-19T10:00:14.828 INFO:teuthology.orchestra.run.smithi135.stderr: 20: (main()+0x700) [0x5570f3620bf0] 2020-05-19T10:00:14.828 INFO:teuthology.orchestra.run.smithi135.stderr: 21: (__libc_start_main()+0xf3) [0x7f8ded9a2873] 2020-05-19T10:00:14.828 INFO:teuthology.orchestra.run.smithi135.stderr: 22: (_start()+0x2e) [0x5570f363faae]
/a/nojha-2020-05-19_00:53:41-rados-wip-revert-34894-distro-basic-smithi/5068255
Related issues
History
#1 Updated by Igor Fedotov almost 4 years ago
store_test's log contains:
-17> 2020-05-19T10:00:14.304+0000 7f8dfa42f240 3 rocksdb: [db/db_impl_open.cc:518] db.wal/000037.log: dropping 534816 bytes; Corruption: error in middle of record
-16> 2020-05-19T10:00:14.304+0000 7f8dfa42f240 3 rocksdb: [db/db_impl_open.cc:518] db.wal/000037.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
-15> 2020-05-19T10:00:14.304+0000 7f8dfa42f240 3 rocksdb: [db/db_impl_open.cc:518] db.wal/000037.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
-14> 2020-05-19T10:00:14.304+0000 7f8dfa42f240 3 rocksdb: [db/db_impl_open.cc:518] db.wal/000037.log: dropping 27445 bytes; Corruption: missing start of fragmented record(2)
which looks to some degree similar to the report at:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/CX5PRFGL6UBFMOJC6CLUMLPMT4B2CXVQ/
So I presume this is a recent regression unrelated to compression or test case itself.
#2 Updated by Neha Ojha almost 4 years ago
Reproduces pretty easily on master: /a/nojha-2020-05-20_19:45:21-rados-master-distro-basic-smithi/5073678. Haven't been able to reproduce it on octopus yet.
#3 Updated by Brad Hubbard almost 4 years ago
/a/bhubbard-2020-05-20_04:16:41-rados-wip-badone-testing-7-distro-basic-smithi/5071455
#4 Updated by Brad Hubbard almost 4 years ago
[Redacted]
#5 Updated by Igor Fedotov almost 4 years ago
- Status changed from New to Fix Under Review
- Backport set to octopus
- Pull request ID set to 35201
#6 Updated by Igor Fedotov almost 4 years ago
So the bug is caused by submitting overlapping write requests for BlueFS WAL via libaio. BlueFS::_flush_range might extend the first request with zero filled data block to fill tail of RocksDB WAL when preexteded WALs are enabled. This request might race with the subsequent request adding data to a WAL since it looks like libaio might execute input requests in arbitrary order. As a result valid WAL data might be overwritten with zeros.
Aside from WAL preextending the final trigger for the bug to appear was recent disabling bluefs buffer io which brought libaio write path into an action.
#7 Updated by Dan van der Ster almost 4 years ago
@Igor: So setting bluefs_preextend_wal_files=false and/or bluefs_buffered_io=true should workaround the issue until the patch is published?
#8 Updated by Igor Fedotov almost 4 years ago
@Dan - right. Setting bluefs_preextend_wal_files seems better to me (one can even try to do that on the fly) but I'd like to hear about the actual experience from the field - still not 100% certain it's absolutely safe.
#9 Updated by Igor Fedotov almost 4 years ago
At the same time IIUC it's OSD restart which reveals data corruption - while OSD is running it doesn't read from WAL and hence corruption isn't effective. OSD even can heal itself by recycling WAL file.
#10 Updated by Dan van der Ster almost 4 years ago
OK thanks. I'm also just confirming that nautilus would be immune, even though 14.2.10 will change bluefs_bufferio_io to false. (Because bluefs_preextend_wal_files still defaults to false).
#11 Updated by Neha Ojha almost 4 years ago
- Priority changed from Normal to Urgent
#12 Updated by Neha Ojha almost 4 years ago
https://github.com/ceph/ceph/pull/35293 - Fix for octopus has been merged and released in 15.2.3.
#13 Updated by Igor Fedotov almost 4 years ago
- Status changed from Fix Under Review to Resolved
#14 Updated by Igor Fedotov almost 4 years ago
- Status changed from Resolved to Fix Under Review
#15 Updated by Igor Fedotov almost 4 years ago
- Pull request ID changed from 35201 to 35497
Original fix was https://github.com/ceph/ceph/pull/35201 but it was decided to kill WAL preextending instead, see https://github.com/ceph/ceph/pull/35497
#16 Updated by Igor Fedotov over 3 years ago
- Status changed from Fix Under Review to Pending Backport
#17 Updated by Nathan Cutler over 3 years ago
- Copied to Backport #46350: octopus: ObjectStore/StoreTestSpecificAUSize.SyntheticMatrixCompression/2 failed added
#18 Updated by Nathan Cutler over 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".
#19 Updated by Neha Ojha about 3 years ago
- Duplicated by Bug #48827: Ceph Bluestore OSDs fail to start on WAL corruption added