Project

General

Profile

Bug #45613

ObjectStore/StoreTestSpecificAUSize.SyntheticMatrixCompression/2 failed

Added by Neha Ojha about 2 months ago. Updated 11 days ago.

Status:
Pending Backport
Priority:
Urgent
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

2020-05-19T09:58:54.096 INFO:teuthology.orchestra.run.smithi135.stdout:[ RUN      ] ObjectStore/StoreTestSpecificAUSize.SyntheticMatrixCompression/2
2020-05-19T09:58:56.724 INFO:teuthology.orchestra.run.smithi135.stdout:---------------------- 1 / 8 ----------------------
2020-05-19T09:58:56.724 INFO:teuthology.orchestra.run.smithi135.stdout:  bluestore_min_alloc_size = 4096
2020-05-19T09:58:56.724 INFO:teuthology.orchestra.run.smithi135.stdout:  max_write = 1048576
2020-05-19T09:58:56.725 INFO:teuthology.orchestra.run.smithi135.stdout:  max_size = 4194304
2020-05-19T09:58:56.725 INFO:teuthology.orchestra.run.smithi135.stdout:  alignment = 65536
2020-05-19T09:58:56.725 INFO:teuthology.orchestra.run.smithi135.stdout:  bluestore_compression_mode = force
2020-05-19T09:58:56.725 INFO:teuthology.orchestra.run.smithi135.stdout:  bluestore_default_buffered_write = false
2020-05-19T09:58:56.725 INFO:teuthology.orchestra.run.smithi135.stdout:  bluestore_sync_submit_transaction = true
2020-05-19T09:58:56.726 INFO:teuthology.orchestra.run.smithi135.stderr:seeding object 0
2020-05-19T09:58:56.862 INFO:teuthology.orchestra.run.smithi135.stderr:seeding object 500
2020-05-19T09:58:56.988 INFO:teuthology.orchestra.run.smithi135.stderr:Op 0
2020-05-19T09:58:56.988 INFO:teuthology.orchestra.run.smithi135.stderr:available_objects: 1000 in_flight_objects: 0 total objects: 1000 in_flight 0
2020-05-19T09:59:24.332 INFO:teuthology.orchestra.run.smithi135.stderr:Op 1000
2020-05-19T09:59:24.352 INFO:teuthology.orchestra.run.smithi135.stderr:available_objects: 1004 in_flight_objects: 0 total objects: 1004 in_flight 0
2020-05-19T09:59:55.582 INFO:teuthology.orchestra.run.smithi135.stderr:Op 2000
2020-05-19T09:59:55.583 INFO:teuthology.orchestra.run.smithi135.stderr:available_objects: 1011 in_flight_objects: 1 total objects: 1012 in_flight 1
2020-05-19T10:00:14.822 INFO:teuthology.orchestra.run.smithi135.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-1670-g1f885b6/rpm/el8/BUILD/ceph-16.0.0-1670-g1f885b6/src/test/objectstore/store_test.cc: In function 'void SyntheticWorkloadState::fsck(bool)' thread 7f8dfa42f240 time 2020-05-19T10:00:14.820590+0000
2020-05-19T10:00:14.823 INFO:teuthology.orchestra.run.smithi135.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-1670-g1f885b6/rpm/el8/BUILD/ceph-16.0.0-1670-g1f885b6/src/test/objectstore/store_test.cc: 4550: FAILED ceph_assert(r == 0 || r == -95)
2020-05-19T10:00:14.824 INFO:teuthology.orchestra.run.smithi135.stderr: ceph version 16.0.0-1670-g1f885b6 (1f885b6c2fdde1f3ab249a56f7b4b763556f4373) pacific (dev)
2020-05-19T10:00:14.824 INFO:teuthology.orchestra.run.smithi135.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f8def662d90]
2020-05-19T10:00:14.824 INFO:teuthology.orchestra.run.smithi135.stderr: 2: (()+0x275faa) [0x7f8def662faa]
2020-05-19T10:00:14.825 INFO:teuthology.orchestra.run.smithi135.stderr: 3: (StoreTest::doSyntheticTest(int, unsigned long, unsigned long, unsigned long)+0x15af) [0x5570f36b107f]
2020-05-19T10:00:14.825 INFO:teuthology.orchestra.run.smithi135.stderr: 4: (StoreTestSpecificAUSize::do_matrix_choose(char const* (*) [10], int, int, int, std::function<void (unsigned long, unsigned long, unsigned long, unsigned long)>)+0x1f6) [0x5570f36f7486]
2020-05-19T10:00:14.825 INFO:teuthology.orchestra.run.smithi135.stderr: 5: (StoreTestSpecificAUSize::do_matrix_choose(char const* (*) [10], int, int, int, std::function<void (unsigned long, unsigned long, unsigned long, unsigned long)>)+0x148) [0x5570f36f73d8]
2020-05-19T10:00:14.825 INFO:teuthology.orchestra.run.smithi135.stderr: 6: (StoreTestSpecificAUSize::do_matrix_choose(char const* (*) [10], int, int, int, std::function<void (unsigned long, unsigned long, unsigned long, unsigned long)>)+0x148) [0x5570f36f73d8]
2020-05-19T10:00:14.825 INFO:teuthology.orchestra.run.smithi135.stderr: 7: (StoreTestSpecificAUSize::do_matrix_choose(char const* (*) [10], int, int, int, std::function<void (unsigned long, unsigned long, unsigned long, unsigned long)>)+0x148) [0x5570f36f73d8]
2020-05-19T10:00:14.826 INFO:teuthology.orchestra.run.smithi135.stderr: 8: (StoreTestSpecificAUSize::do_matrix_choose(char const* (*) [10], int, int, int, std::function<void (unsigned long, unsigned long, unsigned long, unsigned long)>)+0x148) [0x5570f36f73d8]
2020-05-19T10:00:14.826 INFO:teuthology.orchestra.run.smithi135.stderr: 9: (StoreTestSpecificAUSize::do_matrix_choose(char const* (*) [10], int, int, int, std::function<void (unsigned long, unsigned long, unsigned long, unsigned long)>)+0x148) [0x5570f36f73d8]
2020-05-19T10:00:14.826 INFO:teuthology.orchestra.run.smithi135.stderr: 10: (StoreTestSpecificAUSize::do_matrix_choose(char const* (*) [10], int, int, int, std::function<void (unsigned long, unsigned long, unsigned long, unsigned long)>)+0x148) [0x5570f36f73d8]
2020-05-19T10:00:14.826 INFO:teuthology.orchestra.run.smithi135.stderr: 11: (StoreTestSpecificAUSize::do_matrix(char const* (*) [10], std::function<void (unsigned long, unsigned long, unsigned long, unsigned long)>)+0x1ee) [0x5570f36f7d1e]
2020-05-19T10:00:14.827 INFO:teuthology.orchestra.run.smithi135.stderr: 12: (StoreTestSpecificAUSize_SyntheticMatrixCompression_Test::TestBody()+0x20b) [0x5570f3641aeb]
2020-05-19T10:00:14.827 INFO:teuthology.orchestra.run.smithi135.stderr: 13: (void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x8f) [0x5570f39a9e5f]
2020-05-19T10:00:14.827 INFO:teuthology.orchestra.run.smithi135.stderr: 14: (testing::Test::Run()+0xcb) [0x5570f399f7ab]
2020-05-19T10:00:14.827 INFO:teuthology.orchestra.run.smithi135.stderr: 15: (testing::TestInfo::Run()+0x135) [0x5570f399f915]
2020-05-19T10:00:14.827 INFO:teuthology.orchestra.run.smithi135.stderr: 16: (testing::TestSuite::Run()+0xb5) [0x5570f399f9f5]
2020-05-19T10:00:14.828 INFO:teuthology.orchestra.run.smithi135.stderr: 17: (testing::internal::UnitTestImpl::RunAllTests()+0x425) [0x5570f399feb5]
2020-05-19T10:00:14.828 INFO:teuthology.orchestra.run.smithi135.stderr: 18: (bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x8f) [0x5570f39aa3cf]
2020-05-19T10:00:14.828 INFO:teuthology.orchestra.run.smithi135.stderr: 19: (testing::UnitTest::Run()+0xa0) [0x5570f39a0160]
2020-05-19T10:00:14.828 INFO:teuthology.orchestra.run.smithi135.stderr: 20: (main()+0x700) [0x5570f3620bf0]
2020-05-19T10:00:14.828 INFO:teuthology.orchestra.run.smithi135.stderr: 21: (__libc_start_main()+0xf3) [0x7f8ded9a2873]
2020-05-19T10:00:14.828 INFO:teuthology.orchestra.run.smithi135.stderr: 22: (_start()+0x2e) [0x5570f363faae]

/a/nojha-2020-05-19_00:53:41-rados-wip-revert-34894-distro-basic-smithi/5068255


Related issues

Copied to bluestore - Backport #46350: octopus: ObjectStore/StoreTestSpecificAUSize.SyntheticMatrixCompression/2 failed New

History

#1 Updated by Igor Fedotov about 2 months ago

store_test's log contains:
-17> 2020-05-19T10:00:14.304+0000 7f8dfa42f240 3 rocksdb: [db/db_impl_open.cc:518] db.wal/000037.log: dropping 534816 bytes; Corruption: error in middle of record
-16> 2020-05-19T10:00:14.304+0000 7f8dfa42f240 3 rocksdb: [db/db_impl_open.cc:518] db.wal/000037.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
-15> 2020-05-19T10:00:14.304+0000 7f8dfa42f240 3 rocksdb: [db/db_impl_open.cc:518] db.wal/000037.log: dropping 32757 bytes; Corruption: missing start of fragmented record(1)
-14> 2020-05-19T10:00:14.304+0000 7f8dfa42f240 3 rocksdb: [db/db_impl_open.cc:518] db.wal/000037.log: dropping 27445 bytes; Corruption: missing start of fragmented record(2)

which looks to some degree similar to the report at:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/CX5PRFGL6UBFMOJC6CLUMLPMT4B2CXVQ/

So I presume this is a recent regression unrelated to compression or test case itself.

#2 Updated by Neha Ojha about 2 months ago

Reproduces pretty easily on master: /a/nojha-2020-05-20_19:45:21-rados-master-distro-basic-smithi/5073678. Haven't been able to reproduce it on octopus yet.

#3 Updated by Brad Hubbard about 2 months ago

/a/bhubbard-2020-05-20_04:16:41-rados-wip-badone-testing-7-distro-basic-smithi/5071455

#4 Updated by Brad Hubbard about 2 months ago

[Redacted]

#5 Updated by Igor Fedotov about 2 months ago

  • Status changed from New to Fix Under Review
  • Backport set to octopus
  • Pull request ID set to 35201

#6 Updated by Igor Fedotov about 2 months ago

So the bug is caused by submitting overlapping write requests for BlueFS WAL via libaio. BlueFS::_flush_range might extend the first request with zero filled data block to fill tail of RocksDB WAL when preexteded WALs are enabled. This request might race with the subsequent request adding data to a WAL since it looks like libaio might execute input requests in arbitrary order. As a result valid WAL data might be overwritten with zeros.
Aside from WAL preextending the final trigger for the bug to appear was recent disabling bluefs buffer io which brought libaio write path into an action.

#7 Updated by Dan van der Ster about 2 months ago

@Igor: So setting bluefs_preextend_wal_files=false and/or bluefs_buffered_io=true should workaround the issue until the patch is published?

#8 Updated by Igor Fedotov about 2 months ago

@Dan - right. Setting bluefs_preextend_wal_files seems better to me (one can even try to do that on the fly) but I'd like to hear about the actual experience from the field - still not 100% certain it's absolutely safe.

#9 Updated by Igor Fedotov about 2 months ago

At the same time IIUC it's OSD restart which reveals data corruption - while OSD is running it doesn't read from WAL and hence corruption isn't effective. OSD even can heal itself by recycling WAL file.

#10 Updated by Dan van der Ster about 2 months ago

OK thanks. I'm also just confirming that nautilus would be immune, even though 14.2.10 will change bluefs_bufferio_io to false. (Because bluefs_preextend_wal_files still defaults to false).

#11 Updated by Neha Ojha about 2 months ago

  • Priority changed from Normal to Urgent

#12 Updated by Neha Ojha about 1 month ago

https://github.com/ceph/ceph/pull/35293 - Fix for octopus has been merged and released in 15.2.3.

#13 Updated by Igor Fedotov about 1 month ago

  • Status changed from Fix Under Review to Resolved

#14 Updated by Igor Fedotov about 1 month ago

  • Status changed from Resolved to Fix Under Review

#15 Updated by Igor Fedotov about 1 month ago

  • Pull request ID changed from 35201 to 35497

Original fix was https://github.com/ceph/ceph/pull/35201 but it was decided to kill WAL preextending instead, see https://github.com/ceph/ceph/pull/35497

#16 Updated by Igor Fedotov 11 days ago

  • Status changed from Fix Under Review to Pending Backport

#17 Updated by Nathan Cutler 10 days ago

  • Copied to Backport #46350: octopus: ObjectStore/StoreTestSpecificAUSize.SyntheticMatrixCompression/2 failed added

Also available in: Atom PDF