Project

General

Profile

Actions

Bug #37452

closed

FAILED ceph_assert(prealloc_left == (int64_t)need)

Added by Kefu Chai over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Administration/Usability
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
BlueStore
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2018-11-29T02:27:57.041 INFO:tasks.ceph.osd.7.smithi201.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.0.1-1243-gc61da20/rpm/el7/BUILD/ceph-14.0.1-1243-gc61da20/src/os/bluestore/BlueStore.cc: 11579: FAILED ceph_assert(prealloc_left == (int64_t)need)
2018-11-29T02:27:57.047 INFO:tasks.ceph.osd.7.smithi201.stderr: ceph version 14.0.1-1243-gc61da20 (c61da203adbbda46dfa3a14e474e2c81748e948f) nautilus (dev)
2018-11-29T02:27:57.047 INFO:tasks.ceph.osd.7.smithi201.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x5562363d3d92]
2018-11-29T02:27:57.047 INFO:tasks.ceph.osd.7.smithi201.stderr: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x5562363d3f60]
2018-11-29T02:27:57.047 INFO:tasks.ceph.osd.7.smithi201.stderr: 3: (BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*)+0x355c) [0x55623698c46c]
2018-11-29T02:27:57.047 INFO:tasks.ceph.osd.7.smithi201.stderr: 4: (BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::list&, unsigned int)+0x2ad) [0x55623699df0d]
2018-11-29T02:27:57.048 INFO:tasks.ceph.osd.7.smithi201.stderr: 5: (BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int)+0xcb) [0x55623699edcb]
2018-11-29T02:27:57.048 INFO:tasks.ceph.osd.7.smithi201.stderr: 6: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x13e6) [0x5562369aa116]
2018-11-29T02:27:57.048 INFO:tasks.ceph.osd.7.smithi201.stderr: 7: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x51c) [0x5562369ac3cc]
2018-11-29T02:27:57.048 INFO:tasks.ceph.osd.7.smithi201.stderr: 8: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x54) [0x556236703634]
2018-11-29T02:27:57.048 INFO:tasks.ceph.osd.7.smithi201.stderr: 9: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, boost::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x6b3) [0x5562367fec03]
2018-11-29T02:27:57.049 INFO:tasks.ceph.osd.7.smithi201.stderr: 10: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xe22) [0x5562366636a2]
2018-11-29T02:27:57.049 INFO:tasks.ceph.osd.7.smithi201.stderr: 11: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xf9e) [0x5562366bd1fe]
2018-11-29T02:27:57.050 INFO:tasks.ceph.osd.7.smithi201.stderr: 12: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x3955) [0x5562366c1475]
2018-11-29T02:27:57.050 INFO:tasks.ceph.osd.7.smithi201.stderr: 13: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xbd4) [0x5562366c2f04]
2018-11-29T02:27:57.050 INFO:tasks.ceph.osd.7.smithi201.stderr: 14: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1a9) [0x55623650bc79]
2018-11-29T02:27:57.051 INFO:tasks.ceph.osd.7.smithi201.stderr: 15: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x556236796fc2]
2018-11-29T02:27:57.051 INFO:tasks.ceph.osd.7.smithi201.stderr: 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa0c) [0x556236525c3c]
2018-11-29T02:27:57.051 INFO:tasks.ceph.osd.7.smithi201.stderr: 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x556236b45bf3]
2018-11-29T02:27:57.051 INFO:tasks.ceph.osd.7.smithi201.stderr: 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x556236b507b0]
2018-11-29T02:27:57.051 INFO:tasks.ceph.osd.7.smithi201.stderr: 19: (()+0x7e25) [0x7f82b9ab8e25]
2018-11-29T02:27:57.051 INFO:tasks.ceph.osd.7.smithi201.stderr: 20: (clone()+0x6d) [0x7f82b8981bad]

it's found in a test batch. none of the included PRs are touching bluestore or the underlying allocator. the only change which is related is the one who changes the default objecstore type from "filestore" to "bluestore". which means, quite a few tests are exercised with bluestore backend.

$ file remote/*/coredump/*
remote/smithi154/coredump/1543458487.34631.core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'ceph-osd -f --cluster ceph -i 1'
remote/smithi154/coredump/1543459146.34634.core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'ceph-osd -f --cluster ceph -i 3'
remote/smithi154/coredump/1543459748.34630.core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'ceph-osd -f --cluster ceph -i 0'

three OSDs crashed due to the same failure.

http://pulpito.ceph.com/kchai-2018-11-29_01:22:06-rados-wip-kefu2-testing-2018-11-29-0053-distro-basic-smithi/3289820/

Actions #1

Updated by Kefu Chai over 5 years ago

  • Description updated (diff)
Actions #2

Updated by Igor Fedotov over 5 years ago

The most probable root cause for the issue is the lack of free space at BlueStore main device. It's 10GB by default and as far as I can see from the logs that's the only device per OSD, no standalone DB or WAL ones.
And here is store_statfs dump from one of the OSDs 5 secs before the crash:
2018-11-29 02:49:03.284 7f10f7595700 5 osd.0 51 heartbeat osd_stat(store_statfs(0xd258000/0x40000000/0x280000000, data 0x232c2e47f/0x232da4000, compress 0x0/0x0/0x0, omap 0x0, meta 0x40000000), peers [2,6] op hist [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9])

where 0xd258000 - free space left
0x280000000 - total disk size
0x232c2e47f/0x232da4000 - data stored/allocated

So one needs to either increase BlueStore volume size or reduce test case duration.

Actions #3

Updated by Kefu Chai over 5 years ago

  • Category changed from Correctness/Safety to Tests

Igor, thanks for looking into it.

so before the osd crashed, we had allocated 8.79 G out of 10G, and the free space was 210M by then. i will update the involved PR (https://github.com/ceph/ceph/pull/24642) which switches filestore to bluestore.

but in the mean time, can we have a more user-friend error message in this case? i can hardly tell we are out of space just by reading the log or backtrace.

Actions #4

Updated by Kefu Chai over 5 years ago

  • Status changed from New to Rejected
Actions #5

Updated by Kefu Chai over 5 years ago

  • Category changed from Tests to Administration/Usability
  • Status changed from Rejected to New

i'd like to keep this open as usability issue.

Actions #6

Updated by Igor Fedotov over 5 years ago

Kefu Chai wrote:

but in the mean time, can we have a more user-friend error message in this case? i can hardly tell we are out of space just by reading the log or backtrace.

here it is: https://github.com/ceph/ceph/pull/25335

Actions #7

Updated by Kefu Chai over 5 years ago

  • Status changed from New to Resolved

thanks Igor!

Actions

Also available in: Atom PDF