Project

General

Profile

Actions

Bug #42099

closed

OSDs crashed during the fio test

Added by Alex Litvak over 4 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I had number of problems during fio bench mark.

osd.17 crashed showing following logs

2:52:58 storage2n2-la ceph-osd-1710770: 2019-09-30 22:52:58.867 7f093d71e700 1 bdev(0x55b72c156000 /var/lib/ceph/osd/ceph-17/block) aio_submit retries 16
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 2019-09-30 22:52:58.867 7f093d71e700 -1 bdev(0x55b72c156000 /var/lib/ceph/osd/ceph-17/block) aio submit got (11) Resource temporarily unavailable
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/os/bluestore/KernelDevice.cc: In fun
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/os/bluestore/KernelDevice.cc: 757: F
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x55b71f668cf4]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55b71f668ec2]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 3: (KernelDevice::aio_submit(IOContext*)+0x701) [0x55b71fd61ca1]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 4: (BlueStore::_txc_aio_submit(BlueStore::TransContext*)+0x42) [0x55b71fc29892]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 5: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x42b) [0x55b71fc496ab]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 6: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::T
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 7: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x54) [0x55b71f9b1b84]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 8: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, s
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 9: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xf12) [0x55b71f90e322]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 10: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xfae) [0x55b71f969b7e]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 11: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x3965) [0x55b71f96de15]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 12: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xbd4) [0x55b71f96f8a4]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1a9) [0x55b71f7a9ea9]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 14: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x55b71fa475d2]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0x55b71f7c6ef4]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x55b71fdc5ce3]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b71fdc8d80]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 18: (()+0x7dd5) [0x7f0971da9dd5]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 19: (clone()+0x6d) [0x7f0970c7002d]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 2019-09-30 22:52:58.879 7f093d71e700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/os/bluestore/KernelDevice.cc: 757: F
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x55b71f668cf4]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55b71f668ec2]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 3: (KernelDevice::aio_submit(IOContext*)+0x701) [0x55b71fd61ca1]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 4: (BlueStore::_txc_aio_submit(BlueStore::TransContext*)+0x42) [0x55b71fc29892]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 5: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x42b) [0x55b71fc496ab]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 6: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::T
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 7: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x54) [0x55b71f9b1b84]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 8: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, s
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 9: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xf12) [0x55b71f90e322]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 10: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xfae) [0x55b71f969b7e]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 11: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x3965) [0x55b71f96de15]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 12: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xbd4) [0x55b71f96f8a4]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1a9) [0x55b71f7a9ea9]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 14: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x55b71fa475d2]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0x55b71f7c6ef4]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x55b71fdc5ce3]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b71fdc8d80]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 18: (()+0x7dd5) [0x7f0971da9dd5]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 19: (clone()+0x6d) [0x7f0970c7002d]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: * Caught signal (Aborted) *
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: in thread 7f093d71e700 thread_name:tp_osd_tp
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 1: (()+0xf5d0) [0x7f0971db15d0]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 2: (gsignal()+0x37) [0x7f0970ba82c7]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 3: (abort()+0x148) [0x7f0970ba99b8]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 4: (ceph::__ceph_assert_fail(char const
, char const*, int, char const*)+0x199) [0x55b71f668d43]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55b71f668ec2]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 6: (KernelDevice::aio_submit(IOContext*)+0x701) [0x55b71fd61ca1]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 7: (BlueStore::_txc_aio_submit(BlueStore::TransContext*)+0x42) [0x55b71fc29892]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 8: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x42b) [0x55b71fc496ab]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 9: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::T
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 10: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x54) [0x55b71f9b1b84]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 11: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&,
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 12: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xf12) [0x55b71f90e322]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 13: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xfae) [0x55b71f969b7e]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 14: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x3965) [0x55b71f96de15]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 15: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xbd4) [0x55b71f96f8a4]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 16: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1a9) [0x55b71f7a9ea9]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 17: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x55b71fa475d2]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 18: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0x55b71f7c6ef4]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 19: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x55b71fdc5ce3]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 20: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b71fdc8d80]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 21: (()+0x7dd5) [0x7f0971da9dd5]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 22: (clone()+0x6d) [0x7f0970c7002d]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 2019-09-30 22:52:58.883 7f093d71e700 -1
Caught signal (Aborted) *
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: in thread 7f093d71e700 thread_name:tp_osd_tp
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 1: (()+0xf5d0) [0x7f0971db15d0]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 2: (gsignal()+0x37) [0x7f0970ba82c7]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 3: (abort()+0x148) [0x7f0970ba99b8]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x55b71f668d43]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55b71f668ec2]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 6: (KernelDevice::aio_submit(IOContext*)+0x701) [0x55b71fd61ca1]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 7: (BlueStore::_txc_aio_submit(BlueStore::TransContext*)+0x42) [0x55b71fc29892]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 8: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x42b) [0x55b71fc496ab]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 9: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::T
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 10: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x54) [0x55b71f9b1b84]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 11: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&,
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 12: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xf12) [0x55b71f90e322]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 13: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xfae) [0x55b71f969b7e]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 14: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x3965) [0x55b71f96de15]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 15: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xbd4) [0x55b71f96f8a4]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 16: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1a9) [0x55b71f7a9ea9]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 17: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x55b71fa475d2]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 18: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0x55b71f7c6ef4]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 19: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x55b71fdc5ce3]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 20: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b71fdc8d80]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 21: (()+0x7dd5) [0x7f0971da9dd5]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 22: (clone()+0x6d) [0x7f0970c7002d]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: --
begin dump of recent events ---
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9999> 2019-09-30 20:46:02.076 7f0937f13700 5 osd.17 6328 heartbeat osd_stat(store_statfs(0x1a485594000/0x40000000/0x1bf00000000, data 0x19cfcadc2a/0x1a3aa68000, compress 0x0/0x0/0x0, omap 0xcdefc78, meta 0x3321038
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9998> 2019-09-30 20:46:03.776 7f0937f13700 5 osd.17 6328 heartbeat osd_stat(store_statfs(0x1a485594000/0x40000000/0x1bf00000000, data 0x19cfcadc2a/0x1a3aa68000, compress 0x0/0x0/0x0, omap 0xcdefc78, meta 0x3321038
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9997> 2019-09-30 20:46:04.277 7f0937f13700 5 osd.17 6328 heartbeat osd_stat(store_statfs(0x1a485594000/0x40000000/0x1bf00000000, data 0x19cfcadc2a/0x1a3aa68000, compress 0x0/0x0/0x0, omap 0xcdefc78, meta 0x3321038
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9996> 2019-09-30 20:46:04.777 7f0937f13700 5 osd.17 6328 heartbeat osd_stat(store_statfs(0x1a485594000/0x40000000/0x1bf00000000, data 0x19cfcadc2a/0x1a3aa68000, compress 0x0/0x0/0x0, omap 0xcdefc78, meta 0x3321038
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9995> 2019-09-30 20:46:04.905 7f0950744700 5 bluestore.MempoolThread(0x55b72c210a88) _tune_cache_size target: 8485076992 heap: 398680064 unmapped: 9035776 mapped: 389644288 old cache_size: 5064831794 new cache siz
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9994> 2019-09-30 20:46:04.905 7f0950744700 5 bluestore.MempoolThread(0x55b72c210a88) _trim_shards cache_size: 5064831794 kv_alloc: 1979711488 kv_used: 120635251 meta_alloc: 1979711488 meta_used: 117486636 data_all
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9993> 2019-09-30 20:46:05.813 7f094ccd2700 10 monclient: tick
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9992> 2019-09-30 20:46:05.813 7f094ccd2700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-09-30 20:45:35.819641)
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9991> 2019-09-30 20:46:09.913 7f0950744700 5 bluestore.MempoolThread(0x55b72c210a88) _tune_cache_size target: 8485076992 heap: 398680064 unmapped: 9035776 mapped: 389644288 old cache_size: 5064831794 new cache siz
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9990> 2019-09-30 20:46:09.913 7f0950744700 5 bluestore.MempoolThread(0x55b72c210a88) _trim_shards cache_size: 5064831794 kv_alloc: 1979711488 kv_used: 120635251 meta_alloc: 1979711488 meta_used: 117486836 data_all
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9989> 2019-09-30 20:46:10.681 7f0937f13700 5 osd.17 6328 heartbeat osd_stat(store_statfs(0x1a485594000/0x40000000/0x1bf00000000, data 0x19cfcadc2a/0x1a3aa68000, compress 0x0/0x0/0x0, omap 0xcdefc78, meta 0x3321038
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9988> 2019-09-30 20:46:14.569 7f094dcd4700 2 osd.17 6328 ms_handle_reset con 0x55b72c797000 session 0x55b7401f1800
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9987> 2019-09-30 20:46:14.569 7f096da92700 10 monclient: handle_auth_request added challenge on 0x55b733bf0400

OSD.34 was timing out with

Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.700 7f1ee93a8700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1eb582d700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.700 7f1ee93a8700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1ebb038700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.700 7f1ee93a8700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1eb582d700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.700 7f1ee93a8700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1ebb038700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.824 7f1ee8ba7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1eb582d700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.824 7f1ee8ba7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1ebb038700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.824 7f1ee8ba7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1eb582d700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.824 7f1ee8ba7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1ebb038700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.836 7f1ee83a6700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1eb582d700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.836 7f1ee83a6700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1ebb038700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.836 7f1ee93a8700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1eb582d700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.836 7f1ee93a8700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1ebb038700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.864 7f1ee8ba7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1eb582d700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.864 7f1ee8ba7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1ebb038700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.864 7f1ee8ba7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1eb582d700' had timed out after 60

....

Sep 30 22:57:33 storage2n2-la ceph-osd-3410431: 2019-09-30 22:57:33.978 7f1ec604e700 0 bluestore(/var/lib/ceph/osd/ceph-34) _txc_committed_kv slow operation observed for l_bluestore_commit_lat, latency = 183.468s, txc = 0x559ef65a0300
Sep 30 22:57:33 storage2n2-la ceph-osd-3410431: 2019-09-30 22:57:33.978 7f1ec604e700 0 bluestore(/var/lib/ceph/osd/ceph-34) _txc_committed_kv slow operation observed for l_bluestore_commit_lat, latency = 183.395s, txc = 0x559f42720900
Sep 30 22:57:33 storage2n2-la ceph-osd-3410431: 2019-09-30 22:57:33.978 7f1ec604e700 0 bluestore(/var/lib/ceph/osd/ceph-34) _txc_committed_kv slow operation observed for l_bluestore_commit_lat, latency = 183.46s, txc = 0x559ef3990900
Sep 30 22:57:33 storage2n2-la ceph-osd-3410431: 2019-09-30 22:57:33.982 7f1ec504c700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.34 down, but it is still running
Sep 30 22:57:33 storage2n2-la ceph-osd-3410431: 2019-09-30 22:57:33.982 7f1ec504c700 0 log_channel(cluster) log [DBG] : map e6334 wrongly marked me down at e6334
Sep 30 22:57:33 storage2n2-la ceph-osd-3410431: 2019-09-30 22:57:33.982 7f1ec504c700 1 osd.34 6334 start_waiting_for_healthy

Both of osds were on the same node


Files

osd.17.log.txt (18.5 KB) osd.17.log.txt Alex Litvak, 10/01/2019 07:09 AM
Actions #1

Updated by Alex Litvak over 4 years ago

Alex Litvak wrote:

I had number of problems during fio bench mark.

osd.17 crashed showing following logs

2:52:58 storage2n2-la ceph-osd-1710770: 2019-09-30 22:52:58.867 7f093d71e700 1 bdev(0x55b72c156000 /var/lib/ceph/osd/ceph-17/block) aio_submit retries 16
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 2019-09-30 22:52:58.867 7f093d71e700 -1 bdev(0x55b72c156000 /var/lib/ceph/osd/ceph-17/block) aio submit got (11) Resource temporarily unavailable
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/os/bluestore/KernelDevice.cc: In fun
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/os/bluestore/KernelDevice.cc: 757: F
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x55b71f668cf4]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55b71f668ec2]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 3: (KernelDevice::aio_submit(IOContext*)+0x701) [0x55b71fd61ca1]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 4: (BlueStore::_txc_aio_submit(BlueStore::TransContext*)+0x42) [0x55b71fc29892]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 5: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x42b) [0x55b71fc496ab]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 6: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::T
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 7: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x54) [0x55b71f9b1b84]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 8: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, s
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 9: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xf12) [0x55b71f90e322]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 10: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xfae) [0x55b71f969b7e]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 11: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x3965) [0x55b71f96de15]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 12: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xbd4) [0x55b71f96f8a4]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1a9) [0x55b71f7a9ea9]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 14: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x55b71fa475d2]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0x55b71f7c6ef4]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x55b71fdc5ce3]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b71fdc8d80]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 18: (()+0x7dd5) [0x7f0971da9dd5]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 19: (clone()+0x6d) [0x7f0970c7002d]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 2019-09-30 22:52:58.879 7f093d71e700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/os/bluestore/KernelDevice.cc: 757: F
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x55b71f668cf4]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55b71f668ec2]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 3: (KernelDevice::aio_submit(IOContext*)+0x701) [0x55b71fd61ca1]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 4: (BlueStore::_txc_aio_submit(BlueStore::TransContext*)+0x42) [0x55b71fc29892]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 5: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x42b) [0x55b71fc496ab]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 6: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::T
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 7: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x54) [0x55b71f9b1b84]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 8: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, s
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 9: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xf12) [0x55b71f90e322]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 10: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xfae) [0x55b71f969b7e]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 11: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x3965) [0x55b71f96de15]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 12: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xbd4) [0x55b71f96f8a4]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1a9) [0x55b71f7a9ea9]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 14: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x55b71fa475d2]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0x55b71f7c6ef4]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x55b71fdc5ce3]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b71fdc8d80]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 18: (()+0x7dd5) [0x7f0971da9dd5]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 19: (clone()+0x6d) [0x7f0970c7002d]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: * Caught signal (Aborted) *
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: in thread 7f093d71e700 thread_name:tp_osd_tp
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 1: (()+0xf5d0) [0x7f0971db15d0]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 2: (gsignal()+0x37) [0x7f0970ba82c7]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 3: (abort()+0x148) [0x7f0970ba99b8]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 4: (ceph::__ceph_assert_fail(char const
, char const*, int, char const*)+0x199) [0x55b71f668d43]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55b71f668ec2]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 6: (KernelDevice::aio_submit(IOContext*)+0x701) [0x55b71fd61ca1]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 7: (BlueStore::_txc_aio_submit(BlueStore::TransContext*)+0x42) [0x55b71fc29892]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 8: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x42b) [0x55b71fc496ab]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 9: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::T
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 10: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x54) [0x55b71f9b1b84]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 11: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&,
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 12: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xf12) [0x55b71f90e322]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 13: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xfae) [0x55b71f969b7e]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 14: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x3965) [0x55b71f96de15]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 15: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xbd4) [0x55b71f96f8a4]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 16: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1a9) [0x55b71f7a9ea9]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 17: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x55b71fa475d2]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 18: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0x55b71f7c6ef4]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 19: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x55b71fdc5ce3]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 20: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b71fdc8d80]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 21: (()+0x7dd5) [0x7f0971da9dd5]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 22: (clone()+0x6d) [0x7f0970c7002d]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 2019-09-30 22:52:58.883 7f093d71e700 -1
Caught signal (Aborted) *
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: in thread 7f093d71e700 thread_name:tp_osd_tp
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 1: (()+0xf5d0) [0x7f0971db15d0]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 2: (gsignal()+0x37) [0x7f0970ba82c7]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 3: (abort()+0x148) [0x7f0970ba99b8]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x55b71f668d43]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55b71f668ec2]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 6: (KernelDevice::aio_submit(IOContext*)+0x701) [0x55b71fd61ca1]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 7: (BlueStore::_txc_aio_submit(BlueStore::TransContext*)+0x42) [0x55b71fc29892]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 8: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x42b) [0x55b71fc496ab]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 9: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::T
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 10: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x54) [0x55b71f9b1b84]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 11: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&,
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 12: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xf12) [0x55b71f90e322]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 13: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xfae) [0x55b71f969b7e]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 14: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x3965) [0x55b71f96de15]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 15: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xbd4) [0x55b71f96f8a4]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 16: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1a9) [0x55b71f7a9ea9]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 17: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x55b71fa475d2]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 18: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0x55b71f7c6ef4]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 19: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x55b71fdc5ce3]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 20: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b71fdc8d80]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 21: (()+0x7dd5) [0x7f0971da9dd5]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: 22: (clone()+0x6d) [0x7f0970c7002d]
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: --
begin dump of recent events ---
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9999> 2019-09-30 20:46:02.076 7f0937f13700 5 osd.17 6328 heartbeat osd_stat(store_statfs(0x1a485594000/0x40000000/0x1bf00000000, data 0x19cfcadc2a/0x1a3aa68000, compress 0x0/0x0/0x0, omap 0xcdefc78, meta 0x3321038
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9998> 2019-09-30 20:46:03.776 7f0937f13700 5 osd.17 6328 heartbeat osd_stat(store_statfs(0x1a485594000/0x40000000/0x1bf00000000, data 0x19cfcadc2a/0x1a3aa68000, compress 0x0/0x0/0x0, omap 0xcdefc78, meta 0x3321038
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9997> 2019-09-30 20:46:04.277 7f0937f13700 5 osd.17 6328 heartbeat osd_stat(store_statfs(0x1a485594000/0x40000000/0x1bf00000000, data 0x19cfcadc2a/0x1a3aa68000, compress 0x0/0x0/0x0, omap 0xcdefc78, meta 0x3321038
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9996> 2019-09-30 20:46:04.777 7f0937f13700 5 osd.17 6328 heartbeat osd_stat(store_statfs(0x1a485594000/0x40000000/0x1bf00000000, data 0x19cfcadc2a/0x1a3aa68000, compress 0x0/0x0/0x0, omap 0xcdefc78, meta 0x3321038
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9995> 2019-09-30 20:46:04.905 7f0950744700 5 bluestore.MempoolThread(0x55b72c210a88) _tune_cache_size target: 8485076992 heap: 398680064 unmapped: 9035776 mapped: 389644288 old cache_size: 5064831794 new cache siz
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9994> 2019-09-30 20:46:04.905 7f0950744700 5 bluestore.MempoolThread(0x55b72c210a88) _trim_shards cache_size: 5064831794 kv_alloc: 1979711488 kv_used: 120635251 meta_alloc: 1979711488 meta_used: 117486636 data_all
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9993> 2019-09-30 20:46:05.813 7f094ccd2700 10 monclient: tick
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9992> 2019-09-30 20:46:05.813 7f094ccd2700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-09-30 20:45:35.819641)
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9991> 2019-09-30 20:46:09.913 7f0950744700 5 bluestore.MempoolThread(0x55b72c210a88) _tune_cache_size target: 8485076992 heap: 398680064 unmapped: 9035776 mapped: 389644288 old cache_size: 5064831794 new cache siz
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9990> 2019-09-30 20:46:09.913 7f0950744700 5 bluestore.MempoolThread(0x55b72c210a88) _trim_shards cache_size: 5064831794 kv_alloc: 1979711488 kv_used: 120635251 meta_alloc: 1979711488 meta_used: 117486836 data_all
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9989> 2019-09-30 20:46:10.681 7f0937f13700 5 osd.17 6328 heartbeat osd_stat(store_statfs(0x1a485594000/0x40000000/0x1bf00000000, data 0x19cfcadc2a/0x1a3aa68000, compress 0x0/0x0/0x0, omap 0xcdefc78, meta 0x3321038
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9988> 2019-09-30 20:46:14.569 7f094dcd4700 2 osd.17 6328 ms_handle_reset con 0x55b72c797000 session 0x55b7401f1800
Sep 30 22:52:58 storage2n2-la ceph-osd-1710770: -9987> 2019-09-30 20:46:14.569 7f096da92700 10 monclient: handle_auth_request added challenge on 0x55b733bf0400

OSD.34 was timing out with

Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.700 7f1ee93a8700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1eb582d700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.700 7f1ee93a8700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1ebb038700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.700 7f1ee93a8700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1eb582d700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.700 7f1ee93a8700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1ebb038700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.824 7f1ee8ba7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1eb582d700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.824 7f1ee8ba7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1ebb038700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.824 7f1ee8ba7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1eb582d700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.824 7f1ee8ba7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1ebb038700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.836 7f1ee83a6700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1eb582d700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.836 7f1ee83a6700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1ebb038700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.836 7f1ee93a8700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1eb582d700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.836 7f1ee93a8700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1ebb038700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.864 7f1ee8ba7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1eb582d700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.864 7f1ee8ba7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1ebb038700' had timed out after 60
Sep 30 22:55:29 storage2n2-la ceph-osd-3410431: 2019-09-30 22:55:29.864 7f1ee8ba7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1eb582d700' had timed out after 60

....

Sep 30 22:57:33 storage2n2-la ceph-osd-3410431: 2019-09-30 22:57:33.978 7f1ec604e700 0 bluestore(/var/lib/ceph/osd/ceph-34) _txc_committed_kv slow operation observed for l_bluestore_commit_lat, latency = 183.468s, txc = 0x559ef65a0300
Sep 30 22:57:33 storage2n2-la ceph-osd-3410431: 2019-09-30 22:57:33.978 7f1ec604e700 0 bluestore(/var/lib/ceph/osd/ceph-34) _txc_committed_kv slow operation observed for l_bluestore_commit_lat, latency = 183.395s, txc = 0x559f42720900
Sep 30 22:57:33 storage2n2-la ceph-osd-3410431: 2019-09-30 22:57:33.978 7f1ec604e700 0 bluestore(/var/lib/ceph/osd/ceph-34) _txc_committed_kv slow operation observed for l_bluestore_commit_lat, latency = 183.46s, txc = 0x559ef3990900
Sep 30 22:57:33 storage2n2-la ceph-osd-3410431: 2019-09-30 22:57:33.982 7f1ec504c700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.34 down, but it is still running
Sep 30 22:57:33 storage2n2-la ceph-osd-3410431: 2019-09-30 22:57:33.982 7f1ec504c700 0 log_channel(cluster) log [DBG] : map e6334 wrongly marked me down at e6334
Sep 30 22:57:33 storage2n2-la ceph-osd-3410431: 2019-09-30 22:57:33.982 7f1ec504c700 1 osd.34 6334 start_waiting_for_healthy

Both of osds were on the same node

Actions #2

Updated by Alex Litvak over 4 years ago

I attached a text file with the log, hopefully, displayed correctly

Actions #3

Updated by Alex Litvak over 4 years ago

I found a hardware issue that caused the assert on osd 17. I updated firmware on controller and SSDs to see if the issue comes back to the node. However, running stress tests at some ping has time out to reappear on osd-34

Oct 01 19:35:10 storage2n2-la ceph-osd-3411188: 2019-10-01 19:35:10.157 7f8d03150700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8cd45df700' had timed out after 60
Oct 01 19:35:10 storage2n2-la ceph-osd-3411188: 2019-10-01 19:35:10.157 7f8d03150700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8ccedd4700' had timed out after 60
Oct 01 19:35:10 storage2n2-la ceph-osd-3411188: 2019-10-01 19:35:10.157 7f8d03150700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8ccf5d5700' had timed out after 60
Oct 01 19:35:10 storage2n2-la ceph-osd-3411188: 2019-10-01 19:35:10.157 7f8d03150700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8ccfdd6700' had timed out after 60
Oct 01 19:35:10 storage2n2-la ceph-osd-3411188: 2019-10-01 19:35:10.157 7f8d03150700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8cd0dd8700' had timed out after 60
Oct 01 19:35:10 storage2n2-la ceph-osd-3411188: 2019-10-01 19:35:10.157 7f8d03150700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8cd2ddc700' had timed out after 60
Oct 01 19:35:10 storage2n2-la ceph-osd-3411188: 2019-10-01 19:35:10.157 7f8d03150700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8cd35dd700' had timed out after 60
Oct 01 19:35:10 storage2n2-la ceph-osd-3411188: 2019-10-01 19:35:10.157 7f8d03150700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8cd3dde700' had timed out after 60
Oct 01 19:35:10 storage2n2-la ceph-osd-3411188: 2019-10-01 19:35:10.157 7f8d03150700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8cd4de0700' had timed out after 60
Oct 01 19:35:10 storage2n2-la ceph-osd-3411188: 2019-10-01 19:35:10.157 7f8d03150700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8cd45df700' had timed out after 60
Oct 01 19:35:10 storage2n2-la ceph-osd-3411188: 2019-10-01 19:35:10.165 7f8d0214e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8ccedd4700' had timed out after 60
Oct 01 19:35:10 storage2n2-la ceph-osd-3411188: 2019-10-01 19:35:10.165 7f8d0214e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8ccf5d5700' had timed out after 60

Actions #4

Updated by Greg Farnum over 4 years ago

  • Status changed from New to Closed

Timeouts if you manage to overload the cluster are common. We're continuously working to improve QoS and input limits so that doesn't happen, but we'd need more info about what you're doing to see if it's a known case or have any idea about improving it. One of the mailing lists is a better forum for working through that. :)

Actions

Also available in: Atom PDF