Project

General

Profile

Actions

Bug #40642

closed

Bluestore crash due to mass activation on another pool

Added by Nigel Williams almost 5 years ago. Updated almost 5 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Newly deployed Nautilus cluster with SSD and HDD pools on Ubuntu 18.04.2 with kernel 4.15.0-54.

When adding a dozen HDD OSDs at once via:

ceph-volume lvm prepare --bluestore --data /dev/sdh --block.wal /dev/sdt1 --block.db /dev/sdt2 --crush-device-class hdd
...repeated several times for a bunch of OSDs..

ceph-volume lvm activate --all

OSDs in the SSD pool crashed with:
2019-07-03 14:23:04.285 7f8d7150b700 1 osd.69 pg_epoch: 4852 pg[4.0( v 4067'1233 (0'0,4067'1233] lb MIN (bitwise) local-lis/les=4081/4082 n=0 ec=7
39/739 lis/c 4778/4778 les/c/f 4779/4779/0 4847/4852/4773) [189,171,173] r=-1 lpr=4852 pi=[3972,4852)/8 crt=4067'1233 lcod 0'0 unknown NOTIFY mbc={
}] state<Start>: transitioning to Stray
2019-07-03 14:23:05.069 7f8d7150b700 -1 bluestore(/var/lib/ceph/osd/ceph-69) _txc_add_transaction error (39) Directory not empty not handled on ope
ration 21 (op 1, counting from 0)
2019-07-03 14:23:05.069 7f8d7150b700 0 bluestore(/var/lib/ceph/osd/ceph-69) _dump_transaction transaction dump: {
"ops": [ {
"op_num": 0,
"op_name": "remove",
"collection": "4.0_head",
"oid": "#4:00000000::::head#"
}, {
"op_num": 1,
"op_name": "rmcoll",
"collection": "4.0_head"
}
]
}

2019-07-03 14:23:05.069 7f8d7150b700 -1 /build/ceph-14.2.1/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueSto
re::TransContext*, ObjectStore::Transaction*)' thread 7f8d7150b700 time 2019-07-03 14:23:05.070284
/build/ceph-14.2.1/src/os/bluestore/BlueStore.cc: 11089: abort()

ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)
1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; > const&)+0xda
) [0x55d265b50d67]
2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x284c) [0x55d26616d05c]
3: (BlueStore::queue_transactions(boost::intrusive_ptr&lt;ObjectStore::CollectionImpl&gt;&, std::vector&lt;ObjectStore::Transaction, std::allocator&lt;ObjectS
tore::Transaction> >&, boost::intrusive_ptr&lt;TrackedOp&gt;, ThreadPool::TPHandle*)+0x42a) [0x55d26616fd0a]
4: (ObjectStore::queue_transaction(boost::intrusive_ptr&lt;ObjectStore::CollectionImpl&gt;&, ObjectStore::Transaction&&, boost::intrusive_ptr&lt;TrackedOp&gt;
, ThreadPool::TPHandle*)+0x80) [0x55d265d5faf0]
5: (PG::_delete_some(ObjectStore::Transaction*)+0x96d) [0x55d265dbcdbd]
6: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x71) [0x55d265dbd811]
7: (boost::statechart::simple_state&lt;PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list&lt;mpl_::na, mpl_::na, mpl_::na, mpl_:
:na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mp
l_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x211) [0x55d265e0fe71]
8: (boost::statechart::state_machine&lt;PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator&lt;void&gt;, boost::statechart::null
_exception_translator>::send_event(boost::statechart::event_base const&)+0x4b) [0x55d265dd89fb]
9: (PG::do_peering_event(std::shared_ptr&lt;PGPeeringEvent&gt;, PG::RecoveryCtx*)+0x2b6) [0x55d265dc94e6]
10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr&lt;PGPeeringEvent&gt;, ThreadPool::TPHandle&)+0x211) [0x55d265d08a11]
11: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0xd2) [0x55d265d08cd2]
12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x7d0) [0x55d265cfcff0]
13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) [0x55d2662d2574]
14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55d2662d45d0]
15: (()+0x76db) [0x7f8d91db96db]
16: (clone()+0x3f) [0x7f8d90b5988f]

2019-07-03 14:23:05.073 7f8d7150b700 -1 ** Caught signal (Aborted) *
in thread 7f8d7150b700 thread_name:tp_osd_tp

ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)
1: (()+0x12890) [0x7f8d91dc4890]
2: (gsignal()+0xc7) [0x7f8d90a76e97]
3: (abort()+0x141) [0x7f8d90a78801]
4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; > const&)+0x19
d) [0x55d265b50e2a]
5: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x284c) [0x55d26616d05c]
6: (BlueStore::queue_transactions(boost::intrusive_ptr&lt;ObjectStore::CollectionImpl&gt;&, std::vector&lt;ObjectStore::Transaction, std::allocator&lt;ObjectS
tore::Transaction> >&, boost::intrusive_ptr&lt;TrackedOp&gt;, ThreadPool::TPHandle*)+0x42a) [0x55d26616fd0a]
7: (ObjectStore::queue_transaction(boost::intrusive_ptr&lt;ObjectStore::CollectionImpl&gt;&, ObjectStore::Transaction&&, boost::intrusive_ptr&lt;TrackedOp&gt;
, ThreadPool::TPHandle*)+0x80) [0x55d265d5faf0]
8: (PG::_delete_some(ObjectStore::Transaction*)+0x96d) [0x55d265dbcdbd]
9: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x71) [0x55d265dbd811]
10: (boost::statechart::simple_state&lt;PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list&lt;mpl_::na, mpl_::na, mpl_::na, mpl_
::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, m
pl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x211) [0x55d265e0fe71]
11: (boost::statechart::state_machine&lt;PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator&lt;void&gt;, boost::statechart::nul
l_exception_translator>::send_event(boost::statechart::event_base const&)+0x4b) [0x55d265dd89fb]
12: (PG::do_peering_event(std::shared_ptr&lt;PGPeeringEvent&gt;, PG::RecoveryCtx*)+0x2b6) [0x55d265dc94e6]
13: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr&lt;PGPeeringEvent&gt;, ThreadPool::TPHandle&)+0x211) [0x55d265d08a11]
14: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0xd2) [0x55d265d08cd2]
15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x7d0) [0x55d265cfcff0]
16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) [0x55d2662d2574]
17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55d2662d45d0]
18: (()+0x76db) [0x7f8d91db96db]
19: (clone()+0x3f) [0x7f8d90b5988f]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- begin dump of recent events ---
-1608> 2019-07-03 14:20:28.940 7f8d945f0f00 5 asok(0x55d27095e000) register_command assert hook 0x55d2705c4270
-1607> 2019-07-03 14:20:28.940 7f8d945f0f00 5 asok(0x55d27095e000) register_command abort hook 0x55d2705c4270
-1606> 2019-07-03 14:20:28.940 7f8d945f0f00 5 asok(0x55d27095e000) register_command perfcounters_dump hook 0x55d2705c4270
-1605> 2019-07-03 14:20:28.940 7f8d945f0f00 5 asok(0x55d27095e000) register_command 1 hook 0x55d2705c4270
-1604> 2019-07-03 14:20:28.940 7f8d945f0f00 5 asok(0x55d27095e000) register_command perf dump hook 0x55d2705c4270
-1603> 2019-07-03 14:20:28.940 7f8d945f0f00 5 asok(0x55d27095e000) register_command perfcounters_schema hook 0x55d2705c4270
-1602> 2019-07-03 14:20:28.940 7f8d945f0f00 5 asok(0x55d27095e000) register_command perf histogram dump hook 0x55d2705c4270
-1601> 2019-07-03 14:20:28.940 7f8d945f0f00 5 asok(0x55d27095e000) register_command 2 hook 0x55d2705c4270
-1600> 2019-07-03 14:20:28.940 7f8d945f0f00 5 asok(0x55d27095e000) register_command perf schema hook 0x55d2705c4270
-1599> 2019-07-03 14:20:28.940 7f8d945f0f00 5 asok(0x55d27095e000) register_command perf histogram schema hook 0x55d2705c4270
-1598> 2019-07-03 14:20:28.940 7f8d945f0f00 5 asok(0x55d27095e000) register_command perf reset hook 0x55d2705c4270
-1597> 2019-07-03 14:20:28.940 7f8d945f0f00 5 asok(0x55d27095e000) register_command config show hook 0x55d2705c4270
-1596> 2019-07-03 14:20:28.940 7f8d945f0f00 5 asok(0x55d27095e000) register_command config help hook 0x55d2705c4270
-1595> 2019-07-03 14:20:28.940 7f8d945f0f00 5 asok(0x55d27095e000) register_command config set hook 0x55d2705c4270
-1594> 2019-07-03 14:20:28.940 7f8d945f0f00 5 asok(0x55d27095e000) register_command config unset hook 0x55d2705c4270
-1593> 2019-07-03 14:20:28.940 7f8d945f0f00 5 asok(0x55d27095e000) register_command config get hook 0x55d2705c4270
-1592> 2019-07-03 14:20:28.940 7f8d945f0f00 5 asok(0x55d27095e000) register_command config diff hook 0x55d2705c4270
-1591> 2019-07-03 14:20:28.940 7f8d945f0f00 5 asok(0x55d27095e000) register_command config diff get hook 0x55d2705c4270
-1590> 2019-07-03 14:20:28.940 7f8d945f0f00 5 asok(0x55d27095e000) register_command log flush hook 0x55d2705c4270
-1589> 2019-07-03 14:20:28.940 7f8d945f0f00 5 asok(0x55d27095e000) register_command log dump hook 0x55d2705c4270
-1588> 2019-07-03 14:20:28.940 7f8d945f0f00 5 asok(0x55d27095e000) register_command log reopen hook 0x55d2705c4270
-1587> 2019-07-03 14:20:28.940 7f8d945f0f00 5 asok(0x55d27095e000) register_command dump_mempools hook 0x55d27145e068
-1586> 2019-07-03 14:20:28.948 7f8d945f0f00 10 monclient: get_monmap_and_config
-1585> 2019-07-03 14:20:28.948 7f8d945f0f00 10 monclient: build_initial_monmap
-1584> 2019-07-03 14:20:28.948 7f8d945f0f00 10 monclient: monmap:
epoch 0
fsid d737d3f9-1b19-4ca9-a355-587cea34612e
last_changed 2019-07-03 14:20:28.950873
created 2019-07-03 14:20:28.950873
min_mon_release 0 (unknown)
0: v1:172.27.51.20:6789/0 mon.noname-a-legacy
1: v2:172.27.51.20:3300/0 mon.noname-a


Files

cnx-04-ceph-osd.69.log.gz (641 KB) cnx-04-ceph-osd.69.log.gz OSD log file showing Bluestore crash Nigel Williams, 07/03/2019 04:54 AM

Related issues 1 (0 open1 closed)

Is duplicate of RADOS - Bug #38724: _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)Resolved

Actions
Actions #1

Updated by Igor Fedotov almost 5 years ago

  • Project changed from bluestore to RADOS
  • Status changed from New to Duplicate
  • Parent task set to #38724

looks like a duplicate for https://tracker.ceph.com/issues/38724

Actions #2

Updated by Igor Fedotov almost 5 years ago

  • Parent task deleted (#38724)
Actions #3

Updated by Igor Fedotov almost 5 years ago

  • Is duplicate of Bug #38724: _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0) added
Actions

Also available in: Atom PDF