Project

General

Profile

Bug #38724

_txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)

Added by Sage Weil over 1 year ago. Updated 11 months ago.

Status:
Pending Backport
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

100%

Source:
Tags:
Backport:
nautilus, mimic, luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature:

Description

   -22> 2019-03-13 11:16:04.949 7f6815aa5700  5 osd.48 pg_epoch: 19907 pg[7.1( v 18204'27406 (17926'24297,18204'27406] lb MIN (bitwise) local-lis/les=18205/18206 n=20 ec=18092/14826 lis/c 18205/18205 les/c/f 18206/18206/0 19906/19906/19658) [25,14] r=-1 lpr=19906 pi=[18205,19906)/2 crt=18204'27406 lcod 0'0 unknown NOTIFY mbc={}] exit Started/Stray 1.090630 5 0.000104
   -21> 2019-03-13 11:16:04.949 7f6815aa5700  5 osd.48 pg_epoch: 19907 pg[7.1( v 18204'27406 (17926'24297,18204'27406] lb MIN (bitwise) local-lis/les=18205/18206 n=20 ec=18092/14826 lis/c 18205/18205 les/c/f 18206/18206/0 19906/19906/19658) [25,14] r=-1 lpr=19906 pi=[18205,19906)/2 crt=18204'27406 lcod 0'0 unknown NOTIFY mbc={}] enter Started/ToDelete
   -20> 2019-03-13 11:16:04.949 7f6815aa5700  5 osd.48 pg_epoch: 19907 pg[7.1( v 18204'27406 (17926'24297,18204'27406] lb MIN (bitwise) local-lis/les=18205/18206 n=20 ec=18092/14826 lis/c 18205/18205 les/c/f 18206/18206/0 19906/19906/19658) [25,14] r=-1 lpr=19906 pi=[18205,19906)/2 crt=18204'27406 lcod 0'0 unknown NOTIFY mbc={}] enter Started/ToDelete/WaitDeleteReseved
   -17> 2019-03-13 11:16:04.953 7f6815aa5700 -1 bluestore(/var/lib/ceph/osd/ceph-48) _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)
   -16> 2019-03-13 11:16:04.953 7f6815aa5700  0 bluestore(/var/lib/ceph/osd/ceph-48) _dump_transaction transaction dump:
{
    "ops": [
        {
            "op_num": 0,
            "op_name": "remove",
            "collection": "7.1_head",
            "oid": "#7:80000000::::head#" 
        },
        {
            "op_num": 1,
            "op_name": "rmcoll",
            "collection": "7.1_head" 
        }
    ]
}
    -9> 2019-03-13 11:16:04.957 7f6815aa5700 -1 /build/ceph-14.1.0-597-g342814f/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f6815aa5700 time 2019-03-13 11:16:04.955344
/build/ceph-14.1.0-597-g342814f/src/os/bluestore/BlueStore.cc: 11069: abort()

 ceph version 14.1.0-597-g342814f (342814f60900a7eaf5bc3573fabb3a5acd940716) nautilus (rc)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x55b1156c4e45]
 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x296a) [0x55b115cb971a]
 3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x5e6) [0x55b115cbdc86]
 4: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x7f) [0x55b1158751ef]
 5: (PG::_delete_some(ObjectStore::Transaction*)+0x710) [0x55b1158d8d70]
 6: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x71) [0x55b1158d9b31]
 7: (boost::statechart::simple_state<PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x131) [0x55b115921ed1]
 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0x55b1158f631b]
 9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x122) [0x55b1158e5be2]
 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x55b115820f94]
 11: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0xd2) [0x55b115821272]
 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbf5) [0x55b115814f25]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x55b115e3577c]

hit this on my personal clsuter. log attached.

ceph-osd.5.clippedtoupload.log View - OSD 5 crashed (77.4 KB) Edward Kalk, 07/10/2019 02:46 PM

ceph-osd.21.clippedtoupload.log View - OSD 21 crashed (17.7 KB) Edward Kalk, 07/10/2019 02:49 PM

osd 10 will not start 07112019.txt View (222 KB) Edward Kalk, 07/11/2019 04:39 PM


Related issues

Related to RADOS - Bug #39152: nautilus osd crash: Caught signal (Aborted) tp_osd_tp Duplicate 04/09/2019
Duplicated by RADOS - Bug #40642: Bluestore crash due to mass activation on another pool Duplicate 07/03/2019
Copied to RADOS - Backport #39692: mimic: _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0) Resolved
Copied to RADOS - Backport #39693: nautilus: _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0) Resolved
Copied to RADOS - Backport #39694: luminous: _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0) Need More Info

History

#1 Updated by Sage Weil over 1 year ago

ceph-post-file: 26dab2cb-36c9-40de-8455-1379406477e8

#2 Updated by Sage Weil over 1 year ago

  • Status changed from 12 to Need More Info
  • Priority changed from Urgent to High

logging level isn't high enough to tell what data is in this pg. :(

#3 Updated by Greg Farnum over 1 year ago

  • Status changed from Need More Info to 12

A user saw this and uploaded with debug 20 on OSD and bluestone: 2d8d22f4-580b-4b57-a13a-f49dade34ba7

#4 Updated by Sage Weil over 1 year ago

  • Status changed from 12 to Fix Under Review

#5 Updated by Sage Weil over 1 year ago

  • Related to Bug #39152: nautilus osd crash: Caught signal (Aborted) tp_osd_tp added

#6 Updated by Lenz Grimmer over 1 year ago

  • Pull request ID set to 27929

#7 Updated by Sage Weil over 1 year ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to nautilus, mimic, luminous

#8 Updated by Nathan Cutler over 1 year ago

  • Copied to Backport #39692: mimic: _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0) added

#9 Updated by Nathan Cutler over 1 year ago

  • Copied to Backport #39693: nautilus: _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0) added

#10 Updated by Nathan Cutler over 1 year ago

  • Copied to Backport #39694: luminous: _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0) added

#11 Updated by Igor Fedotov about 1 year ago

  • Duplicated by Bug #40642: Bluestore crash due to mass activation on another pool added

#12 Updated by Edward Kalk about 1 year ago

We have hit this as well, it was triggered when I rebooted a node. A few OSD on other hosts crashed. Here's some log:
Jul 08 16:00:54 synergy0 ceph-osd44682: 2019-07-08 16:00:54.962 7fe236221700 -1 bluestore(/var/lib/ceph/osd/ceph-12) txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)
Jul 08 16:00:54 synergy0 ceph-osd44682: /build/ceph-14.2.1/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7fe236221700 time 2019-07-08 16:00:54.965119
Jul 08 16:00:54 synergy0 ceph-osd44682: /build/ceph-14.2.1/src/os/bluestore/BlueStore.cc: 11089: abort()
Jul 08 16:00:54 synergy0 ceph-osd44682: ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)
Jul 08 16:00:54 synergy0 ceph-osd44682: 1: (ceph::
_ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x850a01]
Jul 08 16:00:54 synergy0 ceph-osd44682: 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x296a) [0xe44a2a]
Jul 08 16:00:54 synergy0 ceph-osd44682: 3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x5e6) [0xe48f96]
Jul 08 16:00:54 synergy0 ceph-osd44682: 4: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x7f) [0xa0099f]
Jul 08 16:00:54 synergy0 ceph-osd44682: 5: (PG::_delete_some(ObjectStore::Transaction*)+0x710) [0xa64b30]
Jul 08 16:00:54 synergy0 ceph-osd44682: 6: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x71) [0xa658f1]
Jul 08 16:00:54 synergy0 ceph-osd44682: 7: (boost::statechart::simple_state<PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x131) [0xaaf151]
Jul 08 16:00:54 synergy0 ceph-osd44682: 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0xa82b0b]
Jul 08 16:00:54 synergy0 ceph-osd44682: 9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x122) [0xa72052]
Jul 08 16:00:54 synergy0 ceph-osd44682: 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x9ac6a4]
Jul 08 16:00:54 synergy0 ceph-osd44682: 11: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0xd2) [0x9ac982]
Jul 08 16:00:54 synergy0 ceph-osd44682: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x9a07dd]
Jul 08 16:00:54 synergy0 ceph-osd44682: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xfc37bc]
Jul 08 16:00:54 synergy0 ceph-osd44682: 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xfc6970]
Jul 08 16:00:54 synergy0 ceph-osd44682: 15: (()+0x76ba) [0x7fe2557446ba]
Jul 08 16:00:54 synergy0 ceph-osd44682: 16: (clone()+0x6d) [0x7fe254d4b41d]
Jul 08 16:00:54 synergy0 ceph-osd44682: 2019-07-08 16:00:54.974 7fe236221700 -1 /build/ceph-14.2.1/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7fe236221700 time 2019-07-08 16:00:54.965119
Jul 08 16:00:54 synergy0 ceph-osd44682: /build/ceph-14.2.1/src/os/bluestore/BlueStore.cc: 11089: abort()
Jul 08 16:00:54 synergy0 ceph-osd44682: ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)
Jul 08 16:00:54 synergy0 ceph-osd44682: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x850a01]
Jul 08 16:00:54 synergy0 ceph-osd44682: 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x296a) [0xe44a2a]
Jul 08 16:00:54 synergy0 ceph-osd44682: 3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x5e6) [0xe48f96]
Jul 08 16:00:54 synergy0 ceph-osd44682: 4: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x7f) [0xa0099f]
Jul 08 16:00:54 synergy0 ceph-osd44682: 5: (PG::_delete_some(ObjectStore::Transaction*)+0x710) [0xa64b30]
Jul 08 16:00:54 synergy0 ceph-osd44682: 6: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x71) [0xa658f1]
Jul 08 16:00:54 synergy0 ceph-osd44682: 7: (boost::statechart::simple_state<PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x131) [0xaaf151]
Jul 08 16:00:54 synergy0 ceph-osd44682: 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0xa82b0b]
Jul 08 16:00:54 synergy0 ceph-osd44682: 9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x122) [0xa72052]
Jul 08 16:00:54 synergy0 ceph-osd44682: 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x9ac6a4]
Jul 08 16:00:54 synergy0 ceph-osd44682: 11: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0xd2) [0x9ac982]
Jul 08 16:00:54 synergy0 ceph-osd44682: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x9a07dd]
Jul 08 16:00:54 synergy0 ceph-osd44682: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xfc37bc]
Jul 08 16:00:54 synergy0 ceph-osd44682: 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xfc6970]
Jul 08 16:00:54 synergy0 ceph-osd44682: 15: (()+0x76ba) [0x7fe2557446ba]
Jul 08 16:00:54 synergy0 ceph-osd44682: 16: (clone()+0x6d) [0x7fe254d4b41d]
Jul 08 16:00:54 synergy0 ceph-osd44682: * Caught signal (Aborted) *
Jul 08 16:00:54 synergy0 ceph-osd44682: in thread 7fe236221700 thread_name:tp_osd_tp
Jul 08 16:00:54 synergy0 ceph-osd44682: ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)
Jul 08 16:00:54 synergy0 ceph-osd44682: 1: (()+0x11390) [0x7fe25574e390]
Jul 08 16:00:54 synergy0 ceph-osd44682: 2: (gsignal()+0x38) [0x7fe254c79428]
Jul 08 16:00:54 synergy0 ceph-osd44682: 3: (abort()+0x16a) [0x7fe254c7b02a]
Jul 08 16:00:54 synergy0 ceph-osd44682: 4: (ceph::__ceph_abort(char const
, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1a0) [0x850ac7]
Jul 08 16:00:54 synergy0 ceph-osd44682: 5: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x296a) [0xe44a2a]
Jul 08 16:00:54 synergy0 ceph-osd44682: 6: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x5e6) [0xe48f96]
Jul 08 16:00:54 synergy0 ceph-osd44682: 7: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x7f) [0xa0099f]
Jul 08 16:00:54 synergy0 ceph-osd44682: 8: (PG::_delete_some(ObjectStore::Transaction*)+0x710) [0xa64b30]
Jul 08 16:00:54 synergy0 ceph-osd44682: 9: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x71) [0xa658f1]
Jul 08 16:00:54 synergy0 ceph-osd44682: 10: (boost::statechart::simple_state<PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x131) [0xaaf151]
Jul 08 16:00:54 synergy0 ceph-osd44682: 11: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0xa82b0b]
Jul 08 16:00:54 synergy0 ceph-osd44682: 12: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x122) [0xa72052]
Jul 08 16:00:54 synergy0 ceph-osd44682: 13: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x9ac6a4]
Jul 08 16:00:54 synergy0 ceph-osd44682: 14: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0xd2) [0x9ac982]
Jul 08 16:00:54 synergy0 ceph-osd44682: 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x9a07dd]
Jul 08 16:00:54 synergy0 ceph-osd44682: 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xfc37bc]
Jul 08 16:00:54 synergy0 ceph-osd44682: 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xfc6970]
Jul 08 16:00:54 synergy0 ceph-osd44682: 18: (()+0x76ba) [0x7fe2557446ba]
Jul 08 16:00:54 synergy0 ceph-osd44682: 19: (clone()+0x6d) [0x7fe254d4b41d]
Jul 08 16:00:54 synergy0 ceph-osd44682: 2019-07-08 16:00:54.986 7fe236221700 -1
Caught signal (Aborted) *
Jul 08 16:00:54 synergy0 ceph-osd44682: in thread 7fe236221700 thread_name:tp_osd_tp
Jul 08 16:00:54 synergy0 ceph-osd44682: ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)
Jul 08 16:00:54 synergy0 ceph-osd44682: 1: (()+0x11390) [0x7fe25574e390]
Jul 08 16:00:54 synergy0 ceph-osd44682: 2: (gsignal()+0x38) [0x7fe254c79428]
Jul 08 16:00:54 synergy0 ceph-osd44682: 3: (abort()+0x16a) [0x7fe254c7b02a]
Jul 08 16:00:54 synergy0 ceph-osd44682: 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1a0) [0x850ac7]
Jul 08 16:00:54 synergy0 ceph-osd44682: 5: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x296a) [0xe44a2a]
Jul 08 16:00:54 synergy0 ceph-osd44682: 6: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x5e6) [0xe48f96]
Jul 08 16:00:54 synergy0 ceph-osd44682: 7: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x7f) [0xa0099f]
Jul 08 16:00:54 synergy0 ceph-osd44682: 8: (PG::_delete_some(ObjectStore::Transaction*)+0x710) [0xa64b30]
Jul 08 16:00:54 synergy0 ceph-osd44682: 9: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x71) [0xa658f1]
Jul 08 16:00:54 synergy0 ceph-osd44682: 10: (boost::statechart::simple_state<PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&...

#13 Updated by Edward Kalk about 1 year ago

I am confused by the "Copied to RADOS - Backport #39693: nautilus" status. "Pending Backport 07/03/2019"
Does this mean it has been fixed and there is a patch or? Is there a workaround or code change I can insert?

#15 Updated by Edward Kalk about 1 year ago

We hit this bug again : "2019-07-10 09:16:27.728 7f73b844c700 -1 bluestore(/var/lib/ceph/osd/ceph-5) _txc_add_transaction error (39) Directory not empty not handled on operation 21"
This time it triggered when I added a OSD to a node.
Logs attached.

#16 Updated by Edward Kalk about 1 year ago

found a few things that seem like fixes for this on github... : https://github.com/ceph/ceph/pull/27929/commits

#17 Updated by Edward Kalk about 1 year ago

OSD 9, 15, 10, 13 crashed this AM.
```ceph.log:2019-07-11 09:15:15.501601 mon.synergy0 (mon.0) 4248 : cluster [INF] osd.10 [v2:216.106.44.201:6802/25418,v1:216.106.44.201:6803/25418] boot
ceph.log:2019-07-11 09:15:20.561861 mon.synergy0 (mon.0) 4257 : cluster [INF] osd.13 [v2:216.106.44.201:6818/25427,v1:216.106.44.201:6820/25427] boot
ceph.log:2019-07-11 09:15:23.620936 mon.synergy0 (mon.0) 4263 : cluster [INF] osd.15 [v2:216.106.44.200:6802/17179,v1:216.106.44.200:6803/17179] boot
ceph.log:2019-07-11 09:15:39.440562 mon.synergy0 (mon.0) 4277 : cluster [INF] osd.9 [v2:216.106.44.200:6814/17315,v1:216.106.44.200:6815/17315] boot```

#18 Updated by Edward Kalk about 1 year ago

Was Bug 38724:
```ceph-osd.9.log: -3> 2019-07-11 09:15:13.569 7fc7b8243700 -1 bluestore(/var/lib/ceph/osd/ceph-9) _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)
ceph-osd.15.log:2019-07-11 09:15:08.593 7fbe46b8b700 -1 bluestore(/var/lib/ceph/osd/ceph-15) _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)
ceph-osd.13.log:2019-07-11 09:15:08.627 7fc575850700 -1 bluestore(/var/lib/ceph/osd/ceph-13) _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)
ceph-osd.10.log:2019-07-11 09:15:08.543 7f0ef2a03700 -1 bluestore(/var/lib/ceph/osd/ceph-10) _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)```

We are 14.2.1 on Ubuntu 16.04

#19 Updated by Edward Kalk about 1 year ago

Sometimes when this happens, the OSDs repeatedly crash and Linux system prevents them from being started. it takes 10 minuets for linux to allow them to start again. (too many start attempts)
15:24:13 synergy0 systemd1: : Start request repeated too quickly.
Jul 08 15:24:13 synergy0 systemd1: Failed to start Ceph object storage daemon osd.12.

#20 Updated by Edward Kalk about 1 year ago

^^this results in the Production VMs becoming unresponsive as their disks are unavailable when we have multiple OSDs down on multiple hosts. (we are doing 2 copy) I've seen it where 3 OSDs are down at the same time on different hosts due to this bug. That's when we are seemingly really un lucky with the BUG. (3 copy would not have saved us from that)

#21 Updated by Edward Kalk about 1 year ago

Rebooted node 4, on node 1 and 2, 2 OSDs each crashed and will not start.
The logs are similar, seems to be the BUG related to 38724
Tried to manually start the OSDs, failed. It’s been about 20 minuets with prod down.

#22 Updated by Nathan Fish about 1 year ago

Edward Kalk wrote:

Sometimes when this happens, the OSDs repeatedly crash and Linux system prevents them from being started. it takes 10 minuets for linux to allow them to start again. (too many start attempts)
15:24:13 synergy0 systemd1: : Start request repeated too quickly.
Jul 08 15:24:13 synergy0 systemd1: Failed to start Ceph object storage daemon osd.12.

General systemd tip: you can reset the failure limit with 'systemctl reset-failed'. Then you can 'systemctl restart ceph-osd@12'.

#23 Updated by Nathan Fish about 1 year ago

It seems this will be in 14.2.3. When the fix comes out, will my crashed OSDs work again, or should I just purge and recreate them now?

#24 Updated by Nathan Fish about 1 year ago

Just lost another OSD on 14.2.2. The cluster is still mostly empty, but a large parallel cp is going on, so rebuild will take a while. This is somewhat worrying. Any update on this?

#25 Updated by Nathan Fish about 1 year ago

That OSD failure seems to have caused a cascade. Several more OSDs have crashed. 12% of objects were degraded, and I had to create new 'ssd' class OSDs to get enough failure domains. I cancelled the cp to prioritize recovery. Is there any workaround to repair the OSDs and get them to restart properly? They just crash again every time I restart them.

Can this bug please be set to a higher priority? This has caused an outage for myself and Edward above, and threatens data loss. That warrants at least Major.

#26 Updated by Florian Haas about 1 year ago

  • Affected Versions v12.2.12 added

So we're seeing a similar error after upgrading a Jewel cluster to Luminous (12.2.12), and then adding BlueStore OSDs to it:

    -3> 2019-08-09 11:17:04.772976 7fb41585d700 -1 bluestore(/var/lib/ceph/osd/ceph-78) _txc_add_transaction error (7) Argument list too long not handled on operation 12 (op 2, counting from 0)
    -2> 2019-08-09 11:17:04.772993 7fb41585d700 -1 bluestore(/var/lib/ceph/osd/ceph-78) unexpected error code
    -1> 2019-08-09 11:17:04.773013 7fb41585d700  0 bluestore(/var/lib/ceph/osd/ceph-78)  transaction dump:
{
    "ops": [
        {
            "op_num": 0,
            "op_name": "remove",
            "collection": "2.293_head",
            "oid": "#-4:c96337db:::temp_recovering_2.293_11123'6472830_288833_head:head#" 
        },
        {
            "op_num": 1,
            "op_name": "touch",
            "collection": "2.293_head",
            "oid": "#-4:c96337db:::temp_recovering_2.293_11123'6472830_288833_head:head#" 
        },
        {
            "op_num": 2,
            "op_name": "truncate",
            "collection": "2.293_head",
            "oid": "#-4:c96337db:::temp_recovering_2.293_11123'6472830_288833_head:head#",
            "offset": 4457615932
        },
        {
            "op_num": 3,
            "op_name": "op_setallochint",
            "collection": "2.293_head",
            "oid": "#-4:c96337db:::temp_recovering_2.293_11123'6472830_288833_head:head#",
            "expected_object_size": "0",
            "expected_write_size": "0" 
        },
        {
            "op_num": 4,
            "op_name": "write",
            "collection": "2.293_head",
            "oid": "#-4:c96337db:::temp_recovering_2.293_11123'6472830_288833_head:head#",
            "length": 8388608,
            "offset": 0,
            "bufferlist length": 8388608
        },
        {
            "op_num": 5,
            "op_name": "setattrs",
            "collection": "2.293_head",
            "oid": "#-4:c96337db:::temp_recovering_2.293_11123'6472830_288833_head:head#",
            "attr_lens": {
                "_": 254,
                "snapset": 31
            }
        }
    ]
}

In our case, that's an E2BIG as opposed to ENOTEMPTY, but it does look awfully similar, at least superficially.

#27 Updated by Sage Weil about 1 year ago

Florian Haas wrote:

So we're seeing a similar error after upgrading a Jewel cluster to Luminous (12.2.12), and then adding BlueStore OSDs to it:

[...]

In our case, that's an E2BIG as opposed to ENOTEMPTY, but it does look awfully similar, at least superficially.

BlueStore returns E2BIG because the object is > 4GB, which is not supported (size is stored in an u32). This cluster has some big rados objects.

#28 Updated by Brad Hubbard 11 months ago

  • Priority changed from Normal to High

Bumping priority based on community feedback.

Also available in: Atom PDF