Project

General

Profile

Actions

Bug #38724

closed

_txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)

Added by Sage Weil about 5 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

100%

Source:
Tags:
Backport:
nautilus, mimic, luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

   -22> 2019-03-13 11:16:04.949 7f6815aa5700  5 osd.48 pg_epoch: 19907 pg[7.1( v 18204'27406 (17926'24297,18204'27406] lb MIN (bitwise) local-lis/les=18205/18206 n=20 ec=18092/14826 lis/c 18205/18205 les/c/f 18206/18206/0 19906/19906/19658) [25,14] r=-1 lpr=19906 pi=[18205,19906)/2 crt=18204'27406 lcod 0'0 unknown NOTIFY mbc={}] exit Started/Stray 1.090630 5 0.000104
   -21> 2019-03-13 11:16:04.949 7f6815aa5700  5 osd.48 pg_epoch: 19907 pg[7.1( v 18204'27406 (17926'24297,18204'27406] lb MIN (bitwise) local-lis/les=18205/18206 n=20 ec=18092/14826 lis/c 18205/18205 les/c/f 18206/18206/0 19906/19906/19658) [25,14] r=-1 lpr=19906 pi=[18205,19906)/2 crt=18204'27406 lcod 0'0 unknown NOTIFY mbc={}] enter Started/ToDelete
   -20> 2019-03-13 11:16:04.949 7f6815aa5700  5 osd.48 pg_epoch: 19907 pg[7.1( v 18204'27406 (17926'24297,18204'27406] lb MIN (bitwise) local-lis/les=18205/18206 n=20 ec=18092/14826 lis/c 18205/18205 les/c/f 18206/18206/0 19906/19906/19658) [25,14] r=-1 lpr=19906 pi=[18205,19906)/2 crt=18204'27406 lcod 0'0 unknown NOTIFY mbc={}] enter Started/ToDelete/WaitDeleteReseved
   -17> 2019-03-13 11:16:04.953 7f6815aa5700 -1 bluestore(/var/lib/ceph/osd/ceph-48) _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)
   -16> 2019-03-13 11:16:04.953 7f6815aa5700  0 bluestore(/var/lib/ceph/osd/ceph-48) _dump_transaction transaction dump:
{
    "ops": [
        {
            "op_num": 0,
            "op_name": "remove",
            "collection": "7.1_head",
            "oid": "#7:80000000::::head#" 
        },
        {
            "op_num": 1,
            "op_name": "rmcoll",
            "collection": "7.1_head" 
        }
    ]
}
    -9> 2019-03-13 11:16:04.957 7f6815aa5700 -1 /build/ceph-14.1.0-597-g342814f/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f6815aa5700 time 2019-03-13 11:16:04.955344
/build/ceph-14.1.0-597-g342814f/src/os/bluestore/BlueStore.cc: 11069: abort()

 ceph version 14.1.0-597-g342814f (342814f60900a7eaf5bc3573fabb3a5acd940716) nautilus (rc)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x55b1156c4e45]
 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x296a) [0x55b115cb971a]
 3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x5e6) [0x55b115cbdc86]
 4: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x7f) [0x55b1158751ef]
 5: (PG::_delete_some(ObjectStore::Transaction*)+0x710) [0x55b1158d8d70]
 6: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x71) [0x55b1158d9b31]
 7: (boost::statechart::simple_state<PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x131) [0x55b115921ed1]
 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0x55b1158f631b]
 9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x122) [0x55b1158e5be2]
 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x55b115820f94]
 11: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0xd2) [0x55b115821272]
 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbf5) [0x55b115814f25]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x55b115e3577c]

hit this on my personal clsuter. log attached.

Files

ceph-osd.5.clippedtoupload.log (77.4 KB) ceph-osd.5.clippedtoupload.log OSD 5 crashed Edward Kalk, 07/10/2019 02:46 PM
ceph-osd.21.clippedtoupload.log (17.7 KB) ceph-osd.21.clippedtoupload.log OSD 21 crashed Edward Kalk, 07/10/2019 02:49 PM
osd 10 will not start 07112019.txt (222 KB) osd 10 will not start 07112019.txt Edward Kalk, 07/11/2019 04:39 PM

Related issues 5 (0 open5 closed)

Related to RADOS - Bug #39152: nautilus osd crash: Caught signal (Aborted) tp_osd_tpDuplicate04/09/2019

Actions
Has duplicate RADOS - Bug #40642: Bluestore crash due to mass activation on another poolDuplicate07/03/2019

Actions
Copied to RADOS - Backport #39692: mimic: _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)ResolvedNathan CutlerActions
Copied to RADOS - Backport #39693: nautilus: _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)ResolvedSage WeilActions
Copied to RADOS - Backport #39694: luminous: _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)RejectedActions
Actions #1

Updated by Sage Weil about 5 years ago

ceph-post-file: 26dab2cb-36c9-40de-8455-1379406477e8

Actions #2

Updated by Sage Weil about 5 years ago

  • Status changed from 12 to Need More Info
  • Priority changed from Urgent to High

logging level isn't high enough to tell what data is in this pg. :(

Actions #3

Updated by Greg Farnum almost 5 years ago

  • Status changed from Need More Info to 12

A user saw this and uploaded with debug 20 on OSD and bluestone: 2d8d22f4-580b-4b57-a13a-f49dade34ba7

Actions #4

Updated by Sage Weil almost 5 years ago

  • Status changed from 12 to Fix Under Review
Actions #5

Updated by Sage Weil almost 5 years ago

  • Related to Bug #39152: nautilus osd crash: Caught signal (Aborted) tp_osd_tp added
Actions #6

Updated by Lenz Grimmer almost 5 years ago

  • Pull request ID set to 27929
Actions #7

Updated by Sage Weil almost 5 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to nautilus, mimic, luminous
Actions #8

Updated by Nathan Cutler almost 5 years ago

  • Copied to Backport #39692: mimic: _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0) added
Actions #9

Updated by Nathan Cutler almost 5 years ago

  • Copied to Backport #39693: nautilus: _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0) added
Actions #10

Updated by Nathan Cutler almost 5 years ago

  • Copied to Backport #39694: luminous: _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0) added
Actions #11

Updated by Igor Fedotov almost 5 years ago

  • Has duplicate Bug #40642: Bluestore crash due to mass activation on another pool added
Actions #12

Updated by Edward Kalk almost 5 years ago

We have hit this as well, it was triggered when I rebooted a node. A few OSD on other hosts crashed. Here's some log:
Jul 08 16:00:54 synergy0 ceph-osd44682: 2019-07-08 16:00:54.962 7fe236221700 -1 bluestore(/var/lib/ceph/osd/ceph-12) txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)
Jul 08 16:00:54 synergy0 ceph-osd44682: /build/ceph-14.2.1/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7fe236221700 time 2019-07-08 16:00:54.965119
Jul 08 16:00:54 synergy0 ceph-osd44682: /build/ceph-14.2.1/src/os/bluestore/BlueStore.cc: 11089: abort()
Jul 08 16:00:54 synergy0 ceph-osd44682: ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)
Jul 08 16:00:54 synergy0 ceph-osd44682: 1: (ceph::
_ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x850a01]
Jul 08 16:00:54 synergy0 ceph-osd44682: 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x296a) [0xe44a2a]
Jul 08 16:00:54 synergy0 ceph-osd44682: 3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x5e6) [0xe48f96]
Jul 08 16:00:54 synergy0 ceph-osd44682: 4: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x7f) [0xa0099f]
Jul 08 16:00:54 synergy0 ceph-osd44682: 5: (PG::_delete_some(ObjectStore::Transaction*)+0x710) [0xa64b30]
Jul 08 16:00:54 synergy0 ceph-osd44682: 6: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x71) [0xa658f1]
Jul 08 16:00:54 synergy0 ceph-osd44682: 7: (boost::statechart::simple_state<PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x131) [0xaaf151]
Jul 08 16:00:54 synergy0 ceph-osd44682: 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0xa82b0b]
Jul 08 16:00:54 synergy0 ceph-osd44682: 9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x122) [0xa72052]
Jul 08 16:00:54 synergy0 ceph-osd44682: 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x9ac6a4]
Jul 08 16:00:54 synergy0 ceph-osd44682: 11: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0xd2) [0x9ac982]
Jul 08 16:00:54 synergy0 ceph-osd44682: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x9a07dd]
Jul 08 16:00:54 synergy0 ceph-osd44682: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xfc37bc]
Jul 08 16:00:54 synergy0 ceph-osd44682: 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xfc6970]
Jul 08 16:00:54 synergy0 ceph-osd44682: 15: (()+0x76ba) [0x7fe2557446ba]
Jul 08 16:00:54 synergy0 ceph-osd44682: 16: (clone()+0x6d) [0x7fe254d4b41d]
Jul 08 16:00:54 synergy0 ceph-osd44682: 2019-07-08 16:00:54.974 7fe236221700 -1 /build/ceph-14.2.1/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7fe236221700 time 2019-07-08 16:00:54.965119
Jul 08 16:00:54 synergy0 ceph-osd44682: /build/ceph-14.2.1/src/os/bluestore/BlueStore.cc: 11089: abort()
Jul 08 16:00:54 synergy0 ceph-osd44682: ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)
Jul 08 16:00:54 synergy0 ceph-osd44682: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x850a01]
Jul 08 16:00:54 synergy0 ceph-osd44682: 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x296a) [0xe44a2a]
Jul 08 16:00:54 synergy0 ceph-osd44682: 3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x5e6) [0xe48f96]
Jul 08 16:00:54 synergy0 ceph-osd44682: 4: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x7f) [0xa0099f]
Jul 08 16:00:54 synergy0 ceph-osd44682: 5: (PG::_delete_some(ObjectStore::Transaction*)+0x710) [0xa64b30]
Jul 08 16:00:54 synergy0 ceph-osd44682: 6: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x71) [0xa658f1]
Jul 08 16:00:54 synergy0 ceph-osd44682: 7: (boost::statechart::simple_state<PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x131) [0xaaf151]
Jul 08 16:00:54 synergy0 ceph-osd44682: 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0xa82b0b]
Jul 08 16:00:54 synergy0 ceph-osd44682: 9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x122) [0xa72052]
Jul 08 16:00:54 synergy0 ceph-osd44682: 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x9ac6a4]
Jul 08 16:00:54 synergy0 ceph-osd44682: 11: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0xd2) [0x9ac982]
Jul 08 16:00:54 synergy0 ceph-osd44682: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x9a07dd]
Jul 08 16:00:54 synergy0 ceph-osd44682: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xfc37bc]
Jul 08 16:00:54 synergy0 ceph-osd44682: 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xfc6970]
Jul 08 16:00:54 synergy0 ceph-osd44682: 15: (()+0x76ba) [0x7fe2557446ba]
Jul 08 16:00:54 synergy0 ceph-osd44682: 16: (clone()+0x6d) [0x7fe254d4b41d]
Jul 08 16:00:54 synergy0 ceph-osd44682: * Caught signal (Aborted) *
Jul 08 16:00:54 synergy0 ceph-osd44682: in thread 7fe236221700 thread_name:tp_osd_tp
Jul 08 16:00:54 synergy0 ceph-osd44682: ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)
Jul 08 16:00:54 synergy0 ceph-osd44682: 1: (()+0x11390) [0x7fe25574e390]
Jul 08 16:00:54 synergy0 ceph-osd44682: 2: (gsignal()+0x38) [0x7fe254c79428]
Jul 08 16:00:54 synergy0 ceph-osd44682: 3: (abort()+0x16a) [0x7fe254c7b02a]
Jul 08 16:00:54 synergy0 ceph-osd44682: 4: (ceph::__ceph_abort(char const
, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1a0) [0x850ac7]
Jul 08 16:00:54 synergy0 ceph-osd44682: 5: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x296a) [0xe44a2a]
Jul 08 16:00:54 synergy0 ceph-osd44682: 6: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x5e6) [0xe48f96]
Jul 08 16:00:54 synergy0 ceph-osd44682: 7: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x7f) [0xa0099f]
Jul 08 16:00:54 synergy0 ceph-osd44682: 8: (PG::_delete_some(ObjectStore::Transaction*)+0x710) [0xa64b30]
Jul 08 16:00:54 synergy0 ceph-osd44682: 9: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x71) [0xa658f1]
Jul 08 16:00:54 synergy0 ceph-osd44682: 10: (boost::statechart::simple_state<PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x131) [0xaaf151]
Jul 08 16:00:54 synergy0 ceph-osd44682: 11: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0xa82b0b]
Jul 08 16:00:54 synergy0 ceph-osd44682: 12: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x122) [0xa72052]
Jul 08 16:00:54 synergy0 ceph-osd44682: 13: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x9ac6a4]
Jul 08 16:00:54 synergy0 ceph-osd44682: 14: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0xd2) [0x9ac982]
Jul 08 16:00:54 synergy0 ceph-osd44682: 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x9a07dd]
Jul 08 16:00:54 synergy0 ceph-osd44682: 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xfc37bc]
Jul 08 16:00:54 synergy0 ceph-osd44682: 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xfc6970]
Jul 08 16:00:54 synergy0 ceph-osd44682: 18: (()+0x76ba) [0x7fe2557446ba]
Jul 08 16:00:54 synergy0 ceph-osd44682: 19: (clone()+0x6d) [0x7fe254d4b41d]
Jul 08 16:00:54 synergy0 ceph-osd44682: 2019-07-08 16:00:54.986 7fe236221700 -1
Caught signal (Aborted) *
Jul 08 16:00:54 synergy0 ceph-osd44682: in thread 7fe236221700 thread_name:tp_osd_tp
Jul 08 16:00:54 synergy0 ceph-osd44682: ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)
Jul 08 16:00:54 synergy0 ceph-osd44682: 1: (()+0x11390) [0x7fe25574e390]
Jul 08 16:00:54 synergy0 ceph-osd44682: 2: (gsignal()+0x38) [0x7fe254c79428]
Jul 08 16:00:54 synergy0 ceph-osd44682: 3: (abort()+0x16a) [0x7fe254c7b02a]
Jul 08 16:00:54 synergy0 ceph-osd44682: 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1a0) [0x850ac7]
Jul 08 16:00:54 synergy0 ceph-osd44682: 5: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x296a) [0xe44a2a]
Jul 08 16:00:54 synergy0 ceph-osd44682: 6: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x5e6) [0xe48f96]
Jul 08 16:00:54 synergy0 ceph-osd44682: 7: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x7f) [0xa0099f]
Jul 08 16:00:54 synergy0 ceph-osd44682: 8: (PG::_delete_some(ObjectStore::Transaction*)+0x710) [0xa64b30]
Jul 08 16:00:54 synergy0 ceph-osd44682: 9: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x71) [0xa658f1]
Jul 08 16:00:54 synergy0 ceph-osd44682: 10: (boost::statechart::simple_state<PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&...

Actions #13

Updated by Edward Kalk almost 5 years ago

I am confused by the "Copied to RADOS - Backport #39693: nautilus" status. "Pending Backport 07/03/2019"
Does this mean it has been fixed and there is a patch or? Is there a workaround or code change I can insert?

Updated by Edward Kalk almost 5 years ago

We hit this bug again : "2019-07-10 09:16:27.728 7f73b844c700 -1 bluestore(/var/lib/ceph/osd/ceph-5) _txc_add_transaction error (39) Directory not empty not handled on operation 21"
This time it triggered when I added a OSD to a node.
Logs attached.

Actions #16

Updated by Edward Kalk almost 5 years ago

found a few things that seem like fixes for this on github... : https://github.com/ceph/ceph/pull/27929/commits

Actions #17

Updated by Edward Kalk almost 5 years ago

OSD 9, 15, 10, 13 crashed this AM.
```ceph.log:2019-07-11 09:15:15.501601 mon.synergy0 (mon.0) 4248 : cluster [INF] osd.10 [v2:216.106.44.201:6802/25418,v1:216.106.44.201:6803/25418] boot
ceph.log:2019-07-11 09:15:20.561861 mon.synergy0 (mon.0) 4257 : cluster [INF] osd.13 [v2:216.106.44.201:6818/25427,v1:216.106.44.201:6820/25427] boot
ceph.log:2019-07-11 09:15:23.620936 mon.synergy0 (mon.0) 4263 : cluster [INF] osd.15 [v2:216.106.44.200:6802/17179,v1:216.106.44.200:6803/17179] boot
ceph.log:2019-07-11 09:15:39.440562 mon.synergy0 (mon.0) 4277 : cluster [INF] osd.9 [v2:216.106.44.200:6814/17315,v1:216.106.44.200:6815/17315] boot```

Actions #18

Updated by Edward Kalk almost 5 years ago

Was Bug 38724:
```ceph-osd.9.log: -3> 2019-07-11 09:15:13.569 7fc7b8243700 -1 bluestore(/var/lib/ceph/osd/ceph-9) _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)
ceph-osd.15.log:2019-07-11 09:15:08.593 7fbe46b8b700 -1 bluestore(/var/lib/ceph/osd/ceph-15) _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)
ceph-osd.13.log:2019-07-11 09:15:08.627 7fc575850700 -1 bluestore(/var/lib/ceph/osd/ceph-13) _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)
ceph-osd.10.log:2019-07-11 09:15:08.543 7f0ef2a03700 -1 bluestore(/var/lib/ceph/osd/ceph-10) _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)```

We are 14.2.1 on Ubuntu 16.04

Actions #19

Updated by Edward Kalk almost 5 years ago

Sometimes when this happens, the OSDs repeatedly crash and Linux system prevents them from being started. it takes 10 minuets for linux to allow them to start again. (too many start attempts)
15:24:13 synergy0 systemd1: : Start request repeated too quickly.
Jul 08 15:24:13 synergy0 systemd1: Failed to start Ceph object storage daemon osd.12.

Actions #20

Updated by Edward Kalk almost 5 years ago

^^this results in the Production VMs becoming unresponsive as their disks are unavailable when we have multiple OSDs down on multiple hosts. (we are doing 2 copy) I've seen it where 3 OSDs are down at the same time on different hosts due to this bug. That's when we are seemingly really un lucky with the BUG. (3 copy would not have saved us from that)

Actions #21

Updated by Edward Kalk almost 5 years ago

Rebooted node 4, on node 1 and 2, 2 OSDs each crashed and will not start.
The logs are similar, seems to be the BUG related to 38724
Tried to manually start the OSDs, failed. It’s been about 20 minuets with prod down.

Actions #22

Updated by Nathan Fish almost 5 years ago

Edward Kalk wrote:

Sometimes when this happens, the OSDs repeatedly crash and Linux system prevents them from being started. it takes 10 minuets for linux to allow them to start again. (too many start attempts)
15:24:13 synergy0 systemd1: : Start request repeated too quickly.
Jul 08 15:24:13 synergy0 systemd1: Failed to start Ceph object storage daemon osd.12.

General systemd tip: you can reset the failure limit with 'systemctl reset-failed'. Then you can 'systemctl restart ceph-osd@12'.

Actions #23

Updated by Nathan Fish almost 5 years ago

It seems this will be in 14.2.3. When the fix comes out, will my crashed OSDs work again, or should I just purge and recreate them now?

Actions #24

Updated by Nathan Fish over 4 years ago

Just lost another OSD on 14.2.2. The cluster is still mostly empty, but a large parallel cp is going on, so rebuild will take a while. This is somewhat worrying. Any update on this?

Actions #25

Updated by Nathan Fish over 4 years ago

That OSD failure seems to have caused a cascade. Several more OSDs have crashed. 12% of objects were degraded, and I had to create new 'ssd' class OSDs to get enough failure domains. I cancelled the cp to prioritize recovery. Is there any workaround to repair the OSDs and get them to restart properly? They just crash again every time I restart them.

Can this bug please be set to a higher priority? This has caused an outage for myself and Edward above, and threatens data loss. That warrants at least Major.

Actions #26

Updated by Florian Haas over 4 years ago

  • Affected Versions v12.2.12 added

So we're seeing a similar error after upgrading a Jewel cluster to Luminous (12.2.12), and then adding BlueStore OSDs to it:

    -3> 2019-08-09 11:17:04.772976 7fb41585d700 -1 bluestore(/var/lib/ceph/osd/ceph-78) _txc_add_transaction error (7) Argument list too long not handled on operation 12 (op 2, counting from 0)
    -2> 2019-08-09 11:17:04.772993 7fb41585d700 -1 bluestore(/var/lib/ceph/osd/ceph-78) unexpected error code
    -1> 2019-08-09 11:17:04.773013 7fb41585d700  0 bluestore(/var/lib/ceph/osd/ceph-78)  transaction dump:
{
    "ops": [
        {
            "op_num": 0,
            "op_name": "remove",
            "collection": "2.293_head",
            "oid": "#-4:c96337db:::temp_recovering_2.293_11123'6472830_288833_head:head#" 
        },
        {
            "op_num": 1,
            "op_name": "touch",
            "collection": "2.293_head",
            "oid": "#-4:c96337db:::temp_recovering_2.293_11123'6472830_288833_head:head#" 
        },
        {
            "op_num": 2,
            "op_name": "truncate",
            "collection": "2.293_head",
            "oid": "#-4:c96337db:::temp_recovering_2.293_11123'6472830_288833_head:head#",
            "offset": 4457615932
        },
        {
            "op_num": 3,
            "op_name": "op_setallochint",
            "collection": "2.293_head",
            "oid": "#-4:c96337db:::temp_recovering_2.293_11123'6472830_288833_head:head#",
            "expected_object_size": "0",
            "expected_write_size": "0" 
        },
        {
            "op_num": 4,
            "op_name": "write",
            "collection": "2.293_head",
            "oid": "#-4:c96337db:::temp_recovering_2.293_11123'6472830_288833_head:head#",
            "length": 8388608,
            "offset": 0,
            "bufferlist length": 8388608
        },
        {
            "op_num": 5,
            "op_name": "setattrs",
            "collection": "2.293_head",
            "oid": "#-4:c96337db:::temp_recovering_2.293_11123'6472830_288833_head:head#",
            "attr_lens": {
                "_": 254,
                "snapset": 31
            }
        }
    ]
}

In our case, that's an E2BIG as opposed to ENOTEMPTY, but it does look awfully similar, at least superficially.

Actions #27

Updated by Sage Weil over 4 years ago

Florian Haas wrote:

So we're seeing a similar error after upgrading a Jewel cluster to Luminous (12.2.12), and then adding BlueStore OSDs to it:

[...]

In our case, that's an E2BIG as opposed to ENOTEMPTY, but it does look awfully similar, at least superficially.

BlueStore returns E2BIG because the object is > 4GB, which is not supported (size is stored in an u32). This cluster has some big rados objects.

Actions #28

Updated by Brad Hubbard over 4 years ago

  • Priority changed from Normal to High

Bumping priority based on community feedback.

Actions #29

Updated by Nathan Cutler about 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF