Project

General

Profile

Actions

Bug #39152

closed

nautilus osd crash: Caught signal (Aborted) tp_osd_tp

Added by Wen Wei about 5 years ago. Updated over 4 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

OSD continously crashed

-1> 2019-04-08 17:47:06.615 7f3f3ef62700 -1 /build/ceph-14.2.0/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f3f3ef62700 time 2019-04-08 17:47:06.607260
/build/ceph-14.2.0/src/os/bluestore/BlueStore.cc: 11069: abort()
ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)
1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x850261]
2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x296a) [0xe42aaa]
3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x5e6) [0xe47016]
4: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x7f) [0xa0021f]
5: (PG::_delete_some(ObjectStore::Transaction*)+0x710) [0xa64220]
6: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x71) [0xa64fe1]
7: (boost::statechart::simple_state<PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x131) [0xaaded1]
8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0xa81a7b]
9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x122) [0xa71092]
10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x9abf74]
11: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0xd2) [0x9ac252]
12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x9a00ad]
13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xfc0c1c]
14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xfc3dd0]
15: (()+0x76ba) [0x7f3f5e4846ba]
16: (clone()+0x6d) [0x7f3f5da8b41d]
0> 2019-04-08 17:47:06.623 7f3f3ef62700 -1 ** Caught signal (Aborted) *
in thread 7f3f3ef62700 thread_name:tp_osd_tp
ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)
1: (()+0x11390) [0x7f3f5e48e390]
2: (gsignal()+0x38) [0x7f3f5d9b9428]
3: (abort()+0x16a) [0x7f3f5d9bb02a]
4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1a0) [0x850327]
5: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x296a) [0xe42aaa]
6: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x5e6) [0xe47016]
7: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x7f) [0xa0021f]
8: (PG::_delete_some(ObjectStore::Transaction*)+0x710) [0xa64220]
9: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x71) [0xa64fe1]
10: (boost::statechart::simple_state<PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x131) [0xaaded1]
11: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0xa81a7b]
12: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x122) [0xa71092]
13: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x9abf74]
14: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0xd2) [0x9ac252]
15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbed) [0x9a00ad]
16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0xfc0c1c]
17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xfc3dd0]
18: (()+0x76ba) [0x7f3f5e4846ba]
19: (clone()+0x6d) [0x7f3f5da8b41d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Logs/configs attached

Thanks!


Files

98.config (62.3 KB) 98.config Wen Wei, 04/09/2019 01:02 AM
ceph.conf (299 Bytes) ceph.conf Wen Wei, 04/09/2019 01:02 AM
ceph-osd.98.log.zip (375 KB) ceph-osd.98.log.zip Wen Wei, 04/09/2019 01:03 AM
ceph-osd.12.log.zip (242 KB) ceph-osd.12.log.zip K Jarrett, 05/05/2019 09:05 PM

Related issues 2 (0 open2 closed)

Related to RADOS - Bug #38724: _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)Resolved

Actions
Related to RADOS - Backport #39693: nautilus: _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0)ResolvedSage WeilActions
Actions #1

Updated by Greg Farnum about 5 years ago

  • Project changed from Ceph to bluestore
Actions #2

Updated by Nathan Cutler about 5 years ago

  • Backport set to nautilus
Actions #3

Updated by Neha Ojha almost 5 years ago

  • Project changed from bluestore to RADOS
Actions #4

Updated by Neha Ojha almost 5 years ago

  • Priority changed from Normal to Urgent

A similar issue was reported on ceph-users: "Nautilus (14.2.0) OSDs crashing at startup after removing a pool containing a PG with an unrepairable error"

-13> 2019-04-26 19:23:05.199 7fb2667de700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.0/rpm/el7/BUILD/ceph-14.2.0/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7fb2667de700 time 2019-04-26 19:23:05.193826
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.0/rpm/el7/BUILD/ceph-14.2.0/src/os/bluestore/BlueStore.cc: 11069: abort()

 ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xd8) [0x7c1454ee40]
 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x2a85) [0x7c14b2d5f5]
 3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x526) [0x7c14b2e366]
 4: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x7f) [0x7c1470a81f]
 5: (PG::_delete_some(ObjectStore::Transaction*)+0x83d) [0x7c1476d70d]
 6: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x38) [0x7c1476e528]
 7: (boost::statechart::simple_state<PG::RecoveryState::Deleting, PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x16a) [0x7c147acc8a]
 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x5a) [0x7c1478a91a]
 9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x119) [0x7c14779c99]
 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) [0x7c146b4494]
 11: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0x234) [0x7c146b48d4]
 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0x7c146a8c14]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x7c14ca0f43]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7c14ca3fe0]
 15: (()+0x7dd5) [0x7fb284e3bdd5]
 16: (clone()+0x6d) [0x7fb283d01ead]

ceph-post-file: 2d8d22f4-580b-4b57-a13a-f49dade34ba7

Actions #5

Updated by Sage Weil almost 5 years ago

  • Related to Bug #38724: _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0) added
Actions #6

Updated by Sage Weil almost 5 years ago

I'm guessing this is a dup of #38724

Wen, can you tell us what the cluster workload was? rgw? rbd? cephfs? Thanks!

Actions #7

Updated by K Jarrett almost 5 years ago

Sage Weil wrote:

I'm guessing this is a dup of #38724

Wen, can you tell us what the cluster workload was? rgw? rbd? cephfs? Thanks!

I'm not the original reporter, but I am also seeing this issue affecting a single OSD on a Nautilus 14.2.1 cluster. The workload in my case is predominantly CephFS on an EC pool, but with some RBD. The RBD workload is mainly the device health metrics pool, but in the past there was a single-replica RBD pool which did have an unrepairable PG similar to the post on the ceph-users mailing list.

I've attached the OSD's log and versions are below. I don't have a core/crash dump in /var/lib/ceph/crash.

If there's anything I can do to provide you with more information on this, please do let me know. At the moment, the OSD seems to repeatedly crash and restart until systemd marks it as failed. However, if I subsequently manually reset the failed status and attempt to start it again, the OSD does start. Any activity that results in a recovery operation seems to trigger, or at least significantly increase the chances of, the crash being triggered.

{
    "mon": {
        "ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)": 3
    },
    "osd": {
        "ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)": 19
    },
    "mds": {
        "ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)": 3
    },
    "overall": {
        "ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)": 28
    }
}
Actions #8

Updated by Sage Weil almost 5 years ago

once this is backported at released (#39693) we should confirm this fixes the problematic osd

Actions #9

Updated by Greg Farnum over 4 years ago

  • Status changed from New to Pending Backport
Actions #11

Updated by Nathan Cutler over 4 years ago

  • Status changed from Pending Backport to New
  • Backport deleted (nautilus)

This is problematic to backport because the "Pull request ID" field is not populated and none of the notes mention a PR or commit SHA1.

I grepped the git history for the issue number to no avail:

$ git checkout master
$ git pull
$ git status
On branch master
Your branch is up to date with 'ceph/master'.

nothing to commit, working tree clean
$ git log --grep 39152
$

Based on #39152-8, my suspicion is this might be a duplicate of #39693 - please correct me if I'm wrong (and populate the "Pull request ID" field). Thanks!

Actions #12

Updated by Sage Weil over 4 years ago

  • Status changed from New to Duplicate
  • Pull request ID set to 27929

yep, dup of #39693

Actions #13

Updated by Sage Weil over 4 years ago

  • Related to Backport #39693: nautilus: _txc_add_transaction error (39) Directory not empty not handled on operation 21 (op 1, counting from 0) added
Actions

Also available in: Atom PDF