Project

General

Profile

Actions

Bug #25001

closed

Crashing OSDs after going from 12.2.5 -> 12.2.6 -> 13.2.0

Added by Troy Ablan almost 6 years ago. Updated over 5 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This bug has been opened following on from http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-July/028232.html

At some point in June, I updated my entire system to 12.2.5. Over the next few weeks I started noticing PGs would randomly go inconsistent. I would kick off the repair and it would come back clean.
Last weekend, I did a yum update, saw that there were new packages, so I updated the rest of the cluster.
I went to bed, woke up, and noticed things were in a really bad state. VMs had gotten IO errors and most mounted the disk in a read-only state. At this point, I looked for 12.2.6 release notes, found none on the website, and decided to switch to the Mimic repo as I could think of no other option at this point. As this didn't fix the problem I started looking around on the mailing list and only then did I understand that I should not have panicked and should have waited for 12.2.7.

I was mistaken when I mentioned on the ML that these were SSDs crashing. They're SATA drives. Not all of them are crashing, but the ones that do crash do so repeatedly.

At this point in time, most VMs cannot start due to read errors, and the ones that can start have long pauses because of the OSD churning.

I got a core and a log of an entire single invocation of ceph-osd with debug bluestore = 20 set.

Since the full log and core files are too large to attach, I've hosted them at https://mooinglemur.com/2018-07-ceph/

Just the trace is below:

ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)
1: (()+0x8e1870) [0x559e3b6e9870]
2: (()+0xf6d0) [0x7f55434b76d0]
3: (gsignal()+0x37) [0x7f55424d8277]
4: (abort()+0x148) [0x7f55424d9968]
5: (BlueStore::_wctx_finish(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*, std::set<BlueStore::SharedBlob*, std
::less<BlueStore::SharedBlob*>, std::allocator<BlueStore::SharedBlob*> >)+0xdea) [0x559e3b5d4cda]
6: (BlueStore::_do_truncate(BlueStore::TransContext
, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, std::set<BlueStore::SharedBlob*, std::less<Blue
Store::SharedBlob*>, std::allocator<BlueStore::SharedBlob*> >)+0x13d) [0x559e3b5e3ead]
7: (BlueStore::_do_remove(BlueStore::TransContext
, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>)+0xbf) [0x559e3b5e467f]
8: (BlueStore::_remove(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&)+0x60) [0x559e3b5e5e50]
9: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x105d) [0x559e3b5f066d]
10: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>,
ThreadPool::TPHandle*)+0x519) [0x559e3b5f27b9]
11: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x80) [0x559e3b2038d0]
12: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, ThreadPool::TPHandle*)+0x58) [0x559e3b19a788]
13: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xfe) [0x559e3b1c823e]
14: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x559e3b41f820]
15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x592) [0x559e3b1d2e02]
16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d3) [0x7f554695d333]
17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f554695df20]
18: (()+0x7e25) [0x7f55434afe25]
19: (clone()+0x6d) [0x7f55425a0bad]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Thanks!


Files

gdb.txt.gz (6.38 KB) gdb.txt.gz Brad Hubbard, 07/19/2018 11:01 PM
Actions

Also available in: Atom PDF