Project

General

Profile

Actions

Bug #17814

closed

coredump in ceph-osd

Added by Jeff Layton over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I was running the fs teuthology suite on my test branch and hit a segfault in ceph-osd:

http://qa-proxy.ceph.com/teuthology/jlayton-2016-11-07_19:48:50-fs-wip-jlayton-fsync---basic-mira/531006/remote/mira107/coredump/
$ file 1478552714.27870.core 
1478552714.27870.core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'ceph-osd -f --cluster ceph -i 1'

The branch here is based on commit ef087a488d709e9ec96809dbde02fa0460b73e9a, with a few cephfs client and mds patches on top. Nothing that should affect the osd code.

Actions #1

Updated by Jeff Layton over 7 years ago

  • Subject changed from segfault in ceph-osd to coredump in ceph-osd

Note too that the mira107 console logs show a kernel BUG:

http://qa-proxy.ceph.com/teuthology/jlayton-2016-11-07_19:48:50-fs-wip-jlayton-fsync---basic-mira/531006/console_logs/mira107.log

...it's possible that the core is from that process being killed. The BUG occurred in wb_throttle and not in ceph-osd though, so I'm not sure if it's related.

Actions #2

Updated by Jeff Layton over 7 years ago

I reserved mira068, installed the right debuginfo there and pulled down the core. Here's the backtrace:

(gdb) bt
#0  0x00007f00333681fb in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x000055b422442e75 in ?? ()
#2  <signal handler called>
#3  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#4  0x00007f00325624dc in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x000055b4225e558d in BackoffThrottle::get(unsigned long) ()
#6  0x000055b4222566db in FileStore::op_queue_reserve_throttle(FileStore::Op*) ()
#7  0x000055b42226e58c in FileStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*) ()
#8  0x000055b42214071c in ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<OpRequest>)
    ()
#9  0x000055b422209716 in ReplicatedBackend::submit_transaction(hobject_t const&, eversion_t const&, std::unique_ptr<PGBackend::PGTransaction, std::default_delete<PGBackend::PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, boost::optional<pg_hit_set_history_t>&, Context*, Context*, Context*, unsigned long, osd_reqid_t, std::shared_ptr<OpRequest>) ()
#10 0x000055b4220cd5ac in ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, ReplicatedPG::OpContext*) ()
#11 0x000055b422121075 in ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*) ()
#12 0x000055b4221241dd in ReplicatedPG::do_op(std::shared_ptr<OpRequest>&) ()
#13 0x000055b4220e065c in ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&) ()
#14 0x000055b421f940f5 in OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&) ()
#15 0x000055b421f9431d in PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest> const&) ()
#16 0x000055b421fb54fc in OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) ()
#17 0x000055b4225f15a7 in ShardedThreadPool::shardedthreadpool_worker(unsigned int) ()
#18 0x000055b4225f3700 in ShardedThreadPool::WorkThreadSharded::entry() ()
#19 0x00007f0033360184 in start_thread (arg=0x7f001761c700) at pthread_create.c:312
#20 0x00007f0031ccd37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Signal 6 is SIGABRT, which probably means an assertion? Looks like it died down inside pthread_cond_wait though so I'm not sure whether this thread is the one that fired off the signal or whether it just happened to catch it.

I'll hold on to mira068 for a few days in case anyone with more of a clue about the OSD code wants to take a look at the core.

Actions #3

Updated by Sage Weil over 7 years ago

Yep, looks like a buggy upstream kernel. Note that wb_throttle is a ceph-osd thread (we name most of them via a Thread class ctor argument).

Actions #4

Updated by Jeff Layton over 7 years ago

Ok, good to know. That kernel is basically a v4.8.0 kernel with a pile of patches on top. Might be good to rebase those patches onto the latest v4.8.z stable series kernel?

Actions #5

Updated by Jeff Layton over 7 years ago

  • Category deleted (OSD)
  • Assignee set to Jeff Layton
  • Priority changed from Urgent to Normal

I'll go ahead and grab this for now. I don't see any other reports of that crash, so I'll plan to send it to fsdevel later to day as a report.

Actions #6

Updated by Jeff Layton over 7 years ago

Ilya says:

It's been fixed by d3798ae8c6f3 ("mm: filemap: don't plant shadow entries without radix tree node") in 4.9-rc.

...so I think we just need to get the kernel rebased onto v4.8.4 or later, or just move it to v4.9.

Actions #7

Updated by Jeff Layton over 7 years ago

  • Status changed from New to Resolved

I haven't seen this in quite some time now, so I think it's now resolved by moving to later kernels.

Actions

Also available in: Atom PDF