Bug #17814: coredump in ceph-osd - Ceph - Ceph

Actions

Copy link

Bug #17814

closed

coredump in ceph-osd

Added by Jeff Layton over 7 years ago. Updated over 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Jeff Layton

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I was running the fs teuthology suite on my test branch and hit a segfault in ceph-osd:

http://qa-proxy.ceph.com/teuthology/jlayton-2016-11-07_19:48:50-fs-wip-jlayton-fsync---basic-mira/531006/remote/mira107/coredump/

$ file 1478552714.27870.core 
1478552714.27870.core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'ceph-osd -f --cluster ceph -i 1'

The branch here is based on commit ef087a488d709e9ec96809dbde02fa0460b73e9a, with a few cephfs client and mds patches on top. Nothing that should affect the osd code.

Actions

Copy link

Updated by Jeff Layton over 7 years ago

Subject changed from segfault in ceph-osd to coredump in ceph-osd

Note too that the mira107 console logs show a kernel BUG:

http://qa-proxy.ceph.com/teuthology/jlayton-2016-11-07_19:48:50-fs-wip-jlayton-fsync---basic-mira/531006/console_logs/mira107.log

...it's possible that the core is from that process being killed. The BUG occurred in wb_throttle and not in ceph-osd though, so I'm not sure if it's related.

Actions

Copy link

Updated by Jeff Layton over 7 years ago

I reserved mira068, installed the right debuginfo there and pulled down the core. Here's the backtrace:

(gdb) bt
#0  0x00007f00333681fb in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x000055b422442e75 in ?? ()
#2  <signal handler called>
#3  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#4  0x00007f00325624dc in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x000055b4225e558d in BackoffThrottle::get(unsigned long) ()
#6  0x000055b4222566db in FileStore::op_queue_reserve_throttle(FileStore::Op*) ()
#7  0x000055b42226e58c in FileStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*) ()
#8  0x000055b42214071c in ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<OpRequest>)
    ()
#9  0x000055b422209716 in ReplicatedBackend::submit_transaction(hobject_t const&, eversion_t const&, std::unique_ptr<PGBackend::PGTransaction, std::default_delete<PGBackend::PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, boost::optional<pg_hit_set_history_t>&, Context*, Context*, Context*, unsigned long, osd_reqid_t, std::shared_ptr<OpRequest>) ()
#10 0x000055b4220cd5ac in ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, ReplicatedPG::OpContext*) ()
#11 0x000055b422121075 in ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*) ()
#12 0x000055b4221241dd in ReplicatedPG::do_op(std::shared_ptr<OpRequest>&) ()
#13 0x000055b4220e065c in ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&) ()
#14 0x000055b421f940f5 in OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&) ()
#15 0x000055b421f9431d in PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest> const&) ()
#16 0x000055b421fb54fc in OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) ()
#17 0x000055b4225f15a7 in ShardedThreadPool::shardedthreadpool_worker(unsigned int) ()
#18 0x000055b4225f3700 in ShardedThreadPool::WorkThreadSharded::entry() ()
#19 0x00007f0033360184 in start_thread (arg=0x7f001761c700) at pthread_create.c:312
#20 0x00007f0031ccd37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Signal 6 is SIGABRT, which probably means an assertion? Looks like it died down inside pthread_cond_wait though so I'm not sure whether this thread is the one that fired off the signal or whether it just happened to catch it.

I'll hold on to mira068 for a few days in case anyone with more of a clue about the OSD code wants to take a look at the core.

Actions

Copy link

Updated by Sage Weil over 7 years ago

Yep, looks like a buggy upstream kernel. Note that wb_throttle is a ceph-osd thread (we name most of them via a Thread class ctor argument).

Actions

Copy link

Updated by Jeff Layton over 7 years ago

Ok, good to know. That kernel is basically a v4.8.0 kernel with a pile of patches on top. Might be good to rebase those patches onto the latest v4.8.z stable series kernel?

Actions

Copy link

Updated by Jeff Layton over 7 years ago

Category deleted (~~OSD~~)
Assignee set to Jeff Layton
Priority changed from Urgent to Normal

I'll go ahead and grab this for now. I don't see any other reports of that crash, so I'll plan to send it to fsdevel later to day as a report.

Actions

Copy link

Updated by Jeff Layton over 7 years ago

Ilya says:

It's been fixed by d3798ae8c6f3 ("mm: filemap: don't plant shadow entries without radix tree node") in 4.9-rc.

...so I think we just need to get the kernel rebased onto v4.8.4 or later, or just move it to v4.9.

Actions

Copy link

Updated by Jeff Layton over 7 years ago

Status changed from New to Resolved

I haven't seen this in quite some time now, so I think it's now resolved by moving to later kernels.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #17814

coredump in ceph-osd

Updated by Jeff Layton over 7 years ago

Updated by Jeff Layton over 7 years ago

Updated by Sage Weil over 7 years ago

Updated by Jeff Layton over 7 years ago

Updated by Jeff Layton over 7 years ago

Updated by Jeff Layton over 7 years ago

Updated by Jeff Layton over 7 years ago