Project

General

Profile

Actions

Support #10486

open

OSD Keeps Going Down

Added by Shun Mok Bhark over 9 years ago. Updated over 9 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Pull request ID:

Description

Hello,

I am encountering an issue where an OSD (OSD.1) keeps going down and out.
Initially I thought it was a drive problem however it is mounted and the drive is in a healthy state.
I tried starting the OSD again however after around 15 minutes it went down and was dropped.
I cannot seem to figure out why this is occurring and would like if someone would help me solve this problem.

Here is an excerpt of the log for the event that caused this OSD to be dropped.

0> 2015-01-08 11:20:14.843782 7fb1c4179700 -1 ** Caught signal (Aborted) *
in thread 7fb1c4179700

ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
1: /usr/bin/ceph-osd() [0xa88332]
2: (()+0xf130) [0x7fb1e4664130]
3: (gsignal()+0x39) [0x7fb1e307e5c9]
4: (abort()+0x148) [0x7fb1e307fcd8]
5: (_gnu_cxx::_verbose_terminate_handler()+0x165) [0x7fb1e39829d5]
6: (()+0x5e946) [0x7fb1e3980946]
7: (()+0x5e973) [0x7fb1e3980973]
8: (()+0x5eb9f) [0x7fb1e3980b9f]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xb7ae4a]
10: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, bool)+0xbd4) [0x8e9bb4]
11: (ReplicatedBackend::build_push_op(ObjectRecoveryInfo const&, ObjectRecoveryProgress const&, ObjectRecoveryProgress*, PushOp*, object_stat_sum_t*)+0x5e9) [0x829949]
12: (ReplicatedBackend::prep_push(std::tr1::shared_ptr<ObjectContext>, hobject_t const&, pg_shard_t, eversion_t, interval_set<unsigned long>&, std::map<hobject_t, interval_set<unsigned long>, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, interval_set<unsigned long> > > >&, PushOp*)+0x40c) [0x82aabc]
13: (ReplicatedBackend::prep_push_to_replica(std::tr1::shared_ptr<ObjectContext>, hobject_t const&, pg_shard_t, PushOp*)+0x567) [0x82ee57]
14: (ReplicatedBackend::start_pushes(hobject_t const&, std::tr1::shared_ptr<ObjectContext>, ReplicatedBackend::RPGHandle*)+0x1bf) [0x831e8f]
15: (ReplicatedBackend::recover_object(hobject_t const&, eversion_t, std::tr1::shared_ptr<ObjectContext>, std::tr1::shared_ptr<ObjectContext>, PGBackend::RecoveryHandle*)+0xf3) [0x9df463]
16: (ReplicatedPG::prep_object_replica_pushes(hobject_t const&, eversion_t, PGBackend::RecoveryHandle*)+0x86b) [0x84c87b]
17: (ReplicatedPG::recover_replicas(int, ThreadPool::TPHandle&)+0xa68) [0x84dc38]
18: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*, ThreadPool::TPHandle&, int*)+0x5db) [0x87831b]
19: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x2c3) [0x6784e3]
20: (OSD::RecoveryWQ::_process(PG*, ThreadPool::TPHandle&)+0x27) [0x6dadb7]
21: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa66) [0xb6b966]
22: (ThreadPool::WorkThread::entry()+0x10) [0xb6c9f0]
23: (()+0x7df3) [0x7fb1e465cdf3]
24: (clone()+0x6d) [0x7fb1e313f01d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.1.log
--
end dump of recent events ---

Actions #1

Updated by Sage Weil over 9 years ago

if you look further up in the log you should see why it is crashing.. a segfault message or failed assert or something?

Actions #2

Updated by Shun Mok Bhark over 9 years ago

Looking further up the log it has a osd_ping stamps.

Isn't this a failed assert?

9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xb7ae4a]

Actions #3

Updated by Shun Mok Bhark over 9 years ago

Shun Mok Bhark wrote:

Looking further up the log it has a osd_ping stamps.

Actions

Also available in: Atom PDF