Support #10486: OSD Keeps Going Down - Ceph - Ceph

Actions

Copy link

Support #10486

open

OSD Keeps Going Down

Added by Shun Mok Bhark over 9 years ago. Updated over 9 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Tags:

Reviewed:

Affected Versions:

Pull request ID:

Description

Hello,

I am encountering an issue where an OSD (OSD.1) keeps going down and out.
Initially I thought it was a drive problem however it is mounted and the drive is in a healthy state.
I tried starting the OSD again however after around 15 minutes it went down and was dropped.
I cannot seem to figure out why this is occurring and would like if someone would help me solve this problem.

Here is an excerpt of the log for the event that caused this OSD to be dropped.

0> 2015-01-08 11:20:14.843782 7fb1c4179700 -1 ** Caught signal (Aborted) *
in thread 7fb1c4179700

ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
 1: /usr/bin/ceph-osd() [0xa88332]
 2: (()+0xf130) [0x7fb1e4664130]
 3: (gsignal()+0x39) [0x7fb1e307e5c9]
 4: (abort()+0x148) [0x7fb1e307fcd8]
 5: (_gnu_cxx::_verbose_terminate_handler()+0x165) [0x7fb1e39829d5]
 6: (()+0x5e946) [0x7fb1e3980946]
 7: (()+0x5e973) [0x7fb1e3980973]
 8: (()+0x5eb9f) [0x7fb1e3980b9f]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xb7ae4a]
 10: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, bool)+0xbd4) [0x8e9bb4]
 11: (ReplicatedBackend::build_push_op(ObjectRecoveryInfo const&, ObjectRecoveryProgress const&, ObjectRecoveryProgress*, PushOp*, object_stat_sum_t*)+0x5e9) [0x829949]
 12: (ReplicatedBackend::prep_push(std::tr1::shared_ptr&lt;ObjectContext&gt;, hobject_t const&, pg_shard_t, eversion_t, interval_set&lt;unsigned long&gt;&, std::map&lt;hobject_t, interval_set&lt;unsigned long&gt;, std::less&lt;hobject_t&gt;, std::allocator&lt;std::pair&lt;hobject_t const, interval_set&lt;unsigned long&gt; > > >&, PushOp*)+0x40c) [0x82aabc]
 13: (ReplicatedBackend::prep_push_to_replica(std::tr1::shared_ptr&lt;ObjectContext&gt;, hobject_t const&, pg_shard_t, PushOp*)+0x567) [0x82ee57]
 14: (ReplicatedBackend::start_pushes(hobject_t const&, std::tr1::shared_ptr&lt;ObjectContext&gt;, ReplicatedBackend::RPGHandle*)+0x1bf) [0x831e8f]
 15: (ReplicatedBackend::recover_object(hobject_t const&, eversion_t, std::tr1::shared_ptr&lt;ObjectContext&gt;, std::tr1::shared_ptr&lt;ObjectContext&gt;, PGBackend::RecoveryHandle*)+0xf3) [0x9df463]
 16: (ReplicatedPG::prep_object_replica_pushes(hobject_t const&, eversion_t, PGBackend::RecoveryHandle*)+0x86b) [0x84c87b]
 17: (ReplicatedPG::recover_replicas(int, ThreadPool::TPHandle&)+0xa68) [0x84dc38]
 18: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*, ThreadPool::TPHandle&, int*)+0x5db) [0x87831b]
 19: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x2c3) [0x6784e3]
 20: (OSD::RecoveryWQ::_process(PG*, ThreadPool::TPHandle&)+0x27) [0x6dadb7]
 21: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa66) [0xb6b966]
 22: (ThreadPool::WorkThread::entry()+0x10) [0xb6c9f0]
 23: (()+0x7df3) [0x7fb1e465cdf3]
 24: (clone()+0x6d) [0x7fb1e313f01d]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.1.log
-- end dump of recent events ---