Project

General

Profile

Bug #613

OSD crash: FAILED assert(recovery_oids.count(soid) == 0)

Added by John Leach over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
11/29/2010
Due date:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

I'm running a script that reads and writes random objects using librados (creating a new pool once in a while). Running it on my 3 node cluster for a while resulted in a crash of an osd (which I'll report separately if I can reproduce) but it's left the cluster in a state that crashes a random osd on startup.

So now, starting cosd on all the nodes results in one of them crashing randomly. In this case, osd1 has crashed:

2010-11-29 14:54:52.851389    pg v6440: 676 pgs: 1 creating, 648 active+clean+degraded, 27 degraded+peering; 1726 MB data, 10386 MB used, 176 GB / 197 GB avail; 107963/214281 degraded (50.384%)
2010-11-29 14:54:52.852872   mds e221: 1/1/1 up {0=up:active}, 1 up:standby
2010-11-29 14:54:52.852977   osd e1902: 3 osds: 1 up, 2 in -- 3 blacklisted MDSes
2010-11-29 14:54:52.853123   log 2010-11-29 14:53:02.470127 mon0 10.135.211.78:6789/0 25 : [INF] osd1 10.106.124.118:6801/15793 failed (by osd0 10.135.211.78:6801/25723)
2010-11-29 14:54:52.853262   mon e1: 2 mons at {0=10.135.211.78:6789/0,1=10.106.124.118:6789/0}
osd/OSD.cc: In function 'void OSD::start_recovery_op(PG*, const sobject_t&)':
osd/OSD.cc:4389: FAILED assert(recovery_oids.count(soid) == 0)
 ceph version 0.23.1 (commit:868665d5f2c79ff0cc338339d25579f846794d93)
 1: (OSD::start_recovery_op(PG*, sobject_t const&)+0xfd) [0x4d951d]
 2: (ReplicatedPG::recover_object_replicas(sobject_t const&)+0xbe) [0x49c84e]
 3: (ReplicatedPG::recover_replicas(int)+0x29b) [0x49cd3b]
 4: (ReplicatedPG::start_recovery_ops(int)+0x84) [0x49d1d4]
 5: (OSD::do_recovery(PG*)+0x1a0) [0x4d4590]
 6: (ThreadPool::worker()+0x191) [0x5cefb1]
 7: (ThreadPool::WorkThread::entry()+0xd) [0x4ffefd]
 8: (Thread::_entry_func(void*)+0xa) [0x471c8a]
 9: (()+0x69ca) [0x7f091ee089ca]
 10: (clone()+0x6d) [0x7f091ddd570d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
*** Caught signal (ABRT) ***
 ceph version 0.23.1 (commit:868665d5f2c79ff0cc338339d25579f846794d93)
 1: (sigabrt_handler(int)+0xde) [0x5e0aae]
 2: (()+0x33af0) [0x7f091dd22af0]
 3: (gsignal()+0x35) [0x7f091dd22a75]
 4: (abort()+0x180) [0x7f091dd265c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f091e5d88e5]
 6: (()+0xcad16) [0x7f091e5d6d16]
 7: (()+0xcad43) [0x7f091e5d6d43]
 8: (()+0xcae3e) [0x7f091e5d6e3e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x69c) [0x5cde6c]
 10: (OSD::start_recovery_op(PG*, sobject_t const&)+0xfd) [0x4d951d]
 11: (ReplicatedPG::recover_object_replicas(sobject_t const&)+0xbe) [0x49c84e]
 12: (ReplicatedPG::recover_replicas(int)+0x29b) [0x49cd3b]
 13: (ReplicatedPG::start_recovery_ops(int)+0x84) [0x49d1d4]
 14: (OSD::do_recovery(PG*)+0x1a0) [0x4d4590]
 15: (ThreadPool::worker()+0x191) [0x5cefb1]
 16: (ThreadPool::WorkThread::entry()+0xd) [0x4ffefd]
 17: (Thread::_entry_func(void*)+0xa) [0x471c8a]
 18: (()+0x69ca) [0x7f091ee089ca]
 19: (clone()+0x6d) [0x7f091ddd570d]

Attached is the debug log output from osd1 (starting 1000 lines above the crash, hope that's enough) and the objdump output of the cosd binary.

I'm running the binaries from the ceph Debian packages.

crashing.osd.1.log.gz (14.6 KB) John Leach, 11/29/2010 07:12 AM

objdump.cosd.gz (3.72 MB) John Leach, 11/29/2010 07:12 AM

History

#1 Updated by Sage Weil over 8 years ago

  • Status changed from New to Resolved
  • Assignee set to Sage Weil
  • Target version set to v0.23.2

this was actually a problem with the debug sanity checks. fixed by 1b06332de69b332092d115451efbd29afec79269

Also available in: Atom PDF