Bug #613
OSD crash: FAILED assert(recovery_oids.count(soid) == 0)
% Done:
0%
Spent time:
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
I'm running a script that reads and writes random objects using librados (creating a new pool once in a while). Running it on my 3 node cluster for a while resulted in a crash of an osd (which I'll report separately if I can reproduce) but it's left the cluster in a state that crashes a random osd on startup.
So now, starting cosd on all the nodes results in one of them crashing randomly. In this case, osd1 has crashed:
2010-11-29 14:54:52.851389 pg v6440: 676 pgs: 1 creating, 648 active+clean+degraded, 27 degraded+peering; 1726 MB data, 10386 MB used, 176 GB / 197 GB avail; 107963/214281 degraded (50.384%) 2010-11-29 14:54:52.852872 mds e221: 1/1/1 up {0=up:active}, 1 up:standby 2010-11-29 14:54:52.852977 osd e1902: 3 osds: 1 up, 2 in -- 3 blacklisted MDSes 2010-11-29 14:54:52.853123 log 2010-11-29 14:53:02.470127 mon0 10.135.211.78:6789/0 25 : [INF] osd1 10.106.124.118:6801/15793 failed (by osd0 10.135.211.78:6801/25723) 2010-11-29 14:54:52.853262 mon e1: 2 mons at {0=10.135.211.78:6789/0,1=10.106.124.118:6789/0}
osd/OSD.cc: In function 'void OSD::start_recovery_op(PG*, const sobject_t&)': osd/OSD.cc:4389: FAILED assert(recovery_oids.count(soid) == 0) ceph version 0.23.1 (commit:868665d5f2c79ff0cc338339d25579f846794d93) 1: (OSD::start_recovery_op(PG*, sobject_t const&)+0xfd) [0x4d951d] 2: (ReplicatedPG::recover_object_replicas(sobject_t const&)+0xbe) [0x49c84e] 3: (ReplicatedPG::recover_replicas(int)+0x29b) [0x49cd3b] 4: (ReplicatedPG::start_recovery_ops(int)+0x84) [0x49d1d4] 5: (OSD::do_recovery(PG*)+0x1a0) [0x4d4590] 6: (ThreadPool::worker()+0x191) [0x5cefb1] 7: (ThreadPool::WorkThread::entry()+0xd) [0x4ffefd] 8: (Thread::_entry_func(void*)+0xa) [0x471c8a] 9: (()+0x69ca) [0x7f091ee089ca] 10: (clone()+0x6d) [0x7f091ddd570d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. *** Caught signal (ABRT) *** ceph version 0.23.1 (commit:868665d5f2c79ff0cc338339d25579f846794d93) 1: (sigabrt_handler(int)+0xde) [0x5e0aae] 2: (()+0x33af0) [0x7f091dd22af0] 3: (gsignal()+0x35) [0x7f091dd22a75] 4: (abort()+0x180) [0x7f091dd265c0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f091e5d88e5] 6: (()+0xcad16) [0x7f091e5d6d16] 7: (()+0xcad43) [0x7f091e5d6d43] 8: (()+0xcae3e) [0x7f091e5d6e3e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x69c) [0x5cde6c] 10: (OSD::start_recovery_op(PG*, sobject_t const&)+0xfd) [0x4d951d] 11: (ReplicatedPG::recover_object_replicas(sobject_t const&)+0xbe) [0x49c84e] 12: (ReplicatedPG::recover_replicas(int)+0x29b) [0x49cd3b] 13: (ReplicatedPG::start_recovery_ops(int)+0x84) [0x49d1d4] 14: (OSD::do_recovery(PG*)+0x1a0) [0x4d4590] 15: (ThreadPool::worker()+0x191) [0x5cefb1] 16: (ThreadPool::WorkThread::entry()+0xd) [0x4ffefd] 17: (Thread::_entry_func(void*)+0xa) [0x471c8a] 18: (()+0x69ca) [0x7f091ee089ca] 19: (clone()+0x6d) [0x7f091ddd570d]
Attached is the debug log output from osd1 (starting 1000 lines above the crash, hope that's enough) and the objdump output of the cosd binary.
I'm running the binaries from the ceph Debian packages.
History
#1 Updated by Sage Weil over 13 years ago
- Status changed from New to Resolved
- Assignee set to Sage Weil
- Target version set to v0.23.2
this was actually a problem with the debug sanity checks. fixed by 1b06332de69b332092d115451efbd29afec79269