Actions
Bug #586
closedOSD: Crash during scheduled scrub
Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
After I reported #585 I didn't pay much attention to my cluster, until I found out that I had only one OSD left online.
It seems all the OSD's crashed with the same backtrace:
2010-11-17 12:32:14.563031 7fc2af6c2710 osd1 5562 pg[9.0( v 1983'18202 (1983'18200,1983'18202]+backlog n=420 ec=1693 les=5117 4780/5116/4506) [1,3,6] r=0 rops=1 lcod 0'0 mlcod 0'0 active+] recover_replicas - nothing to do! 2010-11-17 12:32:14.794561 7fc2af6c2710 osd1 5562 pg[9.0( v 1983'18202 (1983'18200,1983'18202]+backlog n=420 ec=1693 les=5117 4780/5116/4506) [1,3,6] r=0 lcod 0'0 mlcod 0'0 active+] recover_replicas 2010-11-17 12:32:14.922299 7fc2af6c2710 osd1 5562 pg[9.0( v 1983'18202 (1983'18200,1983'18202]+backlog n=420 ec=1693 les=5117 4780/5116/4506) [1,3,6] r=0 rops=1 lcod 0'0 mlcod 0'0 active+] recover_replicas - nothing to do! 2010-11-17 12:32:15.064489 7fc2af6c2710 osd1 5562 pg[9.0( v 1983'18202 (1983'18200,1983'18202]+backlog n=420 ec=1693 les=5117 4780/5116/4506) [1,3,6] r=0 lcod 0'0 mlcod 0'0 active+] recover_replicas 2010-11-17 12:32:15.082332 7fc2af6c2710 osd1 5562 pg[9.0( v 1983'18202 (1983'18200,1983'18202]+backlog n=420 ec=1693 les=5117 4780/5116/4506) [1,3,6] r=0 rops=1 lcod 0'0 mlcod 0'0 active+] recover_replicas - nothing to do! osd/OSD.cc: In function 'PG* OSD::_lookup_lock_pg(pg_t)': osd/OSD.cc:954: FAILED assert(pg_map.count(pgid)) ceph version 0.24~rc (commit:7f38858c0c19db36c5ecf36cb4d333579981c811) 1: (OSD::_lookup_lock_pg(pg_t)+0x18e) [0x4c084e] 2: (OSD::sched_scrub()+0x29d) [0x4c40bd] 3: (OSD::tick()+0x62e) [0x4f574e] 4: (SafeTimer::timer_thread()+0x22c) [0x5c726c] 5: (SafeTimerThread::entry()+0xd) [0x5c932d] 6: (Thread::_entry_func(void*)+0xa) [0x46ee8a] 7: (()+0x69ca) [0x7fc2ba7869ca] 8: (clone()+0x6d) [0x7fc2b94ec70d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. *** Caught signal (ABRT) *** ceph version 0.24~rc (commit:7f38858c0c19db36c5ecf36cb4d333579981c811) 1: (sigabrt_handler(int)+0x7d) [0x5de20d] 2: (()+0x33af0) [0x7fc2b9439af0] 3: (gsignal()+0x35) [0x7fc2b9439a75] 4: (abort()+0x180) [0x7fc2b943d5c0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fc2b9cef8e5] 6: (()+0xcad16) [0x7fc2b9cedd16] 7: (()+0xcad43) [0x7fc2b9cedd43] 8: (()+0xcae3e) [0x7fc2b9cede3e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x448) [0x5cb058] 10: (OSD::_lookup_lock_pg(pg_t)+0x18e) [0x4c084e] 11: (OSD::sched_scrub()+0x29d) [0x4c40bd] 12: (OSD::tick()+0x62e) [0x4f574e] 13: (SafeTimer::timer_thread()+0x22c) [0x5c726c] 14: (SafeTimerThread::entry()+0xd) [0x5c932d] 15: (Thread::_entry_func(void*)+0xa) [0x46ee8a] 16: (()+0x69ca) [0x7fc2ba7869ca] 17: (clone()+0x6d) [0x7fc2b94ec70d]
To me it seems this was due to an automated scrub which started?
I ran debugpack on a few machines, placed the data in logger.ceph.widodh.nl:/srv/ceph/issues/osd_crash_scrub
Actions