Bug #586: OSD: Crash during scheduled scrub - Ceph - Ceph

Actions

Copy link

Bug #586

closed

OSD: Crash during scheduled scrub

Added by Wido den Hollander over 13 years ago. Updated over 13 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

After I reported #585 I didn't pay much attention to my cluster, until I found out that I had only one OSD left online.

It seems all the OSD's crashed with the same backtrace:

2010-11-17 12:32:14.563031 7fc2af6c2710 osd1 5562 pg[9.0( v 1983'18202 (1983'18200,1983'18202]+backlog n=420 ec=1693 les=5117 4780/5116/4506) [1,3,6] r=0 rops=1 lcod 0'0 mlcod 0'0 active+] recover_replicas - nothing to do!
2010-11-17 12:32:14.794561 7fc2af6c2710 osd1 5562 pg[9.0( v 1983'18202 (1983'18200,1983'18202]+backlog n=420 ec=1693 les=5117 4780/5116/4506) [1,3,6] r=0 lcod 0'0 mlcod 0'0 active+] recover_replicas
2010-11-17 12:32:14.922299 7fc2af6c2710 osd1 5562 pg[9.0( v 1983'18202 (1983'18200,1983'18202]+backlog n=420 ec=1693 les=5117 4780/5116/4506) [1,3,6] r=0 rops=1 lcod 0'0 mlcod 0'0 active+] recover_replicas - nothing to do!
2010-11-17 12:32:15.064489 7fc2af6c2710 osd1 5562 pg[9.0( v 1983'18202 (1983'18200,1983'18202]+backlog n=420 ec=1693 les=5117 4780/5116/4506) [1,3,6] r=0 lcod 0'0 mlcod 0'0 active+] recover_replicas
2010-11-17 12:32:15.082332 7fc2af6c2710 osd1 5562 pg[9.0( v 1983'18202 (1983'18200,1983'18202]+backlog n=420 ec=1693 les=5117 4780/5116/4506) [1,3,6] r=0 rops=1 lcod 0'0 mlcod 0'0 active+] recover_replicas - nothing to do!
osd/OSD.cc: In function 'PG* OSD::_lookup_lock_pg(pg_t)':
osd/OSD.cc:954: FAILED assert(pg_map.count(pgid))
 ceph version 0.24~rc (commit:7f38858c0c19db36c5ecf36cb4d333579981c811)
 1: (OSD::_lookup_lock_pg(pg_t)+0x18e) [0x4c084e]
 2: (OSD::sched_scrub()+0x29d) [0x4c40bd]
 3: (OSD::tick()+0x62e) [0x4f574e]
 4: (SafeTimer::timer_thread()+0x22c) [0x5c726c]
 5: (SafeTimerThread::entry()+0xd) [0x5c932d]
 6: (Thread::_entry_func(void*)+0xa) [0x46ee8a]
 7: (()+0x69ca) [0x7fc2ba7869ca]
 8: (clone()+0x6d) [0x7fc2b94ec70d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
*** Caught signal (ABRT) ***
 ceph version 0.24~rc (commit:7f38858c0c19db36c5ecf36cb4d333579981c811)
 1: (sigabrt_handler(int)+0x7d) [0x5de20d]
 2: (()+0x33af0) [0x7fc2b9439af0]
 3: (gsignal()+0x35) [0x7fc2b9439a75]
 4: (abort()+0x180) [0x7fc2b943d5c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fc2b9cef8e5]
 6: (()+0xcad16) [0x7fc2b9cedd16]
 7: (()+0xcad43) [0x7fc2b9cedd43]
 8: (()+0xcae3e) [0x7fc2b9cede3e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x448) [0x5cb058]
 10: (OSD::_lookup_lock_pg(pg_t)+0x18e) [0x4c084e]
 11: (OSD::sched_scrub()+0x29d) [0x4c40bd]
 12: (OSD::tick()+0x62e) [0x4f574e]
 13: (SafeTimer::timer_thread()+0x22c) [0x5c726c]
 14: (SafeTimerThread::entry()+0xd) [0x5c932d]
 15: (Thread::_entry_func(void*)+0xa) [0x46ee8a]
 16: (()+0x69ca) [0x7fc2ba7869ca]
 17: (clone()+0x6d) [0x7fc2b94ec70d]

To me it seems this was due to an automated scrub which started?

I ran debugpack on a few machines, placed the data in logger.ceph.widodh.nl:/srv/ceph/issues/osd_crash_scrub