Project

General

Profile

Actions

Bug #585

closed

OSD: ReplicatedPG::pull

Added by Wido den Hollander over 13 years ago. Updated over 13 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On two OSD's (osd5 and osd10) I'm seeing the same crash, the crash almost directly after starting them.

I cranked up the logging and this showed me:

2010-11-17 11:06:31.000850 7fa02755d710 osd5 4524 heartbeat_dispatch 0x4d8ba80
2010-11-17 11:06:31.000908 7fa02755d710 osd5 4524 handle_osd_ping from osd0 got stat stat(2010-11-17 11:06:32.279841 oprate=1.30399 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0)
2010-11-17 11:06:31.000950 7fa02755d710 osd5 4524 _share_map_incoming osd0 [2001:16f8:10:2::c3c3:8f6b]:6801/13094 4524
2010-11-17 11:06:31.000973 7fa02755d710 osd5 4524 take_peer_stat peer osd0 stat(2010-11-17 11:06:32.279841 oprate=1.30399 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0)
2010-11-17 11:06:31.031790 7fa02755d710 osd5 4524 heartbeat_dispatch 0x3d41a80
2010-11-17 11:06:31.031807 7fa02755d710 osd5 4524 handle_osd_ping from osd2 got stat stat(2010-11-17 11:06:31.064110 oprate=0.0945064 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0)
2010-11-17 11:06:31.031832 7fa02755d710 osd5 4524 _share_map_incoming osd2 [2001:16f8:10:2::c3c3:4a8c]:6801/25154 4524
2010-11-17 11:06:31.031848 7fa02755d710 osd5 4524 take_peer_stat peer osd2 stat(2010-11-17 11:06:31.064110 oprate=0.0945064 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0)
2010-11-17 11:06:31.189498 7fa02cd68710 osd5 4524 tick
2010-11-17 11:06:31.189532 7fa025458710 osd5 4524 do_recovery starting 1 (0/1 rops) on pg[4.0( v 1319'1 lc 0'0 (0'0,1319'1]+backlog n=1 ec=1054 les=4524 4523/4523/4523) [5] r=0 mlcod 0'0 active+degraded+ m=1 u=1]
2010-11-17 11:06:31.189577 7fa025458710 osd5 4524   active was 
2010-11-17 11:06:31.189592 7fa025458710 osd5 4524 pg[4.0( v 1319'1 lc 0'0 (0'0,1319'1]+backlog n=1 ec=1054 les=4524 4523/4523/4523) [5] r=0 mlcod 0'0 active+degraded+ m=1 u=1] recover_primary pulling 0 in pg
2010-11-17 11:06:31.189614 7fa025458710 osd5 4524 pg[4.0( v 1319'1 lc 0'0 (0'0,1319'1]+backlog n=1 ec=1054 les=4524 4523/4523/4523) [5] r=0 mlcod 0'0 active+degraded+ m=1 u=1] recover_primary missing(1)
2010-11-17 11:06:31.189638 7fa02cd68710 osd5 4524 scrub_should_schedule loadavg 1.48 >= max 0.5 = no, load too high
2010-11-17 11:06:31.189665 7fa02cd68710 osd5 4524 do_mon_report
2010-11-17 11:06:31.189680 7fa02cd68710 osd5 4524 send_alive up_thru currently 4523 want 4523
2010-11-17 11:06:31.189695 7fa02cd68710 osd5 4524 send_pg_stats
2010-11-17 11:06:31.189715 7fa025458710 osd5 4524 pg[4.0( v 1319'1 lc 0'0 (0'0,1319'1]+backlog n=1 ec=1054 les=4524 4523/4523/4523) [5] r=0 mlcod 0'0 active+degraded+ m=1 u=1] recover_primary .users.email/head 1319'1 (missing) (missing head)
osd/ReplicatedPG.cc: In function 'bool ReplicatedPG::pull(const sobject_t&)':
osd/ReplicatedPG.cc:2783: FAILED assert(missing_loc.count(soid))
 ceph version 0.24~rc (commit:7f38858c0c19db36c5ecf36cb4d333579981c811)
 1: (ReplicatedPG::pull(sobject_t const&)+0x4d2) [0x499902]
 2: (ReplicatedPG::recover_primary(int)+0x72b) [0x49e1fb]
 3: (ReplicatedPG::start_recovery_ops(int)+0xda) [0x49ee7a]
 4: (OSD::do_recovery(PG*)+0x19c) [0x4c288c]
 5: (ThreadPool::worker()+0x28f) [0x5cb71f]
 6: (ThreadPool::WorkThread::entry()+0xd) [0x4fdb3d]
 7: (Thread::_entry_func(void*)+0xa) [0x46ee8a]
 8: (()+0x69ca) [0x7fa03051c9ca]
 9: (clone()+0x6d) [0x7fa02f28270d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
*** Caught signal (ABRT) ***
 ceph version 0.24~rc (commit:7f38858c0c19db36c5ecf36cb4d333579981c811)
 1: (sigabrt_handler(int)+0x7d) [0x5de20d]
 2: (()+0x33af0) [0x7fa02f1cfaf0]
 3: (gsignal()+0x35) [0x7fa02f1cfa75]
 4: (abort()+0x180) [0x7fa02f1d35c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fa02fa858e5]
 6: (()+0xcad16) [0x7fa02fa83d16]
 7: (()+0xcad43) [0x7fa02fa83d43]
 8: (()+0xcae3e) [0x7fa02fa83e3e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x448) [0x5cb058]
 10: (ReplicatedPG::pull(sobject_t const&)+0x4d2) [0x499902]
 11: (ReplicatedPG::recover_primary(int)+0x72b) [0x49e1fb]
 12: (ReplicatedPG::start_recovery_ops(int)+0xda) [0x49ee7a]
 13: (OSD::do_recovery(PG*)+0x19c) [0x4c288c]
 14: (ThreadPool::worker()+0x28f) [0x5cb71f]
 15: (ThreadPool::WorkThread::entry()+0xd) [0x4fdb3d]
 16: (Thread::_entry_func(void*)+0xa) [0x46ee8a]
 17: (()+0x69ca) [0x7fa03051c9ca]
 18: (clone()+0x6d) [0x7fa02f28270d]

I've gathered all the information with cdebugpack and uploaded it to logger.ceph.widodh.nl:/srv/ceph/issues/osd_crash_replicated_pull

Actions #1

Updated by Sage Weil over 13 years ago

  • Status changed from New to Closed

This one should also be fixed in the latest unstable. Probably. The recovery code is still being worked on a bit, but this error at least shouldn't come up.

Actions #2

Updated by Wido den Hollander over 13 years ago

Well, it did show up again:

2010-11-18 08:53:48.672291 7f996f3c8710 osd0 5679 pg[2.215( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] _finish_recovery
2010-11-18 08:53:48.672363 7f996f3c8710 osd0 5679 pg[2.215( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] purge_strays 1
2010-11-18 08:53:48.672378 7f996f3c8710 osd0 5679 pg[2.215( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] sending PGRemove to osd1
2010-11-18 08:53:48.672425 7f996f3c8710 osd0 5679 pg[2.215( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] update_stats 5665'1641
2010-11-18 08:53:48.672441 7f996dbc5710 filestore(/srv/ceph/osd.0) collection_setattr /srv/ceph/osd.0/current/2.234_head 'info' len 309
2010-11-18 08:53:48.672457 7f9968aba710 osd0 5679 pg[0.252( v 1486'456 lc 0'0 (1486'454,1486'456]+backlog n=453 ec=2 les=5666 5665/5665/5665) [0,1] r=0 mlcod 0'0 active+degraded m=453 u=453] recover_primary 1000000079f.00000010/head 1057'2 (unfound) (missing) (missing head)
2010-11-18 08:53:48.672992 7f996f3c8710 osd0 5679 pg[2.234( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] _finish_recovery
2010-11-18 08:53:48.673021 7f996f3c8710 osd0 5679 pg[2.234( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] purge_strays 1
2010-11-18 08:53:48.673033 7f996f3c8710 osd0 5679 pg[2.234( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] sending PGRemove to osd1
2010-11-18 08:53:48.673050 7f996f3c8710 osd0 5679 pg[2.234( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] update_stats 5665'1834
osd/ReplicatedPG.cc: In function 'bool ReplicatedPG::pull(const sobject_t&)':
osd/ReplicatedPG.cc:2783: FAILED assert(missing_loc.count(soid))
 ceph version 0.24~rc (commit:4adfdee7f179083ad0b27ce935302ce0b83630cc)
 1: (ReplicatedPG::pull(sobject_t const&)+0x4c2) [0x498992]
 2: (ReplicatedPG::recover_primary(int)+0x7e1) [0x499c31]
 3: (ReplicatedPG::start_recovery_ops(int)+0x4a) [0x49a75a]
 4: (OSD::do_recovery(PG*)+0x19c) [0x4c278c]
 5: (ThreadPool::worker()+0x28f) [0x5cc52f]
 6: (ThreadPool::WorkThread::entry()+0xd) [0x4feead]
 7: (Thread::_entry_func(void*)+0xa) [0x46f8ba]
 8: (()+0x69ca) [0x7f9973b7e9ca]
 9: (clone()+0x6d) [0x7f99728e470d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
*** Caught signal (ABRT) ***
 ceph version 0.24~rc (commit:4adfdee7f179083ad0b27ce935302ce0b83630cc)
 1: (sigabrt_handler(int)+0x7d) [0x5deb7d]
 2: (()+0x33af0) [0x7f9972831af0]
 3: (gsignal()+0x35) [0x7f9972831a75]
 4: (abort()+0x180) [0x7f99728355c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f99730e78e5]
 6: (()+0xcad16) [0x7f99730e5d16]
 7: (()+0xcad43) [0x7f99730e5d43]
 8: (()+0xcae3e) [0x7f99730e5e3e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x448) [0x5cbc38]
 10: (ReplicatedPG::pull(sobject_t const&)+0x4c2) [0x498992]
 11: (ReplicatedPG::recover_primary(int)+0x7e1) [0x499c31]
 12: (ReplicatedPG::start_recovery_ops(int)+0x4a) [0x49a75a]
 13: (OSD::do_recovery(PG*)+0x19c) [0x4c278c]
 14: (ThreadPool::worker()+0x28f) [0x5cc52f]
 15: (ThreadPool::WorkThread::entry()+0xd) [0x4feead]
 16: (Thread::_entry_func(void*)+0xa) [0x46f8ba]
 17: (()+0x69ca) [0x7f9973b7e9ca]
 18: (clone()+0x6d) [0x7f99728e470d]

As you can see, it's the latest unstable which I'm running.

It seems that all my OSD's went down with this backtrace.

Actions #3

Updated by Colin McCabe over 13 years ago

  • Status changed from Closed to In Progress

need to see what's going on with this

Actions #4

Updated by Colin McCabe over 13 years ago

  • Status changed from In Progress to Resolved

Fixed by commit:82f1de8c0d6e7817ca7d6dd710e3176b2a549e12

Actions

Also available in: Atom PDF