Bug #585: OSD: ReplicatedPG::pull - Ceph - Ceph

Actions

Copy link

Bug #585

closed

OSD: ReplicatedPG::pull

Added by Wido den Hollander over 13 years ago. Updated over 13 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

On two OSD's (osd5 and osd10) I'm seeing the same crash, the crash almost directly after starting them.

I cranked up the logging and this showed me:

2010-11-17 11:06:31.000850 7fa02755d710 osd5 4524 heartbeat_dispatch 0x4d8ba80
2010-11-17 11:06:31.000908 7fa02755d710 osd5 4524 handle_osd_ping from osd0 got stat stat(2010-11-17 11:06:32.279841 oprate=1.30399 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0)
2010-11-17 11:06:31.000950 7fa02755d710 osd5 4524 _share_map_incoming osd0 [2001:16f8:10:2::c3c3:8f6b]:6801/13094 4524
2010-11-17 11:06:31.000973 7fa02755d710 osd5 4524 take_peer_stat peer osd0 stat(2010-11-17 11:06:32.279841 oprate=1.30399 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0)
2010-11-17 11:06:31.031790 7fa02755d710 osd5 4524 heartbeat_dispatch 0x3d41a80
2010-11-17 11:06:31.031807 7fa02755d710 osd5 4524 handle_osd_ping from osd2 got stat stat(2010-11-17 11:06:31.064110 oprate=0.0945064 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0)
2010-11-17 11:06:31.031832 7fa02755d710 osd5 4524 _share_map_incoming osd2 [2001:16f8:10:2::c3c3:4a8c]:6801/25154 4524
2010-11-17 11:06:31.031848 7fa02755d710 osd5 4524 take_peer_stat peer osd2 stat(2010-11-17 11:06:31.064110 oprate=0.0945064 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0)
2010-11-17 11:06:31.189498 7fa02cd68710 osd5 4524 tick
2010-11-17 11:06:31.189532 7fa025458710 osd5 4524 do_recovery starting 1 (0/1 rops) on pg[4.0( v 1319'1 lc 0'0 (0'0,1319'1]+backlog n=1 ec=1054 les=4524 4523/4523/4523) [5] r=0 mlcod 0'0 active+degraded+ m=1 u=1]
2010-11-17 11:06:31.189577 7fa025458710 osd5 4524   active was 
2010-11-17 11:06:31.189592 7fa025458710 osd5 4524 pg[4.0( v 1319'1 lc 0'0 (0'0,1319'1]+backlog n=1 ec=1054 les=4524 4523/4523/4523) [5] r=0 mlcod 0'0 active+degraded+ m=1 u=1] recover_primary pulling 0 in pg
2010-11-17 11:06:31.189614 7fa025458710 osd5 4524 pg[4.0( v 1319'1 lc 0'0 (0'0,1319'1]+backlog n=1 ec=1054 les=4524 4523/4523/4523) [5] r=0 mlcod 0'0 active+degraded+ m=1 u=1] recover_primary missing(1)
2010-11-17 11:06:31.189638 7fa02cd68710 osd5 4524 scrub_should_schedule loadavg 1.48 >= max 0.5 = no, load too high
2010-11-17 11:06:31.189665 7fa02cd68710 osd5 4524 do_mon_report
2010-11-17 11:06:31.189680 7fa02cd68710 osd5 4524 send_alive up_thru currently 4523 want 4523
2010-11-17 11:06:31.189695 7fa02cd68710 osd5 4524 send_pg_stats
2010-11-17 11:06:31.189715 7fa025458710 osd5 4524 pg[4.0( v 1319'1 lc 0'0 (0'0,1319'1]+backlog n=1 ec=1054 les=4524 4523/4523/4523) [5] r=0 mlcod 0'0 active+degraded+ m=1 u=1] recover_primary .users.email/head 1319'1 (missing) (missing head)
osd/ReplicatedPG.cc: In function 'bool ReplicatedPG::pull(const sobject_t&)':
osd/ReplicatedPG.cc:2783: FAILED assert(missing_loc.count(soid))
 ceph version 0.24~rc (commit:7f38858c0c19db36c5ecf36cb4d333579981c811)
 1: (ReplicatedPG::pull(sobject_t const&)+0x4d2) [0x499902]
 2: (ReplicatedPG::recover_primary(int)+0x72b) [0x49e1fb]
 3: (ReplicatedPG::start_recovery_ops(int)+0xda) [0x49ee7a]
 4: (OSD::do_recovery(PG*)+0x19c) [0x4c288c]
 5: (ThreadPool::worker()+0x28f) [0x5cb71f]
 6: (ThreadPool::WorkThread::entry()+0xd) [0x4fdb3d]
 7: (Thread::_entry_func(void*)+0xa) [0x46ee8a]
 8: (()+0x69ca) [0x7fa03051c9ca]
 9: (clone()+0x6d) [0x7fa02f28270d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
*** Caught signal (ABRT) ***
 ceph version 0.24~rc (commit:7f38858c0c19db36c5ecf36cb4d333579981c811)
 1: (sigabrt_handler(int)+0x7d) [0x5de20d]
 2: (()+0x33af0) [0x7fa02f1cfaf0]
 3: (gsignal()+0x35) [0x7fa02f1cfa75]
 4: (abort()+0x180) [0x7fa02f1d35c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fa02fa858e5]
 6: (()+0xcad16) [0x7fa02fa83d16]
 7: (()+0xcad43) [0x7fa02fa83d43]
 8: (()+0xcae3e) [0x7fa02fa83e3e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x448) [0x5cb058]
 10: (ReplicatedPG::pull(sobject_t const&)+0x4d2) [0x499902]
 11: (ReplicatedPG::recover_primary(int)+0x72b) [0x49e1fb]
 12: (ReplicatedPG::start_recovery_ops(int)+0xda) [0x49ee7a]
 13: (OSD::do_recovery(PG*)+0x19c) [0x4c288c]
 14: (ThreadPool::worker()+0x28f) [0x5cb71f]
 15: (ThreadPool::WorkThread::entry()+0xd) [0x4fdb3d]
 16: (Thread::_entry_func(void*)+0xa) [0x46ee8a]
 17: (()+0x69ca) [0x7fa03051c9ca]
 18: (clone()+0x6d) [0x7fa02f28270d]

I've gathered all the information with cdebugpack and uploaded it to logger.ceph.widodh.nl:/srv/ceph/issues/osd_crash_replicated_pull

Actions

Copy link

Updated by Sage Weil over 13 years ago

Status changed from New to Closed

This one should also be fixed in the latest unstable. Probably. The recovery code is still being worked on a bit, but this error at least shouldn't come up.

Actions

Copy link

Updated by Wido den Hollander over 13 years ago

Well, it did show up again:

2010-11-18 08:53:48.672291 7f996f3c8710 osd0 5679 pg[2.215( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] _finish_recovery
2010-11-18 08:53:48.672363 7f996f3c8710 osd0 5679 pg[2.215( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] purge_strays 1
2010-11-18 08:53:48.672378 7f996f3c8710 osd0 5679 pg[2.215( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] sending PGRemove to osd1
2010-11-18 08:53:48.672425 7f996f3c8710 osd0 5679 pg[2.215( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] update_stats 5665'1641
2010-11-18 08:53:48.672441 7f996dbc5710 filestore(/srv/ceph/osd.0) collection_setattr /srv/ceph/osd.0/current/2.234_head 'info' len 309
2010-11-18 08:53:48.672457 7f9968aba710 osd0 5679 pg[0.252( v 1486'456 lc 0'0 (1486'454,1486'456]+backlog n=453 ec=2 les=5666 5665/5665/5665) [0,1] r=0 mlcod 0'0 active+degraded m=453 u=453] recover_primary 1000000079f.00000010/head 1057'2 (unfound) (missing) (missing head)
2010-11-18 08:53:48.672992 7f996f3c8710 osd0 5679 pg[2.234( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] _finish_recovery
2010-11-18 08:53:48.673021 7f996f3c8710 osd0 5679 pg[2.234( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] purge_strays 1
2010-11-18 08:53:48.673033 7f996f3c8710 osd0 5679 pg[2.234( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] sending PGRemove to osd1
2010-11-18 08:53:48.673050 7f996f3c8710 osd0 5679 pg[2.234( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] update_stats 5665'1834
osd/ReplicatedPG.cc: In function 'bool ReplicatedPG::pull(const sobject_t&)':
osd/ReplicatedPG.cc:2783: FAILED assert(missing_loc.count(soid))
 ceph version 0.24~rc (commit:4adfdee7f179083ad0b27ce935302ce0b83630cc)
 1: (ReplicatedPG::pull(sobject_t const&)+0x4c2) [0x498992]
 2: (ReplicatedPG::recover_primary(int)+0x7e1) [0x499c31]
 3: (ReplicatedPG::start_recovery_ops(int)+0x4a) [0x49a75a]
 4: (OSD::do_recovery(PG*)+0x19c) [0x4c278c]
 5: (ThreadPool::worker()+0x28f) [0x5cc52f]
 6: (ThreadPool::WorkThread::entry()+0xd) [0x4feead]
 7: (Thread::_entry_func(void*)+0xa) [0x46f8ba]
 8: (()+0x69ca) [0x7f9973b7e9ca]
 9: (clone()+0x6d) [0x7f99728e470d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
*** Caught signal (ABRT) ***
 ceph version 0.24~rc (commit:4adfdee7f179083ad0b27ce935302ce0b83630cc)
 1: (sigabrt_handler(int)+0x7d) [0x5deb7d]
 2: (()+0x33af0) [0x7f9972831af0]
 3: (gsignal()+0x35) [0x7f9972831a75]
 4: (abort()+0x180) [0x7f99728355c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f99730e78e5]
 6: (()+0xcad16) [0x7f99730e5d16]
 7: (()+0xcad43) [0x7f99730e5d43]
 8: (()+0xcae3e) [0x7f99730e5e3e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x448) [0x5cbc38]
 10: (ReplicatedPG::pull(sobject_t const&)+0x4c2) [0x498992]
 11: (ReplicatedPG::recover_primary(int)+0x7e1) [0x499c31]
 12: (ReplicatedPG::start_recovery_ops(int)+0x4a) [0x49a75a]
 13: (OSD::do_recovery(PG*)+0x19c) [0x4c278c]
 14: (ThreadPool::worker()+0x28f) [0x5cc52f]
 15: (ThreadPool::WorkThread::entry()+0xd) [0x4feead]
 16: (Thread::_entry_func(void*)+0xa) [0x46f8ba]
 17: (()+0x69ca) [0x7f9973b7e9ca]
 18: (clone()+0x6d) [0x7f99728e470d]

As you can see, it's the latest unstable which I'm running.

It seems that all my OSD's went down with this backtrace.

Actions

Copy link