Actions
Bug #585
closedOSD: ReplicatedPG::pull
Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
On two OSD's (osd5 and osd10) I'm seeing the same crash, the crash almost directly after starting them.
I cranked up the logging and this showed me:
2010-11-17 11:06:31.000850 7fa02755d710 osd5 4524 heartbeat_dispatch 0x4d8ba80 2010-11-17 11:06:31.000908 7fa02755d710 osd5 4524 handle_osd_ping from osd0 got stat stat(2010-11-17 11:06:32.279841 oprate=1.30399 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0) 2010-11-17 11:06:31.000950 7fa02755d710 osd5 4524 _share_map_incoming osd0 [2001:16f8:10:2::c3c3:8f6b]:6801/13094 4524 2010-11-17 11:06:31.000973 7fa02755d710 osd5 4524 take_peer_stat peer osd0 stat(2010-11-17 11:06:32.279841 oprate=1.30399 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0) 2010-11-17 11:06:31.031790 7fa02755d710 osd5 4524 heartbeat_dispatch 0x3d41a80 2010-11-17 11:06:31.031807 7fa02755d710 osd5 4524 handle_osd_ping from osd2 got stat stat(2010-11-17 11:06:31.064110 oprate=0.0945064 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0) 2010-11-17 11:06:31.031832 7fa02755d710 osd5 4524 _share_map_incoming osd2 [2001:16f8:10:2::c3c3:4a8c]:6801/25154 4524 2010-11-17 11:06:31.031848 7fa02755d710 osd5 4524 take_peer_stat peer osd2 stat(2010-11-17 11:06:31.064110 oprate=0.0945064 qlen=0 recent_qlen=0 rdlat=0 / 0 fshedin=0) 2010-11-17 11:06:31.189498 7fa02cd68710 osd5 4524 tick 2010-11-17 11:06:31.189532 7fa025458710 osd5 4524 do_recovery starting 1 (0/1 rops) on pg[4.0( v 1319'1 lc 0'0 (0'0,1319'1]+backlog n=1 ec=1054 les=4524 4523/4523/4523) [5] r=0 mlcod 0'0 active+degraded+ m=1 u=1] 2010-11-17 11:06:31.189577 7fa025458710 osd5 4524 active was 2010-11-17 11:06:31.189592 7fa025458710 osd5 4524 pg[4.0( v 1319'1 lc 0'0 (0'0,1319'1]+backlog n=1 ec=1054 les=4524 4523/4523/4523) [5] r=0 mlcod 0'0 active+degraded+ m=1 u=1] recover_primary pulling 0 in pg 2010-11-17 11:06:31.189614 7fa025458710 osd5 4524 pg[4.0( v 1319'1 lc 0'0 (0'0,1319'1]+backlog n=1 ec=1054 les=4524 4523/4523/4523) [5] r=0 mlcod 0'0 active+degraded+ m=1 u=1] recover_primary missing(1) 2010-11-17 11:06:31.189638 7fa02cd68710 osd5 4524 scrub_should_schedule loadavg 1.48 >= max 0.5 = no, load too high 2010-11-17 11:06:31.189665 7fa02cd68710 osd5 4524 do_mon_report 2010-11-17 11:06:31.189680 7fa02cd68710 osd5 4524 send_alive up_thru currently 4523 want 4523 2010-11-17 11:06:31.189695 7fa02cd68710 osd5 4524 send_pg_stats 2010-11-17 11:06:31.189715 7fa025458710 osd5 4524 pg[4.0( v 1319'1 lc 0'0 (0'0,1319'1]+backlog n=1 ec=1054 les=4524 4523/4523/4523) [5] r=0 mlcod 0'0 active+degraded+ m=1 u=1] recover_primary .users.email/head 1319'1 (missing) (missing head) osd/ReplicatedPG.cc: In function 'bool ReplicatedPG::pull(const sobject_t&)': osd/ReplicatedPG.cc:2783: FAILED assert(missing_loc.count(soid)) ceph version 0.24~rc (commit:7f38858c0c19db36c5ecf36cb4d333579981c811) 1: (ReplicatedPG::pull(sobject_t const&)+0x4d2) [0x499902] 2: (ReplicatedPG::recover_primary(int)+0x72b) [0x49e1fb] 3: (ReplicatedPG::start_recovery_ops(int)+0xda) [0x49ee7a] 4: (OSD::do_recovery(PG*)+0x19c) [0x4c288c] 5: (ThreadPool::worker()+0x28f) [0x5cb71f] 6: (ThreadPool::WorkThread::entry()+0xd) [0x4fdb3d] 7: (Thread::_entry_func(void*)+0xa) [0x46ee8a] 8: (()+0x69ca) [0x7fa03051c9ca] 9: (clone()+0x6d) [0x7fa02f28270d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. *** Caught signal (ABRT) *** ceph version 0.24~rc (commit:7f38858c0c19db36c5ecf36cb4d333579981c811) 1: (sigabrt_handler(int)+0x7d) [0x5de20d] 2: (()+0x33af0) [0x7fa02f1cfaf0] 3: (gsignal()+0x35) [0x7fa02f1cfa75] 4: (abort()+0x180) [0x7fa02f1d35c0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fa02fa858e5] 6: (()+0xcad16) [0x7fa02fa83d16] 7: (()+0xcad43) [0x7fa02fa83d43] 8: (()+0xcae3e) [0x7fa02fa83e3e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x448) [0x5cb058] 10: (ReplicatedPG::pull(sobject_t const&)+0x4d2) [0x499902] 11: (ReplicatedPG::recover_primary(int)+0x72b) [0x49e1fb] 12: (ReplicatedPG::start_recovery_ops(int)+0xda) [0x49ee7a] 13: (OSD::do_recovery(PG*)+0x19c) [0x4c288c] 14: (ThreadPool::worker()+0x28f) [0x5cb71f] 15: (ThreadPool::WorkThread::entry()+0xd) [0x4fdb3d] 16: (Thread::_entry_func(void*)+0xa) [0x46ee8a] 17: (()+0x69ca) [0x7fa03051c9ca] 18: (clone()+0x6d) [0x7fa02f28270d]
I've gathered all the information with cdebugpack and uploaded it to logger.ceph.widodh.nl:/srv/ceph/issues/osd_crash_replicated_pull
Updated by Sage Weil over 13 years ago
- Status changed from New to Closed
This one should also be fixed in the latest unstable. Probably. The recovery code is still being worked on a bit, but this error at least shouldn't come up.
Updated by Wido den Hollander over 13 years ago
Well, it did show up again:
2010-11-18 08:53:48.672291 7f996f3c8710 osd0 5679 pg[2.215( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] _finish_recovery 2010-11-18 08:53:48.672363 7f996f3c8710 osd0 5679 pg[2.215( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] purge_strays 1 2010-11-18 08:53:48.672378 7f996f3c8710 osd0 5679 pg[2.215( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] sending PGRemove to osd1 2010-11-18 08:53:48.672425 7f996f3c8710 osd0 5679 pg[2.215( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] update_stats 5665'1641 2010-11-18 08:53:48.672441 7f996dbc5710 filestore(/srv/ceph/osd.0) collection_setattr /srv/ceph/osd.0/current/2.234_head 'info' len 309 2010-11-18 08:53:48.672457 7f9968aba710 osd0 5679 pg[0.252( v 1486'456 lc 0'0 (1486'454,1486'456]+backlog n=453 ec=2 les=5666 5665/5665/5665) [0,1] r=0 mlcod 0'0 active+degraded m=453 u=453] recover_primary 1000000079f.00000010/head 1057'2 (unfound) (missing) (missing head) 2010-11-18 08:53:48.672992 7f996f3c8710 osd0 5679 pg[2.234( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] _finish_recovery 2010-11-18 08:53:48.673021 7f996f3c8710 osd0 5679 pg[2.234( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] purge_strays 1 2010-11-18 08:53:48.673033 7f996f3c8710 osd0 5679 pg[2.234( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] sending PGRemove to osd1 2010-11-18 08:53:48.673050 7f996f3c8710 osd0 5679 pg[2.234( empty n=0 ec=2 les=5666 5665/5665/5665) [0] r=0 mlcod 0'0 active+clean+degraded] update_stats 5665'1834 osd/ReplicatedPG.cc: In function 'bool ReplicatedPG::pull(const sobject_t&)': osd/ReplicatedPG.cc:2783: FAILED assert(missing_loc.count(soid)) ceph version 0.24~rc (commit:4adfdee7f179083ad0b27ce935302ce0b83630cc) 1: (ReplicatedPG::pull(sobject_t const&)+0x4c2) [0x498992] 2: (ReplicatedPG::recover_primary(int)+0x7e1) [0x499c31] 3: (ReplicatedPG::start_recovery_ops(int)+0x4a) [0x49a75a] 4: (OSD::do_recovery(PG*)+0x19c) [0x4c278c] 5: (ThreadPool::worker()+0x28f) [0x5cc52f] 6: (ThreadPool::WorkThread::entry()+0xd) [0x4feead] 7: (Thread::_entry_func(void*)+0xa) [0x46f8ba] 8: (()+0x69ca) [0x7f9973b7e9ca] 9: (clone()+0x6d) [0x7f99728e470d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. *** Caught signal (ABRT) *** ceph version 0.24~rc (commit:4adfdee7f179083ad0b27ce935302ce0b83630cc) 1: (sigabrt_handler(int)+0x7d) [0x5deb7d] 2: (()+0x33af0) [0x7f9972831af0] 3: (gsignal()+0x35) [0x7f9972831a75] 4: (abort()+0x180) [0x7f99728355c0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f99730e78e5] 6: (()+0xcad16) [0x7f99730e5d16] 7: (()+0xcad43) [0x7f99730e5d43] 8: (()+0xcae3e) [0x7f99730e5e3e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x448) [0x5cbc38] 10: (ReplicatedPG::pull(sobject_t const&)+0x4c2) [0x498992] 11: (ReplicatedPG::recover_primary(int)+0x7e1) [0x499c31] 12: (ReplicatedPG::start_recovery_ops(int)+0x4a) [0x49a75a] 13: (OSD::do_recovery(PG*)+0x19c) [0x4c278c] 14: (ThreadPool::worker()+0x28f) [0x5cc52f] 15: (ThreadPool::WorkThread::entry()+0xd) [0x4feead] 16: (Thread::_entry_func(void*)+0xa) [0x46f8ba] 17: (()+0x69ca) [0x7f9973b7e9ca] 18: (clone()+0x6d) [0x7f99728e470d]
As you can see, it's the latest unstable which I'm running.
It seems that all my OSD's went down with this backtrace.
Updated by Colin McCabe over 13 years ago
- Status changed from Closed to In Progress
need to see what's going on with this
Updated by Colin McCabe over 13 years ago
- Status changed from In Progress to Resolved
Fixed by commit:82f1de8c0d6e7817ca7d6dd710e3176b2a549e12
Actions