Actions
Bug #611
closedOSD: OSDMap::get_cluster_inst
% Done:
0%
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
After upgrading to the latest unstable, one OSD crashed. Before the upgrade, 10 of the 12 OSD's were online.
When upgraded, one OSD crashed. I couldn't reproduce it, it wouldn't crash again.
Logging was on 20, this showed:
2010-11-24 11:58:55.153334 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] clear_prior 2010-11-24 11:58:55.153352 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] cancel_generate_backlog 2010-11-24 11:58:55.153373 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] write_info info 309 2010-11-24 11:58:55.153426 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] write_info bigbl 3556 2010-11-24 11:58:55.153446 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] write_log 2010-11-24 11:58:55.153473 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] write_log to 0~294 2010-11-24 11:58:55.153491 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] clean_up_local 2010-11-24 11:58:55.153509 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate - snap_trimq [] 2010-11-24 11:58:55.153526 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate - not complete, missing(1) 2010-11-24 11:58:55.153543 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate - complete_to = 4532'2713 2010-11-24 11:58:55.153568 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate - starting recovery 2010-11-24 11:58:55.153594 7f2ece3a8710 osd5 7048 queue_for_recovery queued pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] 2010-11-24 11:58:55.153616 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd0 is up to date, but sending pg_log anyway 2010-11-24 11:58:55.153641 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd0 sending log(0'0,0'0] missing(0) 2010-11-24 11:58:55.153664 7f2ece3a8710 -- [::]:6801/13134 --> osd0 [2001:16f8:10:2::c3c3:8f6b]:6801/22472 -- pg_log(3.10b e7048) v1 -- ?+0 0x5b96000 2010-11-24 11:58:55.153689 7f2ecbaa2710 osd5 7048 _recover_now defer until 2010-11-24 11:59:10.153587 2010-11-24 11:58:55.153739 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd0 already uptodate, 3.10b( v 4532'2713 (4476'2710,4532'2713] n=10 ec=2 les=7036 7037/7037/7037) 2010-11-24 11:58:55.153767 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd4 is up to date, but sending pg_log anyway 2010-11-24 11:58:55.153785 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd4 sending log(0'0,0'0] missing(0) 2010-11-24 11:58:55.153803 7f2ece3a8710 -- [::]:6801/13134 --> osd4 [2001:16f8:10:2::c3c3:fa1b]:6801/26124 -- pg_log(3.10b e7048) v1 -- ?+0 0x52b2800 2010-11-24 11:58:55.153824 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd4 3.10b( v 4532'2713 lc 4522'2711 (4476'2708,4532'2713] n=10 ec=2 les=7036 7037/7037/7037) missing missing(1) 2010-11-24 11:58:55.153847 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] is_all_uptodate: primary has 1 2010-11-24 11:58:55.153865 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] is_all_uptodate: osd4 has 1 missing 2010-11-24 11:58:55.153882 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate not all replicas are uptodate, queueing recovery 2010-11-24 11:58:55.153903 7f2ece3a8710 osd5 7048 queue_for_recovery already queued pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] 2010-11-24 11:58:55.153930 7f2ecbaa2710 osd5 7048 _recover_now defer until 2010-11-24 11:59:10.153587 2010-11-24 11:58:55.153954 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] update_stats 7037'5665 2010-11-24 11:58:55.153976 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] discover_all_missing 1 missing, 1 unfound 2010-11-24 11:58:55.154002 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] update_stats 7037'5666 2010-11-24 11:58:55.154028 7f2ece3a8710 osd5 7048 do_queries querying osd1 on 1 PGs 2010-11-24 11:58:55.154040 7f2ece3a8710 -- [::]:6801/13134 --> osd1 [2001:16f8:10:2::c3c3:a24f]:6801/3750 -- PGq v1 -- ?+0 0x1582700 2010-11-24 11:58:55.154062 7f2ece3a8710 osd5 7048 do_queries querying osd2 on 1 PGs 2010-11-24 11:58:55.154072 7f2ece3a8710 -- [::]:6801/13134 --> osd2 [2001:16f8:10:2::c3c3:4a8c]:6801/9520 -- PGq v1 -- ?+0 0x1582a80 2010-11-24 11:58:55.154093 7f2ece3a8710 osd5 7048 do_queries querying osd3 on 1 PGs 2010-11-24 11:58:55.154103 7f2ece3a8710 -- [::]:6801/13134 --> osd3 [2001:16f8:10:2::c3c3:2e3a]:6801/3084 -- PGq v1 -- ?+0 0x1582000 2010-11-24 11:58:55.154124 7f2ece3a8710 osd5 7048 do_queries querying osd7 on 1 PGs osd/OSDMap.h: In function 'entity_inst_t OSDMap::get_cluster_inst(int)': osd/OSDMap.h:479: FAILED assert(is_up(osd)) ceph version 0.24~rc (commit:d6e8e8d15d22b51ec86bc5687336c3d50d9b3a5d) 1: (OSDMap::get_cluster_inst(int)+0x159) [0x4a40b9] 2: (OSD::do_queries(std::map<int, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<pg_t const, PG::Query> > >, std::less<int>, std::allocator<std::pair<int const, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<pg_t const, PG::Query> > > > > >&)+0x12b) [0x4c7adb] 3: (OSD::_process_pg_info(unsigned int, int, PG::Info&, PG::Log&, PG::Missing*, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<int const, MOSDPGInfo*> > >*, int&)+0x96a) [0x4e515a] 4: (OSD::handle_pg_log(MOSDPGLog*)+0xb1) [0x4f3071] 5: (OSD::_dispatch(Message*)+0x42d) [0x4f59ad] 6: (OSD::ms_dispatch(Message*)+0x39) [0x4f62d9] 7: (SimpleMessenger::dispatch_entry()+0x759) [0x463b29] 8: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45a99c] 9: (Thread::_entry_func(void*)+0xa) [0x47015a] 10: (()+0x69ca) [0x7f2ed83699ca] 11: (clone()+0x6d) [0x7f2ed70cf70d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. *** Caught signal (ABRT) *** ceph version 0.24~rc (commit:d6e8e8d15d22b51ec86bc5687336c3d50d9b3a5d) 1: (sigabrt_handler(int)+0x7d) [0x5dec4d] 2: (()+0x33af0) [0x7f2ed701caf0] 3: (gsignal()+0x35) [0x7f2ed701ca75] 4: (abort()+0x180) [0x7f2ed70205c0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f2ed78d28e5] 6: (()+0xcad16) [0x7f2ed78d0d16] 7: (()+0xcad43) [0x7f2ed78d0d43] 8: (()+0xcae3e) [0x7f2ed78d0e3e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x448) [0x5cbc88] 10: (OSDMap::get_cluster_inst(int)+0x159) [0x4a40b9] 11: (OSD::do_queries(std::map<int, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<pg_t const, PG::Query> > >, std::less<int>, std::allocator<std::pair<int const, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<pg_t const, PG::Query> > > > > >&)+0x12b) [0x4c7adb] 12: (OSD::_process_pg_info(unsigned int, int, PG::Info&, PG::Log&, PG::Missing*, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<int const, MOSDPGInfo*> > >*, int&)+0x96a) [0x4e515a] 13: (OSD::handle_pg_log(MOSDPGLog*)+0xb1) [0x4f3071] 14: (OSD::_dispatch(Message*)+0x42d) [0x4f59ad] 15: (OSD::ms_dispatch(Message*)+0x39) [0x4f62d9] 16: (SimpleMessenger::dispatch_entry()+0x759) [0x463b29] 17: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45a99c] 18: (Thread::_entry_func(void*)+0xa) [0x47015a] 19: (()+0x69ca) [0x7f2ed83699ca] 20: (clone()+0x6d) [0x7f2ed70cf70d]
I've ran cdebugpack and gathered the data at logger.ceph.widodh.nl:/srv/ceph/issues/osd_crash_get_cluster_inst
Actions