Bug #611
closedOSD: OSDMap::get_cluster_inst
0%
Description
After upgrading to the latest unstable, one OSD crashed. Before the upgrade, 10 of the 12 OSD's were online.
When upgraded, one OSD crashed. I couldn't reproduce it, it wouldn't crash again.
Logging was on 20, this showed:
2010-11-24 11:58:55.153334 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] clear_prior 2010-11-24 11:58:55.153352 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] cancel_generate_backlog 2010-11-24 11:58:55.153373 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] write_info info 309 2010-11-24 11:58:55.153426 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] write_info bigbl 3556 2010-11-24 11:58:55.153446 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] write_log 2010-11-24 11:58:55.153473 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] write_log to 0~294 2010-11-24 11:58:55.153491 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] clean_up_local 2010-11-24 11:58:55.153509 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate - snap_trimq [] 2010-11-24 11:58:55.153526 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate - not complete, missing(1) 2010-11-24 11:58:55.153543 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate - complete_to = 4532'2713 2010-11-24 11:58:55.153568 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate - starting recovery 2010-11-24 11:58:55.153594 7f2ece3a8710 osd5 7048 queue_for_recovery queued pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] 2010-11-24 11:58:55.153616 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd0 is up to date, but sending pg_log anyway 2010-11-24 11:58:55.153641 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd0 sending log(0'0,0'0] missing(0) 2010-11-24 11:58:55.153664 7f2ece3a8710 -- [::]:6801/13134 --> osd0 [2001:16f8:10:2::c3c3:8f6b]:6801/22472 -- pg_log(3.10b e7048) v1 -- ?+0 0x5b96000 2010-11-24 11:58:55.153689 7f2ecbaa2710 osd5 7048 _recover_now defer until 2010-11-24 11:59:10.153587 2010-11-24 11:58:55.153739 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd0 already uptodate, 3.10b( v 4532'2713 (4476'2710,4532'2713] n=10 ec=2 les=7036 7037/7037/7037) 2010-11-24 11:58:55.153767 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd4 is up to date, but sending pg_log anyway 2010-11-24 11:58:55.153785 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd4 sending log(0'0,0'0] missing(0) 2010-11-24 11:58:55.153803 7f2ece3a8710 -- [::]:6801/13134 --> osd4 [2001:16f8:10:2::c3c3:fa1b]:6801/26124 -- pg_log(3.10b e7048) v1 -- ?+0 0x52b2800 2010-11-24 11:58:55.153824 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd4 3.10b( v 4532'2713 lc 4522'2711 (4476'2708,4532'2713] n=10 ec=2 les=7036 7037/7037/7037) missing missing(1) 2010-11-24 11:58:55.153847 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] is_all_uptodate: primary has 1 2010-11-24 11:58:55.153865 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] is_all_uptodate: osd4 has 1 missing 2010-11-24 11:58:55.153882 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate not all replicas are uptodate, queueing recovery 2010-11-24 11:58:55.153903 7f2ece3a8710 osd5 7048 queue_for_recovery already queued pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] 2010-11-24 11:58:55.153930 7f2ecbaa2710 osd5 7048 _recover_now defer until 2010-11-24 11:59:10.153587 2010-11-24 11:58:55.153954 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] update_stats 7037'5665 2010-11-24 11:58:55.153976 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] discover_all_missing 1 missing, 1 unfound 2010-11-24 11:58:55.154002 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] update_stats 7037'5666 2010-11-24 11:58:55.154028 7f2ece3a8710 osd5 7048 do_queries querying osd1 on 1 PGs 2010-11-24 11:58:55.154040 7f2ece3a8710 -- [::]:6801/13134 --> osd1 [2001:16f8:10:2::c3c3:a24f]:6801/3750 -- PGq v1 -- ?+0 0x1582700 2010-11-24 11:58:55.154062 7f2ece3a8710 osd5 7048 do_queries querying osd2 on 1 PGs 2010-11-24 11:58:55.154072 7f2ece3a8710 -- [::]:6801/13134 --> osd2 [2001:16f8:10:2::c3c3:4a8c]:6801/9520 -- PGq v1 -- ?+0 0x1582a80 2010-11-24 11:58:55.154093 7f2ece3a8710 osd5 7048 do_queries querying osd3 on 1 PGs 2010-11-24 11:58:55.154103 7f2ece3a8710 -- [::]:6801/13134 --> osd3 [2001:16f8:10:2::c3c3:2e3a]:6801/3084 -- PGq v1 -- ?+0 0x1582000 2010-11-24 11:58:55.154124 7f2ece3a8710 osd5 7048 do_queries querying osd7 on 1 PGs osd/OSDMap.h: In function 'entity_inst_t OSDMap::get_cluster_inst(int)': osd/OSDMap.h:479: FAILED assert(is_up(osd)) ceph version 0.24~rc (commit:d6e8e8d15d22b51ec86bc5687336c3d50d9b3a5d) 1: (OSDMap::get_cluster_inst(int)+0x159) [0x4a40b9] 2: (OSD::do_queries(std::map<int, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<pg_t const, PG::Query> > >, std::less<int>, std::allocator<std::pair<int const, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<pg_t const, PG::Query> > > > > >&)+0x12b) [0x4c7adb] 3: (OSD::_process_pg_info(unsigned int, int, PG::Info&, PG::Log&, PG::Missing*, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<int const, MOSDPGInfo*> > >*, int&)+0x96a) [0x4e515a] 4: (OSD::handle_pg_log(MOSDPGLog*)+0xb1) [0x4f3071] 5: (OSD::_dispatch(Message*)+0x42d) [0x4f59ad] 6: (OSD::ms_dispatch(Message*)+0x39) [0x4f62d9] 7: (SimpleMessenger::dispatch_entry()+0x759) [0x463b29] 8: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45a99c] 9: (Thread::_entry_func(void*)+0xa) [0x47015a] 10: (()+0x69ca) [0x7f2ed83699ca] 11: (clone()+0x6d) [0x7f2ed70cf70d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. *** Caught signal (ABRT) *** ceph version 0.24~rc (commit:d6e8e8d15d22b51ec86bc5687336c3d50d9b3a5d) 1: (sigabrt_handler(int)+0x7d) [0x5dec4d] 2: (()+0x33af0) [0x7f2ed701caf0] 3: (gsignal()+0x35) [0x7f2ed701ca75] 4: (abort()+0x180) [0x7f2ed70205c0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f2ed78d28e5] 6: (()+0xcad16) [0x7f2ed78d0d16] 7: (()+0xcad43) [0x7f2ed78d0d43] 8: (()+0xcae3e) [0x7f2ed78d0e3e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x448) [0x5cbc88] 10: (OSDMap::get_cluster_inst(int)+0x159) [0x4a40b9] 11: (OSD::do_queries(std::map<int, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<pg_t const, PG::Query> > >, std::less<int>, std::allocator<std::pair<int const, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<pg_t const, PG::Query> > > > > >&)+0x12b) [0x4c7adb] 12: (OSD::_process_pg_info(unsigned int, int, PG::Info&, PG::Log&, PG::Missing*, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<int const, MOSDPGInfo*> > >*, int&)+0x96a) [0x4e515a] 13: (OSD::handle_pg_log(MOSDPGLog*)+0xb1) [0x4f3071] 14: (OSD::_dispatch(Message*)+0x42d) [0x4f59ad] 15: (OSD::ms_dispatch(Message*)+0x39) [0x4f62d9] 16: (SimpleMessenger::dispatch_entry()+0x759) [0x463b29] 17: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45a99c] 18: (Thread::_entry_func(void*)+0xa) [0x47015a] 19: (()+0x69ca) [0x7f2ed83699ca] 20: (clone()+0x6d) [0x7f2ed70cf70d]
I've ran cdebugpack and gathered the data at logger.ceph.widodh.nl:/srv/ceph/issues/osd_crash_get_cluster_inst
Updated by Greg Farnum over 13 years ago
- Assignee set to Samuel Just
Sam said he'd look at this since it's in the background scrubbing bits that he and Josh did.
Updated by Greg Farnum over 13 years ago
- Assignee changed from Samuel Just to Colin McCabe
Okay, I somehow commented/set this bug backwards with another one. Whoops, sorry guys!
This looks like the OSD is assembling a list of missing queries and then sending them out without bothering to check if the recipient OSD is up. It needs to filter the list somewhere, but I'm not sure where's appropriate and this is your code, so it's yours! :)
Updated by Colin McCabe over 13 years ago
- Status changed from New to 7
da087e47c21190f9cbde4d24182b7dfe581cd069 should resolve this