Project

General

Profile

Actions

Bug #611

closed

OSD: OSDMap::get_cluster_inst

Added by Wido den Hollander over 13 years ago. Updated over 13 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
OSD
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After upgrading to the latest unstable, one OSD crashed. Before the upgrade, 10 of the 12 OSD's were online.

When upgraded, one OSD crashed. I couldn't reproduce it, it wouldn't crash again.

Logging was on 20, this showed:

2010-11-24 11:58:55.153334 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] clear_prior
2010-11-24 11:58:55.153352 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] cancel_generate_backlog
2010-11-24 11:58:55.153373 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] write_info info 309
2010-11-24 11:58:55.153426 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] write_info bigbl 3556
2010-11-24 11:58:55.153446 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] write_log
2010-11-24 11:58:55.153473 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] write_log to 0~294
2010-11-24 11:58:55.153491 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] clean_up_local
2010-11-24 11:58:55.153509 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate - snap_trimq []
2010-11-24 11:58:55.153526 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate - not complete, missing(1)
2010-11-24 11:58:55.153543 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate -     complete_to = 4532'2713
2010-11-24 11:58:55.153568 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate - starting recovery
2010-11-24 11:58:55.153594 7f2ece3a8710 osd5 7048 queue_for_recovery queued pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1]
2010-11-24 11:58:55.153616 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd0 is up to date, but sending pg_log anyway
2010-11-24 11:58:55.153641 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd0 sending log(0'0,0'0] missing(0)
2010-11-24 11:58:55.153664 7f2ece3a8710 -- [::]:6801/13134 --> osd0 [2001:16f8:10:2::c3c3:8f6b]:6801/22472 -- pg_log(3.10b e7048) v1 -- ?+0 0x5b96000
2010-11-24 11:58:55.153689 7f2ecbaa2710 osd5 7048 _recover_now defer until 2010-11-24 11:59:10.153587
2010-11-24 11:58:55.153739 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd0 already uptodate, 3.10b( v 4532'2713 (4476'2710,4532'2713] n=10 ec=2 les=7036 7037/7037/7037)
2010-11-24 11:58:55.153767 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd4 is up to date, but sending pg_log anyway
2010-11-24 11:58:55.153785 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd4 sending log(0'0,0'0] missing(0)
2010-11-24 11:58:55.153803 7f2ece3a8710 -- [::]:6801/13134 --> osd4 [2001:16f8:10:2::c3c3:fa1b]:6801/26124 -- pg_log(3.10b e7048) v1 -- ?+0 0x52b2800
2010-11-24 11:58:55.153824 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd4 3.10b( v 4532'2713 lc 4522'2711 (4476'2708,4532'2713] n=10 ec=2 les=7036 7037/7037/7037) missing missing(1)
2010-11-24 11:58:55.153847 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] is_all_uptodate: primary has 1
2010-11-24 11:58:55.153865 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] is_all_uptodate: osd4 has 1 missing
2010-11-24 11:58:55.153882 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate not all replicas are uptodate, queueing recovery
2010-11-24 11:58:55.153903 7f2ece3a8710 osd5 7048 queue_for_recovery already queued pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1]
2010-11-24 11:58:55.153930 7f2ecbaa2710 osd5 7048 _recover_now defer until 2010-11-24 11:59:10.153587
2010-11-24 11:58:55.153954 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] update_stats 7037'5665
2010-11-24 11:58:55.153976 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] discover_all_missing 1 missing, 1 unfound
2010-11-24 11:58:55.154002 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] update_stats 7037'5666
2010-11-24 11:58:55.154028 7f2ece3a8710 osd5 7048 do_queries querying osd1 on 1 PGs
2010-11-24 11:58:55.154040 7f2ece3a8710 -- [::]:6801/13134 --> osd1 [2001:16f8:10:2::c3c3:a24f]:6801/3750 -- PGq v1 -- ?+0 0x1582700
2010-11-24 11:58:55.154062 7f2ece3a8710 osd5 7048 do_queries querying osd2 on 1 PGs
2010-11-24 11:58:55.154072 7f2ece3a8710 -- [::]:6801/13134 --> osd2 [2001:16f8:10:2::c3c3:4a8c]:6801/9520 -- PGq v1 -- ?+0 0x1582a80
2010-11-24 11:58:55.154093 7f2ece3a8710 osd5 7048 do_queries querying osd3 on 1 PGs
2010-11-24 11:58:55.154103 7f2ece3a8710 -- [::]:6801/13134 --> osd3 [2001:16f8:10:2::c3c3:2e3a]:6801/3084 -- PGq v1 -- ?+0 0x1582000
2010-11-24 11:58:55.154124 7f2ece3a8710 osd5 7048 do_queries querying osd7 on 1 PGs
osd/OSDMap.h: In function 'entity_inst_t OSDMap::get_cluster_inst(int)':
osd/OSDMap.h:479: FAILED assert(is_up(osd))
 ceph version 0.24~rc (commit:d6e8e8d15d22b51ec86bc5687336c3d50d9b3a5d)
 1: (OSDMap::get_cluster_inst(int)+0x159) [0x4a40b9]
 2: (OSD::do_queries(std::map<int, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<pg_t const, PG::Query> > >, std::less<int>, std::allocator<std::pair<int const, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<pg_t const, PG::Query> > > > > >&)+0x12b) [0x4c7adb]
 3: (OSD::_process_pg_info(unsigned int, int, PG::Info&, PG::Log&, PG::Missing*, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<int const, MOSDPGInfo*> > >*, int&)+0x96a) [0x4e515a]
 4: (OSD::handle_pg_log(MOSDPGLog*)+0xb1) [0x4f3071]
 5: (OSD::_dispatch(Message*)+0x42d) [0x4f59ad]
 6: (OSD::ms_dispatch(Message*)+0x39) [0x4f62d9]
 7: (SimpleMessenger::dispatch_entry()+0x759) [0x463b29]
 8: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45a99c]
 9: (Thread::_entry_func(void*)+0xa) [0x47015a]
 10: (()+0x69ca) [0x7f2ed83699ca]
 11: (clone()+0x6d) [0x7f2ed70cf70d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
*** Caught signal (ABRT) ***
 ceph version 0.24~rc (commit:d6e8e8d15d22b51ec86bc5687336c3d50d9b3a5d)
 1: (sigabrt_handler(int)+0x7d) [0x5dec4d]
 2: (()+0x33af0) [0x7f2ed701caf0]
 3: (gsignal()+0x35) [0x7f2ed701ca75]
 4: (abort()+0x180) [0x7f2ed70205c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f2ed78d28e5]
 6: (()+0xcad16) [0x7f2ed78d0d16]
 7: (()+0xcad43) [0x7f2ed78d0d43]
 8: (()+0xcae3e) [0x7f2ed78d0e3e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x448) [0x5cbc88]
 10: (OSDMap::get_cluster_inst(int)+0x159) [0x4a40b9]
 11: (OSD::do_queries(std::map<int, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<pg_t const, PG::Query> > >, std::less<int>, std::allocator<std::pair<int const, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<pg_t const, PG::Query> > > > > >&)+0x12b) [0x4c7adb]
 12: (OSD::_process_pg_info(unsigned int, int, PG::Info&, PG::Log&, PG::Missing*, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<int const, MOSDPGInfo*> > >*, int&)+0x96a) [0x4e515a]
 13: (OSD::handle_pg_log(MOSDPGLog*)+0xb1) [0x4f3071]
 14: (OSD::_dispatch(Message*)+0x42d) [0x4f59ad]
 15: (OSD::ms_dispatch(Message*)+0x39) [0x4f62d9]
 16: (SimpleMessenger::dispatch_entry()+0x759) [0x463b29]
 17: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45a99c]
 18: (Thread::_entry_func(void*)+0xa) [0x47015a]
 19: (()+0x69ca) [0x7f2ed83699ca]
 20: (clone()+0x6d) [0x7f2ed70cf70d]

I've ran cdebugpack and gathered the data at logger.ceph.widodh.nl:/srv/ceph/issues/osd_crash_get_cluster_inst

Actions

Also available in: Atom PDF