Project

General

Profile

Actions

Bug #611

closed

OSD: OSDMap::get_cluster_inst

Added by Wido den Hollander over 13 years ago. Updated over 13 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
OSD
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After upgrading to the latest unstable, one OSD crashed. Before the upgrade, 10 of the 12 OSD's were online.

When upgraded, one OSD crashed. I couldn't reproduce it, it wouldn't crash again.

Logging was on 20, this showed:

2010-11-24 11:58:55.153334 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] clear_prior
2010-11-24 11:58:55.153352 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] cancel_generate_backlog
2010-11-24 11:58:55.153373 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] write_info info 309
2010-11-24 11:58:55.153426 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] write_info bigbl 3556
2010-11-24 11:58:55.153446 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] write_log
2010-11-24 11:58:55.153473 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] write_log to 0~294
2010-11-24 11:58:55.153491 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] clean_up_local
2010-11-24 11:58:55.153509 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate - snap_trimq []
2010-11-24 11:58:55.153526 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate - not complete, missing(1)
2010-11-24 11:58:55.153543 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate -     complete_to = 4532'2713
2010-11-24 11:58:55.153568 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate - starting recovery
2010-11-24 11:58:55.153594 7f2ece3a8710 osd5 7048 queue_for_recovery queued pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1]
2010-11-24 11:58:55.153616 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd0 is up to date, but sending pg_log anyway
2010-11-24 11:58:55.153641 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd0 sending log(0'0,0'0] missing(0)
2010-11-24 11:58:55.153664 7f2ece3a8710 -- [::]:6801/13134 --> osd0 [2001:16f8:10:2::c3c3:8f6b]:6801/22472 -- pg_log(3.10b e7048) v1 -- ?+0 0x5b96000
2010-11-24 11:58:55.153689 7f2ecbaa2710 osd5 7048 _recover_now defer until 2010-11-24 11:59:10.153587
2010-11-24 11:58:55.153739 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd0 already uptodate, 3.10b( v 4532'2713 (4476'2710,4532'2713] n=10 ec=2 les=7036 7037/7037/7037)
2010-11-24 11:58:55.153767 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd4 is up to date, but sending pg_log anyway
2010-11-24 11:58:55.153785 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd4 sending log(0'0,0'0] missing(0)
2010-11-24 11:58:55.153803 7f2ece3a8710 -- [::]:6801/13134 --> osd4 [2001:16f8:10:2::c3c3:fa1b]:6801/26124 -- pg_log(3.10b e7048) v1 -- ?+0 0x52b2800
2010-11-24 11:58:55.153824 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate peer osd4 3.10b( v 4532'2713 lc 4522'2711 (4476'2708,4532'2713] n=10 ec=2 les=7036 7037/7037/7037) missing missing(1)
2010-11-24 11:58:55.153847 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] is_all_uptodate: primary has 1
2010-11-24 11:58:55.153865 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] is_all_uptodate: osd4 has 1 missing
2010-11-24 11:58:55.153882 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] activate not all replicas are uptodate, queueing recovery
2010-11-24 11:58:55.153903 7f2ece3a8710 osd5 7048 queue_for_recovery already queued pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1]
2010-11-24 11:58:55.153930 7f2ecbaa2710 osd5 7048 _recover_now defer until 2010-11-24 11:59:10.153587
2010-11-24 11:58:55.153954 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] update_stats 7037'5665
2010-11-24 11:58:55.153976 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] discover_all_missing 1 missing, 1 unfound
2010-11-24 11:58:55.154002 7f2ece3a8710 osd5 7048 pg[3.10b( v 4532'2713 lc 4524'2712 (4476'2710,4532'2713] n=10 ec=2 les=7048 7037/7037/7037) [5,0,4] r=0 lcod 0'0 mlcod 0'0 active m=1 u=1] update_stats 7037'5666
2010-11-24 11:58:55.154028 7f2ece3a8710 osd5 7048 do_queries querying osd1 on 1 PGs
2010-11-24 11:58:55.154040 7f2ece3a8710 -- [::]:6801/13134 --> osd1 [2001:16f8:10:2::c3c3:a24f]:6801/3750 -- PGq v1 -- ?+0 0x1582700
2010-11-24 11:58:55.154062 7f2ece3a8710 osd5 7048 do_queries querying osd2 on 1 PGs
2010-11-24 11:58:55.154072 7f2ece3a8710 -- [::]:6801/13134 --> osd2 [2001:16f8:10:2::c3c3:4a8c]:6801/9520 -- PGq v1 -- ?+0 0x1582a80
2010-11-24 11:58:55.154093 7f2ece3a8710 osd5 7048 do_queries querying osd3 on 1 PGs
2010-11-24 11:58:55.154103 7f2ece3a8710 -- [::]:6801/13134 --> osd3 [2001:16f8:10:2::c3c3:2e3a]:6801/3084 -- PGq v1 -- ?+0 0x1582000
2010-11-24 11:58:55.154124 7f2ece3a8710 osd5 7048 do_queries querying osd7 on 1 PGs
osd/OSDMap.h: In function 'entity_inst_t OSDMap::get_cluster_inst(int)':
osd/OSDMap.h:479: FAILED assert(is_up(osd))
 ceph version 0.24~rc (commit:d6e8e8d15d22b51ec86bc5687336c3d50d9b3a5d)
 1: (OSDMap::get_cluster_inst(int)+0x159) [0x4a40b9]
 2: (OSD::do_queries(std::map<int, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<pg_t const, PG::Query> > >, std::less<int>, std::allocator<std::pair<int const, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<pg_t const, PG::Query> > > > > >&)+0x12b) [0x4c7adb]
 3: (OSD::_process_pg_info(unsigned int, int, PG::Info&, PG::Log&, PG::Missing*, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<int const, MOSDPGInfo*> > >*, int&)+0x96a) [0x4e515a]
 4: (OSD::handle_pg_log(MOSDPGLog*)+0xb1) [0x4f3071]
 5: (OSD::_dispatch(Message*)+0x42d) [0x4f59ad]
 6: (OSD::ms_dispatch(Message*)+0x39) [0x4f62d9]
 7: (SimpleMessenger::dispatch_entry()+0x759) [0x463b29]
 8: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45a99c]
 9: (Thread::_entry_func(void*)+0xa) [0x47015a]
 10: (()+0x69ca) [0x7f2ed83699ca]
 11: (clone()+0x6d) [0x7f2ed70cf70d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
*** Caught signal (ABRT) ***
 ceph version 0.24~rc (commit:d6e8e8d15d22b51ec86bc5687336c3d50d9b3a5d)
 1: (sigabrt_handler(int)+0x7d) [0x5dec4d]
 2: (()+0x33af0) [0x7f2ed701caf0]
 3: (gsignal()+0x35) [0x7f2ed701ca75]
 4: (abort()+0x180) [0x7f2ed70205c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f2ed78d28e5]
 6: (()+0xcad16) [0x7f2ed78d0d16]
 7: (()+0xcad43) [0x7f2ed78d0d43]
 8: (()+0xcae3e) [0x7f2ed78d0e3e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x448) [0x5cbc88]
 10: (OSDMap::get_cluster_inst(int)+0x159) [0x4a40b9]
 11: (OSD::do_queries(std::map<int, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<pg_t const, PG::Query> > >, std::less<int>, std::allocator<std::pair<int const, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<pg_t const, PG::Query> > > > > >&)+0x12b) [0x4c7adb]
 12: (OSD::_process_pg_info(unsigned int, int, PG::Info&, PG::Log&, PG::Missing*, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<int const, MOSDPGInfo*> > >*, int&)+0x96a) [0x4e515a]
 13: (OSD::handle_pg_log(MOSDPGLog*)+0xb1) [0x4f3071]
 14: (OSD::_dispatch(Message*)+0x42d) [0x4f59ad]
 15: (OSD::ms_dispatch(Message*)+0x39) [0x4f62d9]
 16: (SimpleMessenger::dispatch_entry()+0x759) [0x463b29]
 17: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45a99c]
 18: (Thread::_entry_func(void*)+0xa) [0x47015a]
 19: (()+0x69ca) [0x7f2ed83699ca]
 20: (clone()+0x6d) [0x7f2ed70cf70d]

I've ran cdebugpack and gathered the data at logger.ceph.widodh.nl:/srv/ceph/issues/osd_crash_get_cluster_inst

Actions #1

Updated by Greg Farnum over 13 years ago

  • Assignee set to Samuel Just

Sam said he'd look at this since it's in the background scrubbing bits that he and Josh did.

Actions #2

Updated by Greg Farnum over 13 years ago

  • Assignee changed from Samuel Just to Colin McCabe

Okay, I somehow commented/set this bug backwards with another one. Whoops, sorry guys!
This looks like the OSD is assembling a list of missing queries and then sending them out without bothering to check if the recipient OSD is up. It needs to filter the list somewhere, but I'm not sure where's appropriate and this is your code, so it's yours! :)

Actions #3

Updated by Colin McCabe over 13 years ago

I'll take a look

Actions #4

Updated by Colin McCabe over 13 years ago

  • Status changed from New to 7
Actions #5

Updated by Sage Weil over 13 years ago

  • Target version set to v0.24
Actions #6

Updated by Sage Weil over 13 years ago

  • Status changed from 7 to Resolved
Actions

Also available in: Atom PDF