Project

General

Profile

Actions

Bug #566

closed

osd: build_prior needs to be wary of nonexistent osds

Added by Sage Weil over 13 years ago. Updated over 13 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2010-11-08 22:34:35.332280 7f84cec17710 osd0 50 pg[3.0p3( empty n=0 ec=2 les=23 50/50/50) [0,1] r=0 mlcod 0'0 !hml crashed+peering] build_prior interval(21-34 []/[3,0] maybe_went_rw)
2010-11-08 22:34:35.332298 7f84cec17710 filestore(/data/osd0) read /data/osd0/current/meta/osdmap.34_0 0~0
2010-11-08 22:34:35.332329 7f84cec17710 filestore(/data/osd0) read /data/osd0/current/meta/osdmap.34_0 0~2538 = 2538
osd/OSDMap.h: In function 'osd_info_t& OSDMap::get_info(int)':
osd/OSDMap.h:490: FAILED assert(osd < max_osd)
 ceph version 0.22.1 (commit:7464f9688001aa89f9673ba14e6d075d0ee33541)
 1: (PG::peer(ObjectStore::Transaction&, std::list<Context*, std::allocator<Context*> >&, std::map<int, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<pg_t const, PG::Query> > >, std::less<int>, std::allocator<std::pair<int const, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<pg_t const, PG::Query> > > > > >&, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<int const, MOSDPGInfo*> > >*)+0x8e0) [0x54a560]
 2: (OSD::activate_map(ObjectStore::Transaction&, std::list<Context*, std::allocator<Context*> >&)+0x47d) [0x4e3bdd]
 3: (OSD::handle_osd_map(MOSDMap*)+0x2815) [0x4f6795]
 4: (OSD::_dispatch(Message*)+0x2ab) [0x4f89bb]
 5: (OSD::ms_dispatch(Message*)+0x39) [0x4f9429]
 6: (SimpleMessenger::dispatch_entry()+0x79b) [0x46a2db]
 7: (SimpleMessenger::DispatchThread::entry()+0x1f) [0x45d53f]
 8: (Thread::_entry_func(void*)+0xa) [0x470caa]
 9: (()+0x7971) [0x7f84d64f2971]
 10: (clone()+0x6d) [0x7f84d572391d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
*** Caught signal (ABRT) ***
 ceph version 0.22.1 (commit:7464f9688001aa89f9673ba14e6d075d0ee33541)
 1: (sigabrt_handler(int)+0xde) [0x5e06de]
 2: (()+0x33c20) [0x7f84d5670c20]
 3: (gsignal()+0x35) [0x7f84d5670ba5]
 4: (abort()+0x180) [0x7f84d56746b0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f84d5f146bd]
 6: (()+0xb9906) [0x7f84d5f12906]
 7: (()+0xb9933) [0x7f84d5f12933]
 8: (()+0xb9a3e) [0x7f84d5f12a3e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x69c) [0x5ce05c]
 10: (PG::build_prior()+0xa64) [0x546e64]
 11: (PG::peer(ObjectStore::Transaction&, std::list<Context*, std::allocator<Context*> >&, std::map<int, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<pg_t const, PG::Query> > >, std::less<int>, std::allocator<std::pair<int const, std::map<pg_t, PG::Query, std::less<pg_t>, std::allocator<std::pair<pg_t const, PG::Query> > > > > >&, std::map<int, MOSDPGInfo*, std::less<int>, std::allocator<std::pair<int const, MOSDPGInfo*> > >*)+0x8e0) [0x54a560]
 12: (OSD::activate_map(ObjectStore::Transaction&, std::list<Context*, std::allocator<Context*> >&)+0x47d) [0x4e3bdd]
 13: (OSD::handle_osd_map(MOSDMap*)+0x2815) [0x4f6795]
 14: (OSD::_dispatch(Message*)+0x2ab) [0x4f89bb]
 15: (OSD::ms_dispatch(Message*)+0x39) [0x4f9429]
 16: (SimpleMessenger::dispatch_entry()+0x79b) [0x46a2db]
 17: (SimpleMessenger::DispatchThread::entry()+0x1f) [0x45d53f]
 18: (Thread::_entry_func(void*)+0xa) [0x470caa]
 19: (()+0x7971) [0x7f84d64f2971]
(gdb) up
#11 PG::build_prior (this=0x23126e0) at osd/PG.cc:949
warning: Source file is more recent than executable.
949           const osd_info_t& pinfo = osd->osdmap->get_info(o);
(gdb) p o
$2 = 3
(gdb) p osd->osdmap->epoch
$3 = 50
(gdb) list
944         }
945
946         // consider ACTING osds
947         for (unsigned i=0; i<interval.acting.size(); i++) {
948           int o = interval.acting[i];
949           const osd_info_t& pinfo = osd->osdmap->get_info(o);
950
951           // if the osd restarted after this interval but is not known to have
952           // cleanly survived through this interval, we mark the pg crashed.
953           if (pinfo.up_from > interval.last &&
(gdb) p o
$4 = 3

and that map is 50:

root@cephdisk02:~# osdmaptool -p /data/osd0/current/meta/osdmap.50_0
osdmaptool: osdmap file '/data/osd0/current/meta/osdmap.50_0'
epoch 50
fsid e93c55d3-7255-edf2-4603-41bff032e92e
created 2010-10-29 16:19:56.133231
modifed 2010-11-08 22:36:02.107112
flags

pg_pool 0 'data' pg_pool(rep pg_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 lpg_num 2 lpgp_num 2 last_change 1 owner 0)
pg_pool 1 'metadata' pg_pool(rep pg_size 2 crush_ruleset 1 object_hash rjenkins pg_num 256 pgp_num 256 lpg_num 2 lpgp_num 2 last_change 1 owner 0)
pg_pool 2 'casdata' pg_pool(rep pg_size 2 crush_ruleset 2 object_hash rjenkins pg_num 256 pgp_num 256 lpg_num 2 lpgp_num 2 last_change 1 owner 0)
pg_pool 3 'rbd' pg_pool(rep pg_size 2 crush_ruleset 3 object_hash rjenkins pg_num 256 pgp_num 256 lpg_num 2 lpgp_num 2 last_change 1 owner 0)

max_osd 3
osd0 in weight 1 up   (up_from 50 up_thru 30 down_at 49 last_clean 21-37) 192.168.100.15:6804/23979 192.168.100.15:6805/23979
osd1 in weight 1 up   (up_from 3 up_thru 3 down_at 0 last_clean 0-0) 192.168.100.16:6803/2171 192.168.100.16:6804/2171
osd2 in weight 1 up   (up_from 41 up_thru 33 down_at 40 last_clean 26-35) 192.168.100.17:6801/2279 192.168.100.17:6802/2279

i.e., max_osd went down, so the old osd no longer exists.

Actions

Also available in: Atom PDF