Bug #3477: mon: 'entity_inst_t OSDMap::get_inst(int)' fails on 'assert(is_up(osd))' - Ceph - Ceph

Actions

Copy link

Bug #3477

closed

mon: 'entity_inst_t OSDMap::get_inst(int)' fails on 'assert(is_up(osd))'

Added by Joao Eduardo Luis over 11 years ago. Updated over 11 years ago.

Status:

Resolved

Priority:

High

Assignee:

Joao Eduardo Luis

Category:

Monitor

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Reported on the mailing list by Stefan Priebe

http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/10470

  -22> 2012-11-11 22:45:10.941641 7f5d49168700  1 mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable now=2012-11-11 22:45:10.941643 lease_expire=2012-11-11 22:45:15.313577 has v1017 lc 11925
   -21> 2012-11-11 22:45:11.017199 7f5d49168700  1 -- 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 123 ==== paxos(logm lease_ack lc 14415 fc 13913 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (1926207536 0 0) 0x58fe840 con 0x2c38c60
   -20> 2012-11-11 22:45:11.084775 7f5d49168700  1 -- 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 124 ==== paxos(mdsmap lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (2037515134 0 0) 0x4e59080 con 0x2c38c60
   -19> 2012-11-11 22:45:11.084812 7f5d49168700  1 -- 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 125 ==== paxos(monmap lease_ack lc 1 fc 1 pn 0 opn 0 gv {}) v2 ==== 88+0+0 (3822314369 0 0) 0x4e58580 con 0x2c38c60
   -18> 2012-11-11 22:45:11.230970 7f5d48065700  2 -- 10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789 pgs=16 cs=2 l=0).connect error 10.255.0.100:6789/0, 111: Connection refused
   -17> 2012-11-11 22:45:11.231004 7f5d48065700  2 -- 10.255.0.101:6789/0 >> 10.255.0.100:6789/0 pipe(0x2cc9b40 sd=17 :6789 pgs=16 cs=2 l=0).fault 111: Connection refused
   -16> 2012-11-11 22:45:11.238427 7f5d49168700  1 -- 10.255.0.101:6789/0 <== osd.33 10.255.0.102:6806/3537 1315 ==== pg_stats(478 pgs tid 185 v 1017) v1 ==== 158856+0+0 (1263639313 0 0) 0x2ca5b40 con 0x2ca4dc0
   -15> 2012-11-11 22:45:11.238451 7f5d49168700  1 mon.b@1(leader).paxos(pgmap active c 11425..11925) is_readable now=2012-11-11 22:45:11.238452 lease_expire=2012-11-11 22:45:15.313577 has v1017 lc 11925
   -14> 2012-11-11 22:45:11.287036 7f5d49969700  5 mon.b@1(leader).paxos(pgmap active c 11425..11925) propose_new_value 11926 311714 bytes, gv 28550
   -13> 2012-11-11 22:45:11.296084 7f5d49168700  1 -- 10.255.0.101:6789/0 <== osd.41 10.255.0.103:6800/3324 1086 ==== pg_stats(440 pgs tid 185 v 1017) v1 ==== 146276+0+0 (1903337195 0 0) 0x39bad80 con 0x2ca4c60
   -12> 2012-11-11 22:45:11.350697 7f5d49969700  1 -- 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap begin lc 11925 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e60b00
   -11> 2012-11-11 22:45:11.350790 7f5d49168700  1 mon.b@1(leader).paxos(pgmap updating c 11425..11925) is_readable now=2012-11-11 22:45:11.350793 lease_expire=2012-11-11 22:45:15.313577 has v1017 lc 11925
   -10> 2012-11-11 22:45:11.435570 7f5d49168700  1 -- 10.255.0.101:6789/0 <== mon.2 10.255.0.102:6789/0 126 ==== paxos(pgmap accept lc 11925 fc 0 pn 3501 opn 0 gv {}) v2 ==== 88+0+0 (3719346061 0 0) 0x4e60b00 con 0x2c38c60
    -9> 2012-11-11 22:45:11.476091 7f5d49168700  1 -- 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap commit lc 11926 fc 0 pn 3501 opn 0 gv {11926=28550}) v2 -- ?+0 0x4e58580
    -8> 2012-11-11 22:45:11.476125 7f5d49168700  1 -- 10.255.0.101:6789/0 --> mon.2 10.255.0.102:6789/0 -- paxos(pgmap lease lc 11926 fc 11425 pn 0 opn 0 gv {}) v2 -- ?+0 0x4e59080
    -7> 2012-11-11 22:45:11.593945 7f5d49168700  0 log [INF] : pgmap v11926: 7032 pgs: 5977 active+clean, 62 active+remapped+wait_backfill, 586 active+degraded+wait_backfill, 32 active+remapped+backfilling, 18 active+degraded+backfilling, 343 active+degraded+remapped+wait_backfill, 2 remapped+peering, 12 active+degraded+remapped+backfilling; 61211 MB data, 61662 MB used, 4185 GB / 4245 GB avail; 5419/31221 degraded (17.357%)
    -6> 2012-11-11 22:45:11.593954 7f5d49168700  1 -- 10.255.0.101:6789/0 --> mon.1 10.255.0.101:6789/0 -- log(1 entries) v1 -- ?+0 0x40c7200
    -5> 2012-11-11 22:45:11.593969 7f5d49168700  1 -- 10.255.0.101:6789/0 --> 10.255.0.103:6803/3461 -- pg_stats_ack(460 pgs tid 185) v1 -- ?+0 0x3ab5a80 con 0x3bd5420
    -4> 2012-11-11 22:45:11.594050 7f5d49168700  1 -- 10.255.0.101:6789/0 --> 10.255.0.102:6806/3537 -- pg_stats_ack(478 pgs tid 185) v1 -- ?+0 0x4f0b8c0 con 0x2ca4dc0
    -3> 2012-11-11 22:45:11.594359 7f5d49168700  1 mon.b@1(leader).paxos(pgmap active c 11426..11926) is_readable now=2012-11-11 22:45:11.594363 lease_expire=2012-11-11 22:45:16.476123 has v1017 lc 11926
    -2> 2012-11-11 22:45:11.594736 7f5d49168700  1 -- 10.255.0.101:6789/0 <== mon.1 10.255.0.101:6789/0 0 ==== log(1 entries) v1 ==== 0+0+0 (0 0 0) 0x40c7200 con 0x2c382c0
    -1> 2012-11-11 22:45:11.595139 7f5d49969700  1 mon.b@1(leader).osd e1017  we have enough reports/reporters to mark osd.53 down
     0> 2012-11-11 22:45:11.595601 7f5d49969700 -1 ./osd/OSDMap.h: In function 'entity_inst_t OSDMap::get_inst(int) const' thread 7f5d49969700 time 2012-11-11 22:45:11.595149
./osd/OSDMap.h: 345: FAILED assert(is_up(osd))

 ceph version 0.53-758-g26e5f2d (26e5f2d63f569b955bb07b50aa8f930ed9450bc4)
 1: (OSDMonitor::check_failure(utime_t, int, failure_info_t&)+0xc09) [0x4b14e9]
 2: (OSDMonitor::check_failures(utime_t)+0x3a) [0x4b16ca]
 3: (OSDMonitor::tick()+0xab) [0x4b213b]
 4: (Monitor::tick()+0x6d) [0x477fcd]
 5: (SafeTimer::timer_thread()+0x453) [0x58f203]
 6: (SafeTimerThread::entry()+0xd) [0x5913dd]
 7: (()+0x68ca) [0x7f5d4dbce8ca]
 8: (clone()+0x6d) [0x7f5d4c456bfd]

Actions

Copy link

Updated by Sage Weil over 11 years ago

i think the fix is to make a clean_osd_failures() fucntion that removes records for down osds, and call that from update_from_paxos() and on_active().

Actions

Copy link

Updated by Joao Eduardo Luis over 11 years ago

Sage pushed a couple of commits that should solve this issue last night. They can be found in wip-3477. I just haven't tested it nor have been able to reproduce the crash.

Actions

Copy link

Updated by Joao Eduardo Luis over 11 years ago

Status changed from New to 7

Actions

Copy link

Updated by Stefan Priebe over 11 years ago

Can't reproduce the crash with this branch.

But when i trigger
ceph osd out 51; ceph osd out 52; ceph osd out 53; ceph osd out 54; sleep 20; ceph osd in 51; ceph osd in 52; ceph osd in 53; ceph osd in 54;

i get the log full of:
2012-11-13 20:58:06.780608 osd.53 [WRN] slow request 133.034698 seconds old, received at 2012-11-13 20:55:53.745877: osd_sub_op(client.28708.0:1688868 3.3ab 447263ab/rbd_data.702655e03efa.00000000000003e6/head//3 [] v 87'1933 snapset=0=[]:[] snapc=0=[]) v7 currently started
2012-11-13 20:58:06.780609 osd.53 [WRN] slow request 133.017665 seconds old, received at 2012-11-13 20:55:53.762910: osd_sub_op(client.28708.0:1688978 3.2f7 a54bf2f7/rbd_data.702655e03efa.000000000000099e/head//3 [] v 87'2204 snapset=0=[]:[] snapc=0=[]) v7 currently started
2012-11-13 20:58:06.780611 osd.53 [WRN] slow request 133.016105 seconds old, received at 2012-11-13 20:55:53.764470: osd_sub_op(client.28708.0:1688997 3.2b5 67ddc2b5/rbd_data.702655e03efa.0000000000000ab2/head//3 [] v 87'2158 snapset=0=[]:[] snapc=0=[]) v7 currently started
2012-11-13 20:58:06.780613 osd.53 [WRN] slow request 133.008344 seconds old, received at 2012-11-13 20:55:53.772231: osd_sub_op(client.28708.0:1689100 3.a64 f9c15a64/rbd_data.702655e03efa.0000000000001143/head//3 [] v 87'2036 snapset=0=[]:[] snapc=0=[]) v7 currently started
2012-11-13 20:58:07.780700 osd.53 [WRN] 82 slow requests, 5 included below; oldest blocked for > 140.734619 secs
2012-11-13 20:58:07.780703 osd.53 [WRN] slow request 134.007326 seconds old, received at 2012-11-13 20:55:53.773346: osd_sub_op(client.28708.0:1689114 3.cdf 56c89cdf/rbd_data.702655e03efa.0000000000001225/head//3 [] v 87'1764 snapset=0=[]:[] snapc=0=[]) v7 currently started
2012-11-13 20:58:07.780704 osd.53 [WRN] slow request 133.999470 seconds old, received at 2012-11-13 20:55:53.781202: osd_sub_op(client.28708.0:1689204 3.c14 fb991c14/rbd_data.702655e03efa.00000000000016a2/head//3 [] v 87'2224 snapset=0=[]:[] snapc=0=[]) v7 currently started
2012-11-13 20:58:07.780706 osd.53 [WRN] slow request 133.999322 seconds old, received at 2012-11-13 20:55:53.781350: osd_sub_op(client.28708.0:1689159 3.7fa 1dd8c7fa/rbd_data.702655e03efa.0000000000001478/head//3 [] v 87'6406 snapset=0=[]:[] snapc=0=[]) v7 currently started
2012-11-13 20:58:07.780707 osd.53 [WRN] slow request 133.992281 seconds old, received at 2012-11-13 20:55:53.788391: osd_sub_op(client.28708.0:1689314 3.c53 324f3c53/rbd_data.702655e03efa.0000000000001cb1/head//3 [] v 87'1867 snapset=0=[]:[] snapc=0=[]) v7 currently started
2012-11-13 20:58:07.780709 osd.53 [WRN] slow request 133.761128 seconds old, received at 2012-11-13 20:55:54.019544: osd_sub_op(client.28708.0:1689561 3.de5 b3b3fde5/rbd_data.702655e03efa.0000000000000cda/head//3 [] v 87'3475 snapset=0=[]:[] snapc=0=[]) v7 currently started

Is this OK?

Actions

Copy link

Updated by Joao Eduardo Luis over 11 years ago

Status changed from 7 to Resolved

AFAICT, that's the osds reporting slow requests; it doesn't have anything to do directly with the monitor or this bug.

I'm marking this resolved for now, as you haven't been able to trigger it with the new patches. If you hit it again, feel free to update the issue and/or poke us directly ;)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #3477

mon: 'entity_inst_t OSDMap::get_inst(int)' fails on 'assert(is_up(osd))'

Updated by Sage Weil over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Stefan Priebe over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago