Project

General

Profile

Actions

Bug #10367

closed

0.89 EC+ KV OSDs crashing

Added by Kenneth Waegeman over 9 years ago. Updated over 9 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

after issue http://tracker.ceph.com/issues/10119, I tried the same but with the latest version 0.89

Unfortunately, the EC KV osds started crashing after about 10 minutes, without stacktraces. Tried to restart them a few times but with same results.
Now, I put the keyvalue_debug level to 20/20 and stopped and started them all again. Within a minute, an osd goes down without stacktrace:

2014-12-18 12:22:58.670875 7f0c7d58a700 15 getattr 2.2ds0_head/857612ad/10000001c79.00000000/snapdir//2/18446744073709551615/0 'snapset'
2014-12-18 12:22:58.670894 7f0c7d58a700 20 lookup_strip_header failed to get strip_header cid 2.2ds0_head oid 857612ad/10000001c79.00000000/snapdir//2/18446744073709551615/0
2014-12-18 12:22:58.670896 7f0c7d58a700 10 getattr lookup_strip_header failed: r =-2
2014-12-18 12:22:58.670904 7f0c7d58a700 15 getattr 2.2ds0_head/857612ad/10000001c79.00000000/snapdir//2/18446744073709551615/0 '_'
2014-12-18 12:22:58.670924 7f0c7d58a700 20 lookup_strip_header failed to get strip_header cid 2.2ds0_head oid 857612ad/10000001c79.00000000/snapdir//2/18446744073709551615/0
2014-12-18 12:22:58.670927 7f0c7d58a700 10 getattr lookup_strip_header failed: r =-2
2014-12-18 12:22:58.670991 7f0c7d58a700 15 getattr 2.2ds0_head/94dde42d/10000002539.00000000/head//2/18446744073709551615/0 '_'
2014-12-18 12:22:58.671012 7f0c7d58a700 20 lookup_strip_header failed to get strip_header cid 2.2ds0_head oid 94dde42d/10000002539.00000000/head//2/18446744073709551615/0
2014-12-18 12:22:58.671017 7f0c7d58a700 10 getattr lookup_strip_header failed: r =-2
2014-12-18 12:22:58.671019 7f0c7d58a700 15 getattr 2.2ds0_head/94dde42d/10000002539.00000000/head//2/18446744073709551615/0 'snapset'
2014-12-18 12:22:58.671035 7f0c7d58a700 20 lookup_strip_header failed to get strip_header cid 2.2ds0_head oid 94dde42d/10000002539.00000000/head//2/18446744073709551615/0
2014-12-18 12:22:58.671037 7f0c7d58a700 10 getattr lookup_strip_header failed: r =-2
2014-12-18 12:22:58.671039 7f0c7d58a700 15 getattr 2.2ds0_head/94dde42d/10000002539.00000000/snapdir//2/18446744073709551615/0 'snapset'
2014-12-18 12:22:58.671056 7f0c7d58a700 20 lookup_strip_header failed to get strip_header cid 2.2ds0_head oid 94dde42d/10000002539.00000000/snapdir//2/18446744073709551615/0
2014-12-18 12:22:58.671058 7f0c7d58a700 10 getattr lookup_strip_header failed: r =-2
2014-12-18 12:22:58.671064 7f0c7d58a700 15 getattr 2.2ds0_head/94dde42d/10000002539.00000000/snapdir//2/18446744073709551615/0 '_'
2014-12-18 12:22:58.671084 7f0c7d58a700 20 lookup_strip_header failed to get strip_header cid 2.2ds0_head oid 94dde42d/10000002539.00000000/snapdir//2/18446744073709551615/0
2014-12-18 12:22:58.671086 7f0c7d58a700 10 getattr lookup_strip_header failed: r =-2
2014-12-18 12:22:58.671147 7f0c7d58a700 15 getattr 2.2ds0_head/108912ad/10000000bee.00000000/head//2/18446744073709551615/0 '_'
2014-12-18 12:22:58.671176 7f0c7d58a700 20 lookup_strip_header failed to get strip_header cid 2.2ds0_head oid 108912ad/10000000bee.00000000/head//2/18446744073709551615/0
2014-12-18 12:22:58.671182 7f0c7d58a700 10 getattr lookup_strip_header failed: r =-2
2014-12-18 12:22:58.671185 7f0c7d58a700 15 getattr 2.2ds0_head/108912ad/10000000bee.00000000/head//2/18446744073709551615/0 'snapset'
2014-12-18 12:22:58.671206 7f0c7d58a700 20 lookup_strip_header failed to get strip_header cid 2.2ds0_head oid 108912ad/10000000bee.00000000/head//2/18446744073709551615/0
2014-12-18 12:22:58.671210 7f0c7d58a700 10 getattr lookup_strip_header failed: r =-2
2014-12-18 12:22:58.671212 7f0c7d58a700 15 getattr 2.2ds0_head/108912ad/10000000bee.00000000/snapdir//2/18446744073709551615/0 'snapset'
2014-12-18 12:22:58.671232 7f0c7d58a700 20 lookup_strip_header failed to get strip_header cid 2.2ds0_head oid 108912ad/10000000bee.00000000/snapdir//2/18446744073709551615/0
2014-12-18 12:22:58.671235 7f0c7d58a700 10 getattr lookup_strip_header failed: r =-2
2014-12-18 12:22:58.671247 7f0c7d58a700 15 getattr 2.2ds0_head/108912ad/10000000bee.00000000/snapdir//2/18446744073709551615/0 '_'
2014-12-18 12:22:58.671265 7f0c7d58a700 20 lookup_strip_header failed to get strip_header cid 2.2ds0_head oid 108912ad/10000000bee.00000000/snapdir//2/18446744073709551615/0
2014-12-18 12:22:58.671267 7f0c7d58a700 10 getattr lookup_strip_header failed: r =-2
2014-12-18 12:22:58.671352 7f0c7fd8f700 15 getattr 2.2ds0_head/107f12ad/10000000448.0000078b/head//2/18446744073709551615/0 '_'
2014-12-18 12:22:58.671405 7f0c7fd8f700 20 lookup_strip_header failed to get strip_header cid 2.2ds0_head oid 107f12ad/10000000448.0000078b/head//2/18446744073709551615/0
2014-12-18 12:22:58.671410 7f0c7fd8f700 10 getattr lookup_strip_header failed: r =-2
2014-12-18 12:22:58.671413 7f0c7fd8f700 15 getattr 2.2ds0_head/107f12ad/10000000448.0000078b/head//2/18446744073709551615/0 'snapset'
2014-12-18 12:22:58.671435 7f0c7fd8f700 20 lookup_strip_header failed to get strip_header cid 2.2ds0_head oid 107f12ad/10000000448.0000078b/head//2/18446744073709551615/0
2014-12-18 12:22:58.671438 7f0c7fd8f700 10 getattr lookup_strip_header failed: r =-2
2014-12-18 12:22:58.671440 7f0c7fd8f700 15 getattr 2.2ds0_head/107f12ad/10000000448.0000078b/snapdir//2/18446744073709551615/0 'snapset'
2014-12-18 12:22:58.671469 7f0c7fd8f700 20 lookup_strip_header failed to get strip_header cid 2.2ds0_head oid 107f12ad/10000000448.0000078b/snapdir//2/18446744073709551615/0
2014-12-18 12:22:58.671472 7f0c7fd8f700 10 getattr lookup_strip_header failed: r =-2
2014-12-18 12:22:58.671486 7f0c7fd8f700 15 getattr 2.2ds0_head/107f12ad/10000000448.0000078b/snapdir//2/18446744073709551615/0 '_'
2014-12-18 12:22:58.671506 7f0c7fd8f700 20 lookup_strip_header failed to get strip_header cid 2.2ds0_head oid 107f12ad/10000000448.0000078b/snapdir//2/18446744073709551615/0
2014-12-18 12:22:58.671509 7f0c7fd8f700 10 getattr lookup_strip_header failed: r =-2
2014-12-18 12:22:58.671952 7f0c80590700 5 queue_transactions existing osr(2.2ds0 0x3ca8550)/0x3ca8550
2014-12-18 12:22:58.671959 7f0c80590700 5 queue_transactions (trailing journal)
2014-12-18 12:22:58.671962 7f0c80590700 5 queue_op 0x480fda0 seq 8 osr(2.2ds0 0x3ca8550) 989 bytes (queue has 1 ops and 989 bytes)
2014-12-18 12:22:58.671980 7f0c8db9e700 5 _do_op 0x480fda0 seq 8 osr(2.2ds0 0x3ca8550)/0x3ca8550 start
2014-12-18 12:22:58.671991 7f0c8db9e700 10 _do_transaction on 0x594e240
2014-12-18 12:22:58.672001 7f0c8db9e700 15 _omap_setkeys meta/16ef7597/infos/head//-1
2014-12-18 12:22:58.680367 7f0c78580700 10 stat 2.2ds0_head/54a0742d/100000022a3.00000000/head//2/18446744073709551615/0
2014-12-18 12:22:58.680416 7f0c78580700 20 lookup_strip_header failed to get strip_header cid 2.2ds0_head oid 54a0742d/100000022a3.00000000/head//2/18446744073709551615/0
2014-12-18 12:22:58.680422 7f0c78580700 10 stat 2.2ds0_head/54a0742d/100000022a3.00000000/head//2/18446744073709551615/0=-2
2014-12-18 12:22:58.685900 7f0c78580700 5 queue_transactions existing osr(2.2ds0 0x3ca8550)/0x3ca8550
2014-12-18 12:22:58.685907 7f0c78580700 5 queue_transactions (trailing journal)
2014-12-18 12:22:58.685909 7f0c78580700 5 queue_op 0x6e628e0 seq 9 osr(2.2ds0 0x3ca8550) 80703 bytes (queue has 2 ops and 81692 bytes)
2014-12-18 12:22:58.686974 7f0c78580700 10 stat 2.2ds0_head/87c3942d/100000037b5.00000000/head//2/18446744073709551615/0
2014-12-18 12:22:58.687011 7f0c78580700 20 lookup_strip_header failed to get strip_header cid 2.2ds0_head oid 87c3942d/100000037b5.00000000/head//2/18446744073709551615/0
2014-12-18 12:22:58.687016 7f0c78580700 10 stat 2.2ds0_head/87c3942d/100000037b5.00000000/head//2/18446744073709551615/0=-2
2014-12-18 12:22:58.687433 7f0c78580700 5 queue_transactions existing osr(2.2ds0 0x3ca8550)/0x3ca8550
2014-12-18 12:22:58.687439 7f0c78580700 5 queue_transactions (trailing journal)
2014-12-18 12:22:58.687440 7f0c78580700 5 queue_op 0x6e61e00 seq 10 osr(2.2ds0 0x3ca8550) 3331 bytes (queue has 3 ops and 85023 bytes)
2014-12-18 12:22:58.687713 7f0c78580700 10 stat 2.2ds0_head/f434e42d/10000003655.00000000/head//2/18446744073709551615/0
2014-12-18 12:22:58.687744 7f0c78580700 20 lookup_strip_header failed to get strip_header cid 2.2ds0_head oid f434e42d/10000003655.00000000/head//2/18446744073709551615/0
2014-12-18 12:22:58.687749 7f0c78580700 10 stat 2.2ds0_head/f434e42d/10000003655.00000000/head//2/18446744073709551615/0=-2

But right after that osd, other ones start crashing to, and I've found a trace:

45> 2014-12-18 12:23:05.011043 7f5d6057b700  5 - op tracker -- seq: 836, time: 2014-12-18 12:23:05.010989, event: header_read, op: osd_sub_op(unknown.0.0:0 2.43s7 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
44> 2014-12-18 12:23:05.011048 7f5d6057b700 5 - op tracker -- seq: 836, time: 2014-12-18 12:23:05.010990, event: throttled, op: osd_sub_op(unknown.0.0:0 2.43s7 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
43> 2014-12-18 12:23:05.011055 7f5d6057b700 5 - op tracker -- seq: 836, time: 2014-12-18 12:23:05.011020, event: all_read, op: osd_sub_op(unknown.0.0:0 2.43s7 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
42> 2014-12-18 12:23:05.011061 7f5d6057b700 5 - op tracker -- seq: 836, time: 0.000000, event: dispatched, op: osd_sub_op(unknown.0.0:0 2.43s7 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
41> 2014-12-18 12:23:05.011098 7f5d69291700 5 - op tracker -- seq: 836, time: 2014-12-18 12:23:05.011098, event: reached_pg, op: osd_sub_op(unknown.0.0:0 2.43s7 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
40> 2014-12-18 12:23:05.011118 7f5d69291700 5 - op tracker -- seq: 836, time: 2014-12-18 12:23:05.011118, event: started, op: osd_sub_op(unknown.0.0:0 2.43s7 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
39> 2014-12-18 12:23:05.011129 7f5d69291700 5 - op tracker -- seq: 836, time: 2014-12-18 12:23:05.011129, event: done, op: osd_sub_op(unknown.0.0:0 2.43s7 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
38> 2014-12-18 12:23:05.014391 7f5d5e663700 1 - 10.143.8.181:6822/4567 <== osd.13 10.143.8.180:6816/64760 48 ==== osd_sub_op(unknown.0.0:0 2.29s4 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[]) v11 ==== 1202+0+0 (313601380
4 0 0) 0x7f95800 con 0x49065c0
37> 2014-12-18 12:23:05.014416 7f5d5e663700 5 - op tracker -- seq: 837, time: 2014-12-18 12:23:05.014306, event: header_read, op: osd_sub_op(unknown.0.0:0 2.29s4 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
36> 2014-12-18 12:23:05.014424 7f5d5e663700 5 - op tracker -- seq: 837, time: 2014-12-18 12:23:05.014308, event: throttled, op: osd_sub_op(unknown.0.0:0 2.29s4 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
35> 2014-12-18 12:23:05.014430 7f5d5e663700 5 - op tracker -- seq: 837, time: 2014-12-18 12:23:05.014386, event: all_read, op: osd_sub_op(unknown.0.0:0 2.29s4 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
34> 2014-12-18 12:23:05.014436 7f5d5e663700 5 - op tracker -- seq: 837, time: 0.000000, event: dispatched, op: osd_sub_op(unknown.0.0:0 2.29s4 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
33> 2014-12-18 12:23:05.014493 7f5d6728d700 5 - op tracker -- seq: 837, time: 2014-12-18 12:23:05.014493, event: reached_pg, op: osd_sub_op(unknown.0.0:0 2.29s4 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
32> 2014-12-18 12:23:05.014511 7f5d6728d700 5 - op tracker -- seq: 837, time: 2014-12-18 12:23:05.014511, event: started, op: osd_sub_op(unknown.0.0:0 2.29s4 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
31> 2014-12-18 12:23:05.014524 7f5d6728d700 1 - 10.143.8.181:6822/4567 --> 10.143.8.180:6816/64760 -- osd_sub_op_reply(unknown.0.0:0 2.29s0 0//0//-1 [scrub-reserve] ack, result = 0) v2 -- ?+1 0x406e300 con 0x49065c0
30> 2014-12-18 12:23:05.014543 7f5d6728d700 5 - op tracker -- seq: 837, time: 2014-12-18 12:23:05.014542, event: done, op: osd_sub_op(unknown.0.0:0 2.29s4 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
29> 2014-12-18 12:23:05.022730 7f5d5e663700 1 - 10.143.8.181:6822/4567 <== osd.13 10.143.8.180:6816/64760 49 ==== osd_sub_op(unknown.0.0:0 2.29s4 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[]) v11 ==== 1202+0+0 (2268516
487 0 0) 0x7f96e00 con 0x49065c0
28> 2014-12-18 12:23:05.022750 7f5d5e663700 5 - op tracker -- seq: 838, time: 2014-12-18 12:23:05.022671, event: header_read, op: osd_sub_op(unknown.0.0:0 2.29s4 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
27> 2014-12-18 12:23:05.022764 7f5d5e663700 5 - op tracker -- seq: 838, time: 2014-12-18 12:23:05.022672, event: throttled, op: osd_sub_op(unknown.0.0:0 2.29s4 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
26> 2014-12-18 12:23:05.022773 7f5d5e663700 5 - op tracker -- seq: 838, time: 2014-12-18 12:23:05.022725, event: all_read, op: osd_sub_op(unknown.0.0:0 2.29s4 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
25> 2014-12-18 12:23:05.022784 7f5d5e663700 5 - op tracker -- seq: 838, time: 0.000000, event: dispatched, op: osd_sub_op(unknown.0.0:0 2.29s4 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
24> 2014-12-18 12:23:05.022820 7f5d69a92700 5 - op tracker -- seq: 838, time: 2014-12-18 12:23:05.022820, event: reached_pg, op: osd_sub_op(unknown.0.0:0 2.29s4 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
23> 2014-12-18 12:23:05.022841 7f5d69a92700 5 - op tracker -- seq: 838, time: 2014-12-18 12:23:05.022840, event: started, op: osd_sub_op(unknown.0.0:0 2.29s4 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
22> 2014-12-18 12:23:05.022853 7f5d69a92700 5 - op tracker -- seq: 838, time: 2014-12-18 12:23:05.022853, event: done, op: osd_sub_op(unknown.0.0:0 2.29s4 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
21> 2014-12-18 12:23:05.181010 7f5d5e663700 1 - 10.143.8.181:6822/4567 <== osd.13 10.143.8.180:6816/64760 50 ==== MOSDECSubOpRead(2.24s4 408 ECSubRead(tid=43, to_read={e2ffe9a4/10000000447.00000229/head//2=0,839072}, attrs_to_read=))
v1 ==== 140+0+0 (822879689 0 0) 0x7a6b180 con 0x49065c0
20> 2014-12-18 12:23:05.181035 7f5d5e663700 5 - op tracker -- seq: 839, time: 2014-12-18 12:23:05.180962, event: header_read, op: MOSDECSubOpRead(2.24s4 408 ECSubRead(tid=43, to_read={e2ffe9a4/10000000447.00000229/head//2=0,839072},
attrs_to_read=))
19> 2014-12-18 12:23:05.181041 7f5d5e663700 5 - op tracker -- seq: 839, time: 2014-12-18 12:23:05.180964, event: throttled, op: MOSDECSubOpRead(2.24s4 408 ECSubRead(tid=43, to_read={e2ffe9a4/10000000447.00000229/head//2=0,839072}, at
trs_to_read=))
18> 2014-12-18 12:23:05.181046 7f5d5e663700 5 - op tracker -- seq: 839, time: 2014-12-18 12:23:05.181005, event: all_read, op: MOSDECSubOpRead(2.24s4 408 ECSubRead(tid=43, to_read={e2ffe9a4/10000000447.00000229/head//2=0,839072}, att
rs_to_read=))
17> 2014-12-18 12:23:05.181050 7f5d5e663700 5 - op tracker -- seq: 839, time: 0.000000, event: dispatched, op: MOSDECSubOpRead(2.24s4 408 ECSubRead(tid=43, to_read={e2ffe9a4/10000000447.00000229/head//2=0,839072}, attrs_to_read=))
16> 2014-12-18 12:23:05.181149 7f5d6728d700 5 - op tracker -- seq: 839, time: 2014-12-18 12:23:05.181149, event: reached_pg, op: MOSDECSubOpRead(2.24s4 408 ECSubRead(tid=43, to_read={e2ffe9a4/10000000447.00000229/head//2=0,839072}, a
ttrs_to_read=))
15> 2014-12-18 12:23:05.181171 7f5d6728d700 15 read 2.24s4_head/e2ffe9a4/10000000447.00000229/head//2/18446744073709551615/4 0~839072
-14> 2014-12-18 12:23:05.181256 7f5d6728d700 10 lookup_strip_header done cid 2.24s4_head oid e2ffe9a4/10000000447.00000229/head//2/18446744073709551615/4
-13> 2014-12-18 12:23:05.181265 7f5d6728d700 10 file_to_extents done
-12> 2014-12-18 12:23:05.183213 7f5d6728d700 10 _generic_read 2.24s4_head/e2ffe9a4/10000000447.00000229/head//2/18446744073709551615/4 0~419744/419744 r = 0
-11> 2014-12-18 12:23:05.183292 7f5d6728d700 1 -
10.143.8.181:6822/4567 --> 10.143.8.180:6816/64760 -- MOSDECSubOpReadReply(2.24s0 408 ECSubReadReply(tid=43, attrs_read=0)) v1 -- ?+0 0x6ba2a80 con 0x49065c0
10> 2014-12-18 12:23:05.183311 7f5d6728d700 5 - op tracker -- seq: 839, time: 2014-12-18 12:23:05.183311, event: done, op: MOSDECSubOpRead(2.24s4 408 ECSubRead(tid=43, to_read={e2ffe9a4/10000000447.00000229/head//2=0,839072}, attrs_t
o_read=))
9> 2014-12-18 12:23:05.186959 7f5d5bc39700 1 - 10.143.8.181:6822/4567 <== osd.11 10.143.8.180:6830/1237 18 ==== osd_sub_op_reply(unknown.0.0:0 2.2cs0 0//0//2 [scrub-reserve] ack, result = 0) v2 ==== 176+0+1 (1905698453 0 4067132163)
0x406f640 con 0x4a591e0
8> 2014-12-18 12:23:05.186983 7f5d5bc39700 5 - op tracker -- seq: 840, time: 2014-12-18 12:23:05.186921, event: header_read, op: osd_sub_op_reply(unknown.0.0:0 2.2cs0 0//0//2 [scrub-reserve] ack, result = 0)
7> 2014-12-18 12:23:05.186990 7f5d5bc39700 5 - op tracker -- seq: 840, time: 2014-12-18 12:23:05.186923, event: throttled, op: osd_sub_op_reply(unknown.0.0:0 2.2cs0 0//0//2 [scrub-reserve] ack, result = 0)
6> 2014-12-18 12:23:05.186998 7f5d5bc39700 5 - op tracker -- seq: 840, time: 2014-12-18 12:23:05.186955, event: all_read, op: osd_sub_op_reply(unknown.0.0:0 2.2cs0 0//0//2 [scrub-reserve] ack, result = 0)
5> 2014-12-18 12:23:05.187006 7f5d5bc39700 5 - op tracker -- seq: 840, time: 0.000000, event: dispatched, op: osd_sub_op_reply(unknown.0.0:0 2.2cs0 0//0//2 [scrub-reserve] ack, result = 0)
4> 2014-12-18 12:23:05.187066 7f5d65a8a700 5 - op tracker -- seq: 840, time: 2014-12-18 12:23:05.187066, event: reached_pg, op: osd_sub_op_reply(unknown.0.0:0 2.2cs0 0//0//2 [scrub-reserve] ack, result = 0)
3> 2014-12-18 12:23:05.187092 7f5d65a8a700 5 - op tracker -- seq: 840, time: 2014-12-18 12:23:05.187092, event: started, op: osd_sub_op_reply(unknown.0.0:0 2.2cs0 0//0//2 [scrub-reserve] ack, result = 0)
2> 2014-12-18 12:23:05.187108 7f5d65a8a700  5 - op tracker -- seq: 840, time: 2014-12-18 12:23:05.187108, event: done, op: osd_sub_op_reply(unknown.0.0:0 2.2cs0 0//0//2 [scrub-reserve] ack, result = 0)
-1> 2014-12-18 12:23:05.187154 7f5d64a88700 10 collection_list_partial 2.2cs0_head start:0//0//-1/0/0 is_max:0
0> 2014-12-18 12:23:05.189744 7f5d64a88700 -1 os/GenericObjectMap.cc: In function 'int GenericObjectMap::list_objects(const coll_t&, ghobject_t, int, std::vector&lt;ghobject_t&gt;*, ghobject_t*)' thread 7f5d64a88700 time 2014-12-18 12:23:05.187287
os/GenericObjectMap.cc: 1098: FAILED assert(start <= header.oid)
ceph version 0.89 (68fdc0f68e6a04e283d2c5140832a3175b4f9840)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xb91845]
2: (GenericObjectMap::list_objects(coll_t const&, ghobject_t, int, std::vector&lt;ghobject_t, std::allocator&lt;ghobject_t&gt; >, ghobject_t)+0x47c) [0xa6815c]
3: (KeyValueStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vector&lt;ghobject_t, std::allocator&lt;ghobject_t&gt; >, ghobject_t)+0x274) [0x92a4f4]
4: (PGBackend::objects_list_partial(hobject_t const&, int, int, snapid_t, std::vector&lt;hobject_t, std::allocator&lt;hobject_t&gt; >, hobject_t)+0x1b9) [0x8c6709]
5: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x970) [0x7ce3a0]
6: (PG::scrub(ThreadPool::TPHandle&)+0x316) [0x7cf416]
7: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x29) [0x6b09f9]
8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa66) [0xb82576]
9: (ThreadPool::WorkThread::entry()+0x10) [0xb83600]
10: (()+0x7df3) [0x7f5d81e17df3]
11: (clone()+0x6d) [0x7f5d808de01d]
NOTE: a copy of the executable, or `objdump rdS &lt;executable&gt;` is needed to interpret this.
...
20/20 keyvaluestore
log_file /var/log/ceph/ceph-osd.24.log
--
end dump of recent events ---
2014-12-18 12:23:05.214575 7f5d69a92700 15 read 2.24s4_head/7ca2f9a4/10000000447.00000a1e/head//2/18446744073709551615/4 0~839072
2014-12-18 12:23:05.287210 7f5d64a88700 -1 ** Caught signal (Aborted) *
in thread 7f5d64a88700
ceph version 0.89 (68fdc0f68e6a04e283d2c5140832a3175b4f9840)
1: /usr/bin/ceph-osd() [0xa9fab2]
2: (()+0xf130) [0x7f5d81e1f130]
3: (gsignal()+0x39) [0x7f5d8081d5c9]
4: (abort()+0x148) [0x7f5d8081ecd8]
5: (_gnu_cxx::_verbose_terminate_handler()+0x165) [0x7f5d811309d5]
6: (()+0x5e946) [0x7f5d8112e946]
7: (()+0x5e973) [0x7f5d8112e973]
8: (()+0x5eb9f) [0x7f5d8112eb9f]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xb91a3a]
10: (GenericObjectMap::list_objects(coll_t const&, ghobject_t, int, std::vector&lt;ghobject_t, std::allocator&lt;ghobject_t&gt; >, ghobject_t)+0x47c) [0xa6815c]
11: (KeyValueStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vector&lt;ghobject_t, std::allocator&lt;ghobject_t&gt; >, ghobject_t)+0x274) [0x92a4f4]
12: (PGBackend::objects_list_partial(hobject_t const&, int, int, snapid_t, std::vector&lt;hobject_t, std::allocator&lt;hobject_t&gt; >, hobject_t)+0x1b9) [0x8c6709]
13: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x970) [0x7ce3a0]
14: (PG::scrub(ThreadPool::TPHandle&)+0x316) [0x7cf416]
15: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x29) [0x6b09f9]
16: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa66) [0xb82576]
17: (ThreadPool::WorkThread::entry()+0x10) [0xb83600]
18: (()+0x7df3) [0x7f5d81e17df3]
19: (clone()+0x6d) [0x7f5d808de01d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- begin dump of recent events ---
11> 2014-12-18 12:23:05.214440 7f5d5e663700 1 - 10.143.8.181:6822/4567 <== osd.13 10.143.8.180:6816/64760 51 ==== MOSDECSubOpRead(2.24s4 408 ECSubRead(tid=44, to_read={7ca2f9a4/10000000447.00000a1e/head//2=0,839072,841a3da4/100000011ca.00000000/head//2=0,839072}, attrs_to_read=)) v1 ==== 219+0+0 (1216366573 0 0) 0x7a686c0 con 0x49065c0
10> 2014-12-18 12:23:05.214482 7f5d5e663700 5 - op tracker -- seq: 841, time: 2014-12-18 12:23:05.214395, event: header_read, op: MOSDECSubOpRead(2.24s4 408 ECSubRead(tid=44, to_read={7ca2f9a4/10000000447.00000a1e/head//2=0,839072,841a3da4/100000011ca.00000000/head//2=0,839072}, attrs_to_read=))
9> 2014-12-18 12:23:05.214491 7f5d5e663700 5 - op tracker -- seq: 841, time: 2014-12-18 12:23:05.214396, event: throttled, op: MOSDECSubOpRead(2.24s4 408 ECSubRead(tid=44, to_read={7ca2f9a4/10000000447.00000a1e/head//2=0,839072,841a3da4/100000011ca.00000000/head//2=0,839072}, attrs_to_read=))
8> 2014-12-18 12:23:05.214498 7f5d5e663700 5 - op tracker -- seq: 841, time: 2014-12-18 12:23:05.214433, event: all_read, op: MOSDECSubOpRead(2.24s4 408 ECSubRead(tid=44, to_read={7ca2f9a4/10000000447.00000a1e/head//2=0,839072,841a3da4/100000011ca.00000000/head//2=0,839072}, attrs_to_read=))
7> 2014-12-18 12:23:05.214510 7f5d5e663700 5 - op tracker -- seq: 841, time: 0.000000, event: dispatched, op: MOSDECSubOpRead(2.24s4 408 ECSubRead(tid=44, to_read={7ca2f9a4/10000000447.00000a1e/head//2=0,839072,841a3da4/100000011ca.00000000/head//2=0,839072}, attrs_to_read=))
6> 2014-12-18 12:23:05.214555 7f5d69a92700 5 - op tracker -- seq: 841, time: 2014-12-18 12:23:05.214554, event: reached_pg, op: MOSDECSubOpRead(2.24s4 408 ECSubRead(tid=44, to_read={7ca2f9a4/10000000447.00000a1e/head//2=0,839072,841a3da4/100000011ca.00000000/head//2=0,839072}, attrs_to_read=))
5> 2014-12-18 12:23:05.214575 7f5d69a92700 15 read 2.24s4_head/7ca2f9a4/10000000447.00000a1e/head//2/18446744073709551615/4 0~839072
-4> 2014-12-18 12:23:05.222838 7f5d6ea9c700 1 -
10.143.8.181:6823/4567 <== osd.29 10.141.8.181:0/4886 6 ==== osd_ping(ping e408 stamp 2014-12-18 12:23:05.222328) v2 ==== 47+0+0 (903585933 0 0) 0x4278f00 con 0x5cd0840
3> 2014-12-18 12:23:05.222865 7f5d6ea9c700 1 - 10.143.8.181:6823/4567 --> 10.141.8.181:0/4886 -- osd_ping(ping_reply e408 stamp 2014-12-18 12:23:05.222328) v2 -- ?+0 0x58523a0 con 0x5cd0840
2> 2014-12-18 12:23:05.223259 7f5d7029f700 1 - 10.141.8.181:6823/4567 <== osd.29 10.141.8.181:0/4886 6 ==== osd_ping(ping e408 stamp 2014-12-18 12:23:05.222328) v2 ==== 47+0+0 (903585933 0 0) 0x630bc00 con 0x5206e00
1> 2014-12-18 12:23:05.223284 7f5d7029f700 1 - 10.141.8.181:6823/4567 --> 10.141.8.181:0/4886 -- osd_ping(ping_reply e408 stamp 2014-12-18 12:23:05.222328) v2 -- ?+0 0x427f440 con 0x5206e00
0> 2014-12-18 12:23:05.287210 7f5d64a88700 -1 ** Caught signal (Aborted) *
in thread 7f5d64a88700
ceph version 0.89 (68fdc0f68e6a04e283d2c5140832a3175b4f9840)
1: /usr/bin/ceph-osd() [0xa9fab2]
2: (()+0xf130) [0x7f5d81e1f130]
3: (gsignal()+0x39) [0x7f5d8081d5c9]
4: (abort()+0x148) [0x7f5d8081ecd8]
5: (_gnu_cxx::_verbose_terminate_handler()+0x165) [0x7f5d811309d5]
6: (()+0x5e946) [0x7f5d8112e946]
7: (()+0x5e973) [0x7f5d8112e973]
8: (()+0x5eb9f) [0x7f5d8112eb9f]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xb91a3a]
10: (GenericObjectMap::list_objects(coll_t const&, ghobject_t, int, std::vector<ghobject_t, std::allocator<ghobject_t> >, ghobject_t)+0x47c) [0xa6815c]
11: (KeyValueStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vector<ghobject_t, std::allocator<ghobject_t> >, ghobject_t)+0x274) [0x92a4f4]
12: (PGBackend::objects_list_partial(hobject_t const&, int, int, snapid_t, std::vector<hobject_t, std::allocator<hobject_t> >, hobject_t)+0x1b9) [0x8c6709]
13: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x970) [0x7ce3a0]
14: (PG::scrub(ThreadPool::TPHandle&)+0x316) [0x7cf416]
15: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x29) [0x6b09f9]
16: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa66) [0xb82576]
17: (ThreadPool::WorkThread::entry()+0x10) [0xb83600]
18: (()+0x7df3) [0x7f5d81e17df3]
19: (clone()+0x6d) [0x7f5d808de01d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

What could here be the problem?

Thanks!

Actions #1

Updated by Haomai Wang over 9 years ago

Hi, thanks for your report.

I checked your crash OSD commit version(68fdc0f68e6a04e283d2c5140832a3175b4f9840) and found that it doesn't contain my fix (#10119). It should be in the next dev release.

Actions #2

Updated by Haomai Wang over 9 years ago

  • Status changed from New to Rejected
Actions #3

Updated by Kenneth Waegeman over 9 years ago

Can I check if that fix is in the latest release ?

Retested it with the new 0.90 release. The OSDs are again crashing after about 10 minutes. There are no stacktraces in the log, even with debuglevel 20/20.
Is there a way to trigger the stacktraces ?

The output of the osds before they crash:

2014-12-23 15:39:06.146017 7f3c8d2c4700 15 getattr 2.2s0_head/64771b02/100000077e3.00000000/snapdir//2/18446744073709551615/0 '_'
2014-12-23 15:39:06.146042 7f3c8d2c4700 20 lookup_strip_header failed to get strip_header  cid 2.2s0_head oid 64771b02/100000077e3.00000000/snapdir//2/18446744073709551615/0
2014-12-23 15:39:06.146045 7f3c8d2c4700 10 getattr lookup_strip_header failed: r =-2
2014-12-23 15:39:06.146117 7f3c8fac9700 15 getattr 2.2s0_head/17e6f502/10000007d9c.00000000/head//2/18446744073709551615/0 '_'
2014-12-23 15:39:06.146189 7f3c8fac9700 20 lookup_strip_header failed to get strip_header  cid 2.2s0_head oid 17e6f502/10000007d9c.00000000/head//2/18446744073709551615/0
2014-12-23 15:39:06.146196 7f3c8fac9700 10 getattr lookup_strip_header failed: r =-2
2014-12-23 15:39:06.146199 7f3c8fac9700 15 getattr 2.2s0_head/17e6f502/10000007d9c.00000000/snapdir//2/18446744073709551615/0 '_'
2014-12-23 15:39:06.146229 7f3c8fac9700 20 lookup_strip_header failed to get strip_header  cid 2.2s0_head oid 17e6f502/10000007d9c.00000000/snapdir//2/18446744073709551615/0
2014-12-23 15:39:06.146235 7f3c8fac9700 10 getattr lookup_strip_header failed: r =-2
2014-12-23 15:39:06.146275 7f3c8d2c4700 15 getattr 2.2s0_head/64771b02/100000077e3.00000000/head//2/18446744073709551615/0 '_'
2014-12-23 15:39:06.146303 7f3c8d2c4700 20 lookup_strip_header failed to get strip_header  cid 2.2s0_head oid 64771b02/100000077e3.00000000/head//2/18446744073709551615/0
2014-12-23 15:39:06.146307 7f3c8d2c4700 10 getattr lookup_strip_header failed: r =-2
2014-12-23 15:39:06.146342 7f3c8fac9700 15 getattr 2.2s0_head/17e6f502/10000007d9c.00000000/head//2/18446744073709551615/0 '_'
2014-12-23 15:39:06.146381 7f3c8fac9700 20 lookup_strip_header failed to get strip_header  cid 2.2s0_head oid 17e6f502/10000007d9c.00000000/head//2/18446744073709551615/0
2014-12-23 15:39:06.146386 7f3c8fac9700 10 getattr lookup_strip_header failed: r =-2
2014-12-23 15:39:06.146421 7f3c8d2c4700 15 getattr 2.2s0_head/4daecb02/10000007c6c.00000000/head//2/18446744073709551615/0 '_'
2014-12-23 15:39:06.146449 7f3c8d2c4700 20 lookup_strip_header failed to get strip_header  cid 2.2s0_head oid 4daecb02/10000007c6c.00000000/head//2/18446744073709551615/0
2014-12-23 15:39:06.146456 7f3c8d2c4700 10 getattr lookup_strip_header failed: r =-2
2014-12-23 15:39:06.146458 7f3c8d2c4700 15 getattr 2.2s0_head/4daecb02/10000007c6c.00000000/snapdir//2/18446744073709551615/0 '_'
2014-12-23 15:39:06.146481 7f3c8d2c4700 20 lookup_strip_header failed to get strip_header  cid 2.2s0_head oid 4daecb02/10000007c6c.00000000/snapdir//2/18446744073709551615/0
2014-12-23 15:39:06.146483 7f3c8d2c4700 10 getattr lookup_strip_header failed: r =-2
2014-12-23 15:39:06.146520 7f3c8fac9700 15 getattr 2.2s0_head/a4e4c802/10000008006.00000000/head//2/18446744073709551615/0 '_'
2014-12-23 15:39:06.146556 7f3c8fac9700 20 lookup_strip_header failed to get strip_header  cid 2.2s0_head oid a4e4c802/10000008006.00000000/head//2/18446744073709551615/0
2014-12-23 15:39:06.146561 7f3c8fac9700 10 getattr lookup_strip_header failed: r =-2
2014-12-23 15:39:06.146563 7f3c8fac9700 15 getattr 2.2s0_head/a4e4c802/10000008006.00000000/snapdir//2/18446744073709551615/0 '_'
2014-12-23 15:39:06.146592 7f3c8fac9700 20 lookup_strip_header failed to get strip_header  cid 2.2s0_head oid a4e4c802/10000008006.00000000/snapdir//2/18446744073709551615/0
2014-12-23 15:39:06.146594 7f3c8fac9700 10 getattr lookup_strip_header failed: r =-2
2014-12-23 15:39:06.146629 7f3c8d2c4700 15 getattr 2.2s0_head/4daecb02/10000007c6c.00000000/head//2/18446744073709551615/0 '_'
2014-12-23 15:39:06.146658 7f3c8d2c4700 20 lookup_strip_header failed to get strip_header  cid 2.2s0_head oid 4daecb02/10000007c6c.00000000/head//2/18446744073709551615/0
2014-12-23 15:39:06.146663 7f3c8d2c4700 10 getattr lookup_strip_header failed: r =-2
2014-12-23 15:39:06.146694 7f3c8fac9700 15 getattr 2.2s0_head/a4e4c802/10000008006.00000000/head//2/18446744073709551615/0 '_'
2014-12-23 15:39:06.146724 7f3c8fac9700 20 lookup_strip_header failed to get strip_header  cid 2.2s0_head oid a4e4c802/10000008006.00000000/head//2/18446744073709551615/0
2014-12-23 15:39:06.146726 7f3c8fac9700 10 getattr lookup_strip_header failed: r =-2
2014-12-23 15:39:06.146760 7f3c8d2c4700 15 getattr 2.2s0_head/f5371c02/10000008207.00000000/head//2/18446744073709551615/0 '_'
2014-12-23 15:39:06.146790 7f3c8d2c4700 20 lookup_strip_header failed to get strip_header  cid 2.2s0_head oid f5371c02/10000008207.00000000/head//2/18446744073709551615/0
2014-12-23 15:39:06.146796 7f3c8d2c4700 10 getattr lookup_strip_header failed: r =-2
2014-12-23 15:39:06.146798 7f3c8d2c4700 15 getattr 2.2s0_head/f5371c02/10000008207.00000000/snapdir//2/18446744073709551615/0 '_'
2014-12-23 15:39:06.146819 7f3c8d2c4700 20 lookup_strip_header failed to get strip_header  cid 2.2s0_head oid f5371c02/10000008207.00000000/snapdir//2/18446744073709551615/0
2014-12-23 15:39:06.146822 7f3c8d2c4700 10 getattr lookup_strip_header failed: r =-2
2014-12-23 15:39:06.146858 7f3c8fac9700 15 getattr 2.2s0_head/f5371c02/10000008207.00000000/head//2/18446744073709551615/0 '_'
2014-12-23 15:39:06.146896 7f3c8fac9700 20 lookup_strip_header failed to get strip_header  cid 2.2s0_head oid f5371c02/10000008207.00000000/head//2/18446744073709551615/0
2014-12-23 15:39:06.146902 7f3c8fac9700 10 getattr lookup_strip_header failed: r =-2
2014-12-23 15:39:06.146942 7f3c8d2c4700 15 getattr 2.2s0_head/1b75fc02/1000000834d.00000000/head//2/18446744073709551615/0 '_'
2014-12-23 15:39:06.146969 7f3c8d2c4700 20 lookup_strip_header failed to get strip_header  cid 2.2s0_head oid 1b75fc02/1000000834d.00000000/head//2/18446744073709551615/0
2014-12-23 15:39:06.146973 7f3c8d2c4700 10 getattr lookup_strip_header failed: r =-2
2014-12-23 15:39:06.146975 7f3c8d2c4700 15 getattr 2.2s0_head/1b75fc02/1000000834d.00000000/snapdir//2/18446744073709551615/0 '_'
2014-12-23 15:39:06.146996 7f3c8d2c4700 20 lookup_strip_header failed to get strip_header  cid 2.2s0_head oid 1b75fc02/1000000834d.00000000/snapdir//2/18446744073709551615/0
2014-12-23 15:39:06.146998 7f3c8d2c4700 10 getattr lookup_strip_header failed: r =-2
2014-12-23 15:39:06.147031 7f3c8fac9700 15 getattr 2.2s0_head/1b75fc02/1000000834d.00000000/head//2/18446744073709551615/0 '_'
2014-12-23 15:39:06.147064 7f3c8fac9700 20 lookup_strip_header failed to get strip_header  cid 2.2s0_head oid 1b75fc02/1000000834d.00000000/head//2/18446744073709551615/0
2014-12-23 15:39:06.147070 7f3c8fac9700 10 getattr lookup_strip_header failed: r =-2
2014-12-23 15:39:06.147224 7f3c890b6700 10 stat 2.2s0_head/cbb27202/10000002011.00000000/head//2/18446744073709551615/0

Actions #4

Updated by Haomai Wang over 9 years ago

Do you rebuild your new cluster? It's a incompatible change for 0.90 and I test EC+KeyvalueStore locally successfully.

Actions #5

Updated by Kenneth Waegeman over 9 years ago

Yes, I reinstalled the whole cluster..

Actions

Also available in: Atom PDF