Bug #6476
closedlost around 30 OSDs (of 75) at once
0%
Description
Our dumpling cluster (0.67.3) with 64 SATA and 10 SSD OSDs on 10 servers suddenly experienced a crash of the majority of OSDs
I looked at two logfiles, and they had the same crash:
101> 2013-10-04 15:40:00.702072 7f2598a5b700 10 monclient: _send_mon_message to mon.s2 at [2001:620:0:6::108]:6789/0
-100> 2013-10-04 15:40:00.702086 7f2598a5b700 1 - [2001:620:0:6::10c]:6816/5598 --> [2001:620:0:6::108]:6789/0 -- osd_pgtemp(e83264 {0.9b=[60,20,37],0.b11=[60,29,34],0.c2c=[60,16,18,35],11.b06=[60,34]} v83264) v1 -- ?+0 0x1f3425a0 con 0x20518f20
99> 2013-10-04 15:40:00.702213 7f259925c700 5 osd.60 pg_epoch: 83264 pg[1.192( v 27209'206 (0'0,27209'206] local-les=83242 n=1 ec=1 les/c 83242/83242 83264/83264/79540) [8,6,60] r=2 lpr=83264 pi=33428-83263/117 lcod 0'0 inactive NOTIFY] exit Reset 0.000234 1 0.000073
-98> 2013-10-04 15:40:00.702230 7f259925c700 5 osd.60 pg_epoch: 83264 pg[1.192( v 27209'206 (0'0,27209'206] local-les=83242 n=1 ec=1 les/c 83242/83242 83264/83264/79540) [8,6,60] r=2 lpr=83264 pi=33428-83263/117 lcod 0'0 inactive NOTIFY] enter Started
-97> 2013-10-04 15:40:00.702242 7f259925c700 5 osd.60 pg_epoch: 83264 pg[1.192( v 27209'206 (0'0,27209'206] local-les=83242 n=1 ec=1 les/c 83242/83242 83264/83264/79540) [8,6,60] r=2 lpr=83264 pi=33428-83263/117 lcod 0'0 inactive NOTIFY] enter Start
-96> 2013-10-04 15:40:00.702253 7f259925c700 1 osd.60 pg_epoch: 83264 pg[1.192( v 27209'206 (0'0,27209'206] local-les=83242 n=1 ec=1 les/c 83242/83242 83264/83264/79540) [8,6,60] r=2 lpr=83264 pi=33428-83263/117 lcod 0'0 inactive NOTIFY] state<Start>: transitioning to Stray
-95> 2013-10-04 15:40:00.702266 7f259925c700 5 osd.60 pg_epoch: 83264 pg[1.192( v 27209'206 (0'0,27209'206] local-les=83242 n=1 ec=1 les/c 83242/83242 83264/83264/79540) [8,6,60] r=2 lpr=83264 pi=33428-83263/117 lcod 0'0 inactive NOTIFY] exit Start 0.000024 0 0.000000
-94> 2013-10-04 15:40:00.702278 7f259925c700 5 osd.60 pg_epoch: 83264 pg[1.192( v 27209'206 (0'0,27209'206] local-les=83242 n=1 ec=1 les/c 83242/83242 83264/83264/79540) [8,6,60] r=2 lpr=83264 pi=33428-83263/117 lcod 0'0 inactive NOTIFY] enter Started/Stray
-93> 2013-10-04 15:40:00.702421 7f259925c700 1 - [2001:620:0:6::10c]:6817/5598 --> [2001:620:0:6::108]:6808/5799 -- osd_map(83264..83264 src has 81624..83264) v3 -- ?+0 0x250bd8c0 con 0x1ee84f20
92> 2013-10-04 15:40:00.702442 7f259925c700 1 - [2001:620:0:6::10c]:6817/5598 --> [2001:620:0:6::108]:6808/5799 -- pg_notify(1.192(117) epoch 83264) v4 -- ?+0 0x1f75ae00 con 0x1ee84f20
91> 2013-10-04 15:40:00.702619 7f2598a5b700 10 monclient: _send_mon_message to mon.s2 at [2001:620:0:6::108]:6789/0
-90> 2013-10-04 15:40:00.702630 7f2598a5b700 1 - [2001:620:0:6::10c]:6816/5598 --> [2001:620:0:6::108]:6789/0 -- osd_pgtemp(e83264 {0.9b=[60,20,37],0.b11=[60,29,34],0.c2c=[60,16,18,35],11.b06=[60,34]} v83264) v1 -- ?+0 0x1f342960 con 0x20518f20
89> 2013-10-04 15:40:00.702620 7f259925c700 5 osd.60 pg_epoch: 83264 pg[0.193( v 82349'7542 (14513'4540,82349'7542] local-les=83237 n=2154 ec=1 les/c 83088/83092 83263/83263/79544) [8,60] r=1 lpr=83263 pi=27025-83262/147 lcod 28139'7541 inactive NOTIFY] exit Started/Stray 21.076883 3 0.000531
-88> 2013-10-04 15:40:00.702646 7f259925c700 5 osd.60 pg_epoch: 83264 pg[0.193( v 82349'7542 (14513'4540,82349'7542] local-les=83237 n=2154 ec=1 les/c 83088/83092 83263/83263/79544) [8,60] r=1 lpr=83263 pi=27025-83262/147 lcod 28139'7541 inactive NOTIFY] exit Started 21.076962 0 0.000000
-87> 2013-10-04 15:40:00.702668 7f259925c700 5 osd.60 pg_epoch: 83264 pg[0.193( v 82349'7542 (14513'4540,82349'7542] local-les=83237 n=2154 ec=1 les/c 83088/83092 83263/83263/79544) [8,60] r=1 lpr=83263 pi=27025-83262/147 lcod 28139'7541 inactive NOTIFY] enter Reset
-86> 2013-10-04 15:40:00.702922 7f259925c700 5 osd.60 pg_epoch: 83264 pg[0.193( v 82349'7542 (14513'4540,82349'7542] local-les=83237 n=2154 ec=1 les/c 83088/83092 83264/83264/79544) [8,6,60]/[8,60,35] r=1 lpr=83264 pi=27025-83263/148 lcod 28139'7541 remapped NOTIFY] exit Reset 0.000254 1 0.000110
-85> 2013-10-04 15:40:00.702922 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[0.728( v 82349'7820 (14521'4818,82349'7820] local-les=83091 n=2205 ec=1 les/c 83091/83091 83263/83263/80733) [17,60] r=1 lpr=83263 pi=79947-83262/28 lcod 28082'7819 inactive] exit Started/Stray 21.077013 3 0.000212
-84> 2013-10-04 15:40:00.702950 7f259925c700 5 osd.60 pg_epoch: 83264 pg[0.193( v 82349'7542 (14513'4540,82349'7542] local-les=83237 n=2154 ec=1 les/c 83088/83092 83264/83264/79544) [8,6,60]/[8,60,35] r=1 lpr=83264 pi=27025-83263/148 lcod 28139'7541 remapped NOTIFY] enter Started
-83> 2013-10-04 15:40:00.702960 7f259925c700 5 osd.60 pg_epoch: 83264 pg[0.193( v 82349'7542 (14513'4540,82349'7542] local-les=83237 n=2154 ec=1 les/c 83088/83092 83264/83264/79544) [8,6,60]/[8,60,35] r=1 lpr=83264 pi=27025-83263/148 lcod 28139'7541 remapped NOTIFY] enter Start
-82> 2013-10-04 15:40:00.702960 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[0.728( v 82349'7820 (14521'4818,82349'7820] local-les=83091 n=2205 ec=1 les/c 83091/83091 83263/83263/80733) [17,60] r=1 lpr=83263 pi=79947-83262/28 lcod 28082'7819 inactive] exit Started 21.077116 0 0.000000
-81> 2013-10-04 15:40:00.702974 7f259925c700 1 osd.60 pg_epoch: 83264 pg[0.193( v 82349'7542 (14513'4540,82349'7542] local-les=83237 n=2154 ec=1 les/c 83088/83092 83264/83264/79544) [8,6,60]/[8,60,35] r=1 lpr=83264 pi=27025-83263/148 lcod 28139'7541 remapped NOTIFY] state<Start>: transitioning to Stray
-80> 2013-10-04 15:40:00.702979 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[0.728( v 82349'7820 (14521'4818,82349'7820] local-les=83091 n=2205 ec=1 les/c 83091/83091 83263/83263/80733) [17,60] r=1 lpr=83263 pi=79947-83262/28 lcod 28082'7819 inactive] enter Reset
-79> 2013-10-04 15:40:00.703001 7f259925c700 5 osd.60 pg_epoch: 83264 pg[0.193( v 82349'7542 (14513'4540,82349'7542] local-les=83237 n=2154 ec=1 les/c 83088/83092 83264/83264/79544) [8,6,60]/[8,60,35] r=1 lpr=83264 pi=27025-83263/148 lcod 28139'7541 remapped NOTIFY] exit Start 0.000042 0 0.000000
-78> 2013-10-04 15:40:00.703030 7f259925c700 5 osd.60 pg_epoch: 83264 pg[0.193( v 82349'7542 (14513'4540,82349'7542] local-les=83237 n=2154 ec=1 les/c 83088/83092 83264/83264/79544) [8,6,60]/[8,60,35] r=1 lpr=83264 pi=27025-83263/148 lcod 28139'7541 remapped NOTIFY] enter Started/Stray
-77> 2013-10-04 15:40:00.703049 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[0.728( v 82349'7820 (14521'4818,82349'7820] local-les=83091 n=2205 ec=1 les/c 83091/83091 83264/83264/80733) [17,60,6]/[17,60,33] r=1 lpr=83264 pi=79947-83263/29 lcod 28082'7819 remapped] exit Reset 0.000071 1 0.000127
-76> 2013-10-04 15:40:00.703078 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[0.728( v 82349'7820 (14521'4818,82349'7820] local-les=83091 n=2205 ec=1 les/c 83091/83091 83264/83264/80733) [17,60,6]/[17,60,33] r=1 lpr=83264 pi=79947-83263/29 lcod 28082'7819 remapped] enter Started
-75> 2013-10-04 15:40:00.703099 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[0.728( v 82349'7820 (14521'4818,82349'7820] local-les=83091 n=2205 ec=1 les/c 83091/83091 83264/83264/80733) [17,60,6]/[17,60,33] r=1 lpr=83264 pi=79947-83263/29 lcod 28082'7819 remapped] enter Start
-74> 2013-10-04 15:40:00.703119 7f2598a5b700 1 osd.60 pg_epoch: 83264 pg[0.728( v 82349'7820 (14521'4818,82349'7820] local-les=83091 n=2205 ec=1 les/c 83091/83091 83264/83264/80733) [17,60,6]/[17,60,33] r=1 lpr=83264 pi=79947-83263/29 lcod 28082'7819 remapped] state<Start>: transitioning to Stray
-73> 2013-10-04 15:40:00.703139 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[0.728( v 82349'7820 (14521'4818,82349'7820] local-les=83091 n=2205 ec=1 les/c 83091/83091 83264/83264/80733) [17,60,6]/[17,60,33] r=1 lpr=83264 pi=79947-83263/29 lcod 28082'7819 remapped] exit Start 0.000040 0 0.000000
-72> 2013-10-04 15:40:00.703159 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[0.728( v 82349'7820 (14521'4818,82349'7820] local-les=83091 n=2205 ec=1 les/c 83091/83091 83264/83264/80733) [17,60,6]/[17,60,33] r=1 lpr=83264 pi=79947-83263/29 lcod 28082'7819 remapped] enter Started/Stray
-71> 2013-10-04 15:40:00.703271 7f259925c700 1 osd.60 pg_epoch: 83264 pg[0.d90( v 82349'7589 (14513'4587,82349'7589] local-les=83092 n=2210 ec=1 les/c 83092/82890 83263/83263/83239) [60,49]/[60,49,5] r=0 lpr=83263 pi=82889-83262/17 lcod 28199'7588 mlcod 0'0 remapped+peering] state<Started/Primary/Peering>: Peering, affected_by_map, going to Reset
-70> 2013-10-04 15:40:00.703296 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[1.727( v 79513'845 (0'0,79513'845] local-les=83246 n=1 ec=1 les/c 83239/83239 83263/83263/80733) [17,60] r=1 lpr=83263 pi=79947-83262/19 lcod 0'0 inactive] exit Started/Stray 21.077129 3 0.000149
-69> 2013-10-04 15:40:00.703303 7f259925c700 5 osd.60 pg_epoch: 83264 pg[0.d90( v 82349'7589 (14513'4587,82349'7589] local-les=83092 n=2210 ec=1 les/c 83092/82890 83263/83263/83239) [60,49]/[60,49,5] r=0 lpr=83263 pi=82889-83262/17 lcod 28199'7588 mlcod 0'0 remapped+peering] exit Started/Primary/Peering/WaitUpThru 20.907000 3 0.000184
-68> 2013-10-04 15:40:00.703328 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[1.727( v 79513'845 (0'0,79513'845] local-les=83246 n=1 ec=1 les/c 83239/83239 83263/83263/80733) [17,60] r=1 lpr=83263 pi=79947-83262/19 lcod 0'0 inactive] exit Started 21.077225 0 0.000000
-67> 2013-10-04 15:40:00.703340 7f259925c700 5 osd.60 pg_epoch: 83264 pg[0.d90( v 82349'7589 (14513'4587,82349'7589] local-les=83092 n=2210 ec=1 les/c 83092/82890 83263/83263/83239) [60,49]/[60,49,5] r=0 lpr=83263 pi=82889-83262/17 lcod 28199'7588 mlcod 0'0 remapped+peering] exit Started/Primary/Peering 21.077170 0 0.000000
-66> 2013-10-04 15:40:00.703351 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[1.727( v 79513'845 (0'0,79513'845] local-les=83246 n=1 ec=1 les/c 83239/83239 83263/83263/80733) [17,60] r=1 lpr=83263 pi=79947-83262/19 lcod 0'0 inactive] enter Reset
-65> 2013-10-04 15:40:00.703366 7f259925c700 5 osd.60 pg_epoch: 83264 pg[0.d90( v 82349'7589 (14513'4587,82349'7589] local-les=83092 n=2210 ec=1 les/c 83092/82890 83263/83263/83239) [60,49]/[60,49,5] r=0 lpr=83263 pi=82889-83262/17 lcod 28199'7588 mlcod 0'0 remapped] exit Started/Primary 21.077209 0 0.000000
-64> 2013-10-04 15:40:00.703388 7f259925c700 5 osd.60 pg_epoch: 83264 pg[0.d90( v 82349'7589 (14513'4587,82349'7589] local-les=83092 n=2210 ec=1 les/c 83092/82890 83263/83263/83239) [60,49]/[60,49,5] r=0 lpr=83263 pi=82889-83262/17 lcod 28199'7588 mlcod 0'0 remapped] exit Started 21.077293 0 0.000000
-63> 2013-10-04 15:40:00.703393 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[1.727( v 79513'845 (0'0,79513'845] local-les=83246 n=1 ec=1 les/c 83239/83239 83264/83264/80733) [17,60,6] r=1 lpr=83264 pi=79947-83263/20 lcod 0'0 inactive] exit Reset 0.000042 1 0.000098
-62> 2013-10-04 15:40:00.703418 7f259925c700 5 osd.60 pg_epoch: 83264 pg[0.d90( v 82349'7589 (14513'4587,82349'7589] local-les=83092 n=2210 ec=1 les/c 83092/82890 83263/83263/83239) [60,49]/[60,49,5] r=0 lpr=83263 pi=82889-83262/17 lcod 28199'7588 mlcod 0'0 remapped] enter Reset
-61> 2013-10-04 15:40:00.703423 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[1.727( v 79513'845 (0'0,79513'845] local-les=83246 n=1 ec=1 les/c 83239/83239 83264/83264/80733) [17,60,6] r=1 lpr=83264 pi=79947-83263/20 lcod 0'0 inactive] enter Started
-60> 2013-10-04 15:40:00.703457 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[1.727( v 79513'845 (0'0,79513'845] local-les=83246 n=1 ec=1 les/c 83239/83239 83264/83264/80733) [17,60,6] r=1 lpr=83264 pi=79947-83263/20 lcod 0'0 inactive] enter Start
-59> 2013-10-04 15:40:00.703463 7f259925c700 5 osd.60 pg_epoch: 83264 pg[0.d90( v 82349'7589 (14513'4587,82349'7589] local-les=83092 n=2210 ec=1 les/c 83092/82890 83264/83264/83239) [6,60,49]/[60,6,49,5] r=0 lpr=83264 pi=82889-83263/18 lcod 28199'7588 mlcod 0'0 remapped] exit Reset 0.000045 1 0.000193
-58> 2013-10-04 15:40:00.703477 7f259925c700 5 osd.60 pg_epoch: 83264 pg[0.d90( v 82349'7589 (14513'4587,82349'7589] local-les=83092 n=2210 ec=1 les/c 83092/82890 83264/83264/83239) [6,60,49]/[60,6,49,5] r=0 lpr=83264 pi=82889-83263/18 lcod 28199'7588 mlcod 0'0 remapped] enter Started
-57> 2013-10-04 15:40:00.703475 7f2598a5b700 1 osd.60 pg_epoch: 83264 pg[1.727( v 79513'845 (0'0,79513'845] local-les=83246 n=1 ec=1 les/c 83239/83239 83264/83264/80733) [17,60,6] r=1 lpr=83264 pi=79947-83263/20 lcod 0'0 inactive] state<Start>: transitioning to Stray
-56> 2013-10-04 15:40:00.703492 7f259925c700 5 osd.60 pg_epoch: 83264 pg[0.d90( v 82349'7589 (14513'4587,82349'7589] local-les=83092 n=2210 ec=1 les/c 83092/82890 83264/83264/83239) [6,60,49]/[60,6,49,5] r=0 lpr=83264 pi=82889-83263/18 lcod 28199'7588 mlcod 0'0 remapped] enter Start
-55> 2013-10-04 15:40:00.703496 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[1.727( v 79513'845 (0'0,79513'845] local-les=83246 n=1 ec=1 les/c 83239/83239 83264/83264/80733) [17,60,6] r=1 lpr=83264 pi=79947-83263/20 lcod 0'0 inactive] exit Start 0.000040 0 0.000000
-54> 2013-10-04 15:40:00.703509 7f259925c700 1 osd.60 pg_epoch: 83264 pg[0.d90( v 82349'7589 (14513'4587,82349'7589] local-les=83092 n=2210 ec=1 les/c 83092/82890 83264/83264/83239) [6,60,49]/[60,6,49,5] r=0 lpr=83264 pi=82889-83263/18 lcod 28199'7588 mlcod 0'0 remapped] state<Start>: transitioning to Primary
-53> 2013-10-04 15:40:00.703514 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[1.727( v 79513'845 (0'0,79513'845] local-les=83246 n=1 ec=1 les/c 83239/83239 83264/83264/80733) [17,60,6] r=1 lpr=83264 pi=79947-83263/20 lcod 0'0 inactive] enter Started/Stray
-52> 2013-10-04 15:40:00.703520 7f259925c700 5 osd.60 pg_epoch: 83264 pg[0.d90( v 82349'7589 (14513'4587,82349'7589] local-les=83092 n=2210 ec=1 les/c 83092/82890 83264/83264/83239) [6,60,49]/[60,6,49,5] r=0 lpr=83264 pi=82889-83263/18 lcod 28199'7588 mlcod 0'0 remapped] exit Start 0.000028 0 0.000000
-51> 2013-10-04 15:40:00.703559 7f259925c700 5 osd.60 pg_epoch: 83264 pg[0.d90( v 82349'7589 (14513'4587,82349'7589] local-les=83092 n=2210 ec=1 les/c 83092/82890 83264/83264/83239) [6,60,49]/[60,6,49,5] r=0 lpr=83264 pi=82889-83263/18 lcod 28199'7588 mlcod 0'0 remapped] enter Started/Primary
-50> 2013-10-04 15:40:00.703574 7f259925c700 5 osd.60 pg_epoch: 83264 pg[0.d90( v 82349'7589 (14513'4587,82349'7589] local-les=83092 n=2210 ec=1 les/c 83092/82890 83264/83264/83239) [6,60,49]/[60,6,49,5] r=0 lpr=83264 pi=82889-83263/18 lcod 28199'7588 mlcod 0'0 remapped] enter Started/Primary/Peering
-49> 2013-10-04 15:40:00.703587 7f259925c700 5 osd.60 pg_epoch: 83264 pg[0.d90( v 82349'7589 (14513'4587,82349'7589] local-les=83092 n=2210 ec=1 les/c 83092/82890 83264/83264/83239) [6,60,49]/[60,6,49,5] r=0 lpr=83264 pi=82889-83263/18 lcod 28199'7588 mlcod 0'0 remapped+peering] enter Started/Primary/Peering/GetInfo
-48> 2013-10-04 15:40:00.703650 7f259925c700 1 - [2001:620:0:6::10c]:6817/5598 --> [2001:620:0:6::108]:6808/5799 -- pg_notify(0.193(148) epoch 83264) v4 -- ?+0 0x210f1340 con 0x1ee84f20
47> 2013-10-04 15:40:00.703674 7f259925c700 1 - [2001:620:0:6::10c]:6817/5598 --> [2001:620:0:6::104]:6809/4986 -- osd_map(83264..83264 src has 81624..83264) v3 -- ?+0 0x24df4900 con 0x201ab8c0
46> 2013-10-04 15:40:00.703697 7f259925c700 1 - [2001:620:0:6::10c]:6817/5598 --> [2001:620:0:6::104]:6809/4986 -- pg_query(0.d90 epoch 83264) v2 -- ?+0 0x1ecc7860 con 0x201ab8c0
45> 2013-10-04 15:40:00.703829 7f259925c700 1 - [2001:620:0:6::10c]:6817/5598 --> [2001:620:0:6::104]:6817/1005230 -- osd_map(83263..83264 src has 81624..83264) v3 -- ?+0 0x27092fc0 con 0x20567b80
44> 2013-10-04 15:40:00.703848 7f259925c700 1 - [2001:620:0:6::10c]:6817/5598 --> [2001:620:0:6::104]:6817/1005230 -- pg_query(0.d90 epoch 83264) v2 -- ?+0 0x25fe2000 con 0x20567b80
43> 2013-10-04 15:40:00.703978 7f2598a5b700 10 monclient: _send_mon_message to mon.s2 at [2001:620:0:6::108]:6789/0
-42> 2013-10-04 15:40:00.703989 7f2598a5b700 1 - [2001:620:0:6::10c]:6816/5598 --> [2001:620:0:6::108]:6789/0 -- osd_pgtemp(e83264 {0.9b=[60,20,37],0.b11=[60,29,34],0.c2c=[60,16,18,35],11.b06=[60,34]} v83264) v1 -- ?+0 0x1f342000 con 0x20518f20
41> 2013-10-04 15:40:00.704415 7f2598a5b700 10 monclient: _send_mon_message to mon.s2 at [2001:620:0:6::108]:6789/0
-40> 2013-10-04 15:40:00.704430 7f2598a5b700 1 - [2001:620:0:6::10c]:6816/5598 --> [2001:620:0:6::108]:6789/0 -- osd_pgtemp(e83264 {0.9b=[60,20,37],0.b11=[60,29,34],0.c2c=[60,16,18,35],11.b06=[60,34]} v83264) v1 -- ?+0 0x20ada3c0 con 0x20518f20
39> 2013-10-04 15:40:00.704923 7f2598a5b700 10 monclient: _send_mon_message to mon.s2 at [2001:620:0:6::108]:6789/0
-38> 2013-10-04 15:40:00.704936 7f2598a5b700 1 - [2001:620:0:6::10c]:6816/5598 --> [2001:620:0:6::108]:6789/0 -- osd_pgtemp(e83264 {0.9b=[60,20,37],0.b11=[60,29,34],0.c2c=[60,16,18,35],11.b06=[60,34]} v83264) v1 -- ?+0 0x20adb860 con 0x20518f20
37> 2013-10-04 15:40:00.705152 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[1.cc2( v 79513'1188 (0'0,79513'1188] local-les=83231 n=1 ec=1 les/c 83088/83090 83263/83263/82324) [56,60] r=1 lpr=83263 pi=80186-83262/20 lcod 0'0 inactive] exit Started/Stray 21.077678 3 0.000139
-36> 2013-10-04 15:40:00.705175 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[1.cc2( v 79513'1188 (0'0,79513'1188] local-les=83231 n=1 ec=1 les/c 83088/83090 83263/83263/82324) [56,60] r=1 lpr=83263 pi=80186-83262/20 lcod 0'0 inactive] exit Started 21.077764 0 0.000000
-35> 2013-10-04 15:40:00.705201 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[1.cc2( v 79513'1188 (0'0,79513'1188] local-les=83231 n=1 ec=1 les/c 83088/83090 83263/83263/82324) [56,60] r=1 lpr=83263 pi=80186-83262/20 lcod 0'0 inactive] enter Reset
-34> 2013-10-04 15:40:00.705245 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[1.cc2( v 79513'1188 (0'0,79513'1188] local-les=83231 n=1 ec=1 les/c 83088/83090 83264/83264/82324) [56,60,6]/[56,60,22] r=1 lpr=83264 pi=80186-83263/21 lcod 0'0 remapped] exit Reset 0.000043 1 0.000092
-33> 2013-10-04 15:40:00.705268 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[1.cc2( v 79513'1188 (0'0,79513'1188] local-les=83231 n=1 ec=1 les/c 83088/83090 83264/83264/82324) [56,60,6]/[56,60,22] r=1 lpr=83264 pi=80186-83263/21 lcod 0'0 remapped] enter Started
-32> 2013-10-04 15:40:00.705283 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[1.cc2( v 79513'1188 (0'0,79513'1188] local-les=83231 n=1 ec=1 les/c 83088/83090 83264/83264/82324) [56,60,6]/[56,60,22] r=1 lpr=83264 pi=80186-83263/21 lcod 0'0 remapped] enter Start
-31> 2013-10-04 15:40:00.705295 7f2598a5b700 1 osd.60 pg_epoch: 83264 pg[1.cc2( v 79513'1188 (0'0,79513'1188] local-les=83231 n=1 ec=1 les/c 83088/83090 83264/83264/82324) [56,60,6]/[56,60,22] r=1 lpr=83264 pi=80186-83263/21 lcod 0'0 remapped] state<Start>: transitioning to Stray
-30> 2013-10-04 15:40:00.705312 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[1.cc2( v 79513'1188 (0'0,79513'1188] local-les=83231 n=1 ec=1 les/c 83088/83090 83264/83264/82324) [56,60,6]/[56,60,22] r=1 lpr=83264 pi=80186-83263/21 lcod 0'0 remapped] exit Start 0.000029 0 0.000000
-29> 2013-10-04 15:40:00.705322 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[1.cc2( v 79513'1188 (0'0,79513'1188] local-les=83231 n=1 ec=1 les/c 83088/83090 83264/83264/82324) [56,60,6]/[56,60,22] r=1 lpr=83264 pi=80186-83263/21 lcod 0'0 remapped] enter Started/Stray
-28> 2013-10-04 15:40:00.705328 7f257d703700 2 - [2001:620:0:6::10c]:6817/5598 >> [2001:620:0:6::104]:6817/1005230 pipe(0x21054280 sd=106 :56037 s=1 pgs=0 cs=0 l=0 c=0x20567b80). got newly_acked_seq 100 vs out_seq 0
27> 2013-10-04 15:40:00.705346 7f257d703700 2 - [2001:620:0:6::10c]:6817/5598 >> [2001:620:0:6::104]:6817/1005230 pipe(0x21054280 sd=106 :56037 s=1 pgs=0 cs=0 l=0 c=0x20567b80). discarding previously sent 0 osd_map(83263..83264 src has 81624..83264) v3
26> 2013-10-04 15:40:00.705358 7f257d703700 2 - [2001:620:0:6::10c]:6817/5598 >> [2001:620:0:6::104]:6817/1005230 pipe(0x21054280 sd=106 :56037 s=1 pgs=0 cs=0 l=0 c=0x20567b80). discarding previously sent 0 pg_query(0.d90 epoch 83264) v2
25> 2013-10-04 15:40:00.705429 7f259925c700 1 - [2001:620:0:6::10c]:6817/5598 --> [2001:620:0:6::10e]:6820/17512 -- osd_map(83264..83264 src has 81624..83264) v3 -- ?+0 0x259d2fc0 con 0x1eef69a0
24> 2013-10-04 15:40:00.705470 7f259925c700 1 - [2001:620:0:6::10c]:6817/5598 --> [2001:620:0:6::10e]:6820/17512 -- pg_query(0.d90 epoch 83264) v2 -- ?+0 0x25fe2b40 con 0x1eef69a0
23> 2013-10-04 15:40:00.705690 7f259925c700 1 - [2001:620:0:6::10c]:6817/5598 --> [2001:620:0:6::112]:6833/28295 -- osd_map(83264..83264 src has 81624..83264) v3 -- ?+0 0x21296480 con 0x1ee9a580
22> 2013-10-04 15:40:00.705721 7f259925c700 1 - [2001:620:0:6::10c]:6817/5598 --> [2001:620:0:6::112]:6833/28295 -- pg_query(0.d90 epoch 83264) v2 -- ?+0 0x25fe32c0 con 0x1ee9a580
-21> 2013-10-04 15:40:00.705735 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[0.cc3( v 82349'7706 (14513'4704,82349'7706] local-les=83091 n=2225 ec=1 les/c 83091/83065 83263/83263/82465) [56,60]/[56,60,5] r=1 lpr=83263 pi=80186-83262/30 lcod 28197'7705 remapped] exit Started/Stray 21.077960 3 0.000153
-20> 2013-10-04 15:40:00.705765 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[0.cc3( v 82349'7706 (14513'4704,82349'7706] local-les=83091 n=2225 ec=1 les/c 83091/83065 83263/83263/82465) [56,60]/[56,60,5] r=1 lpr=83263 pi=80186-83262/30 lcod 28197'7705 remapped] exit Started 21.078064 0 0.000000
-19> 2013-10-04 15:40:00.705776 7f259925c700 1 osd.60 pg_epoch: 83264 pg[1.d8f( v 28074'892 (0'0,28074'892] local-les=83242 n=0 ec=1 les/c 83242/83246 83263/83263/83263) [60,49] r=0 lpr=83263 pi=83238-83262/2 lcod 0'0 mlcod 0'0 peering] state<Started/Primary/Peering>: Peering, affected_by_map, going to Reset
-18> 2013-10-04 15:40:00.705787 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[0.cc3( v 82349'7706 (14513'4704,82349'7706] local-les=83091 n=2225 ec=1 les/c 83091/83065 83263/83263/82465) [56,60]/[56,60,5] r=1 lpr=83263 pi=80186-83262/30 lcod 28197'7705 remapped] enter Reset
-17> 2013-10-04 15:40:00.705803 7f259925c700 5 osd.60 pg_epoch: 83264 pg[1.d8f( v 28074'892 (0'0,28074'892] local-les=83242 n=0 ec=1 les/c 83242/83246 83263/83263/83263) [60,49] r=0 lpr=83263 pi=83238-83262/2 lcod 0'0 mlcod 0'0 peering] exit Started/Primary/Peering/WaitUpThru 20.909290 3 0.000176
-16> 2013-10-04 15:40:00.705823 7f259925c700 5 osd.60 pg_epoch: 83264 pg[1.d8f( v 28074'892 (0'0,28074'892] local-les=83242 n=0 ec=1 les/c 83242/83246 83263/83263/83263) [60,49] r=0 lpr=83263 pi=83238-83262/2 lcod 0'0 mlcod 0'0 peering] exit Started/Primary/Peering 21.079355 0 0.000000
-15> 2013-10-04 15:40:00.705831 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[0.cc3( v 82349'7706 (14513'4704,82349'7706] local-les=83091 n=2225 ec=1 les/c 83091/83065 83264/83264/82465) [56,60,6]/[56,60,6,5] r=1 lpr=83264 pi=80186-83263/31 lcod 28197'7705 remapped] exit Reset 0.000045 1 0.000098
-14> 2013-10-04 15:40:00.705844 7f259925c700 5 osd.60 pg_epoch: 83264 pg[1.d8f( v 28074'892 (0'0,28074'892] local-les=83242 n=0 ec=1 les/c 83242/83246 83263/83263/83263) [60,49] r=0 lpr=83263 pi=83238-83262/2 lcod 0'0 mlcod 0'0 inactive] exit Started/Primary 21.079390 0 0.000000
-13> 2013-10-04 15:40:00.705850 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[0.cc3( v 82349'7706 (14513'4704,82349'7706] local-les=83091 n=2225 ec=1 les/c 83091/83065 83264/83264/82465) [56,60,6]/[56,60,6,5] r=1 lpr=83264 pi=80186-83263/31 lcod 28197'7705 remapped] enter Started
-12> 2013-10-04 15:40:00.705863 7f259925c700 5 osd.60 pg_epoch: 83264 pg[1.d8f( v 28074'892 (0'0,28074'892] local-les=83242 n=0 ec=1 les/c 83242/83246 83263/83263/83263) [60,49] r=0 lpr=83263 pi=83238-83262/2 lcod 0'0 mlcod 0'0 inactive] exit Started 21.079468 0 0.000000
-11> 2013-10-04 15:40:00.705885 7f259925c700 5 osd.60 pg_epoch: 83264 pg[1.d8f( v 28074'892 (0'0,28074'892] local-les=83242 n=0 ec=1 les/c 83242/83246 83263/83263/83263) [60,49] r=0 lpr=83263 pi=83238-83262/2 lcod 0'0 mlcod 0'0 inactive] enter Reset
-10> 2013-10-04 15:40:00.705946 7f259925c700 5 osd.60 pg_epoch: 83264 pg[1.d8f( v 28074'892 (0'0,28074'892] local-les=83242 n=0 ec=1 les/c 83242/83246 83264/83264/83264) [6,60,49] r=1 lpr=83264 pi=83238-83263/3 lcod 0'0 inactive NOTIFY] exit Reset 0.000060 1 0.000158
-9> 2013-10-04 15:40:00.705970 7f259925c700 5 osd.60 pg_epoch: 83264 pg[1.d8f( v 28074'892 (0'0,28074'892] local-les=83242 n=0 ec=1 les/c 83242/83246 83264/83264/83264) [6,60,49] r=1 lpr=83264 pi=83238-83263/3 lcod 0'0 inactive NOTIFY] enter Started
-8> 2013-10-04 15:40:00.705988 7f259925c700 5 osd.60 pg_epoch: 83264 pg[1.d8f( v 28074'892 (0'0,28074'892] local-les=83242 n=0 ec=1 les/c 83242/83246 83264/83264/83264) [6,60,49] r=1 lpr=83264 pi=83238-83263/3 lcod 0'0 inactive NOTIFY] enter Start
-7> 2013-10-04 15:40:00.706004 7f259925c700 1 osd.60 pg_epoch: 83264 pg[1.d8f( v 28074'892 (0'0,28074'892] local-les=83242 n=0 ec=1 les/c 83242/83246 83264/83264/83264) [6,60,49] r=1 lpr=83264 pi=83238-83263/3 lcod 0'0 inactive NOTIFY] state<Start>: transitioning to Stray
-6> 2013-10-04 15:40:00.706021 7f259925c700 5 osd.60 pg_epoch: 83264 pg[1.d8f( v 28074'892 (0'0,28074'892] local-les=83242 n=0 ec=1 les/c 83242/83246 83264/83264/83264) [6,60,49] r=1 lpr=83264 pi=83238-83263/3 lcod 0'0 inactive NOTIFY] exit Start 0.000033 0 0.000000
-5> 2013-10-04 15:40:00.706039 7f259925c700 5 osd.60 pg_epoch: 83264 pg[1.d8f( v 28074'892 (0'0,28074'892] local-les=83242 n=0 ec=1 les/c 83242/83246 83264/83264/83264) [6,60,49] r=1 lpr=83264 pi=83238-83263/3 lcod 0'0 inactive NOTIFY] enter Started/Stray
-4> 2013-10-04 15:40:00.705864 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[0.cc3( v 82349'7706 (14513'4704,82349'7706] local-les=83091 n=2225 ec=1 les/c 83091/83065 83264/83264/82465) [56,60,6]/[56,60,6,5] r=1 lpr=83264 pi=80186-83263/31 lcod 28197'7705 remapped] enter Start
-3> 2013-10-04 15:40:00.706150 7f2598a5b700 1 osd.60 pg_epoch: 83264 pg[0.cc3( v 82349'7706 (14513'4704,82349'7706] local-les=83091 n=2225 ec=1 les/c 83091/83065 83264/83264/82465) [56,60,6]/[56,60,6,5] r=1 lpr=83264 pi=80186-83263/31 lcod 28197'7705 remapped] state<Start>: transitioning to Stray
-2> 2013-10-04 15:40:00.706175 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[0.cc3( v 82349'7706 (14513'4704,82349'7706] local-les=83091 n=2225 ec=1 les/c 83091/83065 83264/83264/82465) [56,60,6]/[56,60,6,5] r=1 lpr=83264 pi=80186-83263/31 lcod 28197'7705 remapped] exit Start 0.000311 0 0.000000
-1> 2013-10-04 15:40:00.706191 7f2598a5b700 5 osd.60 pg_epoch: 83264 pg[0.cc3( v 82349'7706 (14513'4704,82349'7706] local-les=83091 n=2225 ec=1 les/c 83091/83065 83264/83264/82465) [56,60,6]/[56,60,6,5] r=1 lpr=83264 pi=80186-83263/31 lcod 28197'7705 remapped] enter Started/Stray
0> 2013-10-04 15:40:00.706465 7f257d703700 -1 msg/Pipe.cc: In function 'int Pipe::connect()' thread 7f257d703700 time 2013-10-04 15:40:00.705417
msg/Pipe.cc: 1043: FAILED assert(m)
ceph version 0.67.3 (408cd61584c72c0d97b774b3d8f95c6b1b06341a)
1: (Pipe::connect()+0x3b25) [0xa7d3f5]
2: (Pipe::writer()+0xb3d) [0xa7e2bd]
3: (Pipe::Writer::entry()+0xd) [0xa879ad]
4: (()+0x7f8e) [0x7f25ba2a1f8e]
5: (clone()+0x6d) [0x7f25b8553e1d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 hadoop
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
2/-2 (syslog threshold) end dump of recent events ---
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.60.log
--
2013-10-04 15:40:00.919172 7f257d703700 -1 ** Caught signal (Aborted) *
in thread 7f257d703700
ceph version 0.67.3 (408cd61584c72c0d97b774b3d8f95c6b1b06341a)
1: /usr/bin/ceph-osd() [0x8e90e0]
2: (()+0xfbd0) [0x7f25ba2a9bd0]
3: (gsignal()+0x37) [0x7f25b8491037]
4: (abort()+0x148) [0x7f25b8494698]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f25b8d9de8d]
6: (()+0x5ef76) [0x7f25b8d9bf76]
7: (()+0x5efa3) [0x7f25b8d9bfa3]
8: (()+0x5f1de) [0x7f25b8d9c1de]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x43d) [0x9ab40d]
10: (Pipe::connect()+0x3b25) [0xa7d3f5]
11: (Pipe::writer()+0xb3d) [0xa7e2bd]
12: (Pipe::Writer::entry()+0xd) [0xa879ad]
13: (()+0x7f8e) [0x7f25ba2a1f8e]
14: (clone()+0x6d) [0x7f25b8553e1d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- begin dump of recent events ---
5> 2013-10-04 15:40:00.804751 7f259eb68700 1 - [2001:620:0:6::10c]:6816/5598 <== mon.1 [2001:620:0:6::108]:6789/0 30 ==== osd_map(83261..83264 src has 81624..83264) v3 ==== 25523+0+0 (26236143 0 0) 0x2068c240 con 0x20518f20
4> 2013-10-04 15:40:00.804817 7f259eb68700 3 osd.60 83264 handle_osd_map epochs [83261,83264], i have 83264, src has [81624,83264] [2001:620:0:6::10c]:6819/5598 <== osd.35 [2001:620:0:6::106]:0/5309 27085 ==== osd_ping(ping e83264 stamp 2013-10-04 15:40:00.817483) v2 ==== 47+0+0 (3777669719 0 0) 0x263b9500 con 0x209fef20
-3> 2013-10-04 15:40:00.821351 7f259c363700 1 -
2> 2013-10-04 15:40:00.821386 7f259c363700 1 - [2001:620:0:6::10c]:6819/5598 --> [2001:620:0:6::106]:0/5309 -- osd_ping(ping_reply e83264 stamp 2013-10-04 15:40:00.817483) v2 -- ?+0 0x21797dc0 con 0x209fef20
1> 2013-10-04 15:40:00.821391 7f259b361700 1 - [2001:620:0:6::10c]:6818/5598 <== osd.35 [2001:620:0:6::106]:0/5309 27085 ==== osd_ping(ping e83264 stamp 2013-10-04 15:40:00.817483) v2 ==== 47+0+0 (3777669719 0 0) 0x25052700 con 0x20f11b80
0> 2013-10-04 15:40:00.919172 7f257d703700 -1 ** Caught signal (Aborted) *
in thread 7f257d703700
ceph version 0.67.3 (408cd61584c72c0d97b774b3d8f95c6b1b06341a)
1: /usr/bin/ceph-osd() [0x8e90e0]
2: (()+0xfbd0) [0x7f25ba2a9bd0]
3: (gsignal()+0x37) [0x7f25b8491037]
4: (abort()+0x148) [0x7f25b8494698]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f25b8d9de8d]
6: (()+0x5ef76) [0x7f25b8d9bf76]
7: (()+0x5efa3) [0x7f25b8d9bfa3]
8: (()+0x5f1de) [0x7f25b8d9c1de]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x43d) [0x9ab40d]
10: (Pipe::connect()+0x3b25) [0xa7d3f5]
11: (Pipe::writer()+0xb3d) [0xa7e2bd]
12: (Pipe::Writer::entry()+0xd) [0xa879ad]
13: (()+0x7f8e) [0x7f25ba2a1f8e]
14: (clone()+0x6d) [0x7f25b8553e1d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 hadoop
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
2/-2 (syslog threshold) end dump of recent events ---
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.60.log
--
Interestingly, only the SATA backed OSD crashed, the SSDs (which all are backing a specific pool) stayed up - even on those servers that lost all SATA OSDs.
The OSDs are restarting fine...
Updated by Sage Weil over 10 years ago
Jens, any osd logs you can attach to 6429 would be helpful. thanks!