Project

General

Profile

Actions

Bug #10119

closed

0.88 EC+ KV OSDs crashing

Added by Kenneth Waegeman over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,
I am further testing the EC+ KV setup, and the OSDs were crashing again, so I updated ticket #9727.
But after the OSDs were initially crashed without any logging information, I tried to restart them a few times, but they crash again immediately, and now with this error:

  -26> 2014-11-17 15:23:00.614802 7fca3492f700  1 -- 10.141.8.181:6825/1741 <== osd.5 10.141.8.180:0/33493 96 ==== osd_ping(ping e205 stamp 2014-11-17 15:23:00.614191) v2 ==== 47+0+0 (4154056460 0 0) 0x51850a0 con 0x5ff5280
   -25> 2014-11-17 15:23:00.614833 7fca3492f700  1 -- 10.141.8.181:6825/1741 --> 10.141.8.180:0/33493 -- osd_ping(ping_reply e205 stamp 2014-11-17 15:23:00.614191) v2 -- ?+0 0xa6185a0 con 0x5ff5280
   -24> 2014-11-17 15:23:00.614860 7fca3312c700  1 -- 10.143.8.181:6825/1741 <== osd.5 10.141.8.180:0/33493 96 ==== osd_ping(ping e205 stamp 2014-11-17 15:23:00.614191) v2 ==== 47+0+0 (4154056460 0 0) 0x10779fe0 con 0x5ff4fc0
   -23> 2014-11-17 15:23:00.614877 7fca3312c700  1 -- 10.143.8.181:6825/1741 --> 10.141.8.180:0/33493 -- osd_ping(ping_reply e205 stamp 2014-11-17 15:23:00.614191) v2 -- ?+0 0xa8b9680 con 0x5ff4fc0
   -22> 2014-11-17 15:23:00.850437 7fca3492f700  1 -- 10.141.8.181:6825/1741 <== osd.44 10.143.8.182:0/35567 108 ==== osd_ping(ping e205 stamp 2014-11-17 15:23:00.849840) v2 ==== 47+0+0 (3661645159 0 0) 0xa8205a0 con 0x63aa3c0
   -21> 2014-11-17 15:23:00.850471 7fca3492f700  1 -- 10.141.8.181:6825/1741 --> 10.143.8.182:0/35567 -- osd_ping(ping_reply e205 stamp 2014-11-17 15:23:00.849840) v2 -- ?+0 0x51850a0 con 0x63aa3c0
   -20> 2014-11-17 15:23:00.850485 7fca3312c700  1 -- 10.143.8.181:6825/1741 <== osd.44 10.143.8.182:0/35567 108 ==== osd_ping(ping e205 stamp 2014-11-17 15:23:00.849840) v2 ==== 47+0+0 (3661645159 0 0) 0x5184ce0 con 0x5e95ac0
   -19> 2014-11-17 15:23:00.850503 7fca3312c700  1 -- 10.143.8.181:6825/1741 --> 10.143.8.182:0/35567 -- osd_ping(ping_reply e205 stamp 2014-11-17 15:23:00.849840) v2 -- ?+0 0x10779fe0 con 0x5e95ac0
   -18> 2014-11-17 15:23:00.926742 7fca3312c700  1 -- 10.143.8.181:6825/1741 <== osd.28 10.143.8.181:0/60984 106 ==== osd_ping(ping e205 stamp 2014-11-17 15:23:00.926191) v2 ==== 47+0+0 (377939256 0 0) 0xa6c2b20 con 0x63b6ca0
   -17> 2014-11-17 15:23:00.926765 7fca3312c700  1 -- 10.143.8.181:6825/1741 --> 10.143.8.181:0/60984 -- osd_ping(ping_reply e205 stamp 2014-11-17 15:23:00.926191) v2 -- ?+0 0x5184ce0 con 0x63b6ca0
   -16> 2014-11-17 15:23:00.926861 7fca3492f700  1 -- 10.141.8.181:6825/1741 <== osd.28 10.143.8.181:0/60984 106 ==== osd_ping(ping e205 stamp 2014-11-17 15:23:00.926191) v2 ==== 47+0+0 (377939256 0 0) 0xa42c740 con 0x63b7640
   -15> 2014-11-17 15:23:00.926878 7fca3492f700  1 -- 10.141.8.181:6825/1741 --> 10.143.8.181:0/60984 -- osd_ping(ping_reply e205 stamp 2014-11-17 15:23:00.926191) v2 -- ?+0 0xa8205a0 con 0x63b7640
   -14> 2014-11-17 15:23:01.081963 7fca3312c700  1 -- 10.143.8.181:6825/1741 <== osd.40 10.141.8.182:0/32395 108 ==== osd_ping(ping e205 stamp 2014-11-17 15:23:01.081429) v2 ==== 47+0+0 (2256845560 0 0) 0xa8b4560 con 0x63b6720
   -13> 2014-11-17 15:23:01.081994 7fca3312c700  1 -- 10.143.8.181:6825/1741 --> 10.141.8.182:0/32395 -- osd_ping(ping_reply e205 stamp 2014-11-17 15:23:01.081429) v2 -- ?+0 0xa6c2b20 con 0x63b6720
   -12> 2014-11-17 15:23:01.082029 7fca3492f700  1 -- 10.141.8.181:6825/1741 <== osd.40 10.141.8.182:0/32395 108 ==== osd_ping(ping e205 stamp 2014-11-17 15:23:01.081429) v2 ==== 47+0+0 (2256845560 0 0) 0xa61e180 con 0x64e2c00
   -11> 2014-11-17 15:23:01.082055 7fca3492f700  1 -- 10.141.8.181:6825/1741 --> 10.141.8.182:0/32395 -- osd_ping(ping_reply e205 stamp 2014-11-17 15:23:01.081429) v2 -- ?+0 0xa42c740 con 0x64e2c00
   -10> 2014-11-17 15:23:01.124234 7fca1f87c700  1 -- 10.143.8.181:6824/1741 <== osd.29 10.143.8.181:6818/1234 117 ==== osd_sub_op(unknown.0.0:0 2.7ds4 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[]) v11 ==== 1202+0+0 (724286469 0 0) 0xd5eac00 con 0x5ff35a0
    -9> 2014-11-17 15:23:01.124258 7fca1f87c700  5 -- op tracker -- seq: 1721, time: 2014-11-17 15:23:01.124154, event: header_read, op: osd_sub_op(unknown.0.0:0 2.7ds4 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
    -8> 2014-11-17 15:23:01.124265 7fca1f87c700  5 -- op tracker -- seq: 1721, time: 2014-11-17 15:23:01.124156, event: throttled, op: osd_sub_op(unknown.0.0:0 2.7ds4 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
    -7> 2014-11-17 15:23:01.124272 7fca1f87c700  5 -- op tracker -- seq: 1721, time: 2014-11-17 15:23:01.124228, event: all_read, op: osd_sub_op(unknown.0.0:0 2.7ds4 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
    -6> 2014-11-17 15:23:01.124284 7fca1f87c700  5 -- op tracker -- seq: 1721, time: 0.000000, event: dispatched, op: osd_sub_op(unknown.0.0:0 2.7ds4 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
    -5> 2014-11-17 15:23:01.124323 7fca2c11e700  5 -- op tracker -- seq: 1721, time: 2014-11-17 15:23:01.124323, event: reached_pg, op: osd_sub_op(unknown.0.0:0 2.7ds4 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
    -4> 2014-11-17 15:23:01.124333 7fca2c11e700  5 -- op tracker -- seq: 1721, time: 2014-11-17 15:23:01.124333, event: started, op: osd_sub_op(unknown.0.0:0 2.7ds4 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
    -3> 2014-11-17 15:23:01.124342 7fca2c11e700  1 -- 10.143.8.181:6824/1741 --> 10.143.8.181:6818/1234 -- osd_sub_op_reply(unknown.0.0:0 2.7ds0 0//0//-1 [scrub-reserve] ack, result = 0) v2 -- ?+1 0x8661080 con 0x5ff35a0
    -2> 2014-11-17 15:23:01.124356 7fca2c11e700  5 -- op tracker -- seq: 1721, time: 2014-11-17 15:23:01.124356, event: done, op: osd_sub_op(unknown.0.0:0 2.7ds4 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
    -1> 2014-11-17 15:23:01.125415 7fca37134700  1 -- 10.143.8.181:6824/1741 <== osd.29 10.143.8.181:6818/1234 118 ==== replica scrub(pg: 2.7ds4,from:0'0,to:116'42,epoch:205,start:0//0//-1,end:3cb6e67d//0//-1,chunky:1,deep:0,version:5) v5 ==== 126+0+0 (2329518172 0 0) 0x95dd780 con 0x5ff35a0
     0> 2014-11-17 15:23:01.128900 7fca29118700 -1 os/GenericObjectMap.cc: In function 'int GenericObjectMap::list_objects(const coll_t&, ghobject_t, int, std::vector<ghobject_t>*, ghobject_t*)' thread 7fca29118700 time 2014-11-17 15:23:01.125860
os/GenericObjectMap.cc: 1098: FAILED assert(start <= header.oid)

 ceph version 0.88 (4be687bf4480474117f56c387febc75c904036be)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xb8b095]
 2: (GenericObjectMap::list_objects(coll_t const&, ghobject_t, int, std::vector<ghobject_t, std::allocator<ghobject_t> >*, ghobject_t*)+0x474) [0xa60384]
 3: (KeyValueStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vector<ghobject_t, std::allocator<ghobject_t> >*, ghobject_t*)+0x274) [0x923624]
 4: (KeyValueStore::collection_list_range(coll_t, ghobject_t, ghobject_t, snapid_t, std::vector<ghobject_t, std::allocator<ghobject_t> >*)+0x164) [0x947d24]
 5: (PGBackend::objects_list_range(hobject_t const&, hobject_t const&, snapid_t, std::vector<hobject_t, std::allocator<hobject_t> >*, std::vector<ghobject_t, std::allocator<ghobject_t> >*)+0x106) [0x8c12c6]
 6: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x268) [0x7c1f18]
 7: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x502) [0x7c2832]
 8: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0xda) [0x6c804a]
 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa66) [0xb7bdd6]
 10: (ThreadPool::WorkThread::entry()+0x10) [0xb7ce60]
 11: (()+0x7df3) [0x7fca49a87df3]
 12: (clone()+0x6d) [0x7fca4854e01d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
   -22> 2014-11-17 15:23:01.282625 7fca24606700  2 -- 10.143.8.181:6824/1741 >> 10.143.8.180:6816/36202 pipe(0x5d25800 sd=23 :46066 s=2 pgs=375 cs=1 l=0 c=0x5d3ad60).reader couldn't read tag, (11) Resource temporarily unavailable
   -21> 2014-11-17 15:23:01.282681 7fca24606700  2 -- 10.143.8.181:6824/1741 >> 10.143.8.180:6816/36202 pipe(0x5d25800 sd=23 :46066 s=2 pgs=375 cs=1 l=0 c=0x5d3ad60).fault (11) Resource temporarily unavailable
   -20> 2014-11-17 15:23:01.282697 7fca24606700  0 -- 10.143.8.181:6824/1741 >> 10.143.8.180:6816/36202 pipe(0x5d25800 sd=23 :46066 s=2 pgs=375 cs=1 l=0 c=0x5d3ad60).fault with nothing to send, going to standby
   -19> 2014-11-17 15:23:01.282747 7fca1ea6e700  2 -- 10.143.8.181:0/1741 >> 10.141.8.180:6817/36202 pipe(0x6232000 sd=66 :36760 s=2 pgs=67 cs=1 l=1 c=0x5ff7220).reader couldn't read tag, (0) Success
   -18> 2014-11-17 15:23:01.282788 7fca1ea6e700  2 -- 10.143.8.181:0/1741 >> 10.141.8.180:6817/36202 pipe(0x6232000 sd=66 :36760 s=2 pgs=67 cs=1 l=1 c=0x5ff7220).fault (0) Success
   -17> 2014-11-17 15:23:01.282771 7fca1e96d700  2 -- 10.143.8.181:0/1741 >> 10.143.8.180:6817/36202 pipe(0x6220000 sd=65 :45986 s=2 pgs=67 cs=1 l=1 c=0x5ff6880).reader couldn't read tag, (0) Success
   -16> 2014-11-17 15:23:01.282808 7fca1e96d700  2 -- 10.143.8.181:0/1741 >> 10.143.8.180:6817/36202 pipe(0x6220000 sd=65 :45986 s=2 pgs=67 cs=1 l=1 c=0x5ff6880).fault (0) Success
   -15> 2014-11-17 15:23:01.282828 7fca35931700  1 -- 10.143.8.181:0/1741 mark_down 0x5ff6880 -- 0x6220000
   -14> 2014-11-17 15:23:01.282854 7fca39e99700  2 -- 10.143.8.181:6825/1741 >> 10.143.8.180:0/36202 pipe(0x569c800 sd=119 :6825 s=2 pgs=475 cs=1 l=1 c=0x63b3c80).reader couldn't read tag, (0) Success
   -13> 2014-11-17 15:23:01.282875 7fca39e99700  2 -- 10.143.8.181:6825/1741 >> 10.143.8.180:0/36202 pipe(0x569c800 sd=119 :6825 s=2 pgs=475 cs=1 l=1 c=0x63b3c80).fault (0) Success
   -12> 2014-11-17 15:23:01.283024 7fca39d98700  2 -- 10.141.8.181:6825/1741 >> 10.143.8.180:0/36202 pipe(0x6475800 sd=118 :6825 s=2 pgs=476 cs=1 l=1 c=0x63a61a0).reader couldn't read tag, (0) Success
   -11> 2014-11-17 15:23:01.283046 7fca39d98700  2 -- 10.141.8.181:6825/1741 >> 10.143.8.180:0/36202 pipe(0x6475800 sd=118 :6825 s=2 pgs=476 cs=1 l=1 c=0x63a61a0).fault (0) Success
   -10> 2014-11-17 15:23:01.283535 7fca39c97700  2 -- 10.143.8.181:0/1741 >> 10.141.8.180:6817/36202 pipe(0x6220000 sd=66 :0 s=1 pgs=0 cs=0 l=1 c=0x5d3f7a0).connect error 10.141.8.180:6817/36202, (111) Connection refused
    -9> 2014-11-17 15:23:01.283619 7fca39d98700  2 -- 10.143.8.181:0/1741 >> 10.143.8.180:6817/36202 pipe(0x6475800 sd=65 :0 s=1 pgs=0 cs=0 l=1 c=0x5d3f4e0).connect error 10.143.8.180:6817/36202, (111) Connection refused
    -8> 2014-11-17 15:23:01.283656 7fca39d98700  2 -- 10.143.8.181:0/1741 >> 10.143.8.180:6817/36202 pipe(0x6475800 sd=65 :0 s=1 pgs=0 cs=0 l=1 c=0x5d3f4e0).fault (111) Connection refused
    -7> 2014-11-17 15:23:01.283685 7fca39d98700  0 -- 10.143.8.181:0/1741 >> 10.143.8.180:6817/36202 pipe(0x6475800 sd=65 :0 s=1 pgs=0 cs=0 l=1 c=0x5d3f4e0).fault
    -6> 2014-11-17 15:23:01.283751 7fca39c97700  2 -- 10.143.8.181:0/1741 >> 10.141.8.180:6817/36202 pipe(0x6220000 sd=66 :0 s=1 pgs=0 cs=0 l=1 c=0x5d3f7a0).fault (111) Connection refused
    -5> 2014-11-17 15:23:01.283777 7fca39c97700  0 -- 10.143.8.181:0/1741 >> 10.141.8.180:6817/36202 pipe(0x6220000 sd=66 :0 s=1 pgs=0 cs=0 l=1 c=0x5d3f7a0).fault
    -4> 2014-11-17 15:23:01.283833 7fca39d98700  2 -- 10.143.8.181:0/1741 >> 10.143.8.180:6817/36202 pipe(0x6475800 sd=65 :0 s=1 pgs=0 cs=0 l=1 c=0x5d3f4e0).connect error 10.143.8.180:6817/36202, (111) Connection refused
    -3> 2014-11-17 15:23:01.284116 7fca39d98700  2 -- 10.143.8.181:0/1741 >> 10.143.8.180:6817/36202 pipe(0x6475800 sd=65 :0 s=1 pgs=0 cs=0 l=1 c=0x5d3f4e0).fault (111) Connection refused
    -2> 2014-11-17 15:23:01.284135 7fca39c97700  2 -- 10.143.8.181:0/1741 >> 10.141.8.180:6817/36202 pipe(0x6220000 sd=66 :0 s=1 pgs=0 cs=0 l=1 c=0x5d3f7a0).connect error 10.141.8.180:6817/36202, (111) Connection refused
    -1> 2014-11-17 15:23:01.284168 7fca39c97700  2 -- 10.143.8.181:0/1741 >> 10.141.8.180:6817/36202 pipe(0x6220000 sd=66 :0 s=1 pgs=0 cs=0 l=1 c=0x5d3f7a0).fault (111) Connection refused
     0> 2014-11-17 15:23:01.290314 7fca29118700 -1 *** Caught signal (Aborted) **
 in thread 7fca29118700

 ceph version 0.88 (4be687bf4480474117f56c387febc75c904036be)
 1: /usr/bin/ceph-osd() [0xa97bb2]
 2: (()+0xf130) [0x7fca49a8f130]
 3: (gsignal()+0x39) [0x7fca4848d5c9]
 4: (abort()+0x148) [0x7fca4848ecd8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fca48da09d5]
 6: (()+0x5e946) [0x7fca48d9e946]
 7: (()+0x5e973) [0x7fca48d9e973]
 8: (()+0x5eb9f) [0x7fca48d9eb9f]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xb8b28a]
 10: (GenericObjectMap::list_objects(coll_t const&, ghobject_t, int, std::vector<ghobject_t, std::allocator<ghobject_t> >*, ghobject_t*)+0x474) [0xa60384]
 11: (KeyValueStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vector<ghobject_t, std::allocator<ghobject_t> >*, ghobject_t*)+0x274) [0x923624]
 12: (KeyValueStore::collection_list_range(coll_t, ghobject_t, ghobject_t, snapid_t, std::vector<ghobject_t, std::allocator<ghobject_t> >*)+0x164) [0x947d24]
 13: (PGBackend::objects_list_range(hobject_t const&, hobject_t const&, snapid_t, std::vector<hobject_t, std::allocator<hobject_t> >*, std::vector<ghobject_t, std::allocator<ghobject_t> >*)+0x106) [0x8c12c6]
 14: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x268) [0x7c1f18]
 15: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x502) [0x7c2832]
 16: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0xda) [0x6c804a]
 17: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa66) [0xb7bdd6]
 18: (ThreadPool::WorkThread::entry()+0x10) [0xb7ce60]
 19: (()+0x7df3) [0x7fca49a87df3]
 20: (clone()+0x6d) [0x7fca4854e01d]


The OSDs will not start again.

Because it is a different ceph version and a different error I made this new ticket..

Thanks!
Kenneth

Actions

Also available in: Atom PDF