Project

General

Profile

Actions

Bug #10988

closed

Multiple OSD getting mark_down: common/Thread.cc: 128: FAILED assert(ret == 0)

Added by karan singh about 9 years ago. Updated about 9 years ago.

Status:
Rejected
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
osd mark_down
Backport:
0.80.7
Regression:
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This weekend i saw a weird behaviour of my cluster. More than 50% OSD's are down and out. The problem occured when i increased pg_num and pgp_num values for a pool

The cluster is almost hung.

# ceph -s
2015-03-02 18:41:03.308460 7feb1affd700  1 monclient(hunting): found mon.pouta-s01
2015-03-02 18:41:03.308537 7feb21bac700  5 monclient: authenticate success, global_id 17199
    cluster 2bd3283d-67ef-4316-8b7e-d8f4747eae33
     health HEALTH_WARN 4764 pgs degraded; 1397 pgs down; 1401 pgs peering; 1423 pgs stale; 1401 pgs stuck inactive; 1423 pgs stuck stale; 9230 pgs stuck unclean; 7 requests are blocked > 32 sec; recovery 4899/30477 objects degraded (16.074%)
     monmap e3: 3 mons at {pouta-s01=10.xxx.xx.1:6789/0,pouta-s02=10.xxx.xx.2:6789/0,pouta-s03=10.xxx.xx.3:6789/0}, election epoch 22, quorum 0,1,2 pouta-s01,pouta-s02,pouta-s03
     osdmap e3979: 240 osds: 105 up, 105 in
      pgmap v24883: 17408 pgs, 13 pools, 41533 MB data, 10159 objects
            164 GB used, 381 TB / 381 TB avail
            4899/30477 objects degraded (16.074%)
                   6 stale+active+clean
                 502 active
                   1 peering
                   8 stale+down+remapped+peering
                1072 active+degraded+remapped
                8171 active+clean
                  55 down+remapped+peering
                1079 stale+active+degraded
                  94 stale+active+remapped
                 152 stale+down+peering
                2541 active+degraded
                   1 active+clean+replay
                2460 active+remapped
                1182 down+peering
                   9 stale+active
                   3 stale+peering
                  72 stale+active+degraded+remapped
recovery io 66096 kB/s, 16 objects/s
#

After increasing debug level on OSD , i found the below messages on multiple OSD's


--- begin dump of recent events ---

   -17> 2015-03-02 17:22:12.096104 7fb790400700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fb790400700 time 2015-03-02 17:22:12.092732
common/Thread.cc: 128: FAILED assert(ret == 0)

 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
 1: (Thread::create(unsigned long)+0x8a) [0xaf36fa]
 2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0xae7a1a]
 3: (Accepter::entry()+0x265) [0xb5bb65]
 4: /lib64/libpthread.so.0() [0x3c8a6079d1]
 5: (clone()+0x6d) [0x3c8a2e89dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

   -16> 2015-03-02 17:22:12.111211 7fb790c01700  1 -- 10.100.50.1:6908/17254 <== osd.159 10.100.50.4:0/21437 118 ==== osd_ping(ping e3527 stamp 2015-03-02 17:22:12.106135) v2 ==== 47+0+0 (2059412944 0 0) 0xbf16200 con 0x9d502c0
   -15> 2015-03-02 17:22:12.111246 7fb790c01700  1 -- 10.100.50.1:6908/17254 --> 10.100.50.4:0/21437 -- osd_ping(ping_reply e3527 stamp 2015-03-02 17:22:12.106135) v2 -- ?+0 0xcc10700 con 0x9d502c0
   -14> 2015-03-02 17:22:12.112992 7fb78fbff700  1 -- 10.100.50.1:6892/17254 <== osd.159 10.100.50.4:0/21437 118 ==== osd_ping(ping e3527 stamp 2015-03-02 17:22:12.106135) v2 ==== 47+0+0 (2059412944 0 0) 0xbef96c0 con 0x9de2520
   -13> 2015-03-02 17:22:12.164890 7fb74bc48700  1 -- 10.100.50.1:6870/17254 >> :/0 pipe(0xa20df00 sd=509 :6870 s=0 pgs=0 cs=0 l=0 c=0x3fa4200).accept sd=509 10.100.50.2:52081/0
   -12> 2015-03-02 17:22:12.175014 7fb74bc48700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7fb74bc48700 time 2015-03-02 17:22:12.174041
common/Thread.cc: 128: FAILED assert(ret == 0)

 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
 1: (Thread::create(unsigned long)+0x8a) [0xaf36fa]
 2: (Pipe::accept()+0x4ac5) [0xb47a85]
 3: (Pipe::reader()+0x1bae) [0xb4a8ce]
 4: (Pipe::Reader::entry()+0xd) [0xb4cdad]
 5: /lib64/libpthread.so.0() [0x3c8a6079d1]
 6: (clone()+0x6d) [0x3c8a2e89dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

   -11> 2015-03-02 17:22:12.228677 7fb792404700  1 -- 10.100.50.1:6870/17254 <== osd.49 10.100.50.1:6873/29869 457 ==== osd_map(3528..3528 src has 1256..3528) v3 ==== 2603+0+0 (3804348554 0 0) 0xc3e5340 con 0x96644c0
   -10> 2015-03-02 17:22:12.228856 7fb792404700  3 osd.4 3527 handle_osd_map epochs [3528,3528], i have 3527, src has [1256,3528]
    -9> 2015-03-02 17:22:12.236390 7fb792404700  1 -- 10.100.50.1:6870/17254 mark_down 10.100.50.2:6944/6477 -- 0x92b5f00
    -8> 2015-03-02 17:22:12.594614 7fb744fdd700  0 -- 10.100.50.1:6870/17254 >> 10.100.50.1:6854/17443 pipe(0x944be80 sd=268 :56120 s=2 pgs=114 cs=1 l=0 c=0xab6d280).fault with nothing to send, going to standby
    -7> 2015-03-02 17:22:12.657433 7fb73df6d700  0 -- 10.100.50.1:6870/17254 >> 10.100.50.1:6817/1851 pipe(0xa20be80 sd=135 :6870 s=2 pgs=208 cs=1 l=0 c=0xf228dc0).fault with nothing to send, going to standby
    -6> 2015-03-02 17:22:12.664885 7fb793406700  1 -- 10.100.50.1:6821/17254 <== mon.1 10.100.50.2:6789/0 8 ==== osd_map(3528..3528 src has 1256..3528) v3 ==== 2603+0+0 (3804348554 0 0) 0xc3e7980 con 0x408d6a0
    -5> 2015-03-02 17:22:12.730904 7fb77c244700  0 -- 10.100.50.1:6870/17254 >> 10.100.50.1:6809/45513 pipe(0x92b7800 sd=235 :47560 s=2 pgs=137 cs=1 l=0 c=0x9352520).fault with nothing to send, going to standby
    -4> 2015-03-02 17:22:12.883314 7fb7521ad700  1 -- 10.100.50.1:6870/17254 >> :/0 pipe(0x92b0c80 sd=71 :6870 s=0 pgs=0 cs=0 l=0 c=0x3fa2940).accept sd=71 10.100.50.2:52810/0
    -3> 2015-03-02 17:22:12.903263 7fb7636c1700  0 -- 10.100.50.1:6870/17254 >> 10.100.50.1:6815/54102 pipe(0x9977300 sd=256 :6870 s=2 pgs=48 cs=1 l=0 c=0xa712ec0).fault with nothing to send, going to standby
    -2> 2015-03-02 17:22:12.940200 7fb76bc46700  0 -- 10.100.50.1:6870/17254 >> 10.100.50.1:6848/26104 pipe(0x92b6e00 sd=653 :6870 s=2 pgs=17 cs=1 l=0 c=0x9282100).fault with nothing to send, going to standby
    -1> 2015-03-02 17:22:12.944702 7fb7521ad700  1 -- 10.100.50.1:6870/17254 >> :/0 pipe(0x92b6900 sd=71 :6870 s=0 pgs=0 cs=0 l=0 c=0x3fa3860).accept sd=71 10.100.50.4:35590/0
     0> 2015-03-02 17:22:12.979390 7fb775ee8700  0 -- 10.100.50.1:6870/17254 >> 10.100.50.1:6849/6761 pipe(0x944e680 sd=290 :6870 s=2 pgs=134 cs=1 l=0 c=0x9661080).fault with nothing to send, going to standby

  • At the time of recovery observed cpu,memory and network , everything seems normal
  • I tried restarting all the OSD
     service ceph restart osd -a 
    . After a few minutes , multiple OSD's again started to go down and out

ceph version 0.80.7
Centos 6.5
3.17.2-1.el6.elrepo.x86_64

Could you please suggest how to fix this ?

Actions

Also available in: Atom PDF