Support #27203: osd down while bucket is deleting - RADOS - Ceph

Actions

Copy link

Support #27203

open

osd down while bucket is deleting

Added by 伟杰谭 over 5 years ago. Updated over 5 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Tags:

Reviewed:

Affected Versions:

Ceph - v12.2.5

Component(RADOS):

OSD

Pull request ID:

Description

My environment is
[tanweijie@gz-ceph-52-202 ~]$ ceph --version
ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)
[tanweijie@gz-ceph-52-202 ~]$ cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)

i am deleting bucket by python script,in the same time i am deleting 24 buckets,my python script is: ==============================================================================================
#!/bin/python
import os
import boto
import time
import boto.s3.connection
from boto.s3.key import Key
from multiprocessing import Process
from multiprocessing import Queue as bigqueue
from Queue import Queue

acckey = ''
seckey = ''
mhost = '172.25.52.202'

def get_buckets(all_buckets):
'''
Get buckets from radosgw
'''
for y in ['41','40','39','38','37','36','35','34','33','32','31','30']:
last_day = os.popen('date ~~d "~~'+y+'day" +"%Y-%m-%d-"').readline().strip()
for x in ['06','07','08','09','10','11']:
all_buckets.put(last_day+x)
print last_day+x
return all_buckets.qsize()

class Delete_buckets(Process):
'''
Delete specified bucket
'''
def init(self, bucketname, mhost):
Process.__init__(self)
self.bucketname = bucketname
self.mhost = mhost

def run(self):
        while True:
            self.nowbucket = ''
            try:
                self.nowbucket = self.bucketname.get(timeout=2)
                print 'get bucket '+ self.nowbucket
            except:
                print 'Exiting by queue is empty!'
                return
            self.bucket_file = 'sudo radosgw-admin bi list --bucket='+self.nowbucket+'|grep \'"name"\' |cut -d ":" -f2 |cut -d \'"\' -f 2'
            self.conn = boto.connect_s3(aws_access_key_id=acckey,
                                        aws_secret_access_key=seckey,
                                        host=self.mhost,
                                        port=7480,
                                        is_secure=False,
                                        calling_format = boto.s3.connection.OrdinaryCallingFormat())
            self.dbucket = self.conn.get_bucket(self.nowbucket)
            self.all_files = os.popen(self.bucket_file).readlines()
            for x in self.all_files:
                self.dbucket.delete_key(x.strip())
            try:
                self.conn.delete_bucket(self.nowbucket)
                print self.nowbucket + ' have deleted.'
            except Exception,e:
                print str(e)
                pass

if name == '__main__':
procs = 6
all_proc = Queue()
del_buckets = bigqueue()
#cmd = 'sudo radosgw-admin bucket list |grep \'python-\'|awk -F \'"\' \'{print $2}\''
bcount = get_buckets(all_buckets=del_buckets)
print 'begin to delete {bcount} bucket'.format(bcount=str(bcount))
for x in xrange(0,procs):
tempp = Delete_buckets(bucketname=del_buckets, mhost=mhost)
tempp.daemon = True
all_proc.put(tempp)
tempp.start()
time.sleep(10)
while all_proc.qsize() > 0:
y = all_proc.get(timeout=10)
if not y.is_alive():
y.join()
if del_buckets.qsize() > 2:
tempp = Delete_buckets(bucketname=del_buckets, mhost=mhost)
tempp.daemon = True
all_proc.put(tempp)
tempp.start()
else:
all_proc.put(y)
time.sleep(30) ==============================================================================================
while my gc's args is changed:
sudo ceph tell osd.* injectargs '--rgw_gc_max_objs 4096 --rgw_gc_obj_min_wait 1 --rgw_gc_processor_max_time 1 --rgw_gc_processor_period 7200'

then,after deleting scripts ran a day,my ceph got some situation:two osds get into trouble,their log are:
------------------------------------------------------------------------------------------------------------------------------------
2018-08-23 12:26:44.407350 7f1890d82700 1 osd.168 2863 heartbeat_check: no reply from 172.25.52.201:6803 osd.169 since back 2018-08-23 1 2:26:06.744007 front 2018-08-23 12:26:06.744007 (cutoff 2018-08-23 12:26:24.407349)
20266 2018-08-23 12:26:44.407364 7f1890d82700 -1 osd.168 2863 heartbeat_check: no reply from 172.25.52.203:6891 osd.172 since back 2018-08-23 1 2:26:06.744007 front 2018-08-23 12:26:06.744007 (cutoff 2018-08-23 12:26:24.407349)
20267 2018-08-23 12:26:45.407450 7f1890d82700 -1 osd.168 2863 heartbeat_check: no reply from 172.25.52.201:6803 osd.169 since back 2018-08-23 1 2:26:06.744007 front 2018-08-23 12:26:06.744007 (cutoff 2018-08-23 12:26:25.407449)
20268 2018-08-23 12:26:45.407463 7f1890d82700 -1 osd.168 2863 heartbeat_check: no reply from 172.25.52.203:6891 osd.172 since back 2018-08-23 1 2:26:06.744007 front 2018-08-23 12:26:06.744007 (cutoff 2018-08-23 12:26:25.407449)
20269 2018-08-23 12:26:46.407551 7f1890d82700 -1 osd.168 2863 heartbeat_check: no reply from 172.25.52.201:6803 osd.169 since back 2018-08-23 1 2:26:06.744007 front 2018-08-23 12:26:06.744007 (cutoff 2018-08-23 12:26:26.407550)
20270 2018-08-23 12:26:46.407568 7f1890d82700 -1 osd.168 2863 heartbeat_check: no reply from 172.25.52.203:6891 osd.172 since back 2018-08-23 1 2:26:06.744007 front 2018-08-23 12:26:06.744007 (cutoff 2018-08-23 12:26:26.407550)
20271 2018-08-23 12:26:47.084155 7f1877d50700 1 osd.168 pg_epoch: 2864 pg[7.8c( v 2858'36111104 (2858'36109600,2858'36111104] local-lis/les=28 61/2862 n=194 ec=700/700 lis/c 2861/2861 les/c/f 2862/2862/0 2864/2864/2854) [168] r=0 lpr=2864 pi=[2861,2864)/1 luod=0'0 crt=2858'361111 04 lcod 2858'36111103 mlcod 0'0 active] start_peering_interval up [168,172] -> [168], acting [168,172] -> [168], acting_primary 168 -> 16 8, up_primary 168 -> 168, role 0 -> 0, features acting 2305244844532236283 upacting 2305244844532236283
20272 2018-08-23 12:26:47.084264 7f1878551700 1 osd.168 pg_epoch: 2864 pg[7.57( v 2858'34530914 (2858'34529400,2858'34530914] local-lis/les=28 61/2862 n=206 ec=700/700 lis/c 2861/2861 les/c/f 2862/2862/0 2864/2864/2854) [168] r=0 lpr=2864 pi=[2861,2864)/1 luod=0'0 crt=2858'345309 14 lcod 2858'34530913 mlcod 0'0 active] start_peering_interval up [168,172] -> [168], acting [168,172] -> [168], acting_primary 168 -> 16 8, up_primary 168 -> 168, role 0 -> 0, features acting 2305244844532236283 upacting 2305244844532236283
20273 2018-08-23 12:26:47.084281 7f1877d50700 1 osd.168 pg_epoch: 2864 pg[7.8c( v 2858'36111104 (2858'36109600,2858'36111104] local-lis/les=28 61/2862 n=194 ec=700/700 lis/c 2861/2861 les/c/f 2862/2862/0 2864/2864/2854) [168] r=0 lpr=2864 pi=[2861,2864)/1 crt=2858'36111104 lcod 2 858'36111103 mlcod 0'0 unknown] state<Start>: transitioning to Primary
20274 2018-08-23 12:26:47.084364 7f1878551700 1 osd.168 pg_epoch: 2864 pg[7.57( v 2858'34530914 (2858'34529400,2858'34530914] local-lis/les=28 61/2862 n=206 ec=700/700 lis/c 2861/2861 les/c/f 2862/2862/0 2864/2864/2854) [168] r=0 lpr=2864 pi=[2861,2864)/1 crt=2858'34530914 lcod 2 858'34530913 mlcod 0'0 unknown] state<Start>: transitioning to Primary
20275 2018-08-23 12:26:47.084501 7f1877d50700 1 osd.168 pg_epoch: 2864 pg[7.84( v 2863'38123537 (2863'38122000,2863'38123537] local-lis/les=28 61/2862 n=209 ec=700/700 lis/c 2861/2861 les/c/f 2862/2862/0 2864/2864/2864) [168] r=0 lpr=2864 pi=[2861,2864)/1 luod=0'0 crt=2863'381235 37 lcod 2863'38123536 mlcod 0'0 active] start_peering_interval up [172,168] -> [168], acting [172,168] -> [168], acting_primary 172 -> 16 8, up_primary 172 -> 168, role 1 -> 0, features acting 2305244844532236283 upacting 2305244844532236283
......
2018-08-23 12:30:09.463261 7f1894d43700 0 - 172.25.59.201:6802/7906426 >> 172.25.59.201:6834/1162079 conn(0x7f192c2ac800 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 2 vs existing csq=1 existing_state=STATE_STANDBY
2018-08-23 12:30:09.463294 7f1895544700 0 -- 172.25.59.201:6802/7906426 >> 172.25.59.201:6878/1162054 conn(0x7f193b2df000 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 1 vs existing csq=1 existing_state=STATE_STANDBY
2018-08-23 12:30:09.463295 7f1894542700 0 -- 172.25.59.201:6802/7906426 >> 172.25.59.201:6824/1162010 conn(0x7f19236e9000 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 2 vs existing csq=1 existing_state=STATE_STANDBY
2018-08-23 12:30:09.463405 7f1895544700 0 -- 172.25.59.201:6802/7906426 >> 172.25.59.201:6864/1162064 conn(0x7f1926a7f000 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING
2018-08-23 12:30:09.463454 7f1895544700 0 -- 172.25.59.201:6802/7906426 >> 172.25.59.201:6878/1162054 conn(0x7f193b2df000 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 2 vs existing csq=1 existing_state=STATE_STANDBY
2018-08-23 12:30:24.254324 7f1895544700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:24.254469 7f1894d43700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:24.270953 7f1894d43700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:24.270961 7f1894542700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:24.471704 7f1894d43700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:24.471928 7f1894d43700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:25.773539 7f1894542700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:25.773544 7f1894d43700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:26.251446 7f1895544700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:26.251474 7f1894d43700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:26.273827 7f1894d43700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:26.273834 7f1894542700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:26.963525 7f1895544700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:26.963541 7f1894542700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:27.129003 7f18935ce700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:28.574164 7f1894542700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:28.574181 7f1894d43700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
------------------------------------------------------------------------------------------------------------------------------------
and,my cluster also show me error: ===================================================================================================================
[tanweijie@gz-ceph-52-202 ~]$ sudo ceph -s
cluster:
id: b028cb6a-1a20-45ac-be31-08946fc137b7
health: HEALTH_WARN
noscrub,nodeep-scrub flag(s) set
1 osds down
Reduced data availability: 65 pgs inactive, 33 pgs down, 30 pgs peering
Degraded data redundancy: 5937/5382785588 objects degraded (0.000%), 83 pgs degraded
255 slow requests are blocked > 32 sec ===================================================================================================================
then when i stop my deleting script,ceph back to normal..so what going on?

services:
    mon: 3 daemons, quorum gz-ceph-52-201,gz-ceph-52-202,gz-ceph-52-203
    mgr: gz-ceph-52-203(active), standbys: gz-ceph-52-204
    osd: 176 osds: 175 up, 176 in
         flags noscrub,nodeep-scrub
    rgw: 4 daemons active

data:
    pools:   6 pools, 17152 pgs
    objects: 2566M objects, 460 TB
    usage:   1093 TB used, 438 TB / 1531 TB avail
    pgs:     0.601% pgs not active
             5937/5382785588 objects degraded (0.000%)
             16953 active+clean
             77    active+undersized+degraded
             38    peering
             33    down
             30    activating
             13    active+undersized
             2     active
             2     activating+degraded
             2     active+recovery_wait+degraded
             1     active+degraded
             1     active+recovering+degraded

io:
    client:   420 B/s rd, 0 op/s rd, 0 op/s wr

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Support #27203

osd down while bucket is deleting

Updated by 伟杰谭 over 5 years ago

Updated by 伟杰谭 over 5 years ago

Updated by Josh Durgin over 5 years ago

Project

General

Profile

Ceph » RADOS

Custom queries

Support #27203

osd down while bucket is deleting

Updated by 伟杰 谭 over 5 years ago

Updated by 伟杰 谭 over 5 years ago

Updated by Josh Durgin over 5 years ago

Updated by 伟杰谭 over 5 years ago

Updated by 伟杰谭 over 5 years ago