Project

General

Profile

Actions

Support #27203

open

osd down while bucket is deleting

Added by 伟杰 谭 over 5 years ago. Updated over 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Component(RADOS):
OSD
Pull request ID:

Description

My environment is
[tanweijie@gz-ceph-52-202 ~]$ ceph --version
ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)
[tanweijie@gz-ceph-52-202 ~]$ cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)

i am deleting bucket by python script,in the same time i am deleting 24 buckets,my python script is: ==============================================================================================
#!/bin/python
import os
import boto
import time
import boto.s3.connection
from boto.s3.key import Key
from multiprocessing import Process
from multiprocessing import Queue as bigqueue
from Queue import Queue

acckey = ''
seckey = ''
mhost = '172.25.52.202'

def get_buckets(all_buckets):
'''
Get buckets from radosgw
'''
for y in ['41','40','39','38','37','36','35','34','33','32','31','30']:
last_day = os.popen('date d "'+y+'day" +"%Y-%m-%d-"').readline().strip()
for x in ['06','07','08','09','10','11']:
all_buckets.put(last_day+x)
print last_day+x
return all_buckets.qsize()

class Delete_buckets(Process):
'''
Delete specified bucket
'''
def init(self, bucketname, mhost):
Process.__init__(self)
self.bucketname = bucketname
self.mhost = mhost

def run(self):
while True:
self.nowbucket = ''
try:
self.nowbucket = self.bucketname.get(timeout=2)
print 'get bucket '+ self.nowbucket
except:
print 'Exiting by queue is empty!'
return
self.bucket_file = 'sudo radosgw-admin bi list --bucket='+self.nowbucket+'|grep \'"name"\' |cut -d ":" -f2 |cut -d \'"\' -f 2'
self.conn = boto.connect_s3(aws_access_key_id=acckey,
aws_secret_access_key=seckey,
host=self.mhost,
port=7480,
is_secure=False,
calling_format = boto.s3.connection.OrdinaryCallingFormat())
self.dbucket = self.conn.get_bucket(self.nowbucket)
self.all_files = os.popen(self.bucket_file).readlines()
for x in self.all_files:
self.dbucket.delete_key(x.strip())
try:
self.conn.delete_bucket(self.nowbucket)
print self.nowbucket + ' have deleted.'
except Exception,e:
print str(e)
pass

if name == '__main__':
procs = 6
all_proc = Queue()
del_buckets = bigqueue()
#cmd = 'sudo radosgw-admin bucket list |grep \'python-\'|awk -F \'"\' \'{print $2}\''
bcount = get_buckets(all_buckets=del_buckets)
print 'begin to delete {bcount} bucket'.format(bcount=str(bcount))
for x in xrange(0,procs):
tempp = Delete_buckets(bucketname=del_buckets, mhost=mhost)
tempp.daemon = True
all_proc.put(tempp)
tempp.start()
time.sleep(10)
while all_proc.qsize() > 0:
y = all_proc.get(timeout=10)
if not y.is_alive():
y.join()
if del_buckets.qsize() > 2:
tempp = Delete_buckets(bucketname=del_buckets, mhost=mhost)
tempp.daemon = True
all_proc.put(tempp)
tempp.start()
else:
all_proc.put(y)
time.sleep(30) ==============================================================================================
while my gc's args is changed:
sudo ceph tell osd.* injectargs '--rgw_gc_max_objs 4096 --rgw_gc_obj_min_wait 1 --rgw_gc_processor_max_time 1 --rgw_gc_processor_period 7200'

then,after deleting scripts ran a day,my ceph got some situation:two osds get into trouble,their log are:
------------------------------------------------------------------------------------------------------------------------------------
2018-08-23 12:26:44.407350 7f1890d82700 1 osd.168 2863 heartbeat_check: no reply from 172.25.52.201:6803 osd.169 since back 2018-08-23 1 2:26:06.744007 front 2018-08-23 12:26:06.744007 (cutoff 2018-08-23 12:26:24.407349)
20266 2018-08-23 12:26:44.407364 7f1890d82700 -1 osd.168 2863 heartbeat_check: no reply from 172.25.52.203:6891 osd.172 since back 2018-08-23 1 2:26:06.744007 front 2018-08-23 12:26:06.744007 (cutoff 2018-08-23 12:26:24.407349)
20267 2018-08-23 12:26:45.407450 7f1890d82700 -1 osd.168 2863 heartbeat_check: no reply from 172.25.52.201:6803 osd.169 since back 2018-08-23 1 2:26:06.744007 front 2018-08-23 12:26:06.744007 (cutoff 2018-08-23 12:26:25.407449)
20268 2018-08-23 12:26:45.407463 7f1890d82700 -1 osd.168 2863 heartbeat_check: no reply from 172.25.52.203:6891 osd.172 since back 2018-08-23 1 2:26:06.744007 front 2018-08-23 12:26:06.744007 (cutoff 2018-08-23 12:26:25.407449)
20269 2018-08-23 12:26:46.407551 7f1890d82700 -1 osd.168 2863 heartbeat_check: no reply from 172.25.52.201:6803 osd.169 since back 2018-08-23 1 2:26:06.744007 front 2018-08-23 12:26:06.744007 (cutoff 2018-08-23 12:26:26.407550)
20270 2018-08-23 12:26:46.407568 7f1890d82700 -1 osd.168 2863 heartbeat_check: no reply from 172.25.52.203:6891 osd.172 since back 2018-08-23 1 2:26:06.744007 front 2018-08-23 12:26:06.744007 (cutoff 2018-08-23 12:26:26.407550)
20271 2018-08-23 12:26:47.084155 7f1877d50700 1 osd.168 pg_epoch: 2864 pg[7.8c( v 2858'36111104 (2858'36109600,2858'36111104] local-lis/les=28 61/2862 n=194 ec=700/700 lis/c 2861/2861 les/c/f 2862/2862/0 2864/2864/2854) [168] r=0 lpr=2864 pi=[2861,2864)/1 luod=0'0 crt=2858'361111 04 lcod 2858'36111103 mlcod 0'0 active] start_peering_interval up [168,172] -> [168], acting [168,172] -> [168], acting_primary 168 -> 16 8, up_primary 168 -> 168, role 0 -> 0, features acting 2305244844532236283 upacting 2305244844532236283
20272 2018-08-23 12:26:47.084264 7f1878551700 1 osd.168 pg_epoch: 2864 pg[7.57( v 2858'34530914 (2858'34529400,2858'34530914] local-lis/les=28 61/2862 n=206 ec=700/700 lis/c 2861/2861 les/c/f 2862/2862/0 2864/2864/2854) [168] r=0 lpr=2864 pi=[2861,2864)/1 luod=0'0 crt=2858'345309 14 lcod 2858'34530913 mlcod 0'0 active] start_peering_interval up [168,172] -> [168], acting [168,172] -> [168], acting_primary 168 -> 16 8, up_primary 168 -> 168, role 0 -> 0, features acting 2305244844532236283 upacting 2305244844532236283
20273 2018-08-23 12:26:47.084281 7f1877d50700 1 osd.168 pg_epoch: 2864 pg[7.8c( v 2858'36111104 (2858'36109600,2858'36111104] local-lis/les=28 61/2862 n=194 ec=700/700 lis/c 2861/2861 les/c/f 2862/2862/0 2864/2864/2854) [168] r=0 lpr=2864 pi=[2861,2864)/1 crt=2858'36111104 lcod 2 858'36111103 mlcod 0'0 unknown] state<Start>: transitioning to Primary
20274 2018-08-23 12:26:47.084364 7f1878551700 1 osd.168 pg_epoch: 2864 pg[7.57( v 2858'34530914 (2858'34529400,2858'34530914] local-lis/les=28 61/2862 n=206 ec=700/700 lis/c 2861/2861 les/c/f 2862/2862/0 2864/2864/2854) [168] r=0 lpr=2864 pi=[2861,2864)/1 crt=2858'34530914 lcod 2 858'34530913 mlcod 0'0 unknown] state<Start>: transitioning to Primary
20275 2018-08-23 12:26:47.084501 7f1877d50700 1 osd.168 pg_epoch: 2864 pg[7.84( v 2863'38123537 (2863'38122000,2863'38123537] local-lis/les=28 61/2862 n=209 ec=700/700 lis/c 2861/2861 les/c/f 2862/2862/0 2864/2864/2864) [168] r=0 lpr=2864 pi=[2861,2864)/1 luod=0'0 crt=2863'381235 37 lcod 2863'38123536 mlcod 0'0 active] start_peering_interval up [172,168] -> [168], acting [172,168] -> [168], acting_primary 172 -> 16 8, up_primary 172 -> 168, role 1 -> 0, features acting 2305244844532236283 upacting 2305244844532236283
......
2018-08-23 12:30:09.463261 7f1894d43700 0 -
172.25.59.201:6802/7906426 >> 172.25.59.201:6834/1162079 conn(0x7f192c2ac800 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 2 vs existing csq=1 existing_state=STATE_STANDBY
2018-08-23 12:30:09.463294 7f1895544700 0 -- 172.25.59.201:6802/7906426 >> 172.25.59.201:6878/1162054 conn(0x7f193b2df000 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 1 vs existing csq=1 existing_state=STATE_STANDBY
2018-08-23 12:30:09.463295 7f1894542700 0 -- 172.25.59.201:6802/7906426 >> 172.25.59.201:6824/1162010 conn(0x7f19236e9000 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 2 vs existing csq=1 existing_state=STATE_STANDBY
2018-08-23 12:30:09.463405 7f1895544700 0 -- 172.25.59.201:6802/7906426 >> 172.25.59.201:6864/1162064 conn(0x7f1926a7f000 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING
2018-08-23 12:30:09.463454 7f1895544700 0 -- 172.25.59.201:6802/7906426 >> 172.25.59.201:6878/1162054 conn(0x7f193b2df000 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 2 vs existing csq=1 existing_state=STATE_STANDBY
2018-08-23 12:30:24.254324 7f1895544700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:24.254469 7f1894d43700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:24.270953 7f1894d43700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:24.270961 7f1894542700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:24.471704 7f1894d43700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:24.471928 7f1894d43700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:25.773539 7f1894542700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:25.773544 7f1894d43700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:26.251446 7f1895544700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:26.251474 7f1894d43700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:26.273827 7f1894d43700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:26.273834 7f1894542700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:26.963525 7f1895544700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:26.963541 7f1894542700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:27.129003 7f18935ce700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:28.574164 7f1894542700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
2018-08-23 12:30:28.574181 7f1894d43700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1871543700' had timed out after 15
------------------------------------------------------------------------------------------------------------------------------------
and,my cluster also show me error: ===================================================================================================================
[tanweijie@gz-ceph-52-202 ~]$ sudo ceph -s
cluster:
id: b028cb6a-1a20-45ac-be31-08946fc137b7
health: HEALTH_WARN
noscrub,nodeep-scrub flag(s) set
1 osds down
Reduced data availability: 65 pgs inactive, 33 pgs down, 30 pgs peering
Degraded data redundancy: 5937/5382785588 objects degraded (0.000%), 83 pgs degraded
255 slow requests are blocked > 32 sec ===================================================================================================================
then when i stop my deleting script,ceph back to normal..so what going on?

services:
mon: 3 daemons, quorum gz-ceph-52-201,gz-ceph-52-202,gz-ceph-52-203
mgr: gz-ceph-52-203(active), standbys: gz-ceph-52-204
osd: 176 osds: 175 up, 176 in
flags noscrub,nodeep-scrub
rgw: 4 daemons active
data:
pools: 6 pools, 17152 pgs
objects: 2566M objects, 460 TB
usage: 1093 TB used, 438 TB / 1531 TB avail
pgs: 0.601% pgs not active
5937/5382785588 objects degraded (0.000%)
16953 active+clean
77 active+undersized+degraded
38 peering
33 down
30 activating
13 active+undersized
2 active
2 activating+degraded
2 active+recovery_wait+degraded
1 active+degraded
1 active+recovering+degraded
io:
client: 420 B/s rd, 0 op/s rd, 0 op/s wr
Actions #1

Updated by 伟杰 谭 over 5 years ago

Format is ugly,my fault

Actions #2

Updated by 伟杰 谭 over 5 years ago

Actually,this issue still upset me

-2> 2018-08-23 16:14:52.673287 7f3aeb536700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3acacae700' had timed out after 15
-1> 2018-08-23 16:14:52.673307 7f3aeb536700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3acacae700' had suicide timed out after 150
0> 2018-08-23 16:14:52.691775 7f3acacae700 -1 ** Caught signal (Aborted) *
in thread 7f3acacae700 thread_name:tp_osd_tp
ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)
1: (()+0xa41df1) [0x7f3af30dddf1]
2: (()+0xf100) [0x7f3aefb4e100]
3: (pread64()+0x33) [0x7f3aefb4da93]
4: (KernelDevice::direct_read_unaligned(unsigned long, unsigned long, char*)+0x88) [0x7f3af30ba5b8]
5: (KernelDevice::read_random(unsigned long, unsigned long, char*, bool)+0x5c2) [0x7f3af30bb0c2]
6: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long, char*)+0x128) [0x7f3af308f038]
7: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const+0x20) [0x7f3af30b7160]
8: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const+0x2fd) [0x7f3af348210d]
9: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*, rocksdb::Footer const&, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::BlockContents*, rocksdb::ImmutableCFOptions const&, bool, rocksdb::Slice const&, rocksdb::PersistentCacheOptions const&)+0x2ae) [0x7f3af345e0fe]
10: (()+0xdb3767) [0x7f3af344f767]
11: (rocksdb::BlockBasedTable::MaybeLoadDataBlockToCache(rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::Slice, rocksdb::BlockBasedTable::CachableEntry&lt;rocksdb::Block&gt;*, bool)+0x358) [0x7f3af3451a48]
12: (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::BlockIter*, bool, rocksdb::Status)+0x136) [0x7f3af3451d96]
13: (rocksdb::BlockBasedTable::BlockEntryIteratorState::NewSecondaryIterator(rocksdb::Slice const&)+0x98) [0x7f3af345a6f8]
14: (()+0xde102e) [0x7f3af347d02e]
15: (()+0xde12fc) [0x7f3af347d2fc]
16: (()+0xde1396) [0x7f3af347d396]
17: (rocksdb::MergingIterator::Next()+0x275) [0x7f3af34643c5]
18: (rocksdb::DBIter::FindNextUserEntryInternal(bool, bool)+0x133) [0x7f3af34e72c3]
19: (rocksdb::DBIter::Seek(rocksdb::Slice const&)+0x294) [0x7f3af34e8fa4]
20: (RocksDBStore::RocksDBWholeSpaceIteratorImpl::lower_bound(std::string const&, std::string const&)+0xa2) [0x7f3af3032942]
21: (RocksDBStore::RocksDBWholeSpaceIteratorImpl::upper_bound(std::string const&, std::string const&)+0x30) [0x7f3af3034490]
22: (BlueStore::OmapIteratorImpl::upper_bound(std::string const&)+0x20b) [0x7f3af2f8ba0b]
23: (PrimaryLogPG::do_osd_ops(PrimaryLogPG::OpContext*, std::vector&lt;OSDOp, std::allocator&lt;OSDOp&gt; >&)+0x6b36) [0x7f3af2d33166]
24: (cls_cxx_map_get_vals(void*, std::string const&, std::string const&, unsigned long, std::map&lt;std::string, ceph::buffer::list, std::less&lt;std::string&gt;, std::allocator&lt;std::pair&lt;std::string const, ceph::buffer::list&gt; > >, bool)+0xf8) [0x7f3af2e5ae88]
62124,38 68%
12: (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::BlockIter*, bool, rocksdb::Status)+0x136) [0x7f3af3451d96]
13: (rocksdb::BlockBasedTable::BlockEntryIteratorState::NewSecondaryIterator(rocksdb::Slice const&)+0x98) [0x7f3af345a6f8]
14: (()+0xde102e) [0x7f3af347d02e]
15: (()+0xde12fc) [0x7f3af347d2fc]
16: (()+0xde1396) [0x7f3af347d396]
17: (rocksdb::MergingIterator::Next()+0x275) [0x7f3af34643c5]
18: (rocksdb::DBIter::FindNextUserEntryInternal(bool, bool)+0x133) [0x7f3af34e72c3]
19: (rocksdb::DBIter::Seek(rocksdb::Slice const&)+0x294) [0x7f3af34e8fa4]
20: (RocksDBStore::RocksDBWholeSpaceIteratorImpl::lower_bound(std::string const&, std::string const&)+0xa2) [0x7f3af3032942]
21: (RocksDBStore::RocksDBWholeSpaceIteratorImpl::upper_bound(std::string const&, std::string const&)+0x30) [0x7f3af3034490]
22: (BlueStore::OmapIteratorImpl::upper_bound(std::string const&)+0x20b) [0x7f3af2f8ba0b]
23: (PrimaryLogPG::do_osd_ops(PrimaryLogPG::OpContext*, std::vector&lt;OSDOp, std::allocator&lt;OSDOp&gt; >&)+0x6b36) [0x7f3af2d33166]
24: (cls_cxx_map_get_vals(void*, std::string const&, std::string const&, unsigned long, std::map&lt;std::string, ceph::buffer::list, std::less&lt;std::string&gt;, std::allocator&lt;std::pair&lt;std::string const, ceph::buffer::list&gt; > >, bool)+0xf8) [0x7f3af2e5ae88]
25: (()+0xbbfb5) [0x7f3adc7fefb5]
26: (ClassHandler::ClassMethod::exec(void*, ceph::buffer::list&, ceph::buffer::list&)+0x34) [0x7f3af2c109a4]
27: (PrimaryLogPG::do_osd_ops(PrimaryLogPG::OpContext*, std::vector&lt;OSDOp, std::allocator&lt;OSDOp&gt; >&)+0x124f) [0x7f3af2d2d87f]
28: (PrimaryLogPG::prepare_transaction(PrimaryLogPG::OpContext*)+0xbf) [0x7f3af2d3e5df]
29: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x743) [0x7f3af2d3f353]
30: (PrimaryLogPG::do_op(boost::intrusive_ptr&lt;OpRequest&gt;&)+0x30ff) [0x7f3af2d43c2f]
31: (PrimaryLogPG::do_request(boost::intrusive_ptr&lt;OpRequest&gt;&, ThreadPool::TPHandle&)+0xebb) [0x7f3af2d00f3b]
32: (OSD::dequeue_op(boost::intrusive_ptr&lt;PG&gt;, boost::intrusive_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x3f9) [0x7f3af2b84069]
33: (PGQueueable::RunVis::operator()(boost::intrusive_ptr&lt;OpRequest&gt; const&)+0x57) [0x7f3af2e06ee7]
34: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfce) [0x7f3af2bb2f7e]
35: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) [0x7f3af3122299]
36: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f3af3124230]
37: (()+0x7dc5) [0x7f3aefb46dc5]
38: (clone()+0x6d) [0x7f3aeec3a21d]
Actions #3

Updated by Josh Durgin over 5 years ago

  • Tracker changed from Bug to Support

The heartbeat timing out like that means the OSD is overloaded - in particular delete operations for RGW can overwhelm rocksdb compaction, especially with a small rocksdb cache size or slow disk.

Actions

Also available in: Atom PDF