Project

General

Profile

Bug #23085

Assertion failure in Bluestore _kv_sync_thread()

Added by pratyush ranjan over 3 years ago. Updated over 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While perf-testing a small Ceph cluster running 12.2.2 with 12 OSD, I found following assertion failure -
Looks like a error during rocksdb compaction -

assertion failure code-line-
RocksDB
https://github.com/ceph/rocksdb/blob/62782e7357ff569d836b0ad91e5776fc1d6137e5/db/db_impl_compaction_flush.cc#L1272

Bluestore
https://github.com/ceph/ceph/blob/luminous/src/os/bluestore/BlueStore.cc#L8547

 -7> 2018-02-21 12:26:47.657326 7fc073bc8700  1 -- 10.34.233.252:6801/15110 <== osd.20 10.34.157.176:6809/93929 20741 ==== osd_repop(client.5233.0:1981405 18.5 e262/174) v2 ==== 1076+0+129661 (2918758053 0 841501392) 0x7fc0888d6c00 con 0x7fc0877d2000
    -6> 2018-02-21 12:26:47.657406 7fc05e3c1700  5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 262'20705, trimmed: , trimmed_dups: , clear_divergent_priors: 0
    -5> 2018-02-21 12:26:47.657442 7fc0653cf700  1 -- 10.34.233.252:6800/15110 --> 10.33.142.115:0/4087232167 -- osd_op_reply(1981354 fb303f39-d803-44f8-9ef1-88f661b87aff.5233.165_myobjects1307 [create,setxattr user.rgw.idtag (49),setxattr user.rgw.tail_tag (49),writefull 0~128000,setxattr user.rgw.manifest (322),setxattr user.rgw.acl (177),setxattr user.rgw.content_type (25),setxattr user.rgw.etag (33),call rgw.obj_store_pg_ver,setxattr user.rgw.source_zone (4)] v262'20700 uv16218 ondisk = -17 ((17) File exists)) v8 -- 0x7fc0b54bf9c0 con 0
    -4> 2018-02-21 12:26:47.657611 7fc065bd0700 -1 rocksdb: submit_transaction error: Corruption: Bad table magic number code = 2 Rocksdb transaction:
Put( Prefix = M key = 0x0000000000000555'.0000000262.00000000000000020603' Value size = 203)
Put( Prefix = M key = 0x0000000000000555'._fastinfo' Value size = 186)
Put( Prefix = O key = 0x7f800000000000001226adcf87216662'303f39-d803-44f8-9ef1-88f661b87aff.5233.120_myobjects1908!='0xfffffffffffffffeffffffffffffffff'o' Value size = 1506)
Merge( Prefix = b key = 0x000000030c000000 Value size = 16)
Merge( Prefix = T key = 'bluestore_statfs' Value size = 40)
    -3> 2018-02-21 12:26:47.658798 7fc0733c7700  5 -- 10.34.233.252:6800/15110 >> 10.33.142.115:0/4087232167 conn(0x7fc08653a800 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=8 cs=1 l=1). rx client.5233 seq 21438 0x7fc0b54bf9c0 osd_op(client.5233.0:1981411 18.0 18.50b61c60 (undecoded) ondisk+read+known_if_redirected e262) v8
    -2> 2018-02-21 12:26:47.658824 7fc0733c7700  1 -- 10.34.233.252:6800/15110 <== client.5233 10.33.142.115:0/4087232167 21438 ==== osd_op(client.5233.0:1981411 18.0 18.50b61c60 (undecoded) ondisk+read+known_if_redirected e262) v8 ==== 306+0+0 (1814146792 0 0) 0x7fc0b54bf9c0 con 0x7fc08653a800
    -1> 2018-02-21 12:26:47.658960 7fc05e3c1700  1 -- 10.34.233.252:6800/15110 --> 10.33.142.115:0/4087232167 -- osd_op_reply(1981411 fb303f39-d803-44f8-9ef1-88f661b87aff.5233.165_myobjects1307 [getxattrs,stat] v0'0 uv16218 ondisk = 0) v8 -- 0x7fc0b4a5cdc0 con 0
     0> 2018-02-21 12:26:47.659595 7fc065bd0700 -1 /mnt/varada/build/ceph/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread()' thread 7fc065bd0700 time 2018-02-21 12:26:47.657633
/mnt/varada/build/ceph/src/os/bluestore/BlueStore.cc: 8506: FAILED assert(r == 0)

 ceph version 12.2.2-42-ga749f88 (a749f88ecc6c9a1c4d1e42630c250fdce0741de4) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7fc07a7519d2]
 2: (BlueStore::_kv_sync_thread()+0x34b0) [0x7fc07a5f2500]
 3: (BlueStore::KVSyncThread::entry()+0xd) [0x7fc07a62fd4d]
 4: (()+0x8064) [0x7fc077266064]
 5: (clone()+0x6d) [0x7fc07635a62d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.6.log
--- end dump of recent events ---
2018-02-21 12:26:47.703829 7fc065bd0700 -1 *** Caught signal (Aborted) **
 in thread 7fc065bd0700 thread_name:bstore_kv_sync

 ceph version 12.2.2-42-ga749f88 (a749f88ecc6c9a1c4d1e42630c250fdce0741de4) luminous (stable)
 1: (()+0xa4461c) [0x7fc07a70e61c]
 2: (()+0xf890) [0x7fc07726d890]
 3: (gsignal()+0x37) [0x7fc0762a7067]
 4: (abort()+0x148) [0x7fc0762a8448]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27f) [0x7fc07a751b4f]
 6: (BlueStore::_kv_sync_thread()+0x34b0) [0x7fc07a5f2500]
 7: (BlueStore::KVSyncThread::entry()+0xd) [0x7fc07a62fd4d]
 8: (()+0x8064) [0x7fc077266064]
9: (clone()+0x6d) [0x7fc07635a62d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
   -24> 2018-02-21 12:26:47.666096 7fc0733c7700  5 -- 10.34.233.252:6800/15110 >> 10.33.142.115:0/4087232167 conn(0x7fc08653a800 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=8 cs=1 l=1). rx client.5233 seq 21439 0x7fc0b4a5cdc0 osd_op(client.5233.0:1981443 18.0 18.50b61c60 (undecoded) ondisk+write+known_if_redirected e262) v8
   -23> 2018-02-21 12:26:47.666117 7fc0733c7700  1 -- 10.34.233.252:6800/15110 <== client.5233 10.33.142.115:0/4087232167 21439 ==== osd_op(client.5233.0:1981443 18.0 18.50b61c60 (undecoded) ondisk+write+known_if_redirected e262) v8 ==== 686+0+128920 (1041487758 0 3354636632) 0x7fc0b4a5cdc0 con 0x7fc08653a800
   -22> 2018-02-21 12:26:47.666415 7fc05e3c1700  1 -- 10.34.233.252:6801/15110 --> 10.34.157.176:6805/93299 -- osd_repop(client.5233.0:1981443 18.0 e262/174 18:06386d0a:::fb303f39-d803-44f8-9ef1-88f661b87aff.5233.165_myobjects1307:head v 262'20701) v2 -- 0x7fc0aaeb8a00 con 0
   -21> 2018-02-21 12:26:47.666472 7fc05e3c1700  1 -- 10.34.233.252:6801/15110 --> 10.33.77.153:6809/89759 -- osd_repop(client.5233.0:1981443 18.0 e262/174 18:06386d0a:::fb303f39-d803-44f8-9ef1-88f661b87aff.5233.165_myobjects1307:head v 262'20701) v2 -- 0x7fc0a6e02c00 con 0

History

#1 Updated by pratyush ranjan over 3 years ago

The OSD fails to restart after several attempts.

I also searched for similar issues and found following trackers
with BlueStore::_kv_sync_thread() in call stack -
http://tracker.ceph.com/issues/22510
https://tracker.ceph.com/issues/22539

#2 Updated by pratyush ranjan over 3 years ago

I marked the OSD out and running tests from the top, the cluster is doing fine, no more osd failures even after 2 hours of test run.

Also available in: Atom PDF