Project

General

Profile

Actions

Bug #36303

closed

luminous: 12.2.8 - FAILED assert(0 == "put on missing extent (nothing before)")

Added by Ricardo Barberis over 5 years ago. Updated over 5 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Same/similar as reported on http://tracker.ceph.com/issues/24715, after removing several rbd snapshots I'm getting an inconsistent PG on an SSD OSD.

When I do a repair, the primary OSD of that PG crashes with 'FAILED assert(0 == "put on missing extent (nothing before)")'.

With the OSD down/out the cluster rebalances, the affected PG is replicated and gets a new primary, a deep-scrub doesn't find any issues.

When I start the OSD, it gets promoted again as primary for the affected PG and eventually crashes again, usually after we do a deep-scrub/repair.

The cluster consists of:
- 12 nodes (CentOS 7.5) with a total of 60 SSD OSDs, 5 of these nodes are also mon+mgr
- 3 nodes (CentOS 7.5) with a total of 15 HDD OSDs
- All of the OSDs are using bluestore from the start
- The cluster started on luminous 12.2.4 > 12.2.5 -> 12.2.8
The crashing OSD was installed recently with luminous 12.2.8

I have the full log and objdump of ceph-osd if needed, but they're too big to upload via this form (11 MB each, compressed).

Extract from the log:

2018-10-02 15:05:16.054453 7f6da2769700 4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.8/rpm/el7/BUILD/ceph-12.2.8/src/rocksdb/db/db_impl_files.cc:242] adding log 2000 to recycle list

2018-10-02 15:05:16.054458 7f6da2769700 4 rocksdb: (Original Log Time 2018/10/02-15:05:16.054175) [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.8/rpm/el7/BUILD/ceph-12.2.8/src/rocksdb/db/memtable_list.cc:360] [default] Level-0 commit table #2016 started
2018-10-02 15:05:16.054460 7f6da2769700 4 rocksdb: (Original Log Time 2018/10/02-15:05:16.054430) [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.8/rpm/el7/BUILD/ceph-12.2.8/src/rocksdb/db/memtable_list.cc:383] [default] Level-0 commit table #2016: memtable #1 done
2018-10-02 15:05:16.054462 7f6da2769700 4 rocksdb: (Original Log Time 2018/10/02-15:05:16.054437) EVENT_LOG_v1 {"time_micros": 1538503516054433, "job": 741, "event": "flush_finished", "lsm_state": [1, 1, 17, 0, 0, 0, 0], "immutable_memtables": 0}
2018-10-02 15:05:16.054463 7f6da2769700 4 rocksdb: (Original Log Time 2018/10/02-15:05:16.054448) [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.8/rpm/el7/BUILD/ceph-12.2.8/src/rocksdb/db/db_impl_compaction_flush.cc:132] [default] Level summary: base level 1 max bytes base 268435456 files[1 1 17 0 0 0 0] max score 0.38

2018-10-02 15:05:16.054467 7f6da2769700 4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.8/rpm/el7/BUILD/ceph-12.2.8/src/rocksdb/db/db_impl_files.cc:388] [JOB 741] Try to delete WAL files size 254587152, prev total WAL file size 254602986, number of live WAL files 2.

2018-10-02 15:06:27.859289 7f6da9f78700 1 osd.65 pg_epoch: 36826 pg[2.6e( v 36825'1895260 (36776'1893728,36825'1895260] local-lis/les=36184/36185 n=2884 ec=257/257 lis/c 36184/36184 les/c/f 36185/36185/0 36184/36184/35329) [65,57,3] r=0 lpr=36184 crt=36825'1895260 lcod 36824'1895259 mlcod 36824'1895259 active+clean+snaptrim snaptrimq=[30bc~1,364e~1]] removing snap head
2018-10-02 15:06:28.060717 7f6da6f72700 -1 osd.65 pg_epoch: 36826 pg[2.124( v 36825'1268666 (36667'1267120,36825'1268666] local-lis/les=36429/36430 n=2991 ec=257/257 lis/c 36429/36429 les/c/f 36430/36430/0 36429/36429/30097) [65,39,72] r=0 lpr=36429 crt=36825'1268666 lcod 36825'1268665 mlcod 36825'1268665 active+clean+snaptrim snaptrimq=[30bc~1,364e~1]] removing snap head
2018-10-02 15:06:34.075450 7f6dc5772700 0 -
172.17.31.12:6802/46047 >> 172.17.30.2:0/873499296 conn(0x5566f5566000 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg: challenging authorizer
2018-10-02 15:07:29.173300 7f6da5f70700 0 log_channel(cluster) log [DBG] : 2.29e deep-scrub starts
2018-10-02 15:08:06.832087 7f6da9f78700 -1 bluestore(/var/lib/ceph/osd/ceph-65) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x0, expected 0x16b45a6c, device location [0x0~1000], logical extent 0x1d0000~1000, object #2:7973464c:::rbd_data.ed92691a07f806.0000000000000237:head#
2018-10-02 15:08:07.149865 7f6da5f70700 -1 log_channel(cluster) log [ERR] : 2.29e shard 65: soid 2:7973464c:::rbd_data.ed92691a07f806.0000000000000237:head candidate had a read error
2018-10-02 15:08:15.835174 7f6da5f70700 -1 log_channel(cluster) log [ERR] : 2.29e deep-scrub 0 missing, 1 inconsistent objects
2018-10-02 15:08:15.835179 7f6da5f70700 -1 log_channel(cluster) log [ERR] : 2.29e deep-scrub 1 errors
2018-10-02 15:09:31.944655 7f6da9f78700 -1 osd.65 pg_epoch: 36829 pg[2.29e( v 36827'2056552 (36761'2054979,36827'2056552] local-lis/les=36417/36418 n=2855 ec=257/257 lis/c 36417/36417 les/c/f 36418/36418/0 36417/36417/30473) [65,51,63] r=0 lpr=36417 crt=36827'2056552 lcod 36827'2056551 mlcod 36827'2056551 active+clean+inconsistent+snaptrim snaptrimq=[30be~1,3650~1]] removing snap head
2018-10-02 15:09:31.965843 7f6daaf7a700 -1 osd.65 pg_epoch: 36829 pg[2.2d4( v 36828'2187015 (36758'2185510,36828'2187015] local-lis/les=36546/36547 n=2877 ec=257/257 lis/c 36546/36546 les/c/f 36547/36547/0 36546/36546/29796) [65,18,0] r=0 lpr=36546 crt=36828'2187015 lcod 36828'2187014 mlcod 36828'2187014 active+clean+snaptrim snaptrimq=[30be~1,3650~1]] removing snap head
2018-10-02 15:12:35.212301 7f6da5f70700 0 log_channel(cluster) log [DBG] : 2.29e repair starts
2018-10-02 15:13:10.634990 7f6da5f70700 -1 bluestore(/var/lib/ceph/osd/ceph-65) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x0, expected 0x16b45a6c, device location [0x0~1000], logical extent 0x1d0000~1000, object #2:7973464c:::rbd_data.ed92691a07f806.0000000000000237:head#
2018-10-02 15:13:10.825472 7f6da5f70700 -1 log_channel(cluster) log [ERR] : 2.29e shard 65: soid 2:7973464c:::rbd_data.ed92691a07f806.0000000000000237:head candidate had a read error
2018-10-02 15:13:19.122893 7f6da5f70700 -1 log_channel(cluster) log [ERR] : 2.29e repair 0 missing, 1 inconsistent objects
2018-10-02 15:13:19.122904 7f6da5f70700 -1 log_channel(cluster) log [ERR] : 2.29e repair 1 errors, 1 fixed
2018-10-02 15:13:19.179577 7f6da5f70700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.8/rpm/el7/BUILD/ceph-12.2.8/src/os/bluestore/bluestore_types.cc: In function 'void bluestore_extent_ref_map_t::put(uint64_t, uint32_t, PExtentVector*, bool*)' thread 7f6da5f70700 time 2018-10-02 15:13:19.174814
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.8/rpm/el7/BUILD/ceph-12.2.8/src/os/bluestore/bluestore_types.cc: 222: FAILED assert(0 == "put on missing extent (nothing before)")

ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x5566d9557d90]
2: (bluestore_bdev_label_t::generate_test_instances(std::list<bluestore_bdev_label_t*, std::allocator<bluestore_bdev_label_t*> >&)+0) [0x5566d9456e80]
3: (BlueStore::SharedBlob::put_ref(unsigned long, unsigned int, std::vector&lt;bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >, std::set&lt;BlueStore::SharedBlob, std::less&lt;BlueStore::SharedBlob*&gt;, std::allocator&lt;BlueStore::SharedBlob*&gt; >)+0x7c) [0x5566d93ccfcc]
4: (BlueStore::_wctx_finish(BlueStore::TransContext
, boost::intrusive_ptr&lt;BlueStore::Collection&gt;&, boost::intrusive_ptr&lt;BlueStore::Onode&gt;, BlueStore::WriteContext*, std::set&lt;BlueStore::SharedBlob*, std::less&lt;BlueStore::SharedBlob*&gt;, std::allocator&lt;BlueStore::SharedBlob*&gt; >)+0x799) [0x5566d93fceb9]
5: (BlueStore::_do_truncate(BlueStore::TransContext
, boost::intrusive_ptr&lt;BlueStore::Collection&gt;&, boost::intrusive_ptr&lt;BlueStore::Onode&gt;, unsigned long, std::set&lt;BlueStore::SharedBlob*, std::less&lt;BlueStore::SharedBlob*&gt;, std::allocator&lt;BlueStore::SharedBlob*&gt; >)+0x2c2) [0x5566d94164b2]
6: (BlueStore::_do_remove(BlueStore::TransContext
, boost::intrusive_ptr&lt;BlueStore::Collection&gt;&, boost::intrusive_ptr&lt;BlueStore::Onode&gt;)+0xc6) [0x5566d9416df6]
7: (BlueStore::_remove(BlueStore::TransContext*, boost::intrusive_ptr&lt;BlueStore::Collection&gt;&, boost::intrusive_ptr&lt;BlueStore::Onode&gt;&)+0x94) [0x5566d9418864]
8: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x15af) [0x5566d942b86f]
9: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector&lt;ObjectStore::Transaction, std::allocator&lt;ObjectStore::Transaction&gt; >&, boost::intrusive_ptr&lt;TrackedOp&gt;, ThreadPool::TPHandle*)+0x3a0) [0x5566d942c790]
10: (PrimaryLogPG::queue_transaction(ObjectStore::Transaction&&, boost::intrusive_ptr&lt;OpRequest&gt;)+0xa5) [0x5566d919f545]
11: (ReplicatedBackend::_do_pull_response(boost::intrusive_ptr&lt;OpRequest&gt;)+0x8f8) [0x5566d92a7fa8]
12: (ReplicatedBackend::_handle_message(boost::intrusive_ptr&lt;OpRequest&gt;)+0x224) [0x5566d92a86a4]
13: (PGBackend::handle_message(boost::intrusive_ptr&lt;OpRequest&gt;)+0x50) [0x5566d91ccc60]
14: (PrimaryLogPG::do_request(boost::intrusive_ptr&lt;OpRequest&gt;&, ThreadPool::TPHandle&)+0x59c) [0x5566d9137d2c]
15: (OSD::dequeue_op(boost::intrusive_ptr&lt;PG&gt;, boost::intrusive_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x3f9) [0x5566d8fb8f19]
16: (PGQueueable::RunVis::operator()(boost::intrusive_ptr&lt;OpRequest&gt; const&)+0x57) [0x5566d923f6a7]
17: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfce) [0x5566d8fe7e5e]
18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) [0x5566d955d8a9]
19: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5566d955f840]
20: (()+0x7e25) [0x7f6dc8053e25]
21: (clone()+0x6d) [0x7f6dc7144bad]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Related issues 1 (0 open1 closed)

Is duplicate of bluestore - Bug #36526: segv in BlueStore::OldExtent::createResolved10/18/2018

Actions
Actions #1

Updated by Josh Durgin over 5 years ago

  • Is duplicate of Bug #36526: segv in BlueStore::OldExtent::create added
Actions #2

Updated by Josh Durgin over 5 years ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF