Project

General

Profile

Bug #19555

BlueStore.cc: 1941: FAILED assert(!force)

Added by Jason Dillaman 4 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
04/09/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

2017-04-08T14:50:01.172 INFO:tasks.ceph.cluster1.osd.1.smithi182.stderr:2017-04-08 14:50:01.173459 7fd02c356700 -1 bluestore.extentmap(0x5564ccb17350) update  encode_some needs reshard
2017-04-08T14:50:01.175 INFO:tasks.ceph.cluster1.osd.2.smithi182.stderr:2017-04-08 14:50:01.177995 7f0b044d9700 -1 bluestore.extentmap(0x56415fbea250) update  encode_some needs reshard
2017-04-08T14:50:01.787 INFO:tasks.ceph.cluster1.osd.1.smithi182.stderr:/build/ceph-12.0.0-2361-gbfd5fc0/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::ExtentMap::update(KeyValueDB::Transaction, bool)' thread 7fd02c356700 time 2017-04-08 14:50:01.173460
2017-04-08T14:50:01.787 INFO:tasks.ceph.cluster1.osd.1.smithi182.stderr:/build/ceph-12.0.0-2361-gbfd5fc0/src/os/bluestore/BlueStore.cc: 1941: FAILED assert(!force)
2017-04-08T14:50:01.787 INFO:tasks.ceph.cluster1.osd.2.smithi182.stderr:/build/ceph-12.0.0-2361-gbfd5fc0/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::ExtentMap::update(KeyValueDB::Transaction, bool)' thread 7f0b044d9700 time 2017-04-08 14:50:01.177996
2017-04-08T14:50:01.787 INFO:tasks.ceph.cluster1.osd.2.smithi182.stderr:/build/ceph-12.0.0-2361-gbfd5fc0/src/os/bluestore/BlueStore.cc: 1941: FAILED assert(!force)

http://qa-proxy.ceph.com/teuthology/jdillaman-2017-04-08_09:59:00-rbd-wip-jd-testing-2-distro-basic-smithi/1001772/teuthology.log

ceph-osd.3.rar (848 KB) tao ning, 05/23/2017 01:04 AM

History

#1 Updated by Nick Fisk 2 months ago

I believe I may have just hit this. Was doing 4kb seq writes from fio to a EC pool with overwrites enabled, RBD image had previously been initialised with 4MB writes without problem.. Occurred about 1-2s into the test.


-2> 2017-05-22 15:58:24.506512 7f8ceec80700  1 -- 10.2.0.9:6802/3868 <== osd.4 10.2.0.9:6818/3870 233 ==== MOSDECSubOpWrite(4.ccs2 442/441 ECSubWrite(tid=1, req
id=client.24276.0:896, at_version=442'881, trim_to=0'0, roll_forward_to=0'0)) v2 ==== 6539+0+0 (343262927 0 0) 0x55b3bd5c3800 con 0x55b3bd127000
    -1> 2017-05-22 15:58:24.508798 7f8cd7c7a700 -1 bluestore.extentmap(0x55b3b406ddd0) update  encode_some needs reshard
     0> 2017-05-22 15:58:24.512854 7f8cd7c7a700 -1 /build/ceph-12.0.2-1259-gc946455/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::ExtentMap::update(Ke
yValueDB::Transaction, bool)' thread 7f8cd7c7a700 time 2017-05-22 15:58:24.508817
/build/ceph-12.0.2-1259-gc946455/src/os/bluestore/BlueStore.cc: 1942: FAILED assert(!force)

 ceph version 12.0.2-1259-gc946455 (c94645567be6186595a1d5ad2c56a6425951ec7d)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55b3a9b6d742]
 2: (BlueStore::ExtentMap::update(std::shared_ptr<KeyValueDB::TransactionImpl>, bool)+0x689) [0x55b3a99dde09]
 3: (BlueStore::_txc_write_nodes(BlueStore::TransContext*, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x168b) [0x55b3a9a16a6b]
 4: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<
TrackedOp>, ThreadPool::TPHandle*)+0x57d) [0x55b3a9a338fd]
 5: (PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x66) [0x5
5b3a977e0f6]
 6: (ECBackend::handle_sub_write(pg_shard_t, boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&, Context*)+0x886) [0x55b3a98a5176]
 7: (ECBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x331) [0x55b3a98c0701]
 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x541) [0x55b3a971ef41]
 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1c9) [0x55b3a95b22a9]
 10: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x55b3a95b2727]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x130e) [0x55b3a95d879e]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x7f9) [0x55b3a9b724f9]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55b3a9b758b0]
 14: (()+0x76ba) [0x7f8cf1ff46ba]
 15: (clone()+0x6d) [0x7f8cf106b82d]

#2 Updated by Sage Weil 2 months ago

Nick, is this reproducible? Any chance you can reproduce with debug bluestore = 20?

#3 Updated by Sage Weil 2 months ago

or debug bluestore = 30 would be even better :)

#4 Updated by tao ning 2 months ago

0> 2017-05-19 14:44:41.128625 7f8a6cc28700 -1 ** Caught signal (Aborted) *
in thread 7f8a6cc28700 thread_name:tp_osd

ceph version 12.0.2 (5a1b6b3269da99a18984c138c23935e5eb96f73e)
1: (()+0xcab9b2) [0x55e39f8cb9b2]
2: (()+0x11390) [0x7f8a83121390]
3: (gsignal()+0x38) [0x7f8a820bd428]
4: (abort()+0x16a) [0x7f8a820bf02a]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55e39f92f1fe]
6: (BlueStore::ExtentMap::update(std::shared_ptr<KeyValueDB::TransactionImpl>, bool)+0x689) [0x55e39f776839]
7: (BlueStore::_txc_write_nodes(BlueStore::TransContext*, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x172b) [0x55e39f7b0d6b]
8: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x57c) [0x55e39f7cd75c]
9: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x17c) [0x55e39f3b638c]
10: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, ThreadPool::TPHandle*)+0x6c) [0x55e39f344e2c]
11: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x451) [0x55e39f390b91]
12: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x27) [0x55e39f3d6357]
13: (ThreadPool::worker(ThreadPool::WorkThread*)+0xf2d) [0x55e39f93625d]
14: (ThreadPool::WorkThread::entry()+0x10) [0x55e39f9373c0]
15: (()+0x76ba) [0x7f8a831176ba]
16: (clone()+0x6d) [0x7f8a8218e82d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

See annex for details,Hope to help

#5 Updated by tao ning 2 months ago

blob id more than the maximum

BlueStore.cc :2256
dout(20) << func << " adding spanning " << *b << dendl;
dout(1) << func << " adding spanning " << b->id << dendl;

The log is as follows:
-49> 2017-05-25 19:04:48.737586 7f5dc5eca700 20 bluestore.extentmap(0x5630abb27b10) reshard adding spanning Blob(0x5630a3d4c6f0 spanning 32767 blob([0x11a7f80000~80000] has_unused+shared none/0x1 unused=0xffc3) use_tracker(0x80000 0x1000) SharedBlob(0x5630a3d4bff0 loaded sbid 0x19e50 ref_map(ref_map(0x11a7f80000~80000=2))))
-48> 2017-05-25 19:04:48.737590 7f5dc5eca700 1 bluestore.extentmap(0x5630abb27b10) reshard adding spanning 32767
-38> 2017-05-25 19:04:48.737609 7f5dc5eca700 20 bluestore.extentmap(0x5630abb27b10) reshard adding spanning Blob(0x5630a3d4c7d0 blob([0x116dc80000~80000] has_unused+shared none/0x1 unused=0xff05) use_tracker(0x80000 0x1000) SharedBlob(0x5630a3d4c450 loaded sbid 0x19e5f ref_map(ref_map(0x116dc80000~80000=2))))
-37> 2017-05-25 19:04:48.737611 7f5dc5eca700 1 bluestore.extentmap(0x5630abb27b10) reshard adding spanning -32768

#7 Updated by Nick Fisk about 2 months ago

Sage Weil wrote:

or debug bluestore = 30 would be even better :)

The motherboard in my test box seems to have died over the weekend. Trying to get it back up and running, if I can I will turn up the debugging.

#8 Updated by Xiaoyan Li about 1 month ago

I use bluestore fio plugin to run IO test, this problem occurs after about 3 mins. To set debug bluestore=30 easily cause too much log, and it is hard to reproduce the problem.

I updated dout in reshard(), update(), and encode_some() to lower levels, and had the log in attached file.

The process:
Update, reshard, and then update force:
https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L7665-L7667

encode_some returns must_reshard as true when updating force:
https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L2317-L2321

Look into the log:
2017-06-09 12:01:12.897087 7fe84d72d700 5 bluestore.extentmap(0x1ffc8b0) reshard adding spanning Blob(0x7fe8308a2e70 blob([!~4000,0x23f0000~4000] mutable+csum+has_unused crc32c/0x1000 unused=0x3ff) use_tracker(0x2*0x4000 0x[0,1000]) SharedBlob(0x7fe8308a2ee0 sbid 0x0))
2017-06-09 12:01:12.897673 7fe84d72d700 5 bluestore.extentmap(0x1ffc8b0) encode_some 0x12235000~4ed000 hit new spanning blob 0x12235000~1000: 0x5000~1000 Blob(0x7fe8308a2e70 blob([!~4000,0x23f0000~4000] mutable+csum+has_unused crc32c/0x1000 unused=0x3ff) use_tracker(0x2*0x4000 0x[0,1000]) SharedBlob(0x7fe8308a2ee0 sbid 0x0))

https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L1658
https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L2274-L2276

In the time 2017-06-09 12:01:12.897087, it adds spanning blob, but blod id is less than 0. As no "spanning" is printed in this log. It leads later encode_some returns true.

The log is very big and put in google drive:
https://drive.google.com/file/d/0B6jqFc7e2yxVVUQ4Y0VlYm15b00/view?usp=sharing

#9 Updated by Xiaoyan Li about 1 month ago

I add codes to print the spanning_blob_map when blob->id is less than 0, and finally I think I find the problem.

2017-06-13 06:32:12.565807 7fe89a07f700 5 bluestore.extentmap(0x32228c0) reshard 32764 blobBlob(0x7fe8830a6340 spanning 32764 blob([0x3874000~4000] mutable+csum+has_unused crc32c/0x1000 unused=0xff0f) use_tracker(0x4000 0x1000) SharedBlob(0x7fe882e8dd10 sbid 0x0))
2017-06-13 06:32:12.565812 7fe89a07f700 5 bluestore.extentmap(0x32228c0) reshard 32765 blobBlob(0x7fe88005a2a0 spanning 32765 blob([!~c000,0x73a0000~4000] mutable+csum+has_unused crc32c/0x1000 unused=0xfff) use_tracker(0x4*0x4000 0x[0,0,0,1000]) SharedBlob(0x7fe880605070 sbid 0x0))
2017-06-13 06:32:12.565817 7fe89a07f700 5 bluestore.extentmap(0x32228c0) reshard 32766 blobBlob(0x7fe882405900 spanning 32766 blob([0x9b80000~4000] mutable+csum+has_unused crc32c/0x1000 unused=0xfff) use_tracker(0x4000 0x1000) SharedBlob(0x7fe8800c1890 sbid 0x0))
2017-06-13 06:32:12.565823 7fe89a07f700 5 bluestore.extentmap(0x32228c0) reshard 32767 blobBlob(0x7fe883e3e760 spanning 32767 blob([!~8000,0x7a20000~4000] mutable+csum+has_unused crc32c/0x1000 unused=0x3ff) use_tracker(0x3*0x4000 0x[0,0,1000]) SharedBlob(0x7fe881f0af50 sbid 0x0))

The type of blob id is int16_t(short int), which has maximum data 32768!

New log is at
https://drive.google.com/open?id=0B6jqFc7e2yxVLXViMFBNT09VUzQ

#10 Updated by Xiaoyan Li about 1 month ago

#11 Updated by Kefu Chai about 1 month ago

  • Status changed from New to Need Review

#12 Updated by Sage Weil about 1 month ago

  • Status changed from Need Review to Resolved

Also available in: Atom PDF