Bug #23540: FAILED assert(0 == "can't mark unloaded shard dirty") with compression enabled - bluestore - Ceph

Actions

Copy link

Bug #23540

closed

FAILED assert(0 == "can't mark unloaded shard dirty") with compression enabled

Added by Francisco Freire about 6 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

High

Assignee:

Igor Fedotov

Target version:

% Done:

Source:

Tags:

Backport:

mimic,luminous

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We are using the latest ceph luminous version (12.2.4), and we have a SATA pool tiered by an SSD pool. All using bluestore, and this bug only occurs on the SSD pool. I changed some OSD's do filestore and everything works fine, i got this error like 2 or 3 times a day on EACH osd, causing then to go down and restart. I have to keep the noout flag on cluster to get everything running.

This ceph cluster is used on Openstack on vm disks (nova) and volumes (cinder)

Thanks!

/build/ceph-12.2.4/src/os/bluestore/BlueStore.cc: 2714: FAILED assert(0 == "can't mark unloaded shard dirty")

ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x561c785ce872]
 2: (BlueStore::ExtentMap::dirty_range(unsigned int, unsigned int)+0x54a) [0x561c7841927a]
 3: (BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr&lt;BlueStore::Collection&gt;&, boost::intrusive_ptr&lt;BlueStore::Onode&gt;, unsigned long, unsigned long, ceph::buffer::list&, unsigned int)+0x4d9) [0x561c7847e4b9]
 4: (BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr&lt;BlueStore::Collection&gt;&, boost::intrusive_ptr&lt;BlueStore::Onode&gt;&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int)+0xfc) [0x561c7847ef9c]
 5: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x1b34) [0x561c78485ea4]
 6: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector&lt;ObjectStore::Transaction, std::allocator&lt;ObjectStore::Transaction&gt; >&, boost::intrusive_ptr&lt;TrackedOp&gt;, ThreadPool::TPHandle*)+0x52e) [0x561c7848702e]
 7: (PrimaryLogPG::queue_transactions(std::vector&lt;ObjectStore::Transaction, std::allocator&lt;ObjectStore::Transaction&gt; >&, boost::intrusive_ptr&lt;OpRequest&gt;)+0x66) [0x561c781ae256]
 8: (ReplicatedBackend::do_repop(boost::intrusive_ptr&lt;OpRequest&gt;)+0xc34) [0x561c782d37a4]
 9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr&lt;OpRequest&gt;)+0x294) [0x561c782dc834]
 10: (PGBackend::handle_message(boost::intrusive_ptr&lt;OpRequest&gt;)+0x50) [0x561c781ebca0]
 11: (PrimaryLogPG::do_request(boost::intrusive_ptr&lt;OpRequest&gt;&, ThreadPool::TPHandle&)+0x543) [0x561c781509d3]
 12: (OSD::dequeue_op(boost::intrusive_ptr&lt;PG&gt;, boost::intrusive_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x3a9) [0x561c77fca3b9]
 13: (PGQueueable::RunVis::operator()(boost::intrusive_ptr&lt;OpRequest&gt; const&)+0x57) [0x561c7826d047]
 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x130e) [0x561c77ff29ae]
 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) [0x561c785d3664]
 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x561c785d66a0]
 17: (()+0x76ba) [0x7f040b0e96ba]
 18: (clone()+0x6d) [0x7f040a16041d]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Files

ceph-osd.15.zip (197 KB) ceph-osd.15.zip

Yohay Azulay, 05/14/2018 01:50 PM

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Igor Fedotov about 6 years ago

Hi Francisco,
wondering if you have compression enabled for any of your pools or the whole bluestore?

Actions

Copy link

Updated by Igor Fedotov about 6 years ago

Project changed from Ceph to bluestore
Category deleted (~~OSD~~)

Actions

Copy link

Updated by Francisco Freire about 6 years ago

Yeah. The whole cluster has compression enabled

Actions

Copy link

Updated by Francisco Freire about 6 years ago

I disabled the compression for a while and no OSD's get the error, after enabled it again they back to get the problem, for sure this is caused by compression.

Actions

Copy link

Updated by Igor Fedotov about 6 years ago

Francisco,
thanks for the update, very appreciated.

Curious if you can collect a log for the crushing OSD, with debug bluestore set to 20. It would be very helpful for troubleshooting.
This might impact cluster performance though...

Actions

Copy link

Updated by Igor Fedotov about 6 years ago

Assignee set to Igor Fedotov

Actions

Copy link

Updated by Francisco Freire about 6 years ago

Hello,

We are running for a few days without problem (with compression disabled), to get a debug i need to enable compression again, il do it on a timeframe and post here.

Thanks

Actions

Copy link

Updated by Sage Weil about 6 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by Sage Weil almost 6 years ago

Subject changed from FAILED assert(0 == "can't mark unloaded shard dirty") to FAILED assert(0 == "can't mark unloaded shard dirty") with compression enabled
Status changed from New to Need More Info

Francisco, any update?

Actions

Copy link

#10

Updated by Yohay Azulay almost 6 years ago

I had the same issue with 3 new clusters. compression set to FORCE, once I changed to none and restarted the whole cluster, the osds stopped flapping.

Any logs I can send that can help?

Actions

Copy link

#11

Updated by Igor Fedotov almost 6 years ago

Hi Yohay,
could you please collect a log for the crash with debug bluestore set to 20?

Actions

Copy link

#12

Updated by Yohay Azulay almost 6 years ago

File ceph-osd.15.zip ceph-osd.15.zip added

That can be a problem because I disabled the Compression and cluster is running in production. if I enable compression it will crush again.. :(

log file I have is 110MB and 8MB compressed, it repeats it self and I attached a sample of the log.

Igor Fedotov wrote:

Hi Yohay,
could you please collect a log for the crash with debug bluestore set to 20?

Actions

Copy link

#13

Updated by Sage Weil almost 6 years ago

Hi everyone,

Is someone willing to enable compression on a bluestore osd with debugging enabled (debug bluestore = 20) so that we can capture a complete log leading up to the crash that would be extremely helpful. Igor has been unable to reproduce this in his own environment so we currently don't have much to go on. Perhaps enabling debugging and compression on only a handful of OSDs would be sufficient with a small impact on the production environment (performance, /var/log/ceph disk usage, etc.).

Actions

Copy link

#14

Updated by Yohay Azulay almost 6 years ago

Got it.. here it is: http://77.247.180.45/ceph-osd.11.log.debug.gz

Sage Weil wrote:

Hi everyone,

Is someone willing to enable compression on a bluestore osd with debugging enabled (debug bluestore = 20) so that we can capture a complete log leading up to the crash that would be extremely helpful. Igor has been unable to reproduce this in his own environment so we currently don't have much to go on. Perhaps enabling debugging and compression on only a handful of OSDs would be sufficient with a small impact on the production environment (performance, /var/log/ceph disk usage, etc.).

Actions

Copy link

#15

Updated by Igor Fedotov almost 6 years ago

Yohay,
I can't access the file at the link you provided, "Not found" returned..

Actions

Copy link

#16

Updated by Yohay Azulay almost 6 years ago

arghh. my mistake, http://77.247.180.45/download/ceph-osd.11.log.debug.gz

Igor Fedotov wrote:

Yohay,
I can't access the file at the link you provided, "Not found" returned..

Actions

Copy link

#17

Updated by Igor Fedotov almost 6 years ago

Much better now :)
Thanks a lot!!!

Actions

Copy link

#18

Updated by Peter Gervai almost 6 years ago

ceph-post-file: 1b1d42bb-6cae-430a-8fe7-974ce077b8dc
May (or may not) help, it's around loglevel5 I guess.

Actions

Copy link

#19

Updated by Igor Fedotov almost 6 years ago

Status changed from Need More Info to Fix Under Review

https://github.com/ceph/ceph/pull/22873

Actions

Copy link

#20

Updated by Sage Weil almost 6 years ago

Backport set to mimic,luminous

Actions

Copy link

#21

Updated by Igor Fedotov almost 6 years ago

Related to Backport #24798: luminous: FAILED assert(0 == "can't mark unloaded shard dirty") with compression enabled added

Actions

Copy link

#22

Updated by Igor Fedotov almost 6 years ago

Related to Backport #24799: mimic: FAILED assert(0 == "can't mark unloaded shard dirty") with compression enabled added

Actions

Copy link

#23

Updated by Sage Weil almost 6 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

#26

Updated by Nathan Cutler almost 6 years ago

Related to deleted (Backport #24798: luminous: FAILED assert(0 == "can't mark unloaded shard dirty") with compression enabled)

Actions

Copy link

#27

Updated by Nathan Cutler almost 6 years ago

Related to deleted (Backport #24799: mimic: FAILED assert(0 == "can't mark unloaded shard dirty") with compression enabled)

Actions

Copy link

#28

Updated by Nathan Cutler almost 6 years ago

Copied to Backport #24798: luminous: FAILED assert(0 == "can't mark unloaded shard dirty") with compression enabled added

Actions

Copy link

#29

Updated by Nathan Cutler almost 6 years ago

Copied to Backport #24799: mimic: FAILED assert(0 == "can't mark unloaded shard dirty") with compression enabled added

Actions

Copy link

#30

Updated by Nathan Cutler almost 6 years ago

@Igor Gajowiak - please use "Copied To" instead of "Related To" for the links to the backport issues. All good otherwise, thanks!

Note: you can automate creation of backport issues for issues in Pending Backport status using src/script/backport-create-issue (but this script currently does not know how to limit itself to a single issue - it loops through them all).

Actions

Copy link

#32

Updated by Igor Fedotov over 5 years ago

Status changed from Pending Backport to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » bluestore

Custom queries

Bug #23540

FAILED assert(0 == "can't mark unloaded shard dirty") with compression enabled

Updated by Igor Fedotov about 6 years ago

Updated by Igor Fedotov about 6 years ago

Updated by Francisco Freire about 6 years ago

Updated by Francisco Freire about 6 years ago

Updated by Igor Fedotov about 6 years ago

Updated by Igor Fedotov about 6 years ago

Updated by Francisco Freire about 6 years ago

Updated by Sage Weil about 6 years ago

Updated by Sage Weil almost 6 years ago

Updated by Yohay Azulay almost 6 years ago

Updated by Igor Fedotov almost 6 years ago

Updated by Yohay Azulay almost 6 years ago

Updated by Sage Weil almost 6 years ago

Updated by Yohay Azulay almost 6 years ago

Updated by Igor Fedotov almost 6 years ago

Updated by Yohay Azulay almost 6 years ago

Updated by Igor Fedotov almost 6 years ago

Updated by Peter Gervai almost 6 years ago

Updated by Igor Fedotov almost 6 years ago

Updated by Sage Weil almost 6 years ago

Updated by Igor Fedotov almost 6 years ago

Updated by Igor Fedotov almost 6 years ago

Updated by Sage Weil almost 6 years ago

Updated by Nathan Cutler almost 6 years ago

Updated by Nathan Cutler almost 6 years ago

Updated by Nathan Cutler almost 6 years ago

Updated by Nathan Cutler almost 6 years ago

Updated by Nathan Cutler almost 6 years ago

Updated by Igor Fedotov over 5 years ago