Bug #10138: osd: crash in SnapSet::from_snap_set - Ceph - Ceph

Actions

Copy link

Bug #10138

closed

osd: crash in SnapSet::from_snap_set

Added by Paul Emmerich over 9 years ago. Updated over 8 years ago.

Status:

Can't reproduce

Priority:

High

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We are running ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578) in a 3-server setup with 18 OSDs (4 HDDs and 2 SSDs per server)
We created two pools: 'data' on the HDD OSDs and 'ssd-cache' on the SSD OSDs.

We tried to use the 'ssd-cache' pool as caching tier for the 'data' pool with the following configuration:

writeback, bloom, hit_set_count = 1, hit_set_period = 3600, target_max_bytes = 700gb

The setup works as expected for a few hours under moderate load (a few test VMs), but it then crashes with the following osd log:

6> 2014-11-19 13:00:41.552841 7fe7a890e700  5 - op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.552841, event: write_thread_in_journal_buffer, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
    5> 2014-11-19 13:00:41.553071 7fe7a810d700  5 - op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.553070, event: journaled_completion_queued, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
    4> 2014-11-19 13:00:41.553126 7fe7a5908700  5 - op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.553125, event: commit_sent, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
    3> 2014-11-19 13:00:41.553146 7fe7a5908700  1 - 172.16.200.11:6816/26754 --> 172.16.200.12:6858/14093 -- osd_sub_op_reply(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] ondisk, result = 0) v2 -- ?+0 0x12fd0580 con 0xf87a420
    2> 2014-11-19 13:00:41.553541 7fe7a6109700  5 - op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.553540, event: sub_op_applied, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
    1> 2014-11-19 13:00:41.553565 7fe7a6109700  5 - op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.553564, event: done, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
     0> 2014-11-19 13:00:41.572467 7fe78f602700 -1 ** Caught signal (Aborted) *
 in thread 7fe78f602700

ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
 1: /usr/bin/ceph-osd() [0xba6e6c]
 2: (()+0xf0a0) [0x7fe7b2db40a0]
 3: (gsignal()+0x35) [0x7fe7b171d165]
 4: (abort()+0x180) [0x7fe7b17203e0]
 5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7fe7b1f7489d]
 6: (()+0x63996) [0x7fe7b1f72996]
 7: (()+0x639c3) [0x7fe7b1f729c3]
 8: (()+0x63bee) [0x7fe7b1f72bee]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x220) [0xc87260]
 10: (SnapSet::from_snap_set(librados::snap_set_t const&)+0x8c4) [0x8a3d54]
 11: (ReplicatedPG::finish_promote(int, std::tr1::shared_ptr&lt;OpRequest&gt;, ReplicatedPG::CopyResults*, std::tr1::shared_ptr&lt;ObjectContext&gt;)+0x670) [0x9796c0]
 12: (PromoteCallback::finish(boost::tuples::tuple&lt;int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type&gt;)+0x75) [0x9d3305]
 13: (GenContext&lt;boost::tuples::tuple&lt;int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type&gt; >::complete(boost::tuples::tuple&lt;int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type&gt;)+0x15) [0x9a1d35]
 14: (ReplicatedPG::process_copy_chunk(hobject_t, unsigned long, int)+0x451) [0x98bd31]
 15: (C_Copyfrom::finish(int)+0x86) [0x9d3f16]
 16: (Context::complete(int)+0x9) [0x7cd499]
 17: (Finisher::finisher_thread_entry()+0x148) [0xbc7238]
 18: (()+0x6b50) [0x7fe7b2dabb50]
 19: (clone()+0x6d) [0x7fe7b17c77bd]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

The OSDs that crash are one of the SSD OSDs per server (the ssd pool is currently configured to use 3 replicas, one per server).
Restarting the affected OSDs causes them crash almost immediately after startup.

Changing the cache configuration to forward fixes the problem.

Thanks for your help!

Actions

Copy link

Updated by Paul Emmerich over 9 years ago

Sorry for the formatting fail. The crash log was supposed to look like this:

    -6> 2014-11-19 13:00:41.552841 7fe7a890e700  5 -- op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.552841, event: write_thread_in_journal_buffer, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
    -5> 2014-11-19 13:00:41.553071 7fe7a810d700  5 -- op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.553070, event: journaled_completion_queued, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
    -4> 2014-11-19 13:00:41.553126 7fe7a5908700  5 -- op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.553125, event: commit_sent, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
    -3> 2014-11-19 13:00:41.553146 7fe7a5908700  1 -- 172.16.200.11:6816/26754 --> 172.16.200.12:6858/14093 -- osd_sub_op_reply(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] ondisk, result = 0) v2 -- ?+0 0x12fd0580 con 0xf87a420
    -2> 2014-11-19 13:00:41.553541 7fe7a6109700  5 -- op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.553540, event: sub_op_applied, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
    -1> 2014-11-19 13:00:41.553565 7fe7a6109700  5 -- op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.553564, event: done, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
     0> 2014-11-19 13:00:41.572467 7fe78f602700 -1 *** Caught signal (Aborted) **
 in thread 7fe78f602700

 ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
 1: /usr/bin/ceph-osd() [0xba6e6c]
 2: (()+0xf0a0) [0x7fe7b2db40a0]
 3: (gsignal()+0x35) [0x7fe7b171d165]
 4: (abort()+0x180) [0x7fe7b17203e0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fe7b1f7489d]
 6: (()+0x63996) [0x7fe7b1f72996]
 7: (()+0x639c3) [0x7fe7b1f729c3]
 8: (()+0x63bee) [0x7fe7b1f72bee]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x220) [0xc87260]
 10: (SnapSet::from_snap_set(librados::snap_set_t const&)+0x8c4) [0x8a3d54]
 11: (ReplicatedPG::finish_promote(int, std::tr1::shared_ptr<OpRequest>, ReplicatedPG::CopyResults*, std::tr1::shared_ptr<ObjectContext>)+0x670) [0x9796c0]
 12: (PromoteCallback::finish(boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>)+0x75) [0x9d3305]
 13: (GenContext<boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type> >::complete(boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>)+0x15) [0x9a1d35]
 14: (ReplicatedPG::process_copy_chunk(hobject_t, unsigned long, int)+0x451) [0x98bd31]
 15: (C_Copyfrom::finish(int)+0x86) [0x9d3f16]
 16: (Context::complete(int)+0x9) [0x7cd499]
 17: (Finisher::finisher_thread_entry()+0x148) [0xbc7238]
 18: (()+0x6b50) [0x7fe7b2dabb50]
 19: (clone()+0x6d) [0x7fe7b17c77bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions

Copy link

Updated by Sage Weil over 9 years ago

Status changed from New to Rejected

Please try using the latest firefly release. 0.87 is comparatively old.. many bugs were fixed in 0.80.5 and again in 0.80.7. Thanks!

Actions

Copy link

Updated by Paul Emmerich over 9 years ago

Sage Weil wrote:

Please try using the latest firefly release. 0.87 is comparatively old.. many bugs were fixed in 0.80.5 and again in 0.80.7. Thanks!

Uhm, so you are saying that 0.80.x firefly is newer than 0.87 giant?

We actually upgraded from 0.80.7 firefly to 0.87 giant recently due to issue #9765

Actions

Copy link

Updated by Sage Weil over 9 years ago

Subject changed from OSD crashes when enabling a writeback cache tier to osd: crash in SnapSet::from_snap_set
Category set to OSD
Status changed from Rejected to 12
Priority changed from Normal to Urgent