Project

General

Profile

Actions

Bug #10138

closed

osd: crash in SnapSet::from_snap_set

Added by Paul Emmerich over 9 years ago. Updated over 8 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are running ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578) in a 3-server setup with 18 OSDs (4 HDDs and 2 SSDs per server)
We created two pools: 'data' on the HDD OSDs and 'ssd-cache' on the SSD OSDs.

We tried to use the 'ssd-cache' pool as caching tier for the 'data' pool with the following configuration:

writeback, bloom, hit_set_count = 1, hit_set_period = 3600, target_max_bytes = 700gb

The setup works as expected for a few hours under moderate load (a few test VMs), but it then crashes with the following osd log:

6> 2014-11-19 13:00:41.552841 7fe7a890e700  5 - op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.552841, event: write_thread_in_journal_buffer, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
5> 2014-11-19 13:00:41.553071 7fe7a810d700 5 - op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.553070, event: journaled_completion_queued, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
4> 2014-11-19 13:00:41.553126 7fe7a5908700 5 - op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.553125, event: commit_sent, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
3> 2014-11-19 13:00:41.553146 7fe7a5908700 1 - 172.16.200.11:6816/26754 --> 172.16.200.12:6858/14093 -- osd_sub_op_reply(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] ondisk, result = 0) v2 -- ?+0 0x12fd0580 con 0xf87a420
2> 2014-11-19 13:00:41.553541 7fe7a6109700 5 - op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.553540, event: sub_op_applied, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
1> 2014-11-19 13:00:41.553565 7fe7a6109700 5 - op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.553564, event: done, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
0> 2014-11-19 13:00:41.572467 7fe78f602700 -1 ** Caught signal (Aborted) *
in thread 7fe78f602700
ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
1: /usr/bin/ceph-osd() [0xba6e6c]
2: (()+0xf0a0) [0x7fe7b2db40a0]
3: (gsignal()+0x35) [0x7fe7b171d165]
4: (abort()+0x180) [0x7fe7b17203e0]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7fe7b1f7489d]
6: (()+0x63996) [0x7fe7b1f72996]
7: (()+0x639c3) [0x7fe7b1f729c3]
8: (()+0x63bee) [0x7fe7b1f72bee]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x220) [0xc87260]
10: (SnapSet::from_snap_set(librados::snap_set_t const&)+0x8c4) [0x8a3d54]
11: (ReplicatedPG::finish_promote(int, std::tr1::shared_ptr<OpRequest>, ReplicatedPG::CopyResults*, std::tr1::shared_ptr<ObjectContext>)+0x670) [0x9796c0]
12: (PromoteCallback::finish(boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>)+0x75) [0x9d3305]
13: (GenContext<boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type> >::complete(boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>)+0x15) [0x9a1d35]
14: (ReplicatedPG::process_copy_chunk(hobject_t, unsigned long, int)+0x451) [0x98bd31]
15: (C_Copyfrom::finish(int)+0x86) [0x9d3f16]
16: (Context::complete(int)+0x9) [0x7cd499]
17: (Finisher::finisher_thread_entry()+0x148) [0xbc7238]
18: (()+0x6b50) [0x7fe7b2dabb50]
19: (clone()+0x6d) [0x7fe7b17c77bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

The OSDs that crash are one of the SSD OSDs per server (the ssd pool is currently configured to use 3 replicas, one per server).
Restarting the affected OSDs causes them crash almost immediately after startup.

Changing the cache configuration to forward fixes the problem.

Thanks for your help!

Actions #1

Updated by Paul Emmerich over 9 years ago

Sorry for the formatting fail. The crash log was supposed to look like this:

    -6> 2014-11-19 13:00:41.552841 7fe7a890e700  5 -- op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.552841, event: write_thread_in_journal_buffer, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
    -5> 2014-11-19 13:00:41.553071 7fe7a810d700  5 -- op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.553070, event: journaled_completion_queued, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
    -4> 2014-11-19 13:00:41.553126 7fe7a5908700  5 -- op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.553125, event: commit_sent, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
    -3> 2014-11-19 13:00:41.553146 7fe7a5908700  1 -- 172.16.200.11:6816/26754 --> 172.16.200.12:6858/14093 -- osd_sub_op_reply(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] ondisk, result = 0) v2 -- ?+0 0x12fd0580 con 0xf87a420
    -2> 2014-11-19 13:00:41.553541 7fe7a6109700  5 -- op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.553540, event: sub_op_applied, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
    -1> 2014-11-19 13:00:41.553565 7fe7a6109700  5 -- op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.553564, event: done, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
     0> 2014-11-19 13:00:41.572467 7fe78f602700 -1 *** Caught signal (Aborted) **
 in thread 7fe78f602700

 ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
 1: /usr/bin/ceph-osd() [0xba6e6c]
 2: (()+0xf0a0) [0x7fe7b2db40a0]
 3: (gsignal()+0x35) [0x7fe7b171d165]
 4: (abort()+0x180) [0x7fe7b17203e0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fe7b1f7489d]
 6: (()+0x63996) [0x7fe7b1f72996]
 7: (()+0x639c3) [0x7fe7b1f729c3]
 8: (()+0x63bee) [0x7fe7b1f72bee]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x220) [0xc87260]
 10: (SnapSet::from_snap_set(librados::snap_set_t const&)+0x8c4) [0x8a3d54]
 11: (ReplicatedPG::finish_promote(int, std::tr1::shared_ptr<OpRequest>, ReplicatedPG::CopyResults*, std::tr1::shared_ptr<ObjectContext>)+0x670) [0x9796c0]
 12: (PromoteCallback::finish(boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>)+0x75) [0x9d3305]
 13: (GenContext<boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type> >::complete(boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>)+0x15) [0x9a1d35]
 14: (ReplicatedPG::process_copy_chunk(hobject_t, unsigned long, int)+0x451) [0x98bd31]
 15: (C_Copyfrom::finish(int)+0x86) [0x9d3f16]
 16: (Context::complete(int)+0x9) [0x7cd499]
 17: (Finisher::finisher_thread_entry()+0x148) [0xbc7238]
 18: (()+0x6b50) [0x7fe7b2dabb50]
 19: (clone()+0x6d) [0x7fe7b17c77bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions #2

Updated by Sage Weil over 9 years ago

  • Status changed from New to Rejected

Please try using the latest firefly release. 0.87 is comparatively old.. many bugs were fixed in 0.80.5 and again in 0.80.7. Thanks!

Actions #3

Updated by Paul Emmerich over 9 years ago

Sage Weil wrote:

Please try using the latest firefly release. 0.87 is comparatively old.. many bugs were fixed in 0.80.5 and again in 0.80.7. Thanks!

Uhm, so you are saying that 0.80.x firefly is newer than 0.87 giant?

We actually upgraded from 0.80.7 firefly to 0.87 giant recently due to issue #9765

Actions #4

Updated by Sage Weil over 9 years ago

  • Subject changed from OSD crashes when enabling a writeback cache tier to osd: crash in SnapSet::from_snap_set
  • Category set to OSD
  • Status changed from Rejected to 12
  • Priority changed from Normal to Urgent

Whoops! I misread the version. Do you have a core file?

Actions #5

Updated by Sage Weil over 9 years ago

  • Status changed from 12 to Need More Info
Actions #6

Updated by Sage Weil about 9 years ago

  • Priority changed from Urgent to High
Actions #7

Updated by Paul Emmerich over 8 years ago

We did not encounter this again since upgrading to 0.9x so this can be closed.

Actions #8

Updated by Sage Weil over 8 years ago

  • Status changed from Need More Info to Can't reproduce
  • Regression set to No
Actions

Also available in: Atom PDF