Bug #10138
closedosd: crash in SnapSet::from_snap_set
0%
Description
We are running ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578) in a 3-server setup with 18 OSDs (4 HDDs and 2 SSDs per server)
We created two pools: 'data' on the HDD OSDs and 'ssd-cache' on the SSD OSDs.
We tried to use the 'ssd-cache' pool as caching tier for the 'data' pool with the following configuration:
writeback, bloom, hit_set_count = 1, hit_set_period = 3600, target_max_bytes = 700gb
The setup works as expected for a few hours under moderate load (a few test VMs), but it then crashes with the following osd log:
6> 2014-11-19 13:00:41.552841 7fe7a890e700 5 - op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.552841, event: write_thread_in_journal_buffer, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
5> 2014-11-19 13:00:41.553071 7fe7a810d700 5 - op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.553070, event: journaled_completion_queued, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
4> 2014-11-19 13:00:41.553126 7fe7a5908700 5 - op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.553125, event: commit_sent, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
3> 2014-11-19 13:00:41.553146 7fe7a5908700 1 - 172.16.200.11:6816/26754 --> 172.16.200.12:6858/14093 -- osd_sub_op_reply(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] ondisk, result = 0) v2 -- ?+0 0x12fd0580 con 0xf87a420
2> 2014-11-19 13:00:41.553541 7fe7a6109700 5 - op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.553540, event: sub_op_applied, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
1> 2014-11-19 13:00:41.553565 7fe7a6109700 5 - op tracker -- seq: 131507156, time: 2014-11-19 13:00:41.553564, event: done, op: osd_sub_op(client.844955.0:352418449 7.2c 7bd64cec/rbd_data.5797d2ae8944a.0000000000064882/head//7 [] v 3771'221148 snapset=0=[]:[] snapc=0=[])
0> 2014-11-19 13:00:41.572467 7fe78f602700 -1 ** Caught signal (Aborted) *
in thread 7fe78f602700
ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
1: /usr/bin/ceph-osd() [0xba6e6c]
2: (()+0xf0a0) [0x7fe7b2db40a0]
3: (gsignal()+0x35) [0x7fe7b171d165]
4: (abort()+0x180) [0x7fe7b17203e0]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7fe7b1f7489d]
6: (()+0x63996) [0x7fe7b1f72996]
7: (()+0x639c3) [0x7fe7b1f729c3]
8: (()+0x63bee) [0x7fe7b1f72bee]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x220) [0xc87260]
10: (SnapSet::from_snap_set(librados::snap_set_t const&)+0x8c4) [0x8a3d54]
11: (ReplicatedPG::finish_promote(int, std::tr1::shared_ptr<OpRequest>, ReplicatedPG::CopyResults*, std::tr1::shared_ptr<ObjectContext>)+0x670) [0x9796c0]
12: (PromoteCallback::finish(boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>)+0x75) [0x9d3305]
13: (GenContext<boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type> >::complete(boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>)+0x15) [0x9a1d35]
14: (ReplicatedPG::process_copy_chunk(hobject_t, unsigned long, int)+0x451) [0x98bd31]
15: (C_Copyfrom::finish(int)+0x86) [0x9d3f16]
16: (Context::complete(int)+0x9) [0x7cd499]
17: (Finisher::finisher_thread_entry()+0x148) [0xbc7238]
18: (()+0x6b50) [0x7fe7b2dabb50]
19: (clone()+0x6d) [0x7fe7b17c77bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
The OSDs that crash are one of the SSD OSDs per server (the ssd pool is currently configured to use 3 replicas, one per server).
Restarting the affected OSDs causes them crash almost immediately after startup.
Changing the cache configuration to forward fixes the problem.
Thanks for your help!