Bug #15943: crash adding snap to purged_snaps in ReplicatedPG::WaitingOnReplicas - Ceph - Ceph

Actions

Copy link

Bug #15943

closed

crash adding snap to purged_snaps in ReplicatedPG::WaitingOnReplicas

Added by Samuel Just almost 8 years ago. Updated about 7 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Samuel Just

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

jewel,hammer

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

-16> 2016-05-19 14:51:36.794141 7fbb03926700 10 filestore(/var/lib/ceph/osd/ceph-5) _do_transaction on 0x7fbb1e61ee00
-15> 2016-05-19 14:51:36.794139 7fbb03125700 10 filestore oid: #1:88000000::::head# not skipping op, spos 3617.0.0
-14> 2016-05-19 14:51:36.794146 7fbb03125700 10 filestore > header.spos 0.0.0
-13> 2016-05-19 14:51:36.794139 7fbaf3f7d700 20 osd.5 pg_epoch: 290 pg[1.15( v 196'73 (0'0,196'73] local-les=288 n=1 ec=22 les/c/f 288/288/0 287/287/277) [1,5] r=1 lpr=287 pi=8-286/5 luod=0'0 crt=196'73 lcod 0'0 active NIBBLEWISE] agent_stop
-12> 2016-05-19 14:51:36.794121 7fbaee772700 -1 ** Caught signal (Aborted) **
in thread 7fbaee772700 thread_name:tp_osd_tp

ceph version 10.2.0-1069-g3362c8d (3362c8dd2718b1ff61a18bc7f49474e6808c2fc7)
1: (()+0x904ca2) [0x7fbb120eaca2]
2: (()+0x10340) [0x7fbb1048f340]
3: (gsignal()+0x39) [0x7fbb0e4f1cc9]
4: (abort()+0x148) [0x7fbb0e4f50d8]
5: (()+0x2fb86) [0x7fbb0e4eab86]
6: (()+0x2fc32) [0x7fbb0e4eac32]
7: (ReplicatedPG::WaitingOnReplicas::react(ReplicatedPG::SnapTrim const&)+0xf79) [0x7fbb11d04ce9]
8: (boost::statechart::simple_state<ReplicatedPG::WaitingOnReplicas, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xb4) [0x7fbb11d34264]
9: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0x12b) [0x7fbb11d20afb]
10: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x84) [0x7fbb11d20cc4]
11: (ReplicatedPG::snap_trimmer(unsigned int)+0x46b) [0x7fbb11c9f73b]
12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x8e3) [0x7fbb11b7c1b3]
13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x877) [0x7fbb121d68f7]
14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7fbb121d8820]
15: (()+0x8182) [0x7fbb10487182]
16: (clone()+0x6d) [0x7fbb0e5b547d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

2016-05-19 14:51:32.521796 7fbb03125700 15 filestore(/var/lib/ceph/osd/ceph-5) write meta/#-1:a68b6935:::osdmap.286:0# 0~6055
2016-05-19 14:51:32.521797 7fbaf3f7d700 10 osd.5 pg_epoch: 198 pg[1.1d( v 173'94 (0'0,173'94] local-les=192 n=0 ec=191 les/c/f 192/192/0 177/192/179) [1,5] r=1 lpr=198 crt=173'94 lcod 0'0 inactive NOTIFY NIBBLEWISE] handle_advance_map [3]/[3] -- 3/3
2016-05-19 14:51:32.521806 7fbaf3f7d700 20 PGPool::update cached_removed_snaps [1~9d,a0~1,a2~1,a5~1] newly_removed_snaps [] snapc a7=[a7,a6,a4,a3,a1,9f,9e] (no change)
2016-05-19 14:51:32.521810 7fbaf3f7d700 10 osd.5 pg_epoch: 272 pg[1.1d( v 173'94 (0'0,173'94] local-les=192 n=0 ec=191 les/c/f 192/192/0 177/192/179) [1,5] r=1 lpr=198 crt=173'94 lcod 0'0 inactive NOTIFY NIBBLEWISE] state<Reset>: Reset advmap
2016-05-19 14:51:32.521815 7fbaf3f7d700 10 osd.5 pg_epoch: 272 pg[1.1d( v 173'94 (0'0,173'94] local-les=192 n=0 ec=191 les/c/f 192/192/0 177/192/179) [1,5] r=1 lpr=198 crt=173'94 lcod 0'0 inactive NOTIFY NIBBLEWISE] _calc_past_interval_range start epoch 272 >= end epoch 192, nothing to do
2016-05-19 14:51:32.521816 7fbaf377c700 20 osd.5 272 get_map 229 - loading and decoding 0x7fbb1e834880
2016-05-19 14:51:32.521820 7fbaf3f7d700 20 osd.5 pg_epoch: 272 pg[1.1d( v 173'94 (0'0,173'94] local-les=192 n=0 ec=191 les/c/f 192/192/0 177/192/179) [1,5] r=1 lpr=198 crt=173'94 lcod 0'0 inactive NOTIFY NIBBLEWISE] new interval newup [3] newacting [3]
2016-05-19 14:51:32.521825 7fbaf3f7d700 10 osd.5 pg_epoch: 272 pg[1.1d( v 173'94 (0'0,173'94] local-les=192 n=0 ec=191 les/c/f 192/192/0 177/192/179) [1,5] r=1 lpr=198 crt=173'94 lcod 0'0 inactive NOTIFY NIBBLEWISE] state<Reset>: should restart peering, calling start_peering_interval again
2016-05-19 14:51:32.521829 7fbaf3f7d700 20 osd.5 pg_epoch: 272 pg[1.1d( v 173'94 (0'0,173'94] local-les=192 n=0 ec=191 les/c/f 192/192/0 177/192/179) [1,5] r=1 lpr=198 crt=173'94 lcod 0'0 inactive NOTIFY NIBBLEWISE] set_last_peering_reset 272
2016-05-19 14:51:32.521829 7fbaf377c700 15 filestore(/var/lib/ceph/osd/ceph-5) read meta/#-1:ac5ce935:::osdmap.229:0# 0~0
2016-05-19 14:51:32.521833 7fbaf3f7d700 10 osd.5 pg_epoch: 272 pg[1.1d( v 173'94 (0'0,173'94] local-les=192 n=0 ec=191 les/c/f 192/192/0 177/192/179) [1,5] r=1 lpr=272 crt=173'94 lcod 0'0 inactive NOTIFY NIBBLEWISE] Clearing blocked outgoing recovery messages
2016-05-19 14:51:32.521837 7fbaf3f7d700 10 osd.5 pg_epoch: 272 pg[1.1d( v 173'94 (0'0,173'94] local-les=192 n=0 ec=191 les/c/f 192/192/0 177/192/179) [1,5] r=1 lpr=272 crt=173'94 lcod 0'0 inactive NOTIFY NIBBLEWISE] Not blocking outgoing recovery messages

Starting an OSD after a map gap can cause the cached snaps in PGPool to be out of date.

Related issues 2 (0 open — 2 closed)