Project

General

Profile

Bug #15943

crash adding snap to purged_snaps in ReplicatedPG::WaitingOnReplicas

Added by Samuel Just over 1 year ago. Updated 7 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
05/19/2016
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
jewel,hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

-16> 2016-05-19 14:51:36.794141 7fbb03926700 10 filestore(/var/lib/ceph/osd/ceph-5) _do_transaction on 0x7fbb1e61ee00
-15> 2016-05-19 14:51:36.794139 7fbb03125700 10 filestore oid: #1:88000000::::head# not skipping op, spos 3617.0.0
-14> 2016-05-19 14:51:36.794146 7fbb03125700 10 filestore > header.spos 0.0.0
-13> 2016-05-19 14:51:36.794139 7fbaf3f7d700 20 osd.5 pg_epoch: 290 pg[1.15( v 196'73 (0'0,196'73] local-les=288 n=1 ec=22 les/c/f 288/288/0 287/287/277) [1,5] r=1 lpr=287 pi=8-286/5 luod=0'0 crt=196'73 lcod 0'0 active NIBBLEWISE] agent_stop
-12> 2016-05-19 14:51:36.794121 7fbaee772700 -1 *
* Caught signal (Aborted) **
in thread 7fbaee772700 thread_name:tp_osd_tp

ceph version 10.2.0-1069-g3362c8d (3362c8dd2718b1ff61a18bc7f49474e6808c2fc7)
1: (()+0x904ca2) [0x7fbb120eaca2]
2: (()+0x10340) [0x7fbb1048f340]
3: (gsignal()+0x39) [0x7fbb0e4f1cc9]
4: (abort()+0x148) [0x7fbb0e4f50d8]
5: (()+0x2fb86) [0x7fbb0e4eab86]
6: (()+0x2fc32) [0x7fbb0e4eac32]
7: (ReplicatedPG::WaitingOnReplicas::react(ReplicatedPG::SnapTrim const&)+0xf79) [0x7fbb11d04ce9]
8: (boost::statechart::simple_state<ReplicatedPG::WaitingOnReplicas, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xb4) [0x7fbb11d34264]
9: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0x12b) [0x7fbb11d20afb]
10: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x84) [0x7fbb11d20cc4]
11: (ReplicatedPG::snap_trimmer(unsigned int)+0x46b) [0x7fbb11c9f73b]
12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x8e3) [0x7fbb11b7c1b3]
13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x877) [0x7fbb121d68f7]
14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7fbb121d8820]
15: (()+0x8182) [0x7fbb10487182]
16: (clone()+0x6d) [0x7fbb0e5b547d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

2016-05-19 14:51:32.521796 7fbb03125700 15 filestore(/var/lib/ceph/osd/ceph-5) write meta/#-1:a68b6935:::osdmap.286:0# 0~6055
2016-05-19 14:51:32.521797 7fbaf3f7d700 10 osd.5 pg_epoch: 198 pg[1.1d( v 173'94 (0'0,173'94] local-les=192 n=0 ec=191 les/c/f 192/192/0 177/192/179) [1,5] r=1 lpr=198 crt=173'94 lcod 0'0 inactive NOTIFY NIBBLEWISE] handle_advance_map [3]/[3] -- 3/3
2016-05-19 14:51:32.521806 7fbaf3f7d700 20 PGPool::update cached_removed_snaps [1~9d,a0~1,a2~1,a5~1] newly_removed_snaps [] snapc a7=[a7,a6,a4,a3,a1,9f,9e] (no change)
2016-05-19 14:51:32.521810 7fbaf3f7d700 10 osd.5 pg_epoch: 272 pg[1.1d( v 173'94 (0'0,173'94] local-les=192 n=0 ec=191 les/c/f 192/192/0 177/192/179) [1,5] r=1 lpr=198 crt=173'94 lcod 0'0 inactive NOTIFY NIBBLEWISE] state<Reset>: Reset advmap
2016-05-19 14:51:32.521815 7fbaf3f7d700 10 osd.5 pg_epoch: 272 pg[1.1d( v 173'94 (0'0,173'94] local-les=192 n=0 ec=191 les/c/f 192/192/0 177/192/179) [1,5] r=1 lpr=198 crt=173'94 lcod 0'0 inactive NOTIFY NIBBLEWISE] _calc_past_interval_range start epoch 272 >= end epoch 192, nothing to do
2016-05-19 14:51:32.521816 7fbaf377c700 20 osd.5 272 get_map 229 - loading and decoding 0x7fbb1e834880
2016-05-19 14:51:32.521820 7fbaf3f7d700 20 osd.5 pg_epoch: 272 pg[1.1d( v 173'94 (0'0,173'94] local-les=192 n=0 ec=191 les/c/f 192/192/0 177/192/179) [1,5] r=1 lpr=198 crt=173'94 lcod 0'0 inactive NOTIFY NIBBLEWISE] new interval newup [3] newacting [3]
2016-05-19 14:51:32.521825 7fbaf3f7d700 10 osd.5 pg_epoch: 272 pg[1.1d( v 173'94 (0'0,173'94] local-les=192 n=0 ec=191 les/c/f 192/192/0 177/192/179) [1,5] r=1 lpr=198 crt=173'94 lcod 0'0 inactive NOTIFY NIBBLEWISE] state<Reset>: should restart peering, calling start_peering_interval again
2016-05-19 14:51:32.521829 7fbaf3f7d700 20 osd.5 pg_epoch: 272 pg[1.1d( v 173'94 (0'0,173'94] local-les=192 n=0 ec=191 les/c/f 192/192/0 177/192/179) [1,5] r=1 lpr=198 crt=173'94 lcod 0'0 inactive NOTIFY NIBBLEWISE] set_last_peering_reset 272
2016-05-19 14:51:32.521829 7fbaf377c700 15 filestore(/var/lib/ceph/osd/ceph-5) read meta/#-1:ac5ce935:::osdmap.229:0# 0~0
2016-05-19 14:51:32.521833 7fbaf3f7d700 10 osd.5 pg_epoch: 272 pg[1.1d( v 173'94 (0'0,173'94] local-les=192 n=0 ec=191 les/c/f 192/192/0 177/192/179) [1,5] r=1 lpr=272 crt=173'94 lcod 0'0 inactive NOTIFY NIBBLEWISE] Clearing blocked outgoing recovery messages
2016-05-19 14:51:32.521837 7fbaf3f7d700 10 osd.5 pg_epoch: 272 pg[1.1d( v 173'94 (0'0,173'94] local-les=192 n=0 ec=191 les/c/f 192/192/0 177/192/179) [1,5] r=1 lpr=272 crt=173'94 lcod 0'0 inactive NOTIFY NIBBLEWISE] Not blocking outgoing recovery messages

Starting an OSD after a map gap can cause the cached snaps in PGPool to be out of date.


Related issues

Copied to Ceph - Backport #16150: jewel: crash adding snap to purged_snaps in ReplicatedPG::WaitingOnReplicas Resolved
Copied to Ceph - Backport #16151: hammer: crash adding snap to purged_snaps in ReplicatedPG::WaitingOnReplicas Resolved

History

#1 Updated by Samuel Just over 1 year ago

  • Backport set to jewel,hammer

#2 Updated by Samuel Just over 1 year ago

sjust@teuthology:/a/samuelj-2016-05-18_16:52:37-rados-wip-sam-testing-distro-basic-smithi/200274

#3 Updated by Sage Weil about 1 year ago

  • Status changed from Testing to Pending Backport

#5 Updated by Nathan Cutler about 1 year ago

  • Copied to Backport #16150: jewel: crash adding snap to purged_snaps in ReplicatedPG::WaitingOnReplicas added

#6 Updated by Nathan Cutler about 1 year ago

  • Copied to Backport #16151: hammer: crash adding snap to purged_snaps in ReplicatedPG::WaitingOnReplicas added

#7 Updated by Nathan Cutler 9 months ago

  • Status changed from Pending Backport to Resolved
  • Needs Doc set to No

#8 Updated by Samuel Just 8 months ago

  • Status changed from Resolved to In Progress

Bah, I don't think my fix was right.

#9 Updated by Samuel Just 8 months ago

  • Status changed from In Progress to Pending Backport

This needs to be backported again to both jewel and hammer.

#11 Updated by Nathan Cutler 7 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF