Bug #38124: OSD down on snaptrim. - RADOS - Ceph

Actions

Copy link

Bug #38124

closed

OSD down on snaptrim.

Added by Darius Kasparavičius about 5 years ago. Updated almost 5 years ago.

Status:

Resolved

Priority:

High

Assignee:

David Zafman

Category:

Target version:

% Done:

Source:

Tags:

Backport:

mimic, nautilus

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v13.2.4

ceph-qa-suite:

Component(RADOS):

Pull request ID:

27830

Crash signature (v1):

Crash signature (v2):

Description

All of ceph cluster OSD's crash when ceph runs snaptrim.

The particular error osd is throwing before crashing is:

2019-01-31 10:46:01.310 7fbb2fd45700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.4/rpm/el7/BUILD/ceph-13.2.4/src/osd/PrimaryLogPG.h: In function 'Pri
maryLogPG::Trimming::Trimming(boost::statechart::state<PrimaryLogPG::Trimming, PrimaryLogPG::SnapTrimmer, PrimaryLogPG::WaitReservation>::my_context)' thread 7fbb2fd45700 time 2019-01-31 10:46:01.306356
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.4/rpm/el7/BUILD/ceph-13.2.4/src/osd/PrimaryLogPG.h: 1571: FAILED assert(context< SnapTrimmer >().can_trim())

ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xff) [0x7fbb59f3716f]
 2: (()+0x25a337) [0x7fbb59f37337]
 3: (PrimaryLogPG::NotTrimming::react(PrimaryLogPG::KickTrim const&)+0x783) [0x559a606efbc3]
 4: (boost::statechart::simple_state&lt;PrimaryLogPG::NotTrimming, PrimaryLogPG::SnapTrimmer, boost::mpl::list&lt;mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na
, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xa9) [0x559a6073b629]
 5: (boost::statechart::state_machine&lt;PrimaryLogPG::SnapTrimmer, PrimaryLogPG::NotTrimming, std::allocator&lt;void&gt;, boost::statechart::null_exception_translator>::process_queued_events()+0xb3) [0x559a60715f23]
 6: (boost::statechart::state_machine&lt;PrimaryLogPG::SnapTrimmer, PrimaryLogPG::NotTrimming, std::allocator&lt;void&gt;, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x87) [0x559a60716187]
 7: (_ZZN12PrimaryLogPG13WaitTrimTimerC4EN5boost10statechart5stateIS0_NS_8TrimmingENS1_3mpl4listIN4mpl_2naES8_S8_S8_S8_S8_S8_S8_S8_S8_S8_S8_S8_S8_S8_S8_S8_S8_S8_S8_EELNS2_12history_modeE0EE10my_contextEEN7OnTimer6finishEi()+0xb2) [0x559a607163a2]
 8: (Context::complete(int)+0x9) [0x559a60580e49]
 9: (SafeTimer::timer_thread()+0x18b) [0x7fbb59f33a8b]
 10: (SafeTimerThread::entry()+0xd) [0x7fbb59f3504d]
 11: (()+0x7dd5) [0x7fbb56ab8dd5]
 12: (clone()+0x6d) [0x7fbb55ba8ead]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

The rest of the log is attached with the file.

Files

Download all files

ceph-log.zip (282 KB) ceph-log.zip		Darius Kasparavičius, 01/31/2019 11:58 AM
ceph-osd.tar.gz (754 KB) ceph-osd.tar.gz		Darius Kasparavičius, 02/04/2019 10:19 PM

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Greg Farnum about 5 years ago

Project changed from Ceph to RADOS

Actions

Copy link

Updated by Darius Kasparavičius about 5 years ago

File ceph-osd.tar.gz ceph-osd.tar.gz added

Hello,

I have collected additional information Sage asked. Attached log has debug_osd=20 set.

How this happened:
1. One of the nodes had all it's osd's set to out. To clean them up for replacement.
2. Noticed that a lot of snaptrim was running.
3. Set nosnaptrim flag on the cluster.
4. Once mon_osd_snap_trim_queue_warn_on appeared. Removed nosnaptrim flag.
5. All osds on the cluster crashed and started flapping. Set nosnaptrim flag back on.

Actions

Copy link

Updated by Neha Ojha about 5 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by Greg Farnum about 5 years ago

I was theorizing in a bug scrub that maybe the PG was running behind on OSDMaps and so missing the nosnaptrim flag update, but that isn't the case — the OSD doesn't look at it directly at all, just the PG when it activates a map.

However, since the crash came from the WaitTrimTimer state's timer triggering a transition into NotTrimming and posting a KickTrim event, I think it's safe to say there's some race or missed timer cleanup that causes this when the flag changes state. Since the timer is cleaned up when you exit the WaitTrimTimer state, that also seems a bit odd, but maybe...oh, in fact, I don't see anything that directly kills it. Maybe we do a wide reset() somewhere? Kinda looks like we only do that in PrimaryLogPG::on_change() and that is only triggered in interval changes, though.

(Also interesting: the WaitRepops state goes into WaitTrimTimer if !can_trim() when the replies come back.)

Actions

Copy link

Updated by Darius Kasparavičius about 5 years ago

Hello,

any updates regarding this bug? I would love a patch to resolve this issue ASAP. One of my monitors just died and I can't add new one. As it's throwing slow io errors while trying to synchronise.

Actions

Copy link

Updated by Erikas Kučinskis about 5 years ago

Hello any updates about this?

Actions

Copy link

Updated by Erikas Kučinskis about 5 years ago

Hello it's been two months now is there any update about this bug?

Actions

Copy link

Updated by David Zafman almost 5 years ago

Status changed from New to In Progress
Assignee set to David Zafman

I am able to reproduce this, so I'll work on a fix.

Actions

Copy link

Updated by David Zafman almost 5 years ago

The following script sometimes hits the race and crashes an OSD. I've removed the assert and the script has been running in a loop without seeing any other core dumps.

#! /bin/bash -x

../src/stop.sh
MGR=1 MON=1 MDS=0 OSD=5 ../src/vstart.sh -l -d -n -o osd_snap_trim_sleep=5.0 2> /dev/null
sleep 5
bin/ceph osd pool create test 1 1 2> /dev/null
sleep 5

sleep 2
bin/ceph pg dump pgs 2> /dev/null

for s in $(seq 1 20)
do
  dd if=/dev/urandom bs=1m count=1 of=data
  for i in $(seq 1 100)
  do
    bin/rados -p test put obj$i data 2> /dev/null
  done
  bin/rados -p test mksnap snap${s} 2> /dev/null
done

while(true); do bin/ceph osd set nosnaptrim; sleep 1; bin/ceph osd unset nosnaptrim; done &

for s in $(seq 1 20)
do
  bin/rados -p test rmsnap snap$s
  sleep 3
done

sleep 60
bin/ceph status
bin/ceph osd dump

kill %%
wait
bin/ceph status
bin/ceph osd dump

Actions

Copy link

#10

Updated by David Zafman almost 5 years ago

Pull request ID set to 27830

Actions

Copy link

#11

Updated by David Zafman almost 5 years ago

Status changed from In Progress to Fix Under Review

Actions

Copy link

#12

Updated by Erikas Kučinskis almost 5 years ago

Hi is there any ETA when the bug will be live?

Actions

Copy link

#13

Updated by Erikas Kučinskis almost 5 years ago

Erikas Kučinskis wrote:

Hi is there any ETA when the bug fix will be live?

Actions

Copy link

#14

Updated by Greg Farnum almost 5 years ago

Status changed from Fix Under Review to Pending Backport
Backport set to mimic, nautilus

No ETA; it'll have to wend its way through the backports process. I don't think any releases are imminent so it should be the next point release though.

Actions

Copy link

#15

Updated by Erikas Kučinskis almost 5 years ago

Greg Farnum wrote:

No ETA; it'll have to wend its way through the backports process. I don't think any releases are imminent so it should be the next point release though.

Thank you for the information.

Actions

Copy link

#16

Updated by Nathan Cutler almost 5 years ago

Copied to Backport #39698: mimic: OSD down on snaptrim. added

Actions

Copy link

#17

Updated by Nathan Cutler almost 5 years ago

Copied to Backport #39699: nautilus: OSD down on snaptrim. added

Actions

Copy link

#18

Updated by Nathan Cutler almost 5 years ago

Status changed from Pending Backport to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #38124

OSD down on snaptrim.

Updated by Greg Farnum about 5 years ago

Updated by Darius Kasparavičius about 5 years ago

Updated by Neha Ojha about 5 years ago

Updated by Greg Farnum about 5 years ago

Updated by Darius Kasparavičius about 5 years ago

Updated by Erikas Kučinskis about 5 years ago

Updated by Erikas Kučinskis about 5 years ago

Updated by David Zafman almost 5 years ago

Updated by David Zafman almost 5 years ago

Updated by David Zafman almost 5 years ago

Updated by David Zafman almost 5 years ago

Updated by Erikas Kučinskis almost 5 years ago

Updated by Erikas Kučinskis almost 5 years ago

Updated by Greg Farnum almost 5 years ago

Updated by Erikas Kučinskis almost 5 years ago

Updated by Nathan Cutler almost 5 years ago

Updated by Nathan Cutler almost 5 years ago

Updated by Nathan Cutler almost 5 years ago