Project

General

Profile

Actions

Bug #6101

closed

ceph-osd crash on corrupted store

Added by Mikaël Cluseau over 10 years ago. Updated over 5 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I see a problem with one of my dumpling OSD under debian, on a compressed btrfs. I think my BTRFS is corrupted or somewhat inconsistent (its common at least on my workload).

But this corruption makes the OSD crash so I think it's worth reporting as a bug. It seems that the OSD doesn't crash when set "out".

 ceph version 0.67.1 (e23b817ad0cf1ea19c0a7b7c9999b30bed37d533)
 1: ceph-osd() [0x8fd702]
 2: (()+0xf030) [0x7fc5a59b3030]
 3: (gsignal()+0x35) [0x7fc5a3e5d475]
 4: (abort()+0x180) [0x7fc5a3e606f0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc5a46b289d]
 6: (()+0x63996) [0x7fc5a46b0996]
 7: (()+0x639c3) [0x7fc5a46b09c3]
 8: (()+0x63bee) [0x7fc5a46b0bee]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0x9bf04a]
 10: (ReplicatedPG::trim_object(hobject_t const&)+0x150) [0x6f8500]
 11: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim const&)+0x249) [0x701ba9]
 12: (boost::statechart::detail::reaction_result boost::statechart::simple_state<ReplicatedPG::TrimmingObjects, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list<boost::statechart::custom_reaction<ReplicatedPG::SnapTrim>, boost::statechart::transition<ReplicatedPG::Reset, ReplicatedPG::NotTrimming, boost::statechart::detail::no_context<ReplicatedPG::Reset>, &boost::statechart::detail::no_context<ReplicatedPG::Reset>::no_function>, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, boost::statechart::simple_state<ReplicatedPG::TrimmingObjects, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<ReplicatedPG::TrimmingObjects, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0x96) [0x7540c6]
 13: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0x137) [0x72ed47]
 14: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x26) [0x72ff06]
 15: (ReplicatedPG::snap_trimmer()+0x423) [0x6ca773]
 16: (OSD::SnapTrimWQ::_process(PG*)+0x14) [0x7a33f4]
 17: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x9b38ea]
 18: (ThreadPool::WorkThread::entry()+0x10) [0x9b4b40]
 19: (()+0x6b50) [0x7fc5a59aab50]
 20: (clone()+0x6d) [0x7fc5a3f05a7d]

Files

ceph-osd.1.log.xz (1.53 MB) ceph-osd.1.log.xz osd.1 log Mikaël Cluseau, 08/22/2013 08:53 PM
ceph-mon.dbo-prod-mx-bdd1.log.xz (2.46 MB) ceph-mon.dbo-prod-mx-bdd1.log.xz Mikaël Cluseau, 08/27/2013 01:11 AM
ceph-osd.1.log.xz (3.31 MB) ceph-osd.1.log.xz Mikaël Cluseau, 08/27/2013 01:11 AM
ceph-osd.5.log.xz (345 KB) ceph-osd.5.log.xz Mikaël Cluseau, 08/27/2013 01:11 AM
ceph.log.xz (1.31 MB) ceph.log.xz Mikaël Cluseau, 08/27/2013 01:11 AM
ceph-mon.dsi-sei-dev-3.log.xz (6.49 MB) ceph-mon.dsi-sei-dev-3.log.xz Mikaël Cluseau, 08/27/2013 01:11 AM
ceph-osd.0.log.xz (1.62 MB) ceph-osd.0.log.xz Mikaël Cluseau, 08/27/2013 01:11 AM
ceph-osd.2.log.xz (38.9 MB) ceph-osd.2.log.xz Mikaël Cluseau, 08/27/2013 01:11 AM
0001-Don-t-assert-if-trimming-of-object-fails.patch (1.32 KB) 0001-Don-t-assert-if-trimming-of-object-fails.patch Daniel Poelzleithner, 02/10/2014 06:44 PM
0001-Don-t-assert-if-trimming-of-object-fails.patch (1.34 KB) 0001-Don-t-assert-if-trimming-of-object-fails.patch Daniel Poelzleithner, 02/11/2014 12:48 PM
ceph-trim-missing-snapshot.patch (2.24 KB) ceph-trim-missing-snapshot.patch Nathan O'Sullivan, 02/18/2015 11:07 PM
Actions #1

Updated by Samuel Just over 10 years ago

  • Project changed from rbd to Ceph
  • Status changed from New to Can't reproduce
  • Target version deleted (v0.67 - Dumpling)

Our handling of this situation could be better, but basically it's crashing because it got messed up information out of the fs. It's not crashing when marked as out because out osds typically don't get to be primary for their pg and therefore don't trim. You probably want to let the cluster recover and then replace the osd.

Actions #2

Updated by Mikaël Cluseau over 10 years ago

After some work this week-end, I tried quite some things to keep the test locked in the cluster (to avoid my user entering it again), but without success. The crash finally appeared on another OSD (osd.0), on the second node, while letting the cluster recover. It was under btrfs too so I cleaned up the osd.1 and formatted with XFS. Letting the cluster recover ended in the same crash under XFS.

I tried an infinite loop to restart the OSD, hoping to let the "bad area" behind at some time. I seemed to work a bit but finally it ended crashing after a few milliseconds of service. So I had to rebuild the whole cluster to give the service back to the developers.

IMHO, Ceph shouldn't crash when getting corrupted data from the disk, so I filled a bug. The fact that the error seemed to be transferred between the OSDs, crashing the other ones, is very wrong to me. I'll keep the logs a few weeks, if someone decides to open this issue again I'll upload them.

Actions #3

Updated by Samuel Just over 10 years ago

  • Status changed from Can't reproduce to New

Ah. That's less good. The logs would be the place to start. What kernel version are you running?

Actions #4

Updated by Samuel Just over 10 years ago

52f622a2/rbd_data.22d4c74b0dc51.0000000000002853/c4//2 seems to be the missing object.

Can you locate that object and its other snaps on each node (probably only exists on a few nodes) and attach a dump of the xattrs from each one?

You should be able to find each clone of that object with find . -name "*52f622a2*" in each osd's current/2.22_head/ directory.

If you already killed the cluster, any other crash logs would be good.

Updated by Mikaël Cluseau over 10 years ago

Hi, thanks for following the issue :) I couldn't upload the following logs from where I was so it took me some time, sorry about that. The cluster was already dead yes, because I had to keep the devz busy.

The logs are as follow (source_host/logname):
2,5M dbo-prod-mx-bdd1/ceph-mon.dbo-prod-mx-bdd1.log.xz
3,4M dbo-prod-mx-bdd1/ceph-osd.1.log.xz
348K dbo-prod-mx-bdd1/ceph-osd.5.log.xz
1,4M dsi-sei-dev-3/ceph.log.xz
6,5M dsi-sei-dev-3/ceph-mon.dsi-sei-dev-3.log.xz
1,7M dsi-sei-dev-3/ceph-osd.0.log.xz
39M dsi-sei-dev-3/ceph-osd.2.log.xz

Actions #6

Updated by Mikaël Cluseau over 10 years ago

Samuel Just wrote:

What kernel version are you running?

The debian wheezy's kernel package so its a 3.2 (linux-image-3.2.0-4-amd64_3.2.41-2+deb7u2_amd64.deb).

Actions #7

Updated by Daniel Poelzleithner about 10 years ago

Today I had contact with a cluster that suffers the same problem.

ceph version 0.67.5 (a60ac9194718083a4b6a225fc17cad6096c69bd1)

1: /usr/bin/ceph-osd() [0x7f9bba]
2: (()+0xfcb0) [0x7fb29b121cb0]
3: (gsignal()+0x35) [0x7fb299892425]
4: (abort()+0x17b) [0x7fb299895b8b]
5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fb29a1e569d]
6: (()+0xb5846) [0x7fb29a1e3846]
7: (()+0xb5873) [0x7fb29a1e3873]
8: (()+0xb596e) [0x7fb29a1e396e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1df) [0x8bf3df]
10: (ReplicatedPG::trim_object(hobject_t const&)+0x150) [0x5e9400]
11: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim
const&)+0x423) [0x618613]
12: (boost::statechart::simple_state<ReplicatedPG::TrimmingObjects,
ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0xbc) [0x646eec]
13: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer,
ReplicatedPG::NotTrimming, std::allocator<void>,
boost::statechart::null_exception_translator>::process_queued_events()+0xfb)
[0x62d8ab]
14: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer,
ReplicatedPG::NotTrimming, std::allocator<void>,
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
const&)+0x1e) [0x62da3e]
15: (ReplicatedPG::snap_trimmer()+0x516) [0x5cb1f6]
16: (OSD::SnapTrimWQ::_process(PG*)+0x14) [0x69a3b4]
17: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x8b24f6]
18: (ThreadPool::WorkThread::entry()+0x10) [0x8b4300]
19: (()+0x7e9a) [0x7fb29b119e9a]
20: (clone()+0x6d) [0x7fb2999503fd]

One OSD is constantly offline. After the node joins back the cluster, one corrupted pg causes the assert in trim_object() to trigger:


  int r = find_object_context(coid, &obc, false, NULL);
  if (r == -ENOENT || coid.snap != obc->obs.oi.soid.snap) {
    derr << __func__ << "could not find coid " << coid << dendl;
    assert(0);
  }

gdb> print coid

$2 = (const hobject_t &) @0x214dd848: {oid = {name = {static npos = <optimized out>,
      _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
        _M_p = 0x20ea6608 "rbd_data.5d5c0d2ae8944a.", '0' <repeats 13 times>, "c89"}}}, snap = {val = 4947}, hash = 1936754598, max = false,
  pool = 2, nspace = {static npos = <optimized out>,
    _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0xd27b38 ""}}, key = {
    static npos = <optimized out>,
    _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0xd27b38 ""}}}

We tracked the object down to a block device in rados that contained a windows VM and was a clone of a template image.

While debugging it seemed like find_object_context(coid, &obc, false, NULL); was called from other places without a problem, only while beeing called from trim_object() it crashes the server.

Prior the problem started, there was a problem with the router that connects the ceph cluster, leading to a network downtime of ~2 minutes. I don't know of only one or many nodes where affected by the network outage. Each node runs ~7 ceph-osd instances with 4 nodes in total.

Please mark this bug more urgent, because this is a case where data corruption occured and that causes permanent node failure.

Our customer will now try to remove the rados image to see if this solves the problem.

Actions #8

Updated by Daniel Poelzleithner about 10 years ago

The nodes use xfs, so I don't think that the original assumption that filesystem corruption may be the cause is the only way to trigger this problem.

Actions #9

Updated by Daniel Poelzleithner about 10 years ago

find_object_context() in trim_object returned -ENOENT.

But I couldn't track down why, optimized builds were driving me nuts and every time the debugger stopped, the node gets into a strange state when paused for like 5 secondes. constanlty reconnecting errors...

Actions #10

Updated by Daniel Poelzleithner about 10 years ago

The initial problem was thought to be related to snapshots, therefore they cleaned all snapshots on the cluster but this did not fix it.

the identified vm was a clone of a template machine.

while we were debugging the problem I set a breakpoint to the find_object_context() first with a filter to the pg that caused the crash. The function seemed to return valid results from other contexts of the state machine, only when the trim_object is called, it returns not found.

Actions #11

Updated by Daniel Poelzleithner about 10 years ago

We deleted the rados block that was in the stack trace, but deleting it did not solve the problem.
The node still crashes when syncing the cluster.

Actions #12

Updated by Daniel Poelzleithner about 10 years ago

The patch removes the assert in case the trim event is done an a non existing context.
I think this is save and should only happen in case of this strange corruption.

Could someone knowing this context more deeply to have a look at it please.

Actions #13

Updated by Florian Haas about 10 years ago

Daniel Poelzleithner wrote:

We deleted the rados block that was in the stack trace, but deleting it did not solve the problem.
The node still crashes when syncing the cluster.

Just a hunch:

You know what the offending object is, right? Map it with ceph osd map <poolname> <objname>. Then look into the OSDs that have it, find the file, and run getfattr -d <filename> on all your copies. Let me know if any of your copies returns -E2BIG (i.e. bash telling you "argument list too long").

Actions #14

Updated by Daniel Poelzleithner about 10 years ago

(18:02:26) sage: poelzi: looks reasonable. can you make it log to the cluster log about the error, though? IIRC it is clog->error() << ... or similar.

Updated patch to address this.

Actions #15

Updated by Daniel Poelzleithner about 10 years ago

We applied the fix last week the crashing node.
The osd node now runs again, but see that the missing objects gets trim events ~ every 1-2 seconds that fail of course, but do not seem to harm anymore.
Any idea what causes this and how it can be fixed ?

Actions #16

Updated by Loïc Dachary over 9 years ago

  • Status changed from New to 12
  • Priority changed from Low to High

Bumping so it does not get to the bottom of the list for the next bug scrub.

Actions #17

Updated by Samuel Just over 9 years ago

  • Status changed from 12 to Can't reproduce
Actions #18

Updated by Nathan O'Sullivan about 9 years ago

Running 0.80.8 we experienced this problem on a few OSDs, after performing a full system reset after going OOM during a long-running recovery operation. The OSD would log (for example):

snaptrimq=[87bf~368,8b28~4f7]] trim_objectcould not find coid 73284b5b/rbd_data.c3571a2ae8944a.000000000000123d/87bf//4
osd/ReplicatedPG.cc: 2301: FAILED assert(0)

( https://github.com/ceph/ceph/blob/firefly/src/osd/ReplicatedPG.cc#L2301 )

Presumedly, the snap trim was mid-flight when the host system was reset.

The general pattern of data I observed is:
1. The object's SnapSet xattr had been correctly updated, as though the snap object had been deleted
2. Generally, the snap object itself was still on disk. I removed these manually, since it would trigger other assert's
3. The snap ID is still in the trimsnapq

Thus every time the snap trimmer tried to process the already deleted snap, it would assert as above.

One can temporarily prevent the assert from triggering by setting:
osd_pg_max_concurrent_snap_trims = 0

Resulting of course in no snaps being deleted, but at least keeping the OSD online.

I've attached a patch based on Daniel Poelzeithner's work and a suggestion made by Sage Weil on ceph-devel mailing list to allow the snap trimmer to skip over deleting the snap, while still removing the snap from the snaptrimq.

After restarting OSD with the patch, instead of the assert we instead see one entry:

log [ERR] : TrimmingObjects cannot find snap, dropping from snaptrimq: 73284b5b/rbd_data.c3571a2ae8944a.000000000000123d/87bf//4

Snap trimming then proceeds as normal.

Actions #19

Updated by Christian Theune over 6 years ago

Hey there,

we're currently using Nathan's patch to work around this issue that we encountered on v0.94.10. The patch still applies cleanly there.

Looking at the trunk I can see that the trim code seems to have evolved to not assert on missing objects any longer.

However, looking to upgrade towards Jewel this seems to become an issue again: the trim code isn't robust in Jewel yet (v.10.2.10, see ReplicatedPG.cc:3375).

Any chance to either forward-port Nathan's patch or backport the evolution from trunk?

Actions #20

Updated by Christian Theune over 6 years ago

Quick update: the patch applied cleanly for 0.94.10 but needed a change in the logging close.

Actions #21

Updated by Christian Theune over 6 years ago

Christian Theune wrote:

Quick update: the patch applied cleanly for 0.94.10 but needed a change in the logging close.

s/close/code/ obviously.

Actions #22

Updated by Nathan Cutler over 6 years ago

@Christian: For details see https://github.com/ceph/ceph/blob/master/SubmittingPatches.rst but, to summarize, to get the patch into the git repo, the standard procedure is to open a GitHub PR targeting the branch "master", and then ask for the patch to be backported to the stable branches (currently luminous and jewel).

If it is confirmed that the bug is already fixed in master and the master fix is not backportable (e.g. because the backport would be too complicated), then it is possible to consider a special fix for the stable branch(es) in question.

This policy exists to avoid exactly the situation you describe - an issue is fixed in a stable branch but not in master, so as soon as the next release comes out "a regression is born".

Actions #23

Updated by Christian Theune over 6 years ago

Thanks for that input. I guess this is a bit of a pickle in any case: the patch from Hammer likely won't apply at all and the change in Jewel is so substantial that it doesn't map backwards either. I guess it was fixed "along the way", but I'll try to find out.

Actions #24

Updated by Nathan Cutler over 6 years ago

@Christian: I just noticed this is fixing a FileStore-on-BTRFS issue. As of luminous BTRFS is no longer recommended for FileStore - maybe you could consider migrating your OSDs to XFS when you upgrade to Jewel and take care of the issue that way?

Actions #25

Updated by Christian Theune over 6 years ago

Maybe it's intended to fix something on BTRFS, but this has been happening to us on on XFS and the patch I'm talking about isn't really backend specific at all but touches an area where Ceph-internal data structures can become corrupted and then do not have any chance to recovery in Hammer and Jewel (as I said, Luminous seems fine from looking at the code).

Actions #26

Updated by Nathan Cutler over 6 years ago

Christian Theune wrote:

Maybe it's intended to fix something on BTRFS, but this has been happening to us on on XFS and the patch I'm talking about isn't really backend specific at all but touches an area where Ceph-internal data structures can become corrupted and then do not have any chance to recovery in Hammer and Jewel (as I said, Luminous seems fine from looking at the code).

Hammer is EOL, but perhaps you could open a PR targeting the jewel branch, then? It would be fine if the commit message contained a sentence or two explaining why the fix cannot be cherry-picked from master/luminous. In a pinch, this could be just a copy-paste of what you already wrote, above.

Actions #27

Updated by Christian Theune over 5 years ago

FYI, we're finally moving to Jewel (and hopefully Luminous after that) so I took a look at this patch again. I think I managed to create viable version of the hammer patch, mixed with some inspiration from the newer code in luminous. Note that luminous seems to have fixed the issue by now and porting the Hammer patch to Jewel is sufficient without further work on the newer branches. I guess the Kraken branch doesn't matter anyway.

Thanks for taking a look at https://github.com/ceph/ceph/pull/24423, even if Jewel might already be EOL as well ... (Hopefully we can catch up to Luminous this time without running into an already EOLed version yet again ... ;) )

Actions #29

Updated by Kefu Chai over 5 years ago

  • Affected Versions v10.2.9 added
Actions

Also available in: Atom PDF