Project

General

Profile

Actions

Bug #4937

closed

osd/ReplicatedPG.cc: 1379: FAILED assert(0)

Added by Denis kaganovich almost 11 years ago. Updated almost 11 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Sure down 2 osd with next messages (this is one osd info):

0> 2013-05-08 15:20:25.634740 7fe69d736700 -1 osd/ReplicatedPG.cc: In function 'ReplicatedPG::RepGather* ReplicatedPG::trim_object(const hobject_t&)' thread 7fe69d736700 time 2013-05-08 15:20:25.632849
osd/ReplicatedPG.cc: 1379: FAILED assert(0)
ceph version 0.61-127-gf0c0997 (f0c0997cb86b20fbc2613102fc58de7d64b861f4)
1: (ReplicatedPG::trim_object(hobject_t const&)+0x16d) [0x59718d]
2: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim const&)+0x450) [0x59d000]
3: (boost::statechart::simple_state<ReplicatedPG::TrimmingObjects, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x13c) [0x5f086c]
4: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0x12b) [0x5e115b]
5: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x2b) [0x5e12cb]
6: (ReplicatedPG::snap_trimmer()+0x4f7) [0x579d27]
7: (OSD::SnapTrimWQ::_process(PG*)+0x14) [0x652b94]
8: (ThreadPool::worker(ThreadPool::WorkThread*)+0x579) [0x8aaf79]
9: (ThreadPool::WorkThread::entry()+0x10) [0x8ad0e0]
10: (()+0x84f8) [0x7fe6c37ec4f8]
11: (clone()+0x6d) [0x7fe6c164976d]

And later:
ceph version 0.61-127-gf0c0997 (f0c0997cb86b20fbc2613102fc58de7d64b861f4)
1: /usr/bin/ceph-osd() [0x7d4617]
2: (()+0x10c70) [0x7fe6c37f4c70]
3: (gsignal()+0x35) [0x7fe6c153d275]
4: (abort()+0x139) [0x7fe6c153eb69]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7fe6c1ef8c9d]
6: (()+0xb9e56) [0x7fe6c1ef6e56]
7: (()+0xb9e83) [0x7fe6c1ef6e83]
8: (()+0xb9f7e) [0x7fe6c1ef6f7e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1d3) [0x8b7a23]
10: (ReplicatedPG::trim_object(hobject_t const&)+0x16d) [0x59718d]
11: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim const&)+0x450) [0x59d000]
12: (boost::statechart::simple_state<ReplicatedPG::TrimmingObjects, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x13c) [0x5f086c]
13: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0x12b) [0x5e115b]
14: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x2b) [0x5e12cb]
15: (ReplicatedPG::snap_trimmer()+0x4f7) [0x579d27]
16: (OSD::SnapTrimWQ::_process(PG*)+0x14) [0x652b94]
17: (ThreadPool::worker(ThreadPool::WorkThread*)+0x579) [0x8aaf79]
18: (ThreadPool::WorkThread::entry()+0x10) [0x8ad0e0]
19: (()+0x84f8) [0x7fe6c37ec4f8]
20: (clone()+0x6d) [0x7fe6c164976d]

Attached this part of log and compressed both full logs.


Files

ceph-osd.3.log1 (398 KB) ceph-osd.3.log1 Denis kaganovich, 05/08/2013 05:48 AM
osdbug.tar.gz (8.29 MB) osdbug.tar.gz Denis kaganovich, 05/08/2013 05:48 AM
ceph-osd.5.log.gz (21.6 MB) ceph-osd.5.log.gz Denis kaganovich, 05/10/2013 06:59 AM
trim-lost.patch (742 Bytes) trim-lost.patch Denis kaganovich, 05/14/2013 07:58 PM
Actions #1

Updated by Denis kaganovich almost 11 years ago

It can be related to trying to resolve next:

2013-05-07 07:16:03.399196 7f65257fa700 0 log [ERR] : scrub 2.81 e2080881/rb.0.1ee4.238e1f29.000000001300/head//2 on disk size (4194304) does not match object info size (2132480)
2013-05-07 07:16:03.399217 7f65257fa700 0 log [ERR] : scrub 2.81 e2080881/rb.0.1ee4.238e1f29.000000001300/54e//2 found clone without head
2013-05-07 07:16:05.576715 7f65257fa700 0 log [ERR] : 2.81 scrub stat mismatch, got 355/357 objects, 40/41 clones, 1377152000/1379554816 bytes.
2013-05-07 07:16:05.576730 7f65257fa700 0 log [ERR] : 2.81 scrub 3 errors
2013-05-07 13:03:24.760779 7f08517fa700 0 log [ERR] : repair 2.5 867ccc05/rb.0.1ee4.238e1f29.0000000002fc/54e//2 found clone without head
2013-05-07 13:03:25.722655 7f08517fa700 0 log [ERR] : 2.5 repair stat mismatch, got 339/341 objects, 34/35 clones, 1335987712/1344376320 bytes.
2013-05-07 13:03:25.722763 7f08517fa700 0 log [ERR] : 2.5 repair 2 errors, 1 fixed
2013-05-07 13:03:54.163826 7f08517fa700 0 log [ERR] : repair 2.81 e2080881/rb.0.1ee4.238e1f29.000000001300/head//2 on disk size (4194304) does not match object info size (2132480)
2013-05-07 13:03:54.163852 7f08517fa700 0 log [ERR] : repair 2.81 e2080881/rb.0.1ee4.238e1f29.000000001300/54e//2 found clone without head
2013-05-07 13:04:10.275608 7f08517fa700 0 log [ERR] : 2.81 repair stat mismatch, got 358/360 objects, 42/43 clones, 1389734912/1392137728 bytes.
2013-05-07 13:04:10.275674 7f08517fa700 0 log [ERR] : 2.81 repair 3 errors, 1 fixed
2013-05-07 13:42:20.850601 7f48f95be700 0 log [ERR] : repair 2.5 867ccc05/rb.0.1ee4.238e1f29.0000000002fc/54e//2 found clone without head
2013-05-07 13:42:21.570201 7f48f95be700 0 log [ERR] : 2.5 repair 1 errors, 0 fixed
2013-05-07 13:42:43.829766 7f48f95be700 0 log [ERR] : repair 2.81 e2080881/rb.0.1ee4.238e1f29.000000001300/head//2 on disk size (4194304) does not match object info size (2132480)
2013-05-07 13:42:43.829778 7f48f95be700 0 log [ERR] : repair 2.81 e2080881/rb.0.1ee4.238e1f29.000000001300/54e//2 found clone without head
2013-05-07 13:42:58.346838 7f48f95be700 0 log [ERR] : 2.81 repair 2 errors, 0 fixed
2013-05-07 13:45:10.794511 7f48f95be700 0 log [ERR] : repair 2.5 867ccc05/rb.0.1ee4.238e1f29.0000000002fc/54e//2 found clone without head
2013-05-07 13:45:10.890852 7f48f95be700 0 log [ERR] : 2.5 repair 1 errors, 0 fixed
2013-05-07 13:45:15.056808 7f48f95be700 0 log [ERR] : repair 2.81 e2080881/rb.0.1ee4.238e1f29.000000001300/head//2 on disk size (4194304) does not match object info size (2132480)
2013-05-07 13:45:15.056820 7f48f95be700 0 log [ERR] : repair 2.81 e2080881/rb.0.1ee4.238e1f29.000000001300/54e//2 found clone without head
2013-05-07 13:45:16.817478 7f48f95be700 0 log [ERR] : 2.81 repair 2 errors, 0 fixed
2013-05-07 13:49:08.386591 7f48f95be700 0 log [ERR] : repair 2.5 867ccc05/rb.0.1ee4.238e1f29.0000000002fc/54e//2 found clone without head
2013-05-07 13:49:09.184725 7f48f95be700 0 log [ERR] : 2.5 repair 1 errors, 0 fixed
2013-05-07 13:49:22.670380 7f48f95be700 0 log [ERR] : repair 2.81 e2080881/rb.0.1ee4.238e1f29.000000001300/head//2 on disk size (4194304) does not match object info size (2132480)
2013-05-07 13:49:22.670400 7f48f95be700 0 log [ERR] : repair 2.81 e2080881/rb.0.1ee4.238e1f29.000000001300/54e//2 found clone without head
2013-05-07 13:49:28.829483 7f48f95be700 0 log [ERR] : 2.81 repair 2 errors, 0 fixed

I copy this PG head from another OSD. But now truncate to 0 again - same.

Actions #2

Updated by Denis kaganovich almost 11 years ago

I trying to remove this lost part of snapshots by:
rados rm rb.0.1ee4.238e1f29.000000001300 -p rbd
& remove this files from osd's, but it re-created (with same problem) again & kill my OSDs,

Help!

Actions #3

Updated by Anonymous almost 11 years ago

  • Priority changed from Normal to Urgent
Actions #4

Updated by Ian Colle almost 11 years ago

  • Assignee set to Samuel Just
Actions #5

Updated by Samuel Just almost 11 years ago

Can you confirm that all of your osds are running Cuttlefish (it should work anyway if some are still on Bobtail, but it will help me narrow down the issue)?
-Sam

Actions #6

Updated by Denis kaganovich almost 11 years ago

Yes. All time, starting in this bugzilla, I have 3 nodes with ~same (+-day) git snapshot. Now it just same.

Source of this may be previous bugs or power lost day ago (& UPS failure) for 2 nodes. Or something in time of nightly snapshot->backup/deff->snapshot_delete process.

I make temporary workaround, commenting out "snap_trimmer_machine.initiate();" and think, if you not suggest good solution, in nearest days change it replacing assert(0) by "return NULL"/"return 0" chain & "if"s (let it be some lost clusters while) to just skip trimming for this nodes.

Actions #7

Updated by Samuel Just almost 11 years ago

Can you reproduce with

debug osd = 20
debug filestore = 20
debug ms = 1

and post the logs?

It appears that you have corrupt on disk state probably as a result of the power failures. What file system are you using?
-Sam

Actions #8

Updated by Denis kaganovich almost 11 years ago

OK.
I have xfs on OSDs and reiserfs on system & monitors. Now I have more this objects. I understand next: IMHO there are lost snapshot objects. So, I remove all backup snapsots. And now "HEALTH_ERR 5 pgs inconsistent; 8 scrub errors" (increasing, but I cannot precise say "there are after removing snapshots", but looks like so).

In any case, ceph (and I) need regular mechanism to solve this failures...

Actions #9

Updated by Denis kaganovich almost 11 years ago

PS This can differ in details from natural failure: I copy another copy of problem files, so CAN (or not), for example, be size 0, now - something else. But common dump and assertion same.

Actions #10

Updated by Denis kaganovich almost 11 years ago

1) Just one mo symptom: "assert(soid.snap == *curclone);" (IMHO there are too similar to others, include "clone without head", for big report, but you can ask more);

2) Can you check attached patch & suggest, for example, other return action for react() or there are complete dangerous? (I don't know how clones chained, so it can cause data loss?)

3) Or just wait some more?

Actions #11

Updated by Denis kaganovich almost 11 years ago

Done. Reinstalling ceph and repaired from backups - I have troubles with monitor reinit as I do before. Wiped new mon don't want to reinit and still in client mode infinite, then other mons after just restart die too.

Actions #12

Updated by Olivier Bonvalet almost 11 years ago

I also have scrub errors with this message : "found clone without head"
Should not a "ceph pg repair" fix that kind of errors ?

Actions #13

Updated by David Zafman almost 11 years ago

Updated by Olivier Bonvalet 2 days ago

I also have scrub errors with this message : "found clone without head"
Should not a "ceph pg repair" fix that kind of errors ?

I filed bug #5141 for this issue.

Actions #14

Updated by Samuel Just almost 11 years ago

  • Status changed from New to Can't reproduce

This was caused by corruption of some kind. That corruption may have been a bug.

Actions

Also available in: Atom PDF