Project

General

Profile

Bug #2132

FAILED assert(!missing.is_missing(soid))

Added by Matthew Roy about 12 years ago. Updated about 12 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Possibly a duplicate of Issue #1191 or Issue #339 (both closed with could not reproduce).

Prior to this assert the cluster was running cleanly up for about 3 days. The ceph code on the server that asserted was updated to v0.43 at 18:45:52 and a soft reboot of this server that was running osd.0, mds.a (previously active), and mon.a was initiated at 18:48:08, completing at 18:52:14. During this reboot mds.b remained in standy_replay and never went active. There was an active CephFS client that had hung waiting for writes (the client was still on v0.42).

At the time of the assert (2012-03-02 18:57:46.845105) mds.a was active and still laggy, but was complaining about dentry mismatches. osd.0 had just complained about a bunch of 'old request osd_op'

2012-03-02 18:57:46.845105 7f6322e4d700 data -> (3.0)
osd/ReplicatedPG.cc: In function 'void ReplicatedPG::sub_op_modify(OpRequest*)' thread 7f6325e53700 time 2012-03-02 18:57:47.014783
osd/ReplicatedPG.cc: 3795: FAILED assert(!missing.is_missing(soid))
 ceph version 0.43 (commit:9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
 1: (ReplicatedPG::sub_op_modify(OpRequest*)+0x993) [0x529c63]
 2: (ReplicatedPG::do_sub_op(OpRequest*)+0xab) [0x539fdb]
 3: (OSD::dequeue_op(PG*)+0x111) [0x59a251]
 4: (ThreadPool::worker()+0x6b4) [0x671c54]
 5: (ThreadPool::WorkThread::entry()+0xd) [0x5bd82d]
 6: (()+0x7efc) [0x7f6335c3aefc]
 7: (clone()+0x6d) [0x7f633426b89d]
 ceph version 0.43 (commit:9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
 1: (ReplicatedPG::sub_op_modify(OpRequest*)+0x993) [0x529c63]
 2: (ReplicatedPG::do_sub_op(OpRequest*)+0xab) [0x539fdb]
 3: (OSD::dequeue_op(PG*)+0x111) [0x59a251]
 4: (ThreadPool::worker()+0x6b4) [0x671c54]
 5: (ThreadPool::WorkThread::entry()+0xd) [0x5bd82d]
 6: (()+0x7efc) [0x7f6335c3aefc]
 7: (clone()+0x6d) [0x7f633426b89d]

logsNearAssert1875.grep - cluster log files clipped near time of assert (1.23 MB) Matthew Roy, 03/02/2012 09:59 PM

osd.0.log View - complete osd0 log file (assert is at line 1735) (343 KB) Matthew Roy, 03/02/2012 09:59 PM

core.osdAssert1857.gz - core file from assert (26.9 MB) Matthew Roy, 03/02/2012 09:59 PM

osd.0.log.gz (21.6 MB) Josh Durgin, 03/13/2012 01:56 PM

osd.2.log.bz2 (19.6 MB) Oliver Francke, 03/16/2012 05:36 AM

History

#1 Updated by Sage Weil about 12 years ago

  • Priority changed from Normal to Urgent

#2 Updated by Josh Durgin about 12 years ago

stxShadow saw this as well.

#3 Updated by Sage Weil about 12 years ago

  • Status changed from New to 12

Josh Durgin wrote:

stxShadow saw this as well.

It looks like in stxshadow's case, it was a version mismatch (crashing osd.0 ran 0.43 + omap stuff).

That was not the case for Roy's crash, though, which was on v0.43 (at least on the crashing node). Roy, is it possible that any other ceph-osds in the cluster were running non-v0.43 code?

#4 Updated by Samuel Just about 12 years ago

  • Status changed from 12 to Need More Info

Matthew Roy: What was the nature of the workload? rbd? ceph fs?

#5 Updated by Matthew Roy about 12 years ago

All cephfs workload. It could be a versioning issue, I don't have the syslogs anymore that would show when I updated the machines, but the logs from the other OSD (1) shows a start event at:
2012-03-02 17:43... ceph version 0.42.2 (commit:732f3ec94e39d458230b7728b2a936d431e19322)
and then later a start event:
2012-03-02 19:51:32... ceph version 0.43 (commit:9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)

The OSD that asserted (0) has a start event at:
2012-03-02 18:51... ceph version 0.43 (commit:9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
and previously had a start event at:
2012-02-29 17:38... ceph version 0.42.2 (commit:732f3ec94e39d458230b7728b2a936d431e19322)

So yeah, the two OSDs probably had different versions running at the time of the assert.

#6 Updated by Sage Weil about 12 years ago

  • Status changed from Need More Info to 12
  • Target version set to v0.44

Aha, that explains it... the 0.42.2 and 0.43 interaction looks like the culprit here. We should have made them explicitly incompatible. We'll make sure that's address for 0.44.

Thanks!

#7 Updated by Sage Weil about 12 years ago

  • Status changed from 12 to Resolved

#8 Updated by Oliver Francke about 12 years ago

Well,

its tagged as resolved, but today another node died...:

osd/ReplicatedPG.cc: In function 'void ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)' thread 7fdf4c487700 time 2012-03-16 12:22:47.778029
osd/ReplicatedPG.cc: 3381: FAILED assert(obc->watchers.size() == 0)
ceph version 0.43 (9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
1: (ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)+0x2bd) [0x50b9cd]
2: (ReplicatedPG::_applied_recovered_object(ObjectStore::Transaction*, ReplicatedPG::ObjectContext*)+0x1b1) [0x51cb81]
3: (Finisher::finisher_thread_entry()+0x1b8) [0x679f08]
4: (()+0x68ca) [0x7fdf578fe8ca]
5: (clone()+0x6d) [0x7fdf55f8286d]
ceph version 0.43 (9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
1: (ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)+0x2bd) [0x50b9cd]
2: (ReplicatedPG::_applied_recovered_object(ObjectStore::Transaction*, ReplicatedPG::ObjectContext*)+0x1b1) [0x51cb81]
3: (Finisher::finisher_thread_entry()+0x1b8) [0x679f08]
4: (()+0x68ca) [0x7fdf578fe8ca]
5: (clone()+0x6d) [0x7fdf55f8286d]
  • Caught signal (Aborted) *
    in thread 7fdf4c487700
    ceph version 0.43 (9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
    1: /usr/bin/ceph-osd() [0x6588e9]
    2: (()+0xeff0) [0x7fdf57906ff0]
    3: (gsignal()+0x35) [0x7fdf55ee51b5]
    4: (abort()+0x180) [0x7fdf55ee7fc0]
    5: (_gnu_cxx::_verbose_terminate_handler()+0x115) [0x7fdf56779dc5]
    6: (()+0xcb166) [0x7fdf56778166]
    7: (()+0xcb193) [0x7fdf56778193]
    8: (()+0xcb28e) [0x7fdf5677828e]
    9: (ceph::__ceph_assert_fail(char const
    , char const*, int, char const*)+0x80e) [0x65e0ee]
    10: (ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)+0x2bd) [0x50b9cd]
    11: (ReplicatedPG::_applied_recovered_object(ObjectStore::Transaction*, ReplicatedPG::ObjectContext*)+0x1b1) [0x51cb81]
    12: (Finisher::finisher_thread_entry()+0x1b8) [0x679f08]
    13: (()+0x68ca) [0x7fdf578fe8ca]
    14: (clone()+0x6d) [0x7fdf55f8286d]

osd.2.log is attached, core-file too big ( 120M), if you demand, we could provide ftp-access...

Comments welcome,

Oliver.

#9 Updated by Sage Weil about 12 years ago

Oliver Francke wrote:

Well,

its tagged as resolved, but today another node died...:

osd/ReplicatedPG.cc: In function 'void ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)' thread 7fdf4c487700 time 2012-03-16 12:22:47.778029
osd/ReplicatedPG.cc: 3381: FAILED assert(obc->watchers.size() == 0)
ceph version 0.43 (9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
1: (ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)+0x2bd) [0x50b9cd]
2: (ReplicatedPG::_applied_recovered_object(ObjectStore::Transaction*, ReplicatedPG::ObjectContext*)+0x1b1) [0x51cb81]
3: (Finisher::finisher_thread_entry()+0x1b8) [0x679f08]
4: (()+0x68ca) [0x7fdf578fe8ca]
5: (clone()+0x6d) [0x7fdf55f8286d]
ceph version 0.43 (9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
1: (ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)+0x2bd) [0x50b9cd]
2: (ReplicatedPG::_applied_recovered_object(ObjectStore::Transaction*, ReplicatedPG::ObjectContext*)+0x1b1) [0x51cb81]
3: (Finisher::finisher_thread_entry()+0x1b8) [0x679f08]
4: (()+0x68ca) [0x7fdf578fe8ca]
5: (clone()+0x6d) [0x7fdf55f8286d]
  • Caught signal (Aborted) *
    in thread 7fdf4c487700
    ceph version 0.43 (9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
    1: /usr/bin/ceph-osd() [0x6588e9]
    2: (()+0xeff0) [0x7fdf57906ff0]
    3: (gsignal()+0x35) [0x7fdf55ee51b5]
    4: (abort()+0x180) [0x7fdf55ee7fc0]
    5: (_gnu_cxx::_verbose_terminate_handler()+0x115) [0x7fdf56779dc5]
    6: (()+0xcb166) [0x7fdf56778166]
    7: (()+0xcb193) [0x7fdf56778193]
    8: (()+0xcb28e) [0x7fdf5677828e]
    9: (ceph::__ceph_assert_fail(char const
    , char const*, int, char const*)+0x80e) [0x65e0ee]
    10: (ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)+0x2bd) [0x50b9cd]
    11: (ReplicatedPG::_applied_recovered_object(ObjectStore::Transaction*, ReplicatedPG::ObjectContext*)+0x1b1) [0x51cb81]
    12: (Finisher::finisher_thread_entry()+0x1b8) [0x679f08]
    13: (()+0x68ca) [0x7fdf578fe8ca]
    14: (clone()+0x6d) [0x7fdf55f8286d]

osd.2.log is attached, core-file too big ( 120M), if you demand, we could provide ftp-access...

Comments welcome,

Oliver.

That's a different a crash, just opened #2080

Also available in: Atom PDF