Bug #2132
FAILED assert(!missing.is_missing(soid))
0%
Description
Possibly a duplicate of Issue #1191 or Issue #339 (both closed with could not reproduce).
Prior to this assert the cluster was running cleanly up for about 3 days. The ceph code on the server that asserted was updated to v0.43 at 18:45:52 and a soft reboot of this server that was running osd.0, mds.a (previously active), and mon.a was initiated at 18:48:08, completing at 18:52:14. During this reboot mds.b remained in standy_replay and never went active. There was an active CephFS client that had hung waiting for writes (the client was still on v0.42).
At the time of the assert (2012-03-02 18:57:46.845105) mds.a was active and still laggy, but was complaining about dentry mismatches. osd.0 had just complained about a bunch of 'old request osd_op'
2012-03-02 18:57:46.845105 7f6322e4d700 data -> (3.0) osd/ReplicatedPG.cc: In function 'void ReplicatedPG::sub_op_modify(OpRequest*)' thread 7f6325e53700 time 2012-03-02 18:57:47.014783 osd/ReplicatedPG.cc: 3795: FAILED assert(!missing.is_missing(soid)) ceph version 0.43 (commit:9fa8781c0147d66fcef7c2dd0e09cd3c69747d37) 1: (ReplicatedPG::sub_op_modify(OpRequest*)+0x993) [0x529c63] 2: (ReplicatedPG::do_sub_op(OpRequest*)+0xab) [0x539fdb] 3: (OSD::dequeue_op(PG*)+0x111) [0x59a251] 4: (ThreadPool::worker()+0x6b4) [0x671c54] 5: (ThreadPool::WorkThread::entry()+0xd) [0x5bd82d] 6: (()+0x7efc) [0x7f6335c3aefc] 7: (clone()+0x6d) [0x7f633426b89d] ceph version 0.43 (commit:9fa8781c0147d66fcef7c2dd0e09cd3c69747d37) 1: (ReplicatedPG::sub_op_modify(OpRequest*)+0x993) [0x529c63] 2: (ReplicatedPG::do_sub_op(OpRequest*)+0xab) [0x539fdb] 3: (OSD::dequeue_op(PG*)+0x111) [0x59a251] 4: (ThreadPool::worker()+0x6b4) [0x671c54] 5: (ThreadPool::WorkThread::entry()+0xd) [0x5bd82d] 6: (()+0x7efc) [0x7f6335c3aefc] 7: (clone()+0x6d) [0x7f633426b89d]
History
#1 Updated by Sage Weil over 11 years ago
- Priority changed from Normal to Urgent
#3 Updated by Sage Weil over 11 years ago
- Status changed from New to 12
Josh Durgin wrote:
stxShadow saw this as well.
It looks like in stxshadow's case, it was a version mismatch (crashing osd.0 ran 0.43 + omap stuff).
That was not the case for Roy's crash, though, which was on v0.43 (at least on the crashing node). Roy, is it possible that any other ceph-osds in the cluster were running non-v0.43 code?
#4 Updated by Samuel Just over 11 years ago
- Status changed from 12 to Need More Info
Matthew Roy: What was the nature of the workload? rbd? ceph fs?
#5 Updated by Matthew Roy over 11 years ago
All cephfs workload. It could be a versioning issue, I don't have the syslogs anymore that would show when I updated the machines, but the logs from the other OSD (1) shows a start event at:
2012-03-02 17:43... ceph version 0.42.2 (commit:732f3ec94e39d458230b7728b2a936d431e19322)
and then later a start event:
2012-03-02 19:51:32... ceph version 0.43 (commit:9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
The OSD that asserted (0) has a start event at:
2012-03-02 18:51... ceph version 0.43 (commit:9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
and previously had a start event at:
2012-02-29 17:38... ceph version 0.42.2 (commit:732f3ec94e39d458230b7728b2a936d431e19322)
So yeah, the two OSDs probably had different versions running at the time of the assert.
#6 Updated by Sage Weil over 11 years ago
- Status changed from Need More Info to 12
- Target version set to v0.44
Aha, that explains it... the 0.42.2 and 0.43 interaction looks like the culprit here. We should have made them explicitly incompatible. We'll make sure that's address for 0.44.
Thanks!
#7 Updated by Sage Weil over 11 years ago
- Status changed from 12 to Resolved
#8 Updated by Oliver Francke over 11 years ago
- File osd.2.log.bz2 added
Well,
its tagged as resolved, but today another node died...:
osd/ReplicatedPG.cc: In function 'void ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)' thread 7fdf4c487700 time 2012-03-16 12:22:47.778029osd/ReplicatedPG.cc: 3381: FAILED assert(obc->watchers.size() == 0)
ceph version 0.43 (9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
1: (ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)+0x2bd) [0x50b9cd]
2: (ReplicatedPG::_applied_recovered_object(ObjectStore::Transaction*, ReplicatedPG::ObjectContext*)+0x1b1) [0x51cb81]
3: (Finisher::finisher_thread_entry()+0x1b8) [0x679f08]
4: (()+0x68ca) [0x7fdf578fe8ca]
5: (clone()+0x6d) [0x7fdf55f8286d]
ceph version 0.43 (9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
1: (ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)+0x2bd) [0x50b9cd]
2: (ReplicatedPG::_applied_recovered_object(ObjectStore::Transaction*, ReplicatedPG::ObjectContext*)+0x1b1) [0x51cb81]
3: (Finisher::finisher_thread_entry()+0x1b8) [0x679f08]
4: (()+0x68ca) [0x7fdf578fe8ca]
5: (clone()+0x6d) [0x7fdf55f8286d]
- Caught signal (Aborted) *
in thread 7fdf4c487700
ceph version 0.43 (9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
1: /usr/bin/ceph-osd() [0x6588e9]
2: (()+0xeff0) [0x7fdf57906ff0]
3: (gsignal()+0x35) [0x7fdf55ee51b5]
4: (abort()+0x180) [0x7fdf55ee7fc0]
5: (_gnu_cxx::_verbose_terminate_handler()+0x115) [0x7fdf56779dc5]
6: (()+0xcb166) [0x7fdf56778166]
7: (()+0xcb193) [0x7fdf56778193]
8: (()+0xcb28e) [0x7fdf5677828e]
9: (ceph::__ceph_assert_fail(char const, char const*, int, char const*)+0x80e) [0x65e0ee]
10: (ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)+0x2bd) [0x50b9cd]
11: (ReplicatedPG::_applied_recovered_object(ObjectStore::Transaction*, ReplicatedPG::ObjectContext*)+0x1b1) [0x51cb81]
12: (Finisher::finisher_thread_entry()+0x1b8) [0x679f08]
13: (()+0x68ca) [0x7fdf578fe8ca]
14: (clone()+0x6d) [0x7fdf55f8286d]
osd.2.log is attached, core-file too big ( 120M), if you demand, we could provide ftp-access...
Comments welcome,
Oliver.
#9 Updated by Sage Weil over 11 years ago
Oliver Francke wrote:
Well,
its tagged as resolved, but today another node died...:
osd/ReplicatedPG.cc: In function 'void ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)' thread 7fdf4c487700 time 2012-03-16 12:22:47.778029
osd/ReplicatedPG.cc: 3381: FAILED assert(obc->watchers.size() == 0)
ceph version 0.43 (9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
1: (ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)+0x2bd) [0x50b9cd]
2: (ReplicatedPG::_applied_recovered_object(ObjectStore::Transaction*, ReplicatedPG::ObjectContext*)+0x1b1) [0x51cb81]
3: (Finisher::finisher_thread_entry()+0x1b8) [0x679f08]
4: (()+0x68ca) [0x7fdf578fe8ca]
5: (clone()+0x6d) [0x7fdf55f8286d]
ceph version 0.43 (9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
1: (ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)+0x2bd) [0x50b9cd]
2: (ReplicatedPG::_applied_recovered_object(ObjectStore::Transaction*, ReplicatedPG::ObjectContext*)+0x1b1) [0x51cb81]
3: (Finisher::finisher_thread_entry()+0x1b8) [0x679f08]
4: (()+0x68ca) [0x7fdf578fe8ca]
5: (clone()+0x6d) [0x7fdf55f8286d]
- Caught signal (Aborted) *
in thread 7fdf4c487700
ceph version 0.43 (9fa8781c0147d66fcef7c2dd0e09cd3c69747d37)
1: /usr/bin/ceph-osd() [0x6588e9]
2: (()+0xeff0) [0x7fdf57906ff0]
3: (gsignal()+0x35) [0x7fdf55ee51b5]
4: (abort()+0x180) [0x7fdf55ee7fc0]
5: (_gnu_cxx::_verbose_terminate_handler()+0x115) [0x7fdf56779dc5]
6: (()+0xcb166) [0x7fdf56778166]
7: (()+0xcb193) [0x7fdf56778193]
8: (()+0xcb28e) [0x7fdf5677828e]
9: (ceph::__ceph_assert_fail(char const, char const*, int, char const*)+0x80e) [0x65e0ee]
10: (ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)+0x2bd) [0x50b9cd]
11: (ReplicatedPG::_applied_recovered_object(ObjectStore::Transaction*, ReplicatedPG::ObjectContext*)+0x1b1) [0x51cb81]
12: (Finisher::finisher_thread_entry()+0x1b8) [0x679f08]
13: (()+0x68ca) [0x7fdf578fe8ca]
14: (clone()+0x6d) [0x7fdf55f8286d]osd.2.log is attached, core-file too big ( 120M), if you demand, we could provide ftp-access...
Comments welcome,
Oliver.
That's a different a crash, just opened #2080