Project

General

Profile

Bug #2749

osd/ReplicatedPG.cc: 3436: FAILED assert(last_update_applied < repop->v)

Added by Samuel Just over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Saw on master

osd/ReplicatedPG.cc: 3436: FAILED assert(last_update_applied < repop->v)

ceph version 0.48argonaut-247-g668ce00 (commit:668ce00a559b1a7512ed4539e706e4e2dc4c7c28)
1: (ReplicatedPG::op_applied(ReplicatedPG::RepGather*)+0xa00) [0x553560]
2: (C_OSD_OpApplied::finish(int)+0x11) [0x5a3bb1]
3: (Finisher::finisher_thread_entry()+0x1d8) [0x72c678]
4: (()+0x7e9a) [0x7f04a8726e9a]
5: (clone()+0x6d) [0x7f04a6cdb4bd]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- end dump of recent events ---
2012-07-06 12:20:09.408570 7f049c9bb700 -1 ** Caught signal (Aborted) *
in thread 7f049c9bb700

ceph version 0.48argonaut-247-g668ce00 (commit:668ce00a559b1a7512ed4539e706e4e2dc4c7c28)
1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x7193ca]
2: (()+0xfcb0) [0x7f04a872ecb0]
3: (gsignal()+0x35) [0x7f04a6c1f445]
4: (abort()+0x17b) [0x7f04a6c22bab]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f04a756d69d]
6: (()+0xb5846) [0x7f04a756b846]
7: (()+0xb5873) [0x7f04a756b873]
8: (()+0xb596e) [0x7f04a756b96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x282) [0x7d1052]
10: (ReplicatedPG::op_applied(ReplicatedPG::RepGather*)+0xa00) [0x553560]
11: (C_OSD_OpApplied::finish(int)+0x11) [0x5a3bb1]
12: (Finisher::finisher_thread_entry()+0x1d8) [0x72c678]
13: (()+0x7e9a) [0x7f04a8726e9a]
14: (clone()+0x6d) [0x7f04a6cdb4bd]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Associated revisions

Revision bcfa573f (diff)
Added by Samuel Just over 11 years ago

ReplicatedPG: don't mark repop done until apply completes

Consider the following sequence:
1. issue, apply repop
2. replicas and primary commit
Here, repop->waitfor_(ack|disk) are empty, so we mark
repop->done and remove_repop.
3. interval change, repops still in queue are marked aborted
4. activate, last_update_applied = last_update
5. the repop from one enters apply_repop, is not aborted,
and finds that last_update_applied has passed it by.

Fixes #2749

Signed-off-by: Samuel Just <>

History

#1 Updated by Samuel Just over 11 years ago

roles:
- - mon.a
- osd.0
- osd.1
- osd.2
- - mds.a
- osd.3
- osd.4
- osd.5
- - client.0
targets:
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDx0US96hot7gygZ69W4nxJQ9myYnn3I22YtOaSPe+yFWOPJVQUuOST+aw5K6JDcjdO2Gq0aS6s01mgoWpZlO/FVDKss7vZ2KjMp3uPkGMpDZarNbR3QTe5YZYrl7Wfw4pMu4jh92hCWJEzy5nH0H3X2YJhOd5BdOYz0P97qsMSPQGxhlvDBYBhDl9MLgsS3lKm/Js/OPLO+Uf3/SZceCjUqO2m3WsrJSiQJKh8XUWUu3z+6C1Wg6TXSSlA/jdVCiokDg7WYwPN9zMwzzGkGv+GUGHKMZaPGRZb9LQJLTBf/OjwRSgclAVdDc3vnZeYAS5+sDnt2grnJnlBd1rBUj3n
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC5J4n7rTsH+IMjGAu+EfhukuK5+zScoSaPIfXDOUU8LfvuI/3x8Luiyv9eRVwZgwuLBWZ/zorBbGZ+G2Iaxy3632AG/XE7cRZA9AxzZT+Qvm9D+BW+Uletgf92cttKMk7qwK3DetQwRKKl6AMv0SDpUff+nzqnJH6LMS8zoBPVXDHFM3Lup8h9H6DYEs1F/Zn8LVSw8hNiD279rg1n1hqWdItmnKBPKyC/qkRoPa6h7gDU6FPaBiNhuhBd0016XGrVwL7Y8gqoDBiArP+NDt1lcnbeiK43bFhqW+pYovOdIA2MJC6z+bkZDlOJdxoz9mDP0cJZBdB43v3UdbS1R+WT
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCm/tvD/SaGxjsNa89YdjfFvYexWI9f+EFk1xGwqOyC1fJr1k7nKolWGnB2/7ao8MrEAyhneO+tSs8yTVjxHBpx72KTfIdCfeQtyKqg2Eglj+qljcK/0CaqxFucwQXPL3Jx2XWBQptrxoH9zHwLPaTa/MRaoLm2aCG9mui1v5ruYiIDI3UWzE5CDvl1qAdFpAN82KSFpJKdvTIMxImWaeaVcHoACB7R5bFxxH6rhW9ZGpWIMeKGBNSGNCgORXIfxp/yQqznqg8A58+gi1rveYUOctw26Mmp8DVEjqSYqSL23UrZjOzmh5eq43JPSu1LTvxxU/7OqONmV4wwfW78B9e3
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph:
log-whitelist:
- wrongly marked me down or wrong addr
- objects unfound and apparently lost
- thrashosds: null
- rados:
clients:
- client.0
objects: 500
op_weights:
delete: 50
read: 100
snap_create: 50
snap_remove: 50
snap_rollback: 50
write: 100
ops: 4000

#2 Updated by Sage Weil over 11 years ago

  • Priority changed from Normal to Immediate

#3 Updated by Sage Weil over 11 years ago

ubuntu@teuthology:/a/teuthology-2012-07-09_19:00:03-regression-master-testing-gcov/8328

#4 Updated by Samuel Just over 11 years ago

  • Status changed from New to Resolved

bcfa573f5f615f3403ff71da0212cd1cee7e7d9c

The core provided pretty good support for this sequence of events, marking resolved.

Also available in: Atom PDF