Bug #5670: ceph_test_rados; FAILED assert(0) - Ceph - Ceph

Actions

Copy link

Bug #5670

closed

ceph_test_rados; FAILED assert(0)

Added by David Zafman almost 11 years ago. Updated almost 11 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

ceph version 0.66-697-g921a4aa (921a4aac8a89850303233fe188998202e0ddfe0d)

2013-07-18T15:06:33.739 INFO:teuthology.task.rados.rados.0.out:Deleting 125 current snap is 0
2013-07-18T15:06:33.740 INFO:teuthology.task.rados.rados.0.err:r is -2 while deleting 125 and present is 1
2013-07-18T15:06:33.741 INFO:teuthology.task.rados.rados.0.err:./test/osd/RadosModel.h: In function 'virtual void DeleteOp::_begin()' thread 7fd4c6076780 time 2013-07-18 15:07:02.822706
2013-07-18T15:06:33.741 INFO:teuthology.task.rados.rados.0.err:./test/osd/RadosModel.h: 907: FAILED assert(0)
2013-07-18T15:06:33.741 INFO:teuthology.task.rados.rados.0.err: ceph version 0.66-697-g921a4aa (921a4aac8a89850303233fe188998202e0ddfe0d)
2013-07-18T15:06:33.741 INFO:teuthology.task.rados.rados.0.err: 1: (DeleteOp::_begin()+0x4b9) [0x415c69]
2013-07-18T15:06:33.741 INFO:teuthology.task.rados.rados.0.err: 2: (RadosTestContext::loop(TestOpGenerator*)+0x9a) [0x40cc4a]
2013-07-18T15:06:33.741 INFO:teuthology.task.rados.rados.0.err: 3: (main()+0xaff) [0x40bbff]
2013-07-18T15:06:33.741 INFO:teuthology.task.rados.rados.0.err: 4: (__libc_start_main()+0xed) [0x7fd4c454376d]
2013-07-18T15:06:33.741 INFO:teuthology.task.rados.rados.0.err: 5: ceph_test_rados() [0x40bfc1]

orig.config.yaml:
machine_type: plana
overrides:
admin_socket:
branch: master
ceph:
branch: next
conf:
global:
ms inject delay max: 1
ms inject delay probability: 0.005
ms inject delay type: osd
ms inject socket failures: 2500
mon:
debug mon: 20
debug ms: 20
debug paxos: 20
osd:
debug filestore: 20
debug ms: 1
debug osd: 20
fs: xfs
log-whitelist:
- slow request
install:
ceph:
branch: next
s3tests:
branch: master
workunit:
branch: next
roles:
- - mon.a
- mon.c
- osd.0
- osd.1
- osd.2
- - mon.b
- mds.a
- osd.3
- osd.4
- osd.5
- client.0
tasks:
- chef: null
- clock.check: null
- install: null
- ceph:
log-whitelist:
- wrongly marked me down
- objects unfound and apparently lost
- thrashosds:
chance_pgnum_grow: 2
chance_pgpnum_fix: 1
timeout: 1200
- rados:
clients:
- client.0
objects: 500
op_weights:
delete: 10
read: 45
write: 45
ops: 4000

Actions

Copy link

Updated by Samuel Just almost 11 years ago

Status changed from New to In Progress
Priority changed from Normal to Urgent

1) Primary starts pushing object foo to replica
2) replica creates tmp/foo
3) peering change, replica is out of acting
4) delete foo
5) replica is now primary and sees the foo deletion event
6) write on foo

As of 2 there is a tmp/foo, as of 6, there is also a head/foo. Unfortunately, the fd cacher will cause 6) to occur on tmp/foo instead. There are similar problems with the omap and xattrs. The solution is to delete tmp/foo in merge_log when we also delete head/foo.

Actions

Copy link

Updated by Samuel Just almost 11 years ago

To elaborate, if the object had not been deleted, it would have had to be recovered or backfilled before IO could be served on it. In either case, the tmp directory copy would have been eliminated.

Actions

Copy link

Updated by Samuel Just almost 11 years ago

Nope. Backfill might just skip an object leaving it treacherously in the temp directory. We'll have to track objects in the temp directory. Easy enough, we can just do the tracking in submit_push_*.

Actions

Copy link