Project

General

Profile

Bug #2823

osd: out of order ACKs

Added by Sage Weil over 11 years ago. Updated over 11 years ago.

Status:
Duplicate
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description


2012-07-22T22:22:59.279 INFO:teuthology.task.rados.rados.0.out:7: Writing initial 7
2012-07-22T22:22:59.279 INFO:teuthology.task.rados.rados.0.out:waiting_on = 4
2012-07-22T22:22:59.279 INFO:teuthology.task.rados.rados.0.out:Writing plana8231399-7 from 0 to 55 tid 1 ranges are [0~55,696621~575049,1903405~770015,3073655~649760]
2012-07-22T22:22:59.285 INFO:teuthology.task.rados.rados.0.out:finishing write tid 1 to plana8231399-7
2012-07-22T22:22:59.293 INFO:teuthology.task.rados.rados.0.out:finishing write tid 3 to plana8231399-6
2012-07-22T22:22:59.316 INFO:teuthology.task.rados.rados.0.out:Writing plana8231399-7 from 696621 to 1271670 tid 2 ranges are [0~55,696621~575049,1903405~770015,3073655~649760]
2012-07-22T22:22:59.319 INFO:teuthology.task.rados.rados.0.out:finishing write tid 1 to plana8231399-5
2012-07-22T22:22:59.319 INFO:teuthology.task.rados.rados.0.err:Error: finished tid 1 when last_acked_tid was 2
2012-07-22T22:22:59.320 INFO:teuthology.task.rados.rados.0.err:./test/osd/RadosModel.h: In function 'virtual void WriteOp::_finish(TestOp::CallbackInfo*)' thread 7f81627fc700 time 2012-07-22 22:22:59.304828
2012-07-22T22:22:59.320 INFO:teuthology.task.rados.rados.0.err:./test/osd/RadosModel.h: 822: FAILED assert(0)
2012-07-22T22:22:59.320 INFO:teuthology.task.rados.rados.0.err: ceph version 0.48argonaut-496-g6eed767 (commit:6eed7678bfad69cda7ff405697ee3c1436cdb5e6)
2012-07-22T22:22:59.320 INFO:teuthology.task.rados.rados.0.err: 1: (WriteOp::_finish(TestOp::CallbackInfo*)+0x2a1) [0x413451]
2012-07-22T22:22:59.320 INFO:teuthology.task.rados.rados.0.err: 2: (write_callback(void*, void*)+0x21) [0x420f01]
2012-07-22T22:22:59.320 INFO:teuthology.task.rados.rados.0.err: 3: (librados::C_AioComplete::finish(int)+0x1d) [0x7f816c76fb0d]
2012-07-22T22:22:59.320 INFO:teuthology.task.rados.rados.0.err: 4: (Finisher::finisher_thread_entry()+0x218) [0x7f816c7c9a08]
2012-07-22T22:22:59.320 INFO:teuthology.task.rados.rados.0.err: 5: (()+0x7e9a) [0x7f816c433e9a]
2012-07-22T22:22:59.321 INFO:teuthology.task.rados.rados.0.err: 6: (clone()+0x6d) [0x7f816bc4b4bd]
2012-07-22T22:22:59.321 INFO:teuthology.task.rados.rados.0.err: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2012-07-22T22:22:59.321 INFO:teuthology.task.rados.rados.0.err:terminate called after throwing an instance of 'ceph::FailedAssertion'
2012-07-22T22:22:59.436 ERROR:teuthology.run_tasks:Manager failed: <contextlib.GeneratorContextManager object at 0x1b27b90>
Traceback (most recent call last):
  File "/var/lib/teuthworker/teuthology/teuthology/run_tasks.py", line 45, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.6/contextlib.py", line 23, in __exit__
    self.gen.next()
  File "/var/lib/teuthworker/teuthology/teuthology/task/rados.py", line 97, in task
    run.wait(tests.itervalues())
  File "/var/lib/teuthworker/teuthology/teuthology/orchestra/run.py", line 272, in wait
    proc.exitstatus.get()
  File "/var/lib/teuthworker/teuthology/virtualenv/lib/python2.6/site-packages/gevent/event.py", line 223, in get
    raise self._exception
CommandCrashedError: Command crashed: 'CEPH_CLIENT_ID=0 CEPH_CONF=/tmp/cephtest/ceph.conf LD_LIBRARY_PATH=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/
local/bin/testrados --op read 45 --op write 45 --op delete 10 --op snap_create 0 --op snap_remove 0 --op rollback 0 --op setattr 0 --op rmattr 0 --op watch 0 --max-ops 4000 --objects 500 --max-in-flight 16 --size 4000000 --min-stride-size 400000 --max-stride-size 800000 --max
-seconds 0'

ubuntu@teuthology:/a/sage-2012-07-22_21:26:14-regression-wip-msgr-testing-basic/16249$ cat config.yaml 
kernel: &id001
  kdb: true
  sha1: 77dca1ac33894de22b1740bb9cf6b8ef6429c700
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject socket failures: 200
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: 6eed7678bfad69cda7ff405697ee3c1436cdb5e6
  workunit:
    sha1: 6eed7678bfad69cda7ff405697ee3c1436cdb5e6
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
- - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
targets:
  ubuntu@plana69.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDE4lTHELBl2d+BXuaxMKtJX4r3GN2qSaxHTfewZxutqEC+rSNbD5Otiqwm16GtmCYklbYJL0yr6mizCZ0KzL3lVwNSczb/CxfvLTLegmBv9YzPRvoL1ymxdHggPIw47JTNRg6+UO9EConQpp7LOktT7fYf3AFx/BS9Ux6UeKyxifkb/1GUlzVT2rk1D39frBhI7AcPHovvbd/6uHvamrpUS3wSDWdz6BNRTQDaK307TJS9LPva6Cj6WAgmTZZesjWB4mOpWywrA9sva7GqSrnTMGKnIHPcsI6U57CALwHS1g/zIz73QuolZA2NkbhA7jjrfsUaWYJorE9dynIga9pf
  ubuntu@plana73.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDAZtaJPPhGrqcL1tRUHvgMMykh4hU2IdrX9cC2hYhotYVpzeejuA2TkmRdFhno54E6BM4dRIh3Wqv95YXpsToxOvzGp46lmRl3aNrLeWEEE5KxwgwTWuK72vMpZTuD/8XClj/faeFYuKvdiFFK0jGiX+j74HBOOrl3A3WdCi+/xFJZ/b5vZUJRFZxEkAkpy868iebiMgMHWDzs9A8GVb35MGf8nQGwFtPQlHEtgQNK51Wm7rj3qJM1tP1ZL2ghtrck9mdLBUzo6Fj1YTifewKt/PhteEZ1aP8lMpakOoRrJWawM27XQ7XJZDldZcI4REvF3suJLbiQHzdGArUTG3oz
  ubuntu@plana82.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDKbuUXtRWySmQ24tUH6g2Y3MU0N5+VTNSxHzX5QhVnLUY2djFFUHUhivJGutTfIOdIpw+jQP2fwQoryqC/oqSKyE2fwunP+M1ZEobQILQFthKeUKpDH6rC1gnaJI6CU3k0voVUTQpHUyKLApO5A4knXLfzmPJeE4lbi8Tds9Kx1TtID4+R8Z0UgJgfu8LVsMw8225hvuE0lOG+2n3Ms97uLh9MIxuELm4HAerp2etlVQVeE99udYeST+LMhrltxFRdKD8WUiuycIrzaT9D446ELfN+Vlugh6jSwbJycCzeqfEcPBEKNd4A9oPBThR7T0iHKRNGXHv7QhwpEgPUTT8f
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    timeout: 1200
- rados:
    clients:
    - client.0
    objects: 500
    op_weights:
      delete: 10
      read: 45
      write: 45
    ops: 4000

ubuntu@teuthology:/a/sage-2012-07-22_21:26:14-regression-wip-msgr-testing-basic/16251 too

Related issues

Related to Ceph - Bug #2803: filer: probe crash Resolved 07/19/2012
Duplicates Ceph - Bug #2947: osd: out of order reply Resolved 08/14/2012

Associated revisions

Revision 67832c34 (diff)
Added by Sage Weil over 11 years ago

osd: fix ACK ordering on resent ops

The wait_for_ondisk handling fixed COMMIT ordering, but the ACKs need to
go back in the same order too. For example:

- op A is queued
- client disconnects, both ACK and COMMIT replies are lost
- client reconnects
- op A and B are sent
- op A is queued
- op B is applied, ACK is sent
- op A and B COMMITs are sent
-> client's ack callbacks will see B and then A.

Fix this by creating a waiting_for_ack queue as well, and sending ACK
responses as needed. Also handle the case where the ACK should be sent
immediately when the retry event is received.

Fixes: #2823
Signed-off-by: Sage Weil <>
Reviewed-by: Mike Ryan <>

History

#1 Updated by Sage Weil over 11 years ago

  • Subject changed from osd: out of order reply with ms failures to osd: out of order ACKs

specifically, this is out of order ACKs when ops are resent. trivially reproducible with the thrasher + radosmodel tests.

  • send op A
  • osd applies, but reply gets drop by peering, ms fault, whatever
  • client reconnects, sends A and B
  • A is a dup, gets put on waiting_for_ondisk
  • B is applied, ACK sent
  • A applies, COMMIT sent
  • B applies, COMMIT sent

The result is that a client watching ACKs sees B then A (implied by the COMMIT reply).

Basically, if the client asks for ACKs, those need to be sent and ordered as well.

#2 Updated by Sage Weil over 11 years ago

  • Status changed from New to Resolved

#3 Updated by Samuel Just over 11 years ago

  • Status changed from Resolved to New
  • Assignee changed from Sage Weil to Samuel Just

#4 Updated by Samuel Just over 11 years ago

(10:46:42 AM) : 4196: (1138s) collection:rados-thrash clusters:6-osd-3-machine.yaml fs:ext4.yaml thrashers:default.yaml workloads:snaps-few-objects.yaml
(10:47:37 AM) /work: directory?
(10:48:18 AM) : ubuntu@teuthology:/a/teuthology-2012-08-01_19:00:04-regression-master-testing-gcov/4196

#5 Updated by Tamilarasi muthamizhan over 11 years ago

Log location: ubuntu@teuthology:/a/teuthology-2012-08-01_19:00:04-regression-master-testing-gcov/4196

ubuntu@teuthology:/a/teuthology-2012-08-01_19:00:04-regression-master-testing-gcov/4196$ cat config.yaml
kernel: &id001
kdb: true
sha1: a879747d111666eb20c3d011328d011f61e0518f
nuke-on-error: true
overrides:
ceph:
coverage: true
fs: ext4
log-whitelist:
- slow request
sha1: ca2c381380fe83ae56c6ffdde3c775e9adf65a1f
workunit:
sha1: ca2c381380fe83ae56c6ffdde3c775e9adf65a1f
roles:
- - mon.a
- osd.0
- osd.1
- osd.2
- - mds.a
- osd.3
- osd.4
- osd.5
- - client.0
targets:
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC6ZmsmnHcSY7O9viGtUzt5WebiPbwcXo9tg5qgWsaqn46DeegKbdsQ55ajysSUVVhvQA5hW6J9IYyZ5MjtlY2G/whyHYG85tNpAUiuedaQHmzARtL3URZmy2ZxwXgYyPHW3t1n0cu6KSb4pTv9vBjcaCouV2wgrinHAISzDOVuUeXdIhC8Tr3MB0nD1Gw6Xcak680XsQw6oYP6cM+yGCZ7sF15W8TR9IJGmphMIvtd8aTuBo9yet5rIxUfzpCM9Jiv+XgH2oT9h9WfacuL1uQ2C/dHUWoPynK36Uv2J785bfw/hVVtuSGu9Lb1n4o8p8Z88Ex4i8KaOxMiQAs3zqOx
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDrxOb9f5/SfItd83HOnLVyJRnfji0fbdvL+3T82akjV6J4s/nyR8Bu+rpXbyUwu2BRDoxK4pT2dBqw86meq1qbU5Q1ypWBSH41MYGd213fy0g8YibFiYVGmXFCSwtY8X2Pet9vtLDoYvtnsgNI8djy5GPkQyZFKSszJHznZvQU10NWfM6RfxxtsBKXC/aot4QXb3GIym2/EmeuTAAef6p98dd15P9l9HQkpwXZLwiDZ53IbU79CTINo5HTD/6+1XHUcjb1OUKzQMx1jU485gW6IlsR0G0jJKSv+YEu4zSxxva7gWt1AYxGo2jhNDffEGLsNurzXFf9yeYshCTAszLf
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC5sa91O7pjxDijkXWrqS3slUWFtQJZtC3hfN5cKt+oISIVRZoy4CNxFrOWhmXwKOZaIV7fS2PzmhZQe9jfA5iRP7G/opl/5sMImzTIUUz8Qrlqb7epmwYHkoE196uN45y8iW/IDwsD7o69+rOGIj0SgdT0nGBYXB27OhdECdumxHcST/vM8dFfIwwMienvhKKB+ghrS2bYa33DcfYHBQWdIPXrUf2brb/XRYDwVMg8W5uPyW8dx3GIONVxSWequnTO1m0FuMKB7BrVpH19+/H4DFOwj6vqSvvOYUWm3Go3FsQTvparxrg7/Rliee0RBWcBIE3D0gWk6r74HV0C4pnT
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph:
log-whitelist:
- wrongly marked me down
- objects unfound and apparently lost
- thrashosds:
timeout: 1200
- rados:
clients:
- client.0
objects: 50
op_weights:
delete: 50
read: 100
snap_create: 50
snap_remove: 50
snap_rollback: 50
write: 100
ops: 4000

ubuntu@teuthology:/a/teuthology-2012-08-01_19:00:04-regression-master-testing-gcov/4196$ cat summary.yaml
ceph-sha1: ca2c381380fe83ae56c6ffdde3c775e9adf65a1f
client.0-kernel-sha1: a879747d111666eb20c3d011328d011f61e0518f
description: collection:rados-thrash clusters:6-osd-3-machine.yaml fs:ext4.yaml thrashers:default.yaml
workloads:snaps-few-objects.yaml
duration: 1138.2689270973206
failure_reason: 'Command crashed: ''CEPH_CLIENT_ID=0 CEPH_CONF=/tmp/cephtest/ceph.conf
LD_LIBRARY_PATH=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump
/tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage
/tmp/cephtest/binary/usr/local/bin/testrados --op read 100 --op write 100 --op delete
50 --op snap_create 50 --op snap_remove 50 --op rollback 0 --op setattr 0 --op rmattr
0 --op watch 0 --max-ops 4000 --objects 50 --max-in-flight 16 --size 4000000 --min-stride-size
400000 --max-stride-size 800000 --max-seconds 0'''
flavor: gcov
mds.a-kernel-sha1: a879747d111666eb20c3d011328d011f61e0518f
mon.a-kernel-sha1: a879747d111666eb20c3d011328d011f61e0518f
owner: scheduled_teuthology@teuthology
success: false

#6 Updated by Tamilarasi muthamizhan over 11 years ago

Recent logs: ubuntu@teuthology:/a/teuthology-2012-08-11_00:00:08-regression-next-testing-basic/6401

#7 Updated by Sage Weil over 11 years ago

  • Status changed from New to Duplicate

Also available in: Atom PDF