Project

General

Profile

Actions

Bug #2082

closed

osd: broken queuing during replay

Added by Sage Weil about 12 years ago. Updated about 12 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
OSD
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description


2012-02-19T15:33:45.005 INFO:teuthology.task.rados.rados.0.out:finishing write tid 3 to sepia832175-447
2012-02-19T15:33:45.005 INFO:teuthology.task.rados.rados.0.out:finishing write tid 1 to sepia832175-325
2012-02-19T15:33:45.005 INFO:teuthology.task.rados.rados.0.out:finishing write tid 1 to sepia832175-183
2012-02-19T15:33:45.005 INFO:teuthology.task.rados.rados.0.out:finishing write tid 2 to sepia832175-325
2012-02-19T15:33:45.005 INFO:teuthology.task.rados.rados.0.out:finishing write tid 2 to sepia832175-183
2012-02-19T15:33:45.126 INFO:teuthology.task.rados.rados.0.out:Writing sepia832175-158 from 679187 to 1283761 tid 2 ranges are [0~56,679187~604574,1921743~657143,3080556~590796]
2012-02-19T15:33:46.356 INFO:teuthology.task.thrashosds.thrasher:in_osds:  [0, 5, 2, 1]  out_osds:  [3, 4] dead_osds:  [] live_osds:  [1, 0, 3, 2, 5, 4]
2012-02-19T15:33:46.357 INFO:teuthology.task.thrashosds.thrasher:choose_action: min_in 2 min_out 0 min_live 2 min_dead 0
2012-02-19T15:33:46.357 INFO:teuthology.task.thrashosds.thrasher:Removing osd 1, in_osds are: [0, 5, 2, 1]
2012-02-19T15:33:46.357 DEBUG:teuthology.orchestra.run:Running: 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp
/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise osd out 1'
2012-02-19T15:33:48.258 INFO:teuthology.task.rados.rados.0.out:Writing sepia832175-158 from 1921743 to 2578886 tid 3 ranges are [0~56,679187~604574,1921743~657143,3080556~590796]
2012-02-19T15:33:48.536 INFO:teuthology.task.rados.rados.0.out:finishing write tid 1 to sepia832175-324
2012-02-19T15:33:48.675 INFO:teuthology.task.rados.rados.0.out:Writing sepia832175-158 from 3080556 to 3671352 tid 4 ranges are [0~56,679187~604574,1921743~657143,3080556~590796]
2012-02-19T15:33:48.704 INFO:teuthology.task.rados.rados.0.err:Error: finished tid 1 when last_acked_tid was 4
2012-02-19T15:33:49.104 INFO:teuthology.task.rados.rados.0.err:./test/osd/RadosModel.h: In function 'virtual void WriteOp::_finish(TestOp::CallbackInfo*)' thread 7fe4c84fd700 time 2012-02-19 15:33:48.706821
2012-02-19T15:33:49.105 INFO:teuthology.task.rados.rados.0.err:./test/osd/RadosModel.h: 368: FAILED assert(0)
2012-02-19T15:33:49.116 INFO:teuthology.task.rados.rados.0.err: ceph version 0.41-400-g2c2b388 (commit:2c2b3881bf24b7d3b1aed457545285b5384d1c76)
2012-02-19T15:33:49.116 INFO:teuthology.task.rados.rados.0.err: 1: (WriteOp::_finish(TestOp::CallbackInfo*)+0x252) [0x40a772]
2012-02-19T15:33:49.117 INFO:teuthology.task.rados.rados.0.err: 2: (write_callback(void*, void*)+0x21) [0x414371]
2012-02-19T15:33:49.117 INFO:teuthology.task.rados.rados.0.err: 3: (librados::RadosClient::C_aio_Ack::finish(int)+0xa4) [0x7fe4cada3824]
2012-02-19T15:33:49.117 INFO:teuthology.task.rados.rados.0.err: 4: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf79) [0x7fe4cadb35c9]
2012-02-19T15:33:49.117 INFO:teuthology.task.rados.rados.0.err: 5: (librados::RadosClient::_dispatch(Message*)+0x6c) [0x7fe4cad8ef4c]
2012-02-19T15:33:49.117 INFO:teuthology.task.rados.rados.0.err: 6: (librados::RadosClient::ms_dispatch(Message*)+0x1d3) [0x7fe4cad8f173]
2012-02-19T15:33:49.118 INFO:teuthology.task.rados.rados.0.err: 7: (SimpleMessenger::dispatch_entry()+0x89a) [0x7fe4cae430ca]
2012-02-19T15:33:49.118 INFO:teuthology.task.rados.rados.0.err: 8: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x7fe4cad9ccbc]
2012-02-19T15:33:49.118 INFO:teuthology.task.rados.rados.0.err: 9: (()+0x7971) [0x7fe4caaf2971]
2012-02-19T15:33:49.118 INFO:teuthology.task.rados.rados.0.err: 10: (clone()+0x6d) [0x7fe4ca33292d]
2012-02-19T15:33:49.118 INFO:teuthology.task.rados.rados.0.err: ceph version 0.41-400-g2c2b388 (commit:2c2b3881bf24b7d3b1aed457545285b5384d1c76)
2012-02-19T15:33:49.119 INFO:teuthology.task.rados.rados.0.err: 1: (WriteOp::_finish(TestOp::CallbackInfo*)+0x252) [0x40a772]
2012-02-19T15:33:49.119 INFO:teuthology.task.rados.rados.0.err: 2: (write_callback(void*, void*)+0x21) [0x414371]
2012-02-19T15:33:49.119 INFO:teuthology.task.rados.rados.0.err: 3: (librados::RadosClient::C_aio_Ack::finish(int)+0xa4) [0x7fe4cada3824]
2012-02-19T15:33:49.119 INFO:teuthology.task.rados.rados.0.err: 4: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf79) [0x7fe4cadb35c9]
2012-02-19T15:33:49.119 INFO:teuthology.task.rados.rados.0.err: 5: (librados::RadosClient::_dispatch(Message*)+0x6c) [0x7fe4cad8ef4c]
2012-02-19T15:33:49.120 INFO:teuthology.task.rados.rados.0.err: 6: (librados::RadosClient::ms_dispatch(Message*)+0x1d3) [0x7fe4cad8f173]
2012-02-19T15:33:49.120 INFO:teuthology.task.rados.rados.0.err: 7: (SimpleMessenger::dispatch_entry()+0x89a) [0x7fe4cae430ca]
2012-02-19T15:33:49.120 INFO:teuthology.task.rados.rados.0.err: 8: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x7fe4cad9ccbc]
2012-02-19T15:33:49.120 INFO:teuthology.task.rados.rados.0.err: 9: (()+0x7971) [0x7fe4caaf2971]
2012-02-19T15:33:49.120 INFO:teuthology.task.rados.rados.0.err: 10: (clone()+0x6d) [0x7fe4ca33292d]

on

ubuntu@teuthology:/a/master-2012-02-19_10:53:37/12688$ cat config.yaml 
kernel: &id001
  branch: testing
nuke-on-error: true
overrides:
  ceph:
    branch: master
    btrfs: 1
    coverage: true
    fs: xfs
    log-whitelist:
    - clocks not synchronized
    - old request
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
- - mds.a
  - client.0
  - osd.3
  - osd.4
  - osd.5
targets:
  ubuntu@sepia80.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCts9+jqfNcV/Be7XXh87aCtjWMf2tzSv4FFH1QuQ9QJnpwTQ9Nwk5dBnZKYGOPO0WYlY//3Be/IoCnCubH7+rCjsfluZ0o+wJVdKLazJG3KeiLP1LgJPB8J+emv4cCshTJbuZoFgeNzADdwF8C239ImnPzJqQdh2Av7aomnv8TcZFYwxjyMSBI/F4VQX4Ml8cWq2U1jnk6ft7cs1lZSe+74/glVjWpxBDcgMDFNkGUlLyPdWXxYUMdWHkayTvu8j4/AV8HQaFkbmLks/FTCjpYawCuETflkGl/WbTl00hARGHrPPJpDHmHIf6rEuwMMZ2Eh2Ggui3s9jhmRqlAyGVH
  ubuntu@sepia83.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCp01zJkpSUWuvZBouUOqatTTlrysUBgp7ue58PUuJ2rZJV2jk9TKog5wryJxQli1dqllh/PfsInOkjgLGBhnohzyUZlz1ioT505Sla/FMGeEctyUqSW0jm1m9EPSLhyHS4JrYFLAaZFtmeOddq1x9hgeLotK/pTaMcgVra2LuOwyytK2URVK7nB3oXEbWZqLXSoWJUltm/AsZPV2gc2+6dUToC1tiAx+Uop8KJwInk+8p69YnhrGPvpCqJMkw0ly2azyV7wqwgPFVuFTxmVxC3RmpRLXJvjqfvFTjKiP/3Dg7INExFE9tFfS6TZhxj2QeDq2mpSrREvwTVkLCD3RiZ
tasks:
- internal.lock_machines: 2
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- ceph:
    log-whitelist:
    - wrongly marked me down or wrong addr
    - objects unfound and apparently lost
- thrashosds: null
- rados:
    clients:
    - client.0
    objects: 500
    op_weights:
      delete: 10
      read: 45
      write: 45
    ops: 4000

saw this a couple of days ago, too.

Actions #1

Updated by Sage Weil about 12 years ago

  • Priority changed from High to Normal
Actions #2

Updated by Sage Weil about 12 years ago

  • Target version changed from v0.43 to v0.44
Actions #3

Updated by Sage Weil about 12 years ago

  • Subject changed from testrados got out of order reply? to osd: broken queuing during replay
  • Status changed from New to 7
  • Assignee set to Sage Weil

Hit this again and dug into the logs. It's a bug on the osd... testrados is using the data, which has replay enabled, and you get something like:

- pg is peering (not active)
- rx op 1, put on waiting_for_active
- pg activates, goes into replay (does not requeue waiting_for_active)
- rx op 2, it isn't replayed, so we do it now (BUG)
- replay finishes, we requeue op 1
-> broken ordering

the fix is simple, see wip-2082

Actions #4

Updated by Sage Weil about 12 years ago

  • Status changed from 7 to Resolved
Actions

Also available in: Atom PDF