Project

General

Profile

Actions

Bug #2947

closed

osd: out of order reply

Added by Sage Weil over 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

triggered by thrashing by this job:

nuke-on-error: true
overrides:
  ceph:
    fs: ext4
    log-whitelist:
    - slow request
    conf:
      osd:
        debug ms: 20
        debug osd: 20
        debug filestore: 20
      client:
        debug ms: 20
        debug objecter: 20
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
- - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    timeout: 1200
- rados:
    clients:
    - client.0
    objects: 500
    op_weights:
      delete: 10
      read: 45
      write: 45
    ops: 4000

we observe (after 10-20 successes) a failure like this:

2012-08-13T17:17:26.349 INFO:teuthology.task.rados.rados.0.out:finishing write tid 2 to plana689588-300
2012-08-13T17:17:26.349 INFO:teuthology.task.rados.rados.0.out:finishing write tid 1 to plana689588-300
2012-08-13T17:17:26.349 INFO:teuthology.task.rados.rados.0.err:Error: finished tid 1 when last_acked_tid was 2
2012-08-13T17:17:26.349 INFO:teuthology.task.rados.rados.0.err:./test/osd/RadosModel.h: In function 'virtual void WriteOp::_finish(TestOp::CallbackInfo*)' thread 7f9bc37fe700 time 2012-08-13 17:17:26.342537
2012-08-13T17:17:26.350 INFO:teuthology.task.rados.rados.0.err:./test/osd/RadosModel.h: 822: FAILED assert(0)
2012-08-13T17:17:26.350 INFO:teuthology.task.rados.rados.0.err: ceph version 0.50-157-g4e7d49e (commit:4e7d49e0c66ef39593e3bf5e6684779620c4eeda)
2012-08-13T17:17:26.350 INFO:teuthology.task.rados.rados.0.err: 1: (WriteOp::_finish(TestOp::CallbackInfo*)+0x29e) [0x41385e]
2012-08-13T17:17:26.350 INFO:teuthology.task.rados.rados.0.err: 2: (write_callback(void*, void*)+0x23) [0x421a43]
2012-08-13T17:17:26.350 INFO:teuthology.task.rados.rados.0.err: 3: (librados::C_AioComplete::finish(int)+0x1e) [0x7f9bdd3b4dae]
2012-08-13T17:17:26.350 INFO:teuthology.task.rados.rados.0.err: 4: (Finisher::finisher_thread_entry()+0x218) [0x7f9bdd4114e8]
2012-08-13T17:17:26.350 INFO:teuthology.task.rados.rados.0.err: 5: (()+0x7e9a) [0x7f9bdd077e9a]
2012-08-13T17:17:26.350 INFO:teuthology.task.rados.rados.0.err: 6: (clone()+0x6d) [0x7f9bdc88f4bd]
2012-08-13T17:17:26.351 INFO:teuthology.task.rados.rados.0.err: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

from the osd, we see


ubuntu@teuthology:/a/sage-msgr-rados/14$ zcat remote/ubuntu@plana66.front.sepia.ceph.com/log/osd.2.log.gz | grep client.4117.0:313 | egrep '( ==| -->)'
2012-08-13 17:17:24.040289 7f47062b3700  1 -- 10.214.132.12:6805/6031 <== client.4117 10.214.132.10:0/1009588 444 ==== osd_op(client.4117.0:3133 plana689588-300 [write 0~56] 0.1860ffb4) v4 ==== 154+0+56 (2650586133 0 4000434292) 0x33f5000 con 0x2e13780
2012-08-13 17:17:24.084791 7f47062b3700  1 -- 10.214.132.12:6805/6031 <== client.4117 10.214.132.10:0/1009588 445 ==== osd_op(client.4117.0:3134 plana689588-300 [write 561017~484826] 0.1860ffb4) v4 ==== 154+0+484826 (681737827 0 1415563408) 0x3249d80 con 0x2e13780
2012-08-13 17:17:24.237951 7f47062b3700  1 -- 10.214.132.12:6805/6031 <== client.4117 10.214.132.10:0/1009588 446 ==== osd_op(client.4117.0:3135 plana689588-300 [write 1813902~618476] 0.1860ffb4) v4 ==== 154+0+618476 (1857590084 0 20743113) 0x3249b40 con 0x2e13780
2012-08-13 17:17:24.239559 7f47062b3700  1 -- 10.214.132.12:6805/6031 <== client.4117 10.214.132.10:0/1009588 447 ==== osd_op(client.4117.0:3136 plana689588-300 [write 2845880~601117] 0.1860ffb4) v4 ==== 154+0+601117 (2960727005 0 399445279) 0x33d7480 con 0x2e13780
2012-08-13 17:17:25.389857 7f47031ac700  1 -- 10.214.132.12:6807/6031 --> osd.1 10.214.132.12:6804/6033 -- osd_sub_op(client.4117.0:3134 0.14 1860ffb4/plana689588-300/head//0 [] v 88'109 snapset=0=[]:[] snapc=0=[]) v7 -- ?+485369 0x307f400
2012-08-13 17:17:25.390836 7f47031ac700  1 -- 10.214.132.12:6807/6031 --> osd.1 10.214.132.12:6804/6033 -- osd_sub_op(client.4117.0:3133 0.14 1860ffb4/plana689588-300/head//0 [] v 88'110 snapset=0=[]:[] snapc=0=[]) v7 -- ?+599 0x372a400
2012-08-13 17:17:25.391718 7f47029ab700  1 -- 10.214.132.12:6807/6031 --> osd.1 10.214.132.12:6804/6033 -- osd_sub_op(client.4117.0:3135 0.14 1860ffb4/plana689588-300/head//0 [] v 88'111 snapset=0=[]:[] snapc=0=[]) v7 -- ?+619019 0x34a2400
2012-08-13 17:17:25.392604 7f47029ab700  1 -- 10.214.132.12:6807/6031 --> osd.1 10.214.132.12:6804/6033 -- osd_sub_op(client.4117.0:3136 0.14 1860ffb4/plana689588-300/head//0 [] v 88'112 snapset=0=[]:[] snapc=0=[]) v7 -- ?+601660 0x34a1a00
2012-08-13 17:17:25.489379 7f4705ab2700  1 -- 10.214.132.12:6807/6031 <== osd.1 10.214.132.12:6804/6033 482 ==== osd_sub_op_reply(client.4117.0:3134 0.14 1860ffb4/plana689588-300/head//0 [] ondisk, result = 0) v1 ==== 141+0+0 (3830699219 0 0) 0x344b500 con 0x2cd5140
2012-08-13 17:17:25.489574 7f4705ab2700  1 -- 10.214.132.12:6807/6031 <== osd.1 10.214.132.12:6804/6033 483 ==== osd_sub_op_reply(client.4117.0:3133 0.14 1860ffb4/plana689588-300/head//0 [] ondisk, result = 0) v1 ==== 141+0+0 (1714506867 0 0) 0x344b280 con 0x2cd5140
2012-08-13 17:17:25.490041 7f4705ab2700  1 -- 10.214.132.12:6807/6031 <== osd.1 10.214.132.12:6804/6033 484 ==== osd_sub_op_reply(client.4117.0:3135 0.14 1860ffb4/plana689588-300/head//0 [] ondisk, result = 0) v1 ==== 141+0+0 (1725329692 0 0) 0x344b000 con 0x2cd5140
2012-08-13 17:17:25.490392 7f4705ab2700  1 -- 10.214.132.12:6807/6031 <== osd.1 10.214.132.12:6804/6033 485 ==== osd_sub_op_reply(client.4117.0:3136 0.14 1860ffb4/plana689588-300/head//0 [] ondisk, result = 0) v1 ==== 141+0+0 (4081841846 0 0) 0x2fa2c80 con 0x2cd5140

note that :3134 and :3133 switch order.

the reason appears to be:

- 3133 put on waiting_for_missing queue, recovery ensues
- 3134 put on op_waiters (waiting for map) (client had a newer map epoch for this request)
- pg does advance_map
- on_change requeues waiting_for_missing
- pg does activate_map
- take_waiters requeues op_waiters (waiting for map)

-> 3134 now preceeds 3133 in the op_queue.


Related issues 1 (0 open1 closed)

Has duplicate Ceph - Bug #2823: osd: out of order ACKsDuplicateSamuel Just07/22/2012

Actions
Actions #1

Updated by Sage Weil over 11 years ago

  • Status changed from 12 to 7
Actions #2

Updated by Tamilarasi muthamizhan over 11 years ago

  • Status changed from 7 to In Progress
  • Source changed from Development to Q/A
Actions #3

Updated by Sage Weil over 11 years ago


ubuntu@teuthology:/a/teuthology-2012-08-20_02:00:02-regression-testing-testing-basic/4930$ cat config.yaml 
kernel: &id001
  kdb: true
  sha1: 1fe5e9932156f6122c3b1ff6ba7541c27c86718c
nuke-on-error: true
overrides:
  ceph:
    fs: ext4
    log-whitelist:
    - slow request
    sha1: 77745f9441c499871e7de0782fc7db89fcec0c53
  workunit:
    sha1: 77745f9441c499871e7de0782fc7db89fcec0c53
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
- - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
targets:
  ubuntu@plana78.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCl9XSQkT+TSrTTaMDs/IFmQ3nIyrLNiqLFEahRBSLwIMbVppF2EG9XVT1AkRbul9DbY6BTSVGd/AO205di6kFRv9qpHFPfMqUt3XUATEyBZ7drpwliNlM8taFxb43hPTVwLstqM+0wy1arlan6onug+S45ORvtWaKfo4w6e6DcQDKuH4N718mNzMk/ePkr+UbmeGLSZnxKOvKSfA4vFWzEpEJEQlpqSbCQzAl06qTxE9rMCAyQkSLPrr6UmEG1FY0ZX3M1vZYzV5MCVR1J2xofHvsatfbwZjgzJI2136aC8xOBLN4TT0URQ5pYWxAixEsoDRej2u3eYLWRFa9EyIvd
  ubuntu@plana80.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCjbUncT43IZIcpSlFXnWDaagYNdnaLTfgsN8TKi3J2QYOB1BlMpaSRBByXr5MW2ZGuWrB77USjJLdRW/feuwtAjDbwc7y5woxcxn1u9eqPh4RyQWqyCmKWvi3GpdM4/4vh+L+7X5rNzJPStTjiPXCxHK+stPGCk9pg+J+KKg/GlJ1Lwx1DeWetdForGqEVJrTSSWj8RM+Nyw/V+c+t2d6gW1SzjY2NvaDdVHfduM+8X9F9aTf3KLitnIvNQWXzfqoEDNPSwrrO2uvTDyqpFkW6HLpZDnWm+io8Y9P7XMrEhWCTlQX2SHeQv/BhaA1btYAw7qfLC9n8RbeXLlQnc5yn
  ubuntu@plana81.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCzwyoEUguCYhu0TNdXrk7aWVUvePF6F6coBynLZ73Y/7eqTKAzxnNCiwUQx3kGjK33kliZDk7g/x4FdsiwzknDuGGCXP1pZyIGVtU5wNJ6ZM29XyH2SZvyU0MNfmoMzygxHR53TGcsK3hzwSbaW1woEpJmoqgIFQJ6BJr3nc2foKl79wBQdCNumqJ7sbh26xYVI29vYsJHUTYAdmyE4QrLaLOkZKU5Q/OvUnbKQbURcs7ArxFooObu9h0ENRPK4MKuxBFfgpAYTr/rMeWfVQxvSWsuOMpOdzLaNLt5UBYVVU+wxIFdDwcHb+2Er0rEV49W9xUD6JGXnaFjrxDDocfj
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    timeout: 1200
- rados:
    clients:
    - client.0
    objects: 500
    op_weights:
      delete: 10
      read: 45
      write: 45
    ops: 4000
Actions #4

Updated by Tamilarasi muthamizhan over 11 years ago

More Logs: ubuntu@teuthology:/a/teuthology-2012-08-19_02:00:05-regression-testing-testing-basic/4288

Actions #5

Updated by Sage Weil over 11 years ago

  • Target version set to v0.51
Actions #6

Updated by Sage Weil over 11 years ago

  • Target version changed from v0.51 to 83
Actions #7

Updated by Sage Weil over 11 years ago

  • Status changed from In Progress to Resolved
Actions #8

Updated by Sage Weil over 11 years ago

  • Status changed from Resolved to In Progress

ooof, the saga continues: ubuntu@teuthology:/a/sage-gfoo2/5974

Actions #9

Updated by Sage Weil over 11 years ago

  • Status changed from In Progress to Resolved
Actions #10

Updated by Sage Weil over 11 years ago

  • Target version changed from 83 to v0.52a
Actions #11

Updated by Sage Weil over 11 years ago

  • Status changed from Resolved to In Progress
  • Assignee set to Sage Weil

nooo!


kernel: &id001
  kdb: true
  sha1: 995fc068ddf675260098c60591989bf2ee184338
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
    coverage: true
    fs: xfs
    log-whitelist:
    - slow request
    sha1: aa91cf81af548463ae583872e319316871771e94
  workunit:
    sha1: aa91cf81af548463ae583872e319316871771e94
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
- - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
targets:
  ubuntu@plana76.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCoNlhzO4H0dC4/dcXgYUv2BA0azBZ8DQhoLeW3PWTI6bkmNKB5dR40BVtf7XcTwoLYsbKYKKf3hwQOo53TQNCOpiPjk2DyAIWUbth6JYraLWcDoIjsdGOjHi3hdjcCkIS3SzG/ZTgK8UwOu8fU2wdcuLNIV02jvU1t5n46Vw2Q1I8h4Am38x7+91UF41/E/ibHHxjFDbZMiqB75IWBaQumQzMmWkoH6z2xHQyWIxLUoHemxXGDCW2Mv4vaaHPsCfWuzbLLX+jhQRbOapWbaalM0/B2V0tieQGrunTlr9Pmve3+1Rga4RwB4cTYpmxYj+MvZqOSktky6Fpd2Dta+dr/
  ubuntu@plana78.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCl9XSQkT+TSrTTaMDs/IFmQ3nIyrLNiqLFEahRBSLwIMbVppF2EG9XVT1AkRbul9DbY6BTSVGd/AO205di6kFRv9qpHFPfMqUt3XUATEyBZ7drpwliNlM8taFxb43hPTVwLstqM+0wy1arlan6onug+S45ORvtWaKfo4w6e6DcQDKuH4N718mNzMk/ePkr+UbmeGLSZnxKOvKSfA4vFWzEpEJEQlpqSbCQzAl06qTxE9rMCAyQkSLPrr6UmEG1FY0ZX3M1vZYzV5MCVR1J2xofHvsatfbwZjgzJI2136aC8xOBLN4TT0URQ5pYWxAixEsoDRej2u3eYLWRFa9EyIvd
  ubuntu@plana79.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCx0nVMVy140vXGRPqjqx63mfytPlqmoN7YoJ3Si0ti1XtvJTftB9EdQGwqj/tsY95DeUNBtAQs5TBsiLr1E/JHlKt7EXwyWsJNB2ntvkPJOMxoounypjkVgfv91EWmERQGFsalDmIYjSuSCG28g5Vaz8il9D7fH/ykKZ38EQChhPXIpB2bieJOr2Xm6llde1q2rUEltV17EmiQvu9eUuxb9y9h057k6GSqpsTViPADlT7CG7W60bqWs8d7TvV4rvPhUy6oyUp1ar8116NMSFUiaTgVTidDiQ3xyZeguwJAbzh86MQdHVhSi89W/vjvoEP1opjZP3RArB4BoNwzz/Dh
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    timeout: 1200
- rados:
    clients:
    - client.0
    objects: 500
    op_weights:
      delete: 10
      read: 45
      write: 45
    ops: 4000
ubuntu@teuthology:/a/teuthology-2012-08-25_19:00:04-regression-master-testing-gcov/8842

Actions #12

Updated by Sage Weil over 11 years ago

  • Status changed from In Progress to 7
Actions #13

Updated by Sage Weil over 11 years ago

  • Status changed from 7 to Resolved
Actions #14

Updated by Tamilarasi muthamizhan over 11 years ago

  • Status changed from Resolved to In Progress
  • Target version deleted (v0.52a)

Recent logs: ubuntu@teuthology:/a/teuthology-2012-10-20_19:00:11-regression-master-testing-gcov/4810

2012-10-20T20:31:19.803 INFO:teuthology.task.rados.rados.0.err:Error: finished tid 1 when last_acked_tid was 3
2012-10-20T20:31:19.806 INFO:teuthology.task.rados.rados.0.out:Writing plana3710565-450 from 3100727 to 3596741 tid 4 ranges are [0~57,796741~740629,1992905~604660,3100727~496014]
2012-10-20T20:31:19.815 INFO:teuthology.task.rados.rados.0.err:./test/osd/RadosModel.h: In function 'virtual void WriteOp::_finish(TestOp::CallbackInfo*)' thread 7f0be97fa700 time 2012-10-20 20:31:19.809501
2012-10-20T20:31:19.815 INFO:teuthology.task.rados.rados.0.err:./test/osd/RadosModel.h: 823: FAILED assert(0)
2012-10-20T20:31:19.816 INFO:teuthology.task.rados.rados.0.err: ceph version 0.53-396-ge6cbcb2 (commit:e6cbcb240073ddb39c7434d6133c8c6722c71e1c)
2012-10-20T20:31:19.816 INFO:teuthology.task.rados.rados.0.err: 1: (WriteOp::_finish(TestOp::CallbackInfo*)+0x42d) [0x418b8d]
2012-10-20T20:31:19.816 INFO:teuthology.task.rados.rados.0.err: 2: (write_callback(void*, void*)+0x29) [0x42d1a9]
2012-10-20T20:31:19.816 INFO:teuthology.task.rados.rados.0.err: 3: (librados::C_AioComplete::finish(int)+0x20) [0x7f0bf3aee7a0]
2012-10-20T20:31:19.816 INFO:teuthology.task.rados.rados.0.err: 4: (Finisher::finisher_thread_entry()+0x340) [0x7f0bf3b71d50]
2012-10-20T20:31:19.816 INFO:teuthology.task.rados.rados.0.err: 5: (Finisher::FinisherThread::entry()+0x15) [0x7f0bf3ad7975]
2012-10-20T20:31:19.816 INFO:teuthology.task.rados.rados.0.err: 6: (Thread::_entry_func(void*)+0x12) [0x7f0bf3cf2fc2]
2012-10-20T20:31:19.816 INFO:teuthology.task.rados.rados.0.err: 7: (()+0x7e9a) [0x7f0bf378fe9a]
2012-10-20T20:31:19.817 INFO:teuthology.task.rados.rados.0.err: 8: (clone()+0x6d) [0x7f0bf2fa74bd]
2012-10-20T20:31:19.827 INFO:teuthology.task.rados.rados.0.err: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #15

Updated by Sage Weil over 11 years ago

  • Priority changed from Urgent to High
Actions #16

Updated by Sage Weil over 11 years ago

  • Status changed from In Progress to Can't reproduce
Actions #17

Updated by Tamilarasi muthamizhan about 11 years ago

  • Status changed from Can't reproduce to In Progress

recent logs: ubuntu@teuthology:/a/teuthology-2013-02-10_01:00:02-regression-master-testing-gcov/4134

2013-02-10T06:29:43.312 INFO:teuthology.task.rados.rados.0.err:Error: finished tid 1 when last_acked_tid was 2
2013-02-10T06:29:43.314 INFO:teuthology.task.rados.rados.0.err:./test/osd/RadosModel.h: In function 'virtual void WriteOp::_finish(TestOp::CallbackInfo*)' thread 7fa357fff700 time 2013-02-10 06:29:38.39236
1
2013-02-10T06:29:43.314 INFO:teuthology.task.rados.rados.0.err:./test/osd/RadosModel.h: 823: FAILED assert(0)
2013-02-10T06:29:43.314 INFO:teuthology.task.rados.rados.0.err: ceph version 0.56-707-gabc80ff (abc80ffc5b1aab3915c049701ab85c57fe93d550)
2013-02-10T06:29:43.314 INFO:teuthology.task.rados.rados.0.err: 1: (WriteOp::_finish(TestOp::CallbackInfo*)+0x42b) [0x41843b]
2013-02-10T06:29:43.314 INFO:teuthology.task.rados.rados.0.err: 2: (write_callback(void*, void*)+0x29) [0x42cd69]
2013-02-10T06:29:43.315 INFO:teuthology.task.rados.rados.0.err: 3: (librados::C_AioComplete::finish(int)+0x20) [0x7fa36a19f230]
2013-02-10T06:29:43.315 INFO:teuthology.task.rados.rados.0.err: 4: (Finisher::finisher_thread_entry()+0x298) [0x7fa36a22fd38]
2013-02-10T06:29:43.315 INFO:teuthology.task.rados.rados.0.err: 5: (Finisher::FinisherThread::entry()+0x15) [0x7fa36a188935]
2013-02-10T06:29:43.315 INFO:teuthology.task.rados.rados.0.err: 6: (Thread::_entry_func(void*)+0x12) [0x7fa36a3c0072]
2013-02-10T06:29:43.315 INFO:teuthology.task.rados.rados.0.err: 7: (()+0x7e9a) [0x7fa369e2ae9a]
2013-02-10T06:29:43.315 INFO:teuthology.task.rados.rados.0.err: 8: (clone()+0x6d) [0x7fa3696424bd]
2013-02-10T06:29:43.316 INFO:teuthology.task.rados.rados.0.err: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2013-02-10T06:29:43.316 INFO:teuthology.task.rados.rados.0.err:terminate called after throwing an instance of 'ceph::FailedAssertion'
2013-02-10T06:29:43.401 ERROR:teuthology.run_tasks:Manager failed: <contextlib.GeneratorContextManager object at 0x2780650>
Traceback (most recent call last):
  File "/var/lib/teuthworker/teuthology-master/teuthology/run_tasks.py", line 45, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/var/lib/teuthworker/teuthology-master/teuthology/task/rados.py", line 99, in task
    run.wait(tests.itervalues())
  File "/var/lib/teuthworker/teuthology-master/teuthology/orchestra/run.py", line 273, in wait
    proc.exitstatus.get()
  File "/var/lib/teuthworker/teuthology-master/virtualenv/local/lib/python2.7/site-packages/gevent/event.py", line 223, in get
    raise self._exception
CommandCrashedError: Command crashed: 'CEPH_CLIENT_ID=0 CEPH_CONF=/tmp/cephtest/ceph.conf LD_LIBRARY_PATH=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin
/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/testrados --op read 100 --op write 100 --op delete 50 --op snap_create 50 --op snap_remove 50 --op rollback 50 --op setattr 
0 --op rmattr 0 --op watch 0 --max-ops 4000 --objects 500 --max-in-flight 16 --size 4000000 --min-stride-size 400000 --max-stride-size 800000 --max-seconds 0'
2013-02-10T06:29:43.432 DEBUG:teuthology.run_tasks:Unwinding manager <contextlib.GeneratorContextManager object at 0x2773e10>

ubuntu@teuthology:/a/teuthology-2013-02-10_01:00:02-regression-master-testing-gcov/4134$ cat config.yaml 
kernel: &id001
  kdb: true
  sha1: 012d5bda1c0f229494c67098d00edfa24c531ea5
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
      osd:
        osd op thread timeout: 60
    coverage: true
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: abc80ffc5b1aab3915c049701ab85c57fe93d550
  s3tests:
    branch: master
  workunit:
    sha1: abc80ffc5b1aab3915c049701ab85c57fe93d550
roles:
- - mon.a
  - mon.b
  - osd.0
  - osd.1
  - osd.2
- - mon.c
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
targets:
  ubuntu@plana63.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDDy2BKPe+fe5jK0ziU8aKM0DzODSTaKWecQRwLjnZLbjDvTyHm8x8xX/JCts3bFfrc2ozFz7ILBIWU96JRZiFF2TtFZjtf1H19kyvR8PWCxiZ/lld+C7B6U8iiPSNiSlgo7mwkpk1JoSpHe4rK/Z7WQRWBMsCC7XJETu6rRX3i0ZYaKh8BoWWhpsBs1quSNxRXNUqJ6OKnDbB5Vuan1TK9b49RXmibx+oapXm8V0sHEVLYa+NTUs+wAEHAnjFgRe75Cik/rmgeE0m2Cff1rp9tFhEEDwZ5PUdnscOTY78BxImMRdkbZ8lJXOGcOOsD3Dj1jOr4pVrgxZqUdtWfJGkj
  ubuntu@plana68.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCjq/dBds5eofgD5l4r9AcpVnq4tnZrPYUlDPcgRgzUSM36rUAOA9IPzjamtlb+kdmo5TkWve/QgqDt/6SI07L0EeULiRhPnYjsDsDvRINoL/7oUd6dqhsC+5OMrnuscOIQUVM40TR/J4F2pwUWSM5YKsc6VBYkmh+2JTvI9ZiwXsA0KY0a6qa5ScDDZtfGapo166cm/BkqmGPPudYqCrXbCqKLaXjSizU3e02RlFLwF4xw0wKCKQ5T/mh8mqUOEWnZZTVNx0Xrkit0q247KKc4sNZYX4sQVRx8MMedOr33VEhmyuOY8TgHLFp/aB0HzbatYYJb0HATxbjwJWFjGBIX
  ubuntu@plana74.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCbl3n0OItF2T7uE/9SNA9YV+pq3nFmmr3EBjf2aMo7Bkyy6e9MC3kaK64vzlZxppbpxEhbrgWo7GsVHxphZqwtMt5Ek3RDbLUpQ8dPckbhOZWBhoZe9S8uCynDIZaF3Yq/b1//OBufSdM0TRYK4Vf4rtQnpyWPEn48HaairNhQvKCxY6XUImZc7Lh4CIaMkT0yyZkBlksJeAkTs4UgxhkSoSc9IAlAxwwPK5JpRL7GOowT+hsqYCVJM5+NVqIOGS66Vam/m9xHqecQtfYIEFoVFZiF/xi+b2kstxJzD39UTNP/XB5J2CtJOujW9J1jsdPKSz1L9XoDIkAFHzQjdhyx
tasks:
- internal.lock_machines:
  - 3
  - plana
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    timeout: 1200
- rados:
    clients:
    - client.0
    objects: 500
    op_weights:
      delete: 50
      read: 100
      rollback: 50
      snap_create: 50
      snap_remove: 50
      write: 100
    ops: 4000
ubuntu@teuthology:/a/teuthology-2013-02-10_01:00:02-regression-master-testing-gcov/4134$ cat summary.yaml 
ceph-sha1: abc80ffc5b1aab3915c049701ab85c57fe93d550
client.0-kernel-sha1: 012d5bda1c0f229494c67098d00edfa24c531ea5
description: collection:rados-thrash clusters:6-osd-3-machine.yaml fs:btrfs.yaml msgr-failures:few.yaml
  thrashers:default.yaml workloads:snaps-many-objects.yaml
duration: 1533.4427070617676
failure_reason: 'Command crashed: ''CEPH_CLIENT_ID=0 CEPH_CONF=/tmp/cephtest/ceph.conf
  LD_LIBRARY_PATH=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump
  /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage
  /tmp/cephtest/binary/usr/local/bin/testrados --op read 100 --op write 100 --op delete
  50 --op snap_create 50 --op snap_remove 50 --op rollback 50 --op setattr 0 --op
  rmattr 0 --op watch 0 --max-ops 4000 --objects 500 --max-in-flight 16 --size 4000000
  --min-stride-size 400000 --max-stride-size 800000 --max-seconds 0'''
flavor: gcov
mon.a-kernel-sha1: 012d5bda1c0f229494c67098d00edfa24c531ea5
mon.c-kernel-sha1: 012d5bda1c0f229494c67098d00edfa24c531ea5
owner: scheduled_teuthology@teuthology
success: false

Actions #18

Updated by Tamilarasi muthamizhan about 11 years ago

  • Priority changed from High to Urgent
Actions #19

Updated by Sage Weil about 11 years ago

woot, reproduced with logs: ubuntu@teuthology:/a/sage-s1/1886

Actions #20

Updated by Sage Weil about 11 years ago

a sequence of messages, all tagged with epochs greater than ours, put on waiting_for_map

- op1 for pg 0.f
- notify for 0.f (that will trigger it's creation)
- op2 for pg 0.f
- op3 for pg 0.f
- op4 for pg 0.f

we get our map, and hten

- op1 is put on waiting_for_pg
- notify creates 0.f, puts op1 at end of finished queue
- op2 is handled
- op3 is handled
- op4 is handled
- op1 is handled

Actions #21

Updated by Sage Weil about 11 years ago

  • Status changed from In Progress to 7
Actions #22

Updated by Sage Weil about 11 years ago

  • Status changed from 7 to Resolved

56c5a07708d52de1699585c9560cff8b4e993d0a

I've been hammering on this test for a while and it's turning up some other issues but this one seems fixed.

Actions

Also available in: Atom PDF