Project

General

Profile

Bug #2956

osd:FAILED assert(waiting_for_ondisk.begin()->first == repop->v)

Added by Tamilarasi muthamizhan over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Logs: ubuntu@teuthology:/a/teuthology-2012-08-15_19:00:16-regression-master-testing-gcov/1878

2012-08-15 21:35:20.030298 7ff87c176700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::eval_repo
p(ReplicatedPG::RepGather*)' thread 7ff87c176700 time 2012-08-15 21:35:19.853043
osd/ReplicatedPG.cc: 3547: FAILED assert(waiting_for_ondisk.begin()->first == repop->v)

 ceph version 0.50-182-g08b8bba (commit:08b8bba433e6471eb76b3ed8dd6b23fbbf796af3)
 1: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x397) [0x56f857]
 2: (ReplicatedPG::repop_ack(ReplicatedPG::RepGather*, int, int, int, eversion_t)+0x21c) [0x57192c]
 3: (ReplicatedPG::sub_op_modify_reply(std::tr1::shared_ptr<OpRequest>)+0x22a) [0x572a7a]
 4: (ReplicatedPG::do_sub_op_reply(std::tr1::shared_ptr<OpRequest>)+0x84) [0x5ca5a4]
 5: (PG::do_request(std::tr1::shared_ptr<OpRequest>)+0x404) [0x6f4794]
 6: (OSD::dequeue_op(PG*)+0x304) [0x611f94]
 7: (OSD::OpWQ::_process(PG*)+0x15) [0x678445]
 8: (ThreadPool::WorkQueue<PG>::_void_process(void*)+0x12) [0x66ea32]
 9: (ThreadPool::worker()+0x4db) [0x8f396b]
 10: (ThreadPool::WorkThread::entry()+0x15) [0x66ff65]
 11: (Thread::_entry_func(void*)+0x12) [0x8e6412]
 12: (()+0x7e9a) [0x7ff88c7ebe9a]
 13: (clone()+0x6d) [0x7ff88ada04bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
ubuntu@teuthology:/a/teuthology-2012-08-15_19:00:16-regression-master-testing-gcov/1878$ cat config.yaml 
kernel: &id001
  kdb: true
  sha1: 1fe5e9932156f6122c3b1ff6ba7541c27c86718c
nuke-on-error: true
overrides:
  ceph:
    coverage: true
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: 08b8bba433e6471eb76b3ed8dd6b23fbbf796af3
  workunit:
    sha1: 08b8bba433e6471eb76b3ed8dd6b23fbbf796af3
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
targets:
  ubuntu@plana25.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDuXajaQgHe9XnbLOzI8WWFYVz6+TnOiTzbkIJPGOZpzQEjnUtJraQIEt5ABSeovMjiEj+V4XvunfyuSmEd0H9giRSyjmCHTPGlpndfTeCdVtCBpNqf5GkUqHaEY1Hp57XPbya2rGlwtFm0NeIDYx6pfkejKnsTOUqwhgUb6950TRhjHQhMjFgyALSyfAm/4y6vGZfjm57+yyih6XgDkqWiiQ6Y/aJVR2n+iCzvqEzV7JSCU+Brn+k8IQLHho1fadYqc5PjYct5BaVlHcP6c+T8nJE/DvqGwZ4gQaVJcuWJiDfLOPPYo1g/0AFicxauLwVNJ6HFR9FjLLGtGU+2DcVN
  ubuntu@plana36.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC9Ru6XkJBGiUQK9AtlFt82TzpaWuKams26i0FcItt3hbniR1yxpWVHM3dQI5Gft3liumnOD+cPZiZJzGYyj2KDBCZ8G9V65YqCbzO+moJmv5wDWKg1pEIIW040aLrlsOPbZlEL7htT14MHTTstyTQCOLkrySCpexwYrA2wQBhsHc7pxL+XLa+WM1zTXSQe6QrS8iYxITGRibEMSjcXlOuLFnst42O6o4WQHd31WS9pbniBmso7KVgTFxmcN5rvEo1YAJJYwVxGfmorWrXan1ULY6CksasatbCuohmVNNZfsnE8KdyYsPYCbKIPp9NnmBL3Pp/oPqqyPsj36Wgj5e4/
  ubuntu@plana37.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDrxOb9f5/SfItd83HOnLVyJRnfji0fbdvL+3T82akjV6J4s/nyR8Bu+rpXbyUwu2BRDoxK4pT2dBqw86meq1qbU5Q1ypWBSH41MYGd213fy0g8YibFiYVGmXFCSwtY8X2Pet9vtLDoYvtnsgNI8djy5GPkQyZFKSszJHznZvQU10NWfM6RfxxtsBKXC/aot4QXb3GIym2/EmeuTAAef6p98dd15P9l9HQkpwXZLwiDZ53IbU79CTINo5HTD/6+1XHUcjb1OUKzQMx1jU485gW6IlsR0G0jJKSv+YEu4zSxxva7gWt1AYxGo2jhNDffEGLsNurzXFf9yeYshCTAszLf
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds: null
- kclient: null
- workunit:
    clients:
      all:
      - suites/ffsb.sh
ubuntu@teuthology:/a/teuthology-2012-08-15_19:00:16-regression-master-testing-gcov/1878$ cat summary.yaml 
ceph-sha1: 08b8bba433e6471eb76b3ed8dd6b23fbbf796af3
client.0-kernel-sha1: 1fe5e9932156f6122c3b1ff6ba7541c27c86718c
description: collection:kernel-thrash clusters:fixed-3.yaml fs:btrfs.yaml thrashers:default.yaml
  workloads:kclient_workunit_suites_ffsb.yaml
duration: 824.2622230052948
failure_reason: 'Command failed with status 1: ''/tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage
  /tmp/cephtest/archive/coverage /tmp/cephtest/daemon-helper term /tmp/cephtest/binary/usr/local/bin/ceph-osd
  -f -i 2 -c /tmp/cephtest/ceph.conf'''
flavor: gcov
mon.a-kernel-sha1: 1fe5e9932156f6122c3b1ff6ba7541c27c86718c
mon.b-kernel-sha1: 1fe5e9932156f6122c3b1ff6ba7541c27c86718c
owner: scheduled_teuthology@teuthology
success: false

Related issues

Duplicates Ceph - Bug #3072: osd/ReplicatedPG.cc: 3548: FAILED assert(waiting_for_ondisk.begin()->first == repop->v) Resolved 09/04/2012

Associated revisions

Revision dd4c1dc9 (diff)
Added by Sage Weil over 8 years ago

osd: fix requeue order of dup ops

The waiting_for_ondisk (and ack) maps get dups of ops that are in progress.
If we have a peering change in which the role does not change, we will
requeue the in-progress ops but leave these in the waiting_for_ondisk
maps, which will then trigger an assert the next time we examine that map
and find it didn't match up with what we expected.

Fix this by requeuing these on any peering reset in on_change(). This
keeps the two queues in sync.

Fixes: #2956
Signed-off-by: Sage Weil <>

History

#1 Updated by Sage Weil over 8 years ago

  • Priority changed from Normal to Urgent
ubuntu@teuthology:/a/sage-2012-08-20_09:17:16-rados-master-testing-next-basic$ cat 5116/summary.yaml 
ceph-sha1: cfe211af138db2d309a8691d8629c5c12926a6f1
client.0-kernel-sha1: dff193ce4b08151b6d01fc99491b571c61efd44d
description: collection:thrash clusters:6-osd-3-machine.yaml fs:btrfs.yaml msgr-failures:few.yaml
  thrashers:default.yaml workloads:radosbench.yaml
duration: 2956.7149050235748
failure_reason: 'Command failed with status 1: ''/tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage
  /tmp/cephtest/archive/coverage /tmp/cephtest/daemon-helper kill /tmp/cephtest/binary/usr/local/bin/ceph-osd
  -f -i 0 -c /tmp/cephtest/ceph.conf'''
flavor: basic
mds.a-kernel-sha1: dff193ce4b08151b6d01fc99491b571c61efd44d
mon.a-kernel-sha1: dff193ce4b08151b6d01fc99491b571c61efd44d
owner: scheduled_sage@metropolis
success: false
ubuntu@teuthology:/a/sage-2012-08-20_09:17:16-rados-master-testing-next-basic$ cd 5116
ubuntu@teuthology:/a/sage-2012-08-20_09:17:16-rados-master-testing-next-basic/5116$ cat config.yaml 
kernel: &id001
  kdb: true
  sha1: dff193ce4b08151b6d01fc99491b571c61efd44d
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        debug ms: 20
        ms inject socket failures: 5000
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: cfe211af138db2d309a8691d8629c5c12926a6f1
  workunit:
    sha1: cfe211af138db2d309a8691d8629c5c12926a6f1
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
- - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
targets:
  ubuntu@plana47.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCUMS+/Rfo92n0pY5cDrv+M9lss9i6+Zum4aa4aE54KsOcKkl+6yooZcZL8bllGLVL1W7BkaBOJ59dQwTVIo/UAgiKyA4J5IVwBPjwNNp4/mXzKJtKQPj0UrTCKsQrKasWPC+FVRzqJRK70cgC5D40znuopmfmENoPwCniOJALFCw3q8XLkcq1SH0jzDXJdsrnTVGxwRHYq9cF9J7fr6XZQXuAk7XO3jG1eqlF8xljmkvI0Ftux50TkOsDzpkscD5jHkxiFj/gkO2KR5GNbybdnxllHBAYuv2hoxrsW2oyIxbeforwZFV0DcDhRReRTx8BhXZ0o5erZgPgzS+ZbfWol
  ubuntu@plana75.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC/sBKIbaWlkUEbStD0wYVUj2aEuiP8WB0B4h4oyzOJaWaKSTPAK2hzAxEDVOkG1JhpR2JrfXitDtA7MW48NvP77Ov/EvOnTHBeTE7mvWL0D2d4/YUoqhF+RLojHgFNOE0FsVEc/2rhARYX9/4VL5YQ1kaE4dKeRqLxn/eA6BoW5+NDbdQ1Bt6qWNSTXYC2qs09do6wUXHbB+KE1Obay4QTGf77QA+ueVnAnKmYym5c5kGMqb7DD+I/OZyUcOWTCQ4sDpo2nh0GpHATqAAWXeFMSpJ0sVQmR5ByTpKsoRV3QxmxlNHBJVDrBoGbw7O0z8AisuwOfqzrOO5M3Q+16Gen
  ubuntu@plana79.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCx0nVMVy140vXGRPqjqx63mfytPlqmoN7YoJ3Si0ti1XtvJTftB9EdQGwqj/tsY95DeUNBtAQs5TBsiLr1E/JHlKt7EXwyWsJNB2ntvkPJOMxoounypjkVgfv91EWmERQGFsalDmIYjSuSCG28g5Vaz8il9D7fH/ykKZ38EQChhPXIpB2bieJOr2Xm6llde1q2rUEltV17EmiQvu9eUuxb9y9h057k6GSqpsTViPADlT7CG7W60bqWs8d7TvV4rvPhUy6oyUp1ar8116NMSFUiaTgVTidDiQ3xyZeguwJAbzh86MQdHVhSi89W/vjvoEP1opjZP3RArB4BoNwzz/Dh
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    timeout: 1200
- radosbench:
    clients:
    - client.0
    time: 1800

#2 Updated by Sage Weil over 8 years ago

  • Assignee set to Sage Weil

#3 Updated by Sage Weil over 8 years ago

  • Status changed from New to 7

#4 Updated by Sage Weil over 8 years ago

  • Target version set to v0.51

#5 Updated by Sage Weil over 8 years ago

  • Target version changed from v0.51 to 83

#6 Updated by Sage Weil over 8 years ago

  • Status changed from 7 to Resolved

#7 Updated by Sage Weil over 8 years ago

  • Target version changed from 83 to v0.52a

Also available in: Atom PDF