Project

General

Profile

Actions

Bug #2691

closed

osd/ReplicatedPG.cc: 5888: FAILED assert(latest->is_update())

Added by Sage Weil almost 12 years ago. Updated about 11 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
argonaut
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

    -1> 2012-07-02 14:19:01.529985 7fba53ac2700 -1 osd/ReplicatedPG.cc: In function 'int ReplicatedPG::recover_primary(int)' thread 7fba53ac2700 time 2012-07-02 14:19:01.528553
osd/ReplicatedPG.cc: 5888: FAILED assert(latest->is_update())

 ceph version 0.47.3-547-g2472034 (commit:2472034c4fcd53575734ac8ac8e877687c9ca910)
 1: (ReplicatedPG::recover_primary(int)+0x4d3) [0x553093]
 2: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*)+0x7b) [0x55ef4b]
 3: (OSD::do_recovery(PG*)+0x464) [0x5cb644]
 4: (ThreadPool::worker()+0x605) [0x7a08d5]
 5: (ThreadPool::WorkThread::entry()+0xd) [0x5e09dd]
 6: (()+0x7e9a) [0x7fba65139e9a]
 7: (clone()+0x6d) [0x7fba636ee4bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2012-07-02 14:19:01.530957 7fba54ac4700  0 log [INF] : 2.6 scrub ok

ubuntu@teuthology:/a/sage-2012-07-02_11:21:38-regression-next-testing-basic/4596$ cat config.yaml 
kernel: &id001
  branch: testing
  kdb: true
nuke-on-error: true
overrides:
  ceph:
    branch: next
    conf:
      client:
        rbd cache: true
    fs: btrfs
    log-whitelist:
    - slow request
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
- - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
targets:
  ubuntu@plana50.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDVJ+lkgUdkr27WFzrmwSQU22m+pFIiqzhfcO4Hinu8A8uyP4FIephrEcq4Rrt4hp14Syb1pxXisV6UKwAZKikDoD1Wl0LSro4TzOs6HuMEhfvzdnISvyzE3f2w0cj1zE61rHFYfPNF14b9fkE3wBf2Vb4i6ReaN2/Yd12J/xO52tJH1lPxgsFoAIRMjdQMbfVwPU6kK9SY4ngt9iLjge6gZ0O9Jwe2vrgD6+LNoMY9qvNjgRvQdCTi85OQwitU0ZMZdGC0cQ/oNbKd+yW92rW9Wu6dcyKSisesRcm7lbtS6X2uUup+u3vWze7coT+Py3TdNW6nGpIg4muyvqHfSinz
  ubuntu@plana67.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDqiZsiT7h3fNR9yzwK2WToaotO4olIxmVdh+aSf3ILwEpHjFYbWXymL0C77hn0MdGRbaOzWOSMMng3MAHKy9xR3/CGNXXqO7iEK1fJOvSfmypkvJDyrMY/RuSvdifcXJyREvFsSK6cdmRpO235ODhfui4FC5BLmgv/VvasH/1Ur4ALfe7UE9L+cU4VeoJdl082oYeo1nn1beERgaypX67MXepG2NKbEY77jG5FXbGVpKWmsgIEWiiX8p6+afTOP+8cGsM3vsAG7nTJeFVKkEHc7A8cPkT4l/iXKjSiwWAtU5NV0QmRC/1ad78+xTOWNzJaTrIxoKuuGpB+DjdvrJgN
  ubuntu@plana69.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDE4lTHELBl2d+BXuaxMKtJX4r3GN2qSaxHTfewZxutqEC+rSNbD5Otiqwm16GtmCYklbYJL0yr6mizCZ0KzL3lVwNSczb/CxfvLTLegmBv9YzPRvoL1ymxdHggPIw47JTNRg6+UO9EConQpp7LOktT7fYf3AFx/BS9Ux6UeKyxifkb/1GUlzVT2rk1D39frBhI7AcPHovvbd/6uHvamrpUS3wSDWdz6BNRTQDaK307TJS9LPva6Cj6WAgmTZZesjWB4mOpWywrA9sva7GqSrnTMGKnIHPcsI6U57CALwHS1g/zIz73QuolZA2NkbhA7jjrfsUaWYJorE9dynIga9pf
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph:
    log-whitelist:
    - wrongly marked me down or wrong addr
    - objects unfound and apparently lost
- thrashosds: null
- rbd_fsx:
    clients:
    - client.0
    ops: 20000



Files

ceph.tar.xz (21.6 MB) ceph.tar.xz OSD log, binary, core Artem Grinblat, 02/14/2013 04:11 PM
Actions #1

Updated by Sage Weil almost 12 years ago

took down osd.2 and osd.3 with same crash. coredumps are on the hosts..

Actions #2

Updated by Samuel Just almost 12 years ago

  • Assignee set to Samuel Just
Actions #3

Updated by Sage Weil over 11 years ago

  • Target version set to v0.51
Actions #4

Updated by Sage Weil over 11 years ago

  • Target version changed from v0.51 to 83
Actions #5

Updated by Tamilarasi muthamizhan over 11 years ago

  • Status changed from 12 to In Progress

Recent log: ubuntu@teuthology:/a/teuthology-2012-08-20_00:00:04-regression-next-testing-basic/4822


   -10> 2012-08-20 02:05:18.505644 7fe225181700 -1 osd/ReplicatedPG.cc: In function 'int ReplicatedPG::re
cover_primary(int)' thread 7fe225181700 time 2012-08-20 02:05:18.499420
osd/ReplicatedPG.cc: 6004: FAILED assert(latest->is_update())

 ceph version 0.50-109-gda210be (commit:da210bee091705fc488cc7c839c32d46280c1719)
 1: (ReplicatedPG::recover_primary(int)+0x4f1) [0x56bac1]
 2: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*)+0x7b) [0x58b8db]
 3: (OSD::do_recovery(PG*)+0x361) [0x5e13a1]
 4: (OSD::RecoveryWQ::_process(PG*)+0x15) [0x615835]
 5: (ThreadPool::worker()+0x523) [0x7e3533]
 6: (ThreadPool::WorkThread::entry()+0xd) [0x5f90fd]
 7: (()+0x7e9a) [0x7fe2367f8e9a]
 8: (clone()+0x6d) [0x7fe234dad4bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

2012-08-20 02:05:18.727855 7fe225181700 -1 *** Caught signal (Aborted) **
 in thread 7fe225181700

 ceph version 0.50-109-gda210be (commit:da210bee091705fc488cc7c839c32d46280c1719)
 1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x72afa1]
 2: (()+0xfcb0) [0x7fe236800cb0]
 3: (gsignal()+0x35) [0x7fe234cf1445]
 4: (abort()+0x17b) [0x7fe234cf4bab]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fe23563f69d]
 6: (()+0xb5846) [0x7fe23563d846]
 7: (()+0xb5873) [0x7fe23563d873]
 8: (()+0xb596e) [0x7fe23563d96e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1ea) [0x7ed1da]
 10: (ReplicatedPG::recover_primary(int)+0x4f1) [0x56bac1]
 11: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*)+0x7b) [0x58b8db]
 12: (OSD::do_recovery(PG*)+0x361) [0x5e13a1]
 13: (OSD::RecoveryWQ::_process(PG*)+0x15) [0x615835]
 14: (ThreadPool::worker()+0x523) [0x7e3533]
 15: (ThreadPool::WorkThread::entry()+0xd) [0x5f90fd]
 16: (()+0x7e9a) [0x7fe2367f8e9a]
 17: (clone()+0x6d) [0x7fe234dad4bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2012-08-20 02:05:18.727855 7fe225181700 -1 *** Caught signal (Aborted) **
 in thread 7fe225181700

 ceph version 0.50-109-gda210be (commit:da210bee091705fc488cc7c839c32d46280c1719)
 1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x72afa1]
 2: (()+0xfcb0) [0x7fe236800cb0]
 3: (gsignal()+0x35) [0x7fe234cf1445]
 4: (abort()+0x17b) [0x7fe234cf4bab]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fe23563f69d]
 6: (()+0xb5846) [0x7fe23563d846]
 7: (()+0xb5873) [0x7fe23563d873]
 8: (()+0xb596e) [0x7fe23563d96e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1ea) [0x7ed1da]
 10: (ReplicatedPG::recover_primary(int)+0x4f1) [0x56bac1]
 11: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*)+0x7b) [0x58b8db]
 12: (OSD::do_recovery(PG*)+0x361) [0x5e13a1]
 13: (OSD::RecoveryWQ::_process(PG*)+0x15) [0x615835]
 14: (ThreadPool::worker()+0x523) [0x7e3533]
 15: (ThreadPool::WorkThread::entry()+0xd) [0x5f90fd]
 16: (()+0x7e9a) [0x7fe2367f8e9a]
 17: (clone()+0x6d) [0x7fe234dad4bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

ubuntu@teuthology:/a/teuthology-2012-08-20_00:00:04-regression-next-testing-basic/4822$ cat config.yaml 
kernel: &id001
  kdb: true
  sha1: 1fe5e9932156f6122c3b1ff6ba7541c27c86718c
nuke-on-error: true
overrides:
  ceph:
    conf:
      client:
        rbd cache: false
    fs: ext4
    log-whitelist:
    - slow request
    sha1: da210bee091705fc488cc7c839c32d46280c1719
  workunit:
    sha1: da210bee091705fc488cc7c839c32d46280c1719
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
- - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
targets:
  ubuntu@plana02.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCtjMpSkaJhFqFtpo5AEe3KHygR+ueaWU+gYrrRzPa8YvmR0TCapw0kz77y1Fjcfh8rkTapnevpaYgQSMrMs0Yc34kF5XtNRuQXkpTwrhS8isZJBeNSc1W5XeKjj4KB/UuzBywJq0h/0KbH1DrMy72cGISOzdiP9CMA5KUvJo0m31wv1+MPcPn/5AhZgoWPStfaZdb4TaJUrNLrws0oRXa0yQbUa6WmUBsYhHsw4K1ukJAcJwVjcgAAv1N+GnyuWLVs+pvknBO3Whv1RhjY6EDGjun1MDPw+OE3wJsJX7BRr8eZv2Avi7pRlseWeWJwgsHMJ/j0yhf+SCy1+oSPrD2b
  ubuntu@plana62.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC+nI5/l38Kdw2W/qbEKrVMcnVdIxJG7hNnD7nnS3+Zx/uPiWrds26ZPrM5IY7D8Mf7sjBzUYbqsX9xGYMLLTQaeDwsZn/7RjjSg8zOS1aMP5F/AJzSQx4Nt37eLUsRHX3yA30/OQcl6sBgDjHyhSPcSuHWSnMmoy4pkDo3xpQMQMtxDG8gWq+to1hZwJbsiK9FdutEgPJg3inWM1WVc5L6NmRN2WQNEGT8HvtlBCWqX6/H/hLujQlbgyJAbeG4BriMV3gCIccJE833f/fN9KIzaMlD7qHTgWcaGk+LY84nUdNlTkNoX+L4m6WRY8/Pt9om2dOocsXyCwYLIS4heIDT
  ubuntu@plana63.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDDy2BKPe+fe5jK0ziU8aKM0DzODSTaKWecQRwLjnZLbjDvTyHm8x8xX/JCts3bFfrc2ozFz7ILBIWU96JRZiFF2TtFZjtf1H19kyvR8PWCxiZ/lld+C7B6U8iiPSNiSlgo7mwkpk1JoSpHe4rK/Z7WQRWBMsCC7XJETu6rRX3i0ZYaKh8BoWWhpsBs1quSNxRXNUqJ6OKnDbB5Vuan1TK9b49RXmibx+oapXm8V0sHEVLYa+NTUs+wAEHAnjFgRe75Cik/rmgeE0m2Cff1rp9tFhEEDwZ5PUdnscOTY78BxImMRdkbZ8lJXOGcOOsD3Dj1jOr4pVrgxZqUdtWfJGkj
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    timeout: 1200
- rbd_fsx:
    clients:
    - client.0
    ops: 20000
Actions #6

Updated by Sage Weil over 11 years ago

  • Target version changed from 83 to v0.52a
Actions #7

Updated by Sage Weil over 11 years ago

  • Priority changed from Urgent to High
Actions #8

Updated by Tamilarasi muthamizhan over 11 years ago

Recent logs: ubuntu@teuthology:/a/teuthology-2012-09-11_02:00:03-regression-testing-testing-basic/20743

Actions #9

Updated by Tamilarasi muthamizhan over 11 years ago

ubuntu@teuthology:/a/teuthology-2012-09-12_02:00:04-regression-testing-testing-basic/21369

Actions #10

Updated by Sage Weil over 11 years ago

  • Priority changed from High to Urgent
Actions #11

Updated by Tamilarasi muthamizhan over 11 years ago

Recent logs: ubuntu@teuthology:/a/teuthology-2012-09-13_04:00:05-regression-stable-master-basic/22002

Actions #12

Updated by Samuel Just over 11 years ago

  • Status changed from In Progress to Resolved
Actions #13

Updated by Samuel Just over 11 years ago

  • Status changed from Resolved to 12
  • Priority changed from Urgent to Normal
  • Target version deleted (v0.52a)
  • Backport set to argonaut

This has shown up once in argonaut, probably not worth backporting unless it becomes more of a problem?

Actions #14

Updated by Tamilarasi muthamizhan over 11 years ago

for reference, ubuntu@teuthology:/a/teuthology-2013-01-10_07:00:03-regression-argonaut-master-basic/38145

Actions #15

Updated by Artem Grinblat about 11 years ago

I might have a similar assetion here on bobtail (ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5)):

2013-02-15 03:46:57.148126 7f5c25686700  0 log [ERR] : 3.35 osd.0: soid 7da92cf5/LHN1Xt3V8IMebWbOB1ZRapRnDn4nMb3/head//3 size 2119242 != known size 0, digest 1367424093 != known d
2013-02-15 03:46:57.148190 7f5c25686700  0 log [ERR] : repair 3.35 7da92cf5
/LHN1Xt3V8IMebWbOB1ZRapRnDn4nMb3/head//3 on disk size (0) does not match object info size (2119242)
2013-02-15 03:49:05.151320 7f5c25686700  0 log [ERR] : 3.35 repair stat mismatch, got 51056/51056 objects, 0/0 clones, 7195419192/7197538434 bytes.
2013-02-15 03:49:05.151417 7f5c25686700  0 log [ERR] : 3.35 repair 0 missing, 1 inconsistent objects
2013-02-15 03:49:05.151450 7f5c25686700  0 log [ERR] : 3.35 repair 3 errors, 2 fixed
2013-02-15 03:49:05.156288 7f5c25e87700  0 log [ERR] : 3.35 missing primary copy of 7da92cf5/LHN1Xt3V8IMebWbOB1ZRapRnDn4nMb3/head//3, unfound
2013-02-15 03:49:05.389117 7f5c25e87700 -1 osd/ReplicatedPG.cc: In function 'int ReplicatedPG::recover_primary(int)' thread 7f5c25e87700 time 2013-02-15 03:49:05.311197
osd/ReplicatedPG.cc: 6537: FAILED assert(latest->is_update())

 ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5)
 1: (ReplicatedPG::recover_primary(int)+0x5a1) [0x670571]
 2: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*)+0x112) [0x672e92]
 3: (OSD::do_recovery(PG*)+0x277) [0x6e48b7]
 4: (OSD::RecoveryWQ::_process(PG*)+0xd) [0x721b1d]
 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x992) [0x8e9ce2]
 6: (ThreadPool::WorkThread::entry()+0x10) [0x8eac70]
 7: (()+0x6b50) [0x7f5c38fc6b50]
 8: (clone()+0x6d) [0x7f5c37752a7d]

I've been doing "ceph pg repair 3.35" and removed the inconsistent object with "rados -p ... rm LHN1Xt3V8IMebWbOB1ZRapRnDn4nMb3".

Actions #16

Updated by Sage Weil about 11 years ago

  • Status changed from 12 to Won't Fix

this is pre-argonaut

Actions

Also available in: Atom PDF