Project

General

Profile

Actions

Fix #6990

closed

osd crash when running mixed versions of dumpling and master

Added by Tamilarasi muthamizhan over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
David Zafman
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

steps to reproduce:

1.running a cluster of 2 nodes with dumpling version of ceph.
2. upgrade only the osds on the first node to master branch
3. thrash osds.

This causes the osd running master branch to crash.

logs are copied to :/home/ubuntu/bug

2013-12-12 14:28:26.317723 7fc955f97700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::d
o_backfill(OpRequestRef)' thread 7fc955f97700 time 2013-12-12 14:28:26.316292
osd/ReplicatedPG.cc: 1439: FAILED assert(is_replica())

 ceph version 0.67.4-37-ga447fb7 (a447fb7d04fbad84f9ecb57726396bb6ca29d8f6)
 1: (ReplicatedPG::do_backfill(std::tr1::shared_ptr<OpRequest>)+0xbd5) [0x5d6125]
 2: (PG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f0) [0x706c80]
 3: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x
330) [0x65ae10]
 4: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x4a0) [0x671510]
 5: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boos
t::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x6acb8c]
 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x8b4f06]
 7: (ThreadPool::WorkThread::entry()+0x10) [0x8b6d10]
 8: (()+0x7e9a) [0x7fc96a09ce9a]
 9: (clone()+0x6d) [0x7fc9681e8ccd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

config file to reproduce the issue:

tamil@tamil-VirtualBox:~/tam_final/teuthology$ cat up_master.yaml 
overrides:
  ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
    - log bound mismatch

roles:
- [mon.a, mon.b, osd.0, osd.1, osd.2, mds.a]
- [mon.c, osd.3, osd.4, osd.5, client.0]

targets:
  ubuntu@mira023.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCw8G36ubCLJBcN7Ys9+3erO+GTlJyGJirlP2p1zdkuB4gNpG0scx9lZcM+id8D9ywrA+gQK5DMKaYBuhDHzk8tvbtX9X5TsCdXHpQJtrXmvUCSPKKOK7efnhw/qRB43CYa2p4sM+X1i7QTCXBOjk8syYzM5sxumjsxswsTsVnZ75xRcOIK30W8Cog3wwVsbr4ZaJ8YlMxNObzPqOYlfYCsl+AJ8ELa7hPd+8JTP3EBYjiVvfjntkmYr8CWA+z9kXRxp6Iv9ADr4OAB9uJOkQpOAievN2qF1hCFLoI0Qxlw2px0fVpLl0SFOctVRFnefzWnuYeN+CjNHgnUAVN5HaBj
  ubuntu@mira052.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC3sW7EMc9QRG2qjunPv8uQ3rCKTYjs/P/6/aYnNUJ8CM3IkJHexkNlkGYdTD5fOyVzQBC1c+SoqPpyRYPcJvNSOOiJpoQuUE1eyVNYLdtrFaqGCN9nmQg0turDQMwDlE8nK2Fmk74xB1Bc7lvaGm9/EqZrYYMq0KSTKGlIXUD/lAHzdAbe0uItRuEi7g7FALZ9lVgUBVdW3zE+pBpIW/yqP3NKNzP6cwaDu00tUGYgnQi8tjDo+0zZEMTa4hFb8dbO4HVz+10J7qZZCPATiX0SAZvGpm9YferGLxUdGG0qeuo/SHjc2UCMg1TfFug3oRSLDlUI3BllscyCWuWXZZ2j

tasks:
- chef:
- install:
    branch: dumpling
- ceph:
    fs: xfs
- install.upgrade:
    osd.0:
      branch: master
- ceph.restart:
    daemons: [osd.0, osd.1, osd.2]
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    timeout: 1200
- ceph.restart:
    daemons: [mon.a]
    wait-for-healthy: false
    wait-for-osds-up: true
- workunit:
    clients:
      client.0:
      - rados/test.sh
- ceph.restart:
    daemons: [mon.b]
    wait-for-healthy: false
    wait-for-osds-up: true
- workunit:
    clients:
      client.0:
      - rados/test.sh
- ceph.restart:
    daemons: [mon.c]
    wait-for-healthy: false
    wait-for-osds-up: true
- ceph.wait_for_mon_quorum: [a, b, c]
- workunit:
    clients:
      client.0:
      - rados/test.sh

Actions #1

Updated by Greg Farnum over 10 years ago

Tamil, I think you're backwards about which OSDs crash? That backtrace says .67.4, and the master branch OSDs don't contain that assert anywhere any more. :)

Actions #2

Updated by Greg Farnum over 10 years ago

  • Assignee set to David Zafman

And I'm pretty sure this is fallout from some of David's no-acting-set-for-backfillers changes.

Actions #3

Updated by Tamilarasi muthamizhan over 10 years ago

oh yes Greg, you are right. this happened on the dumpling osds and not on the master ones.

Actions #4

Updated by David Zafman over 10 years ago

  • Tracker changed from Bug to Fix
Actions #5

Updated by David Zafman over 10 years ago

  • Target version set to v0.76a
Actions #6

Updated by David Zafman over 10 years ago

  • Target version changed from v0.76a to v0.75
Actions #7

Updated by Sage Weil over 10 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF