Fix #6990: osd crash when running mixed versions of dumpling and master - Ceph - Ceph

Actions

Copy link

Fix #6990

closed

osd crash when running mixed versions of dumpling and master

Added by Tamilarasi muthamizhan over 10 years ago. Updated over 10 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

David Zafman

Category:

Target version:

v0.75

% Done:

Source:

Q/A

Tags:

Backport:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

steps to reproduce:

1.running a cluster of 2 nodes with dumpling version of ceph.
2. upgrade only the osds on the first node to master branch
3. thrash osds.

This causes the osd running master branch to crash.

logs are copied to ubuntu@mira052.front.sepia.ceph.com:/home/ubuntu/bug

2013-12-12 14:28:26.317723 7fc955f97700 -1 osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::d
o_backfill(OpRequestRef)' thread 7fc955f97700 time 2013-12-12 14:28:26.316292
osd/ReplicatedPG.cc: 1439: FAILED assert(is_replica())

 ceph version 0.67.4-37-ga447fb7 (a447fb7d04fbad84f9ecb57726396bb6ca29d8f6)
 1: (ReplicatedPG::do_backfill(std::tr1::shared_ptr<OpRequest>)+0xbd5) [0x5d6125]
 2: (PG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f0) [0x706c80]
 3: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x
330) [0x65ae10]
 4: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x4a0) [0x671510]
 5: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boos
t::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x6acb8c]
 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x8b4f06]
 7: (ThreadPool::WorkThread::entry()+0x10) [0x8b6d10]
 8: (()+0x7e9a) [0x7fc96a09ce9a]
 9: (clone()+0x6d) [0x7fc9681e8ccd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

config file to reproduce the issue:

tamil@tamil-VirtualBox:~/tam_final/teuthology$ cat up_master.yaml 
overrides:
  ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
    - log bound mismatch

roles:
- [mon.a, mon.b, osd.0, osd.1, osd.2, mds.a]
- [mon.c, osd.3, osd.4, osd.5, client.0]

targets:
  ubuntu@mira023.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCw8G36ubCLJBcN7Ys9+3erO+GTlJyGJirlP2p1zdkuB4gNpG0scx9lZcM+id8D9ywrA+gQK5DMKaYBuhDHzk8tvbtX9X5TsCdXHpQJtrXmvUCSPKKOK7efnhw/qRB43CYa2p4sM+X1i7QTCXBOjk8syYzM5sxumjsxswsTsVnZ75xRcOIK30W8Cog3wwVsbr4ZaJ8YlMxNObzPqOYlfYCsl+AJ8ELa7hPd+8JTP3EBYjiVvfjntkmYr8CWA+z9kXRxp6Iv9ADr4OAB9uJOkQpOAievN2qF1hCFLoI0Qxlw2px0fVpLl0SFOctVRFnefzWnuYeN+CjNHgnUAVN5HaBj
  ubuntu@mira052.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC3sW7EMc9QRG2qjunPv8uQ3rCKTYjs/P/6/aYnNUJ8CM3IkJHexkNlkGYdTD5fOyVzQBC1c+SoqPpyRYPcJvNSOOiJpoQuUE1eyVNYLdtrFaqGCN9nmQg0turDQMwDlE8nK2Fmk74xB1Bc7lvaGm9/EqZrYYMq0KSTKGlIXUD/lAHzdAbe0uItRuEi7g7FALZ9lVgUBVdW3zE+pBpIW/yqP3NKNzP6cwaDu00tUGYgnQi8tjDo+0zZEMTa4hFb8dbO4HVz+10J7qZZCPATiX0SAZvGpm9YferGLxUdGG0qeuo/SHjc2UCMg1TfFug3oRSLDlUI3BllscyCWuWXZZ2j

tasks:
- chef:
- install:
    branch: dumpling
- ceph:
    fs: xfs
- install.upgrade:
    osd.0:
      branch: master
- ceph.restart:
    daemons: [osd.0, osd.1, osd.2]
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    timeout: 1200
- ceph.restart:
    daemons: [mon.a]
    wait-for-healthy: false
    wait-for-osds-up: true
- workunit:
    clients:
      client.0:
      - rados/test.sh
- ceph.restart:
    daemons: [mon.b]
    wait-for-healthy: false
    wait-for-osds-up: true
- workunit:
    clients:
      client.0:
      - rados/test.sh
- ceph.restart:
    daemons: [mon.c]
    wait-for-healthy: false
    wait-for-osds-up: true
- ceph.wait_for_mon_quorum: [a, b, c]
- workunit:
    clients:
      client.0:
      - rados/test.sh

Actions

Copy link

Updated by Greg Farnum over 10 years ago

Tamil, I think you're backwards about which OSDs crash? That backtrace says .67.4, and the master branch OSDs don't contain that assert anywhere any more. :)

Actions

Copy link