Bug #8381: osd crash when osd use leveldb as filestore - Ceph - Ceph

Actions

Copy link

Bug #8381

closed

osd crash when osd use leveldb as filestore

Added by Xinxin Shu almost 10 years ago. Updated almost 10 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Haomai Wang

Category:

OSD

Target version:

% Done:

Source:

Community (dev)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

when osd use leveldb as filestore(add configur options osd_objectstore = keyvaluestore-dev in ceph.conf), i used qemu rbd , after i attached rbd to one vm , everything seems ok , i can get disk label by 'fdisk -l' , however , when i used 'dd if=/dev/zero of=/dev/vdb bs=1M &', the ceph cluster crashed, i saw "bad op" error, the attached file is detail log.

-8> 2014-05-19 10:47:44.828717 7f55579aa700 10 osd.0 pg_epoch: 245 pg[3.3f( v 245'1 (0'0,245'1] local-les=222 n=1 ec=44 les/c 222/222 221/221/173) [0,44] r=0 lpr=221 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+clean] append_log  adding 1 keys
    -7> 2014-05-19 10:47:44.828753 7f55579aa700 10 write_log with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, dirty_divergent_priors: 0, writeout_from: 245'1, trimmed:
    -6> 2014-05-19 10:47:44.828821 7f5564dff700 12 KeyValueStore::op_tp worker wq KeyValueStore::OpWQ start processing 0x8b7dc20 (1 active)
    -5> 2014-05-19 10:47:44.828818 7f55579aa700 10 osd.0 pg_epoch: 245 pg[3.3f( v 245'1 (0'0,245'1] local-les=222 n=1 ec=44 les/c 222/222 221/221/173) [0,44] r=0 lpr=221 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+clean] eval_repop repgather(0x76a0840 245'1 rep_tid=1 committed?=0 applied?=0 lock=0 op=osd_op(client.4358.0:104 rbd_data.10426b8b4567.0000000000000004 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304] 3.bf28a03f ack+ondisk+write e245) v4) wants=ad
    -4> 2014-05-19 10:47:44.828839 7f55579aa700 10 osd.0 245 dequeue_op 0xa57c780 finish
    -3> 2014-05-19 10:47:44.828843 7f55579aa700 15 OSD::op_tp worker wq OSD::OpWQ done processing 0x1 (0 active)
    -2> 2014-05-19 10:47:44.828846 7f55579aa700 20 OSD::op_tp worker waiting
    -1> 2014-05-19 10:47:44.829013 7f5564dff700 -1 bad op 2307
     0> 2014-05-19 10:47:44.830448 7f5564dff700 -1 os/KeyValueStore.cc: In function 'unsigned int KeyValueStore::_do_transaction(ObjectStore::Transaction&, KeyValueStore::BufferTransaction&, SequencerPosition&, ThreadPool::TPHandle*)' thread 7f5564dff700 time 2014-05-19 10:47:44.829023
os/KeyValueStore.cc: 1457: FAILED assert(0)

ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (KeyValueStore::_do_transaction(ObjectStore::Transaction&, KeyValueStore::BufferTransaction&, SequencerPosition&, ThreadPool::TPHandle*)+0x1d0) [0x9d8720]
 2: (KeyValueStore::_do_transactions(std::list&lt;ObjectStore::Transaction*, std::allocator&lt;ObjectStore::Transaction*&gt; >&, unsigned long, ThreadPool::TPHandle*)+0x8e) [0x9daa0e]
 3: (KeyValueStore::_do_op(KeyValueStore::OpSequencer*, ThreadPool::TPHandle&)+0x97) [0x9dab17]
 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0xb5101a]
 5: (ThreadPool::WorkThread::entry()+0x10) [0xb52270]
 6: (()+0x7e9a) [0x7f556d321e9a]
 7: (clone()+0x6d) [0x7f556b8ccccd]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Files

Download all files

osd.0.log (1.69 MB) osd.0.log		Xinxin Shu, 05/18/2014 08:05 PM
osd.0.log (2.11 MB) osd.0.log		Xinxin Shu, 05/19/2014 07:20 PM
ceph-rbd.log (21.4 KB) ceph-rbd.log		Xinxin Shu, 05/31/2014 09:54 PM

Actions

Copy link

Updated by Haomai Wang almost 10 years ago

Assignee set to Haomai Wang

I can't reproduce the crash. It seemed ok running "dd" or "fio" in vm.

The bad op "2307" is confusing. As I know, there shouldn't exist op so large.

Actions

Copy link

Updated by Xinxin Shu almost 10 years ago

hi haomai , this issue occurs once i used 'dd', in order to help you root the cause, what kind of info should i provide , btw, what is your ceph configuration on your test setup.
Haomai Wang wrote:

I can't reproduce the crash. It seemed ok running "dd" or "fio" in vm.

The bad op "2307" is confusing. As I know, there shouldn't exist op so large.

Actions

Copy link

Updated by Haomai Wang almost 10 years ago

you can add "debug_keyvaluestore = 20/20" to ceph.conf

Actions

Copy link

Updated by Xinxin Shu almost 10 years ago

File osd.0.log osd.0.log added

log with 'debug keyvaluestore = 20/20'

Actions

Copy link

Updated by Haomai Wang almost 10 years ago

Thanks to xinxin!

The bug is resulted from set_alloc_hint op.

case Transaction::OP_SETALLOCHINT:
      // TODO: can kvstore make use of the hint?
      break;

We just skip it but not decode it, it will result in the next op get the incorrect result.

Actions

Copy link

Updated by Haomai Wang almost 10 years ago

Category set to OSD
Status changed from New to Fix Under Review

Actions

Copy link

Updated by Xinxin Shu almost 10 years ago

hi haomai , after apply your patch, after i run 'virsh attach-device' to attach rbd, the vm is killed, however , when i change osd backend to filestore, everything seems ok , i check the dmesg , get nothing useful , i do not know how to identify this issues, can you give some hints, thanks, btw ,my ceph-version is ceph version 0.80.1-1-g410c990

Actions

Copy link

Updated by Haomai Wang almost 10 years ago

You can add these lines to /etc/ceph/ceph.conf which run qemu
[client]
log_file = /var/log/ceph/ceph-rbd.log
admin_socket = /var/run/ceph/ceph-rbd.asok

Then you should get info from /var/log/ceph/ceph-rbd.log

Actions

Copy link

Updated by Haomai Wang almost 10 years ago

Hi,xinxin.

I boot the vm with keyvaluestore backend, and attach new disk. There nothing happened.

Actions

Copy link

#10

Updated by Xinxin Shu almost 10 years ago

File ceph-rbd.log ceph-rbd.log added

hi haimao , i can reproduce this error from time to time, the attached file is detail log for client, btw, can you list your os , libvirt/qemu and ceph configuration.

Actions

Copy link

#11

Updated by Haomai Wang almost 10 years ago

It seemed that no useful info can get from log.

My ceph cluster is compiled from master branch.
(qemu-kvm-1.2.0)
I don't use libvirt, just qemu directly. I think libvirt isn't the accident

Actions

Copy link

#12

Updated by Shaun McDowell almost 10 years ago

We saw this issue as well. Is there a way we can pull this fix with ceph-deploy install --dev <Branch or Commit> to get the fix and see if the problem is resolved? Currently, ceph-deploy doesn't seem to pull anything newer than the ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74). We also would like to avoid cloning and building on all of our nodes.

Actions

Copy link

#13

Updated by Haomai Wang almost 10 years ago

Hi Shaun,
The fix commit is https://github.com/ceph/ceph/pull/1840. I'm not familiar to ceph-deploy, hope the link can help and if anything confused let me know.

Actions

Copy link

#14

Updated by Xinxin Shu almost 10 years ago

i get the following error from dmesg, how can i identify this error :

[1117577.605328] virbr0: port 1(vnet0) entered forwarding state
[1117592.937103] kvm²⁰⁹³⁸⁹: segfault at 7f6c00000018 ip 00007f6d15f63009 sp 00007f6c7aefb7f0 error 6 in librados.so.2.0.0[7f6d15c16000+66b000]
[1117592.972125] virbr0: port 1(vnet0) entered disabled state
[1117592.972479] virbr0: port 1(vnet0) entered disabled state
[1117592.972601] device vnet0 left promiscuous mode
[1117592.972603] virbr0: port 1(vnet0) entered disabled state

Actions

Copy link

#15