Bug #8381
closedosd crash when osd use leveldb as filestore
0%
Description
when osd use leveldb as filestore(add configur options osd_objectstore = keyvaluestore-dev in ceph.conf), i used qemu rbd , after i attached rbd to one vm , everything seems ok , i can get disk label by 'fdisk -l' , however , when i used 'dd if=/dev/zero of=/dev/vdb bs=1M &', the ceph cluster crashed, i saw "bad op" error, the attached file is detail log.
-8> 2014-05-19 10:47:44.828717 7f55579aa700 10 osd.0 pg_epoch: 245 pg[3.3f( v 245'1 (0'0,245'1] local-les=222 n=1 ec=44 les/c 222/222 221/221/173) [0,44] r=0 lpr=221 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+clean] append_log adding 1 keys
-7> 2014-05-19 10:47:44.828753 7f55579aa700 10 write_log with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, dirty_divergent_priors: 0, writeout_from: 245'1, trimmed:
-6> 2014-05-19 10:47:44.828821 7f5564dff700 12 KeyValueStore::op_tp worker wq KeyValueStore::OpWQ start processing 0x8b7dc20 (1 active)
-5> 2014-05-19 10:47:44.828818 7f55579aa700 10 osd.0 pg_epoch: 245 pg[3.3f( v 245'1 (0'0,245'1] local-les=222 n=1 ec=44 les/c 222/222 221/221/173) [0,44] r=0 lpr=221 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+clean] eval_repop repgather(0x76a0840 245'1 rep_tid=1 committed?=0 applied?=0 lock=0 op=osd_op(client.4358.0:104 rbd_data.10426b8b4567.0000000000000004 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304] 3.bf28a03f ack+ondisk+write e245) v4) wants=ad
-4> 2014-05-19 10:47:44.828839 7f55579aa700 10 osd.0 245 dequeue_op 0xa57c780 finish
-3> 2014-05-19 10:47:44.828843 7f55579aa700 15 OSD::op_tp worker wq OSD::OpWQ done processing 0x1 (0 active)
-2> 2014-05-19 10:47:44.828846 7f55579aa700 20 OSD::op_tp worker waiting
-1> 2014-05-19 10:47:44.829013 7f5564dff700 -1 bad op 2307
0> 2014-05-19 10:47:44.830448 7f5564dff700 -1 os/KeyValueStore.cc: In function 'unsigned int KeyValueStore::_do_transaction(ObjectStore::Transaction&, KeyValueStore::BufferTransaction&, SequencerPosition&, ThreadPool::TPHandle*)' thread 7f5564dff700 time 2014-05-19 10:47:44.829023
os/KeyValueStore.cc: 1457: FAILED assert(0)
ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
1: (KeyValueStore::_do_transaction(ObjectStore::Transaction&, KeyValueStore::BufferTransaction&, SequencerPosition&, ThreadPool::TPHandle*)+0x1d0) [0x9d8720]
2: (KeyValueStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x8e) [0x9daa0e]
3: (KeyValueStore::_do_op(KeyValueStore::OpSequencer*, ThreadPool::TPHandle&)+0x97) [0x9dab17]
4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0xb5101a]
5: (ThreadPool::WorkThread::entry()+0x10) [0xb52270]
6: (()+0x7e9a) [0x7f556d321e9a]
7: (clone()+0x6d) [0x7f556b8ccccd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Files
Updated by Haomai Wang almost 10 years ago
- Assignee set to Haomai Wang
I can't reproduce the crash. It seemed ok running "dd" or "fio" in vm.
The bad op "2307" is confusing. As I know, there shouldn't exist op so large.
Updated by Xinxin Shu almost 10 years ago
hi haomai , this issue occurs once i used 'dd', in order to help you root the cause, what kind of info should i provide , btw, what is your ceph configuration on your test setup.
Haomai Wang wrote:
I can't reproduce the crash. It seemed ok running "dd" or "fio" in vm.
The bad op "2307" is confusing. As I know, there shouldn't exist op so large.
Updated by Haomai Wang almost 10 years ago
you can add "debug_keyvaluestore = 20/20" to ceph.conf
Updated by Xinxin Shu almost 10 years ago
log with 'debug keyvaluestore = 20/20'
Updated by Haomai Wang almost 10 years ago
Thanks to xinxin!
The bug is resulted from set_alloc_hint op.
case Transaction::OP_SETALLOCHINT:
// TODO: can kvstore make use of the hint?
break;
We just skip it but not decode it, it will result in the next op get the incorrect result.
Updated by Haomai Wang almost 10 years ago
- Category set to OSD
- Status changed from New to Fix Under Review
Updated by Xinxin Shu almost 10 years ago
hi haomai , after apply your patch, after i run 'virsh attach-device' to attach rbd, the vm is killed, however , when i change osd backend to filestore, everything seems ok , i check the dmesg , get nothing useful , i do not know how to identify this issues, can you give some hints, thanks, btw ,my ceph-version is ceph version 0.80.1-1-g410c990
Updated by Haomai Wang almost 10 years ago
You can add these lines to /etc/ceph/ceph.conf which run qemu
[client]
log_file = /var/log/ceph/ceph-rbd.log
admin_socket = /var/run/ceph/ceph-rbd.asok
Then you should get info from /var/log/ceph/ceph-rbd.log
Updated by Haomai Wang almost 10 years ago
Hi,xinxin.
I boot the vm with keyvaluestore backend, and attach new disk. There nothing happened.
Updated by Xinxin Shu almost 10 years ago
- File ceph-rbd.log ceph-rbd.log added
hi haimao , i can reproduce this error from time to time, the attached file is detail log for client, btw, can you list your os , libvirt/qemu and ceph configuration.
Updated by Haomai Wang almost 10 years ago
It seemed that no useful info can get from log.
My ceph cluster is compiled from master branch.
(qemu-kvm-1.2.0)
I don't use libvirt, just qemu directly. I think libvirt isn't the accident
Updated by Shaun McDowell almost 10 years ago
We saw this issue as well. Is there a way we can pull this fix with ceph-deploy install --dev <Branch or Commit> to get the fix and see if the problem is resolved? Currently, ceph-deploy doesn't seem to pull anything newer than the ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74). We also would like to avoid cloning and building on all of our nodes.
Updated by Haomai Wang almost 10 years ago
Hi Shaun,
The fix commit is https://github.com/ceph/ceph/pull/1840. I'm not familiar to ceph-deploy, hope the link can help and if anything confused let me know.
Updated by Xinxin Shu almost 10 years ago
i get the following error from dmesg, how can i identify this error :
[1117577.605328] virbr0: port 1(vnet0) entered forwarding state
[1117592.937103] kvm209389: segfault at 7f6c00000018 ip 00007f6d15f63009 sp 00007f6c7aefb7f0 error 6 in librados.so.2.0.0[7f6d15c16000+66b000]
[1117592.972125] virbr0: port 1(vnet0) entered disabled state
[1117592.972479] virbr0: port 1(vnet0) entered disabled state
[1117592.972601] device vnet0 left promiscuous mode
[1117592.972603] virbr0: port 1(vnet0) entered disabled state
Updated by Sage Weil almost 10 years ago
- Status changed from Fix Under Review to Resolved
- Source changed from other to Community (dev)
Updated by Xinxin Shu almost 10 years ago
haomai , i create a new bug report 8529 for segment fault of librados, it seem to be the problem of buffer class.