Bug #8381
osd crash when osd use leveldb as filestore
0%
Description
when osd use leveldb as filestore(add configur options osd_objectstore = keyvaluestore-dev in ceph.conf), i used qemu rbd , after i attached rbd to one vm , everything seems ok , i can get disk label by 'fdisk -l' , however , when i used 'dd if=/dev/zero of=/dev/vdb bs=1M &', the ceph cluster crashed, i saw "bad op" error, the attached file is detail log.
-8> 2014-05-19 10:47:44.828717 7f55579aa700 10 osd.0 pg_epoch: 245 pg[3.3f( v 245'1 (0'0,245'1] local-les=222 n=1 ec=44 les/c 222/222 221/221/173) [0,44] r=0 lpr=221 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+clean] append_log adding 1 keys
-7> 2014-05-19 10:47:44.828753 7f55579aa700 10 write_log with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, dirty_divergent_priors: 0, writeout_from: 245'1, trimmed:
-6> 2014-05-19 10:47:44.828821 7f5564dff700 12 KeyValueStore::op_tp worker wq KeyValueStore::OpWQ start processing 0x8b7dc20 (1 active)
-5> 2014-05-19 10:47:44.828818 7f55579aa700 10 osd.0 pg_epoch: 245 pg[3.3f( v 245'1 (0'0,245'1] local-les=222 n=1 ec=44 les/c 222/222 221/221/173) [0,44] r=0 lpr=221 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+clean] eval_repop repgather(0x76a0840 245'1 rep_tid=1 committed?=0 applied?=0 lock=0 op=osd_op(client.4358.0:104 rbd_data.10426b8b4567.0000000000000004 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304] 3.bf28a03f ack+ondisk+write e245) v4) wants=ad
-4> 2014-05-19 10:47:44.828839 7f55579aa700 10 osd.0 245 dequeue_op 0xa57c780 finish
-3> 2014-05-19 10:47:44.828843 7f55579aa700 15 OSD::op_tp worker wq OSD::OpWQ done processing 0x1 (0 active)
-2> 2014-05-19 10:47:44.828846 7f55579aa700 20 OSD::op_tp worker waiting
-1> 2014-05-19 10:47:44.829013 7f5564dff700 -1 bad op 2307
0> 2014-05-19 10:47:44.830448 7f5564dff700 -1 os/KeyValueStore.cc: In function 'unsigned int KeyValueStore::_do_transaction(ObjectStore::Transaction&, KeyValueStore::BufferTransaction&, SequencerPosition&, ThreadPool::TPHandle*)' thread 7f5564dff700 time 2014-05-19 10:47:44.829023
os/KeyValueStore.cc: 1457: FAILED assert(0)
ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
1: (KeyValueStore::_do_transaction(ObjectStore::Transaction&, KeyValueStore::BufferTransaction&, SequencerPosition&, ThreadPool::TPHandle*)+0x1d0) [0x9d8720]
2: (KeyValueStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x8e) [0x9daa0e]
3: (KeyValueStore::_do_op(KeyValueStore::OpSequencer*, ThreadPool::TPHandle&)+0x97) [0x9dab17]
4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0xb5101a]
5: (ThreadPool::WorkThread::entry()+0x10) [0xb52270]
6: (()+0x7e9a) [0x7f556d321e9a]
7: (clone()+0x6d) [0x7f556b8ccccd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Associated revisions
Fix set_alloc_hint op cause KeyValueStore crash problem
Now KeyValueStore doesn't support set_alloc_hit op, the implementation of
_do_transaction need to consider decoding the arguments. Otherwise, the
arguments will be regarded as the next op.
Fix the same problem for MemStore.
Fix #8381
Reported-by: Xinxin Shu <xinxin.shu5040@gmail.com>
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Fix set_alloc_hint op cause KeyValueStore crash problem
Now KeyValueStore doesn't support set_alloc_hit op, the implementation of
_do_transaction need to consider decoding the arguments. Otherwise, the
arguments will be regarded as the next op.
Fix the same problem for MemStore.
Fix #8381
Reported-by: Xinxin Shu <xinxin.shu5040@gmail.com>
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
(cherry picked from commit c08adbc98ff5f380ecd215f8bd9cf3cab214913c)
History
#1 Updated by Haomai Wang almost 10 years ago
- Assignee set to Haomai Wang
I can't reproduce the crash. It seemed ok running "dd" or "fio" in vm.
The bad op "2307" is confusing. As I know, there shouldn't exist op so large.
#2 Updated by Xinxin Shu almost 10 years ago
hi haomai , this issue occurs once i used 'dd', in order to help you root the cause, what kind of info should i provide , btw, what is your ceph configuration on your test setup.
Haomai Wang wrote:
I can't reproduce the crash. It seemed ok running "dd" or "fio" in vm.
The bad op "2307" is confusing. As I know, there shouldn't exist op so large.
#3 Updated by Haomai Wang almost 10 years ago
you can add "debug_keyvaluestore = 20/20" to ceph.conf
#4 Updated by Xinxin Shu almost 10 years ago
log with 'debug keyvaluestore = 20/20'
#5 Updated by Haomai Wang almost 10 years ago
Thanks to xinxin!
The bug is resulted from set_alloc_hint op.
case Transaction::OP_SETALLOCHINT:
// TODO: can kvstore make use of the hint?
break;
We just skip it but not decode it, it will result in the next op get the incorrect result.
#6 Updated by Haomai Wang almost 10 years ago
- Category set to OSD
- Status changed from New to Fix Under Review
#7 Updated by Xinxin Shu almost 10 years ago
hi haomai , after apply your patch, after i run 'virsh attach-device' to attach rbd, the vm is killed, however , when i change osd backend to filestore, everything seems ok , i check the dmesg , get nothing useful , i do not know how to identify this issues, can you give some hints, thanks, btw ,my ceph-version is ceph version 0.80.1-1-g410c990
#8 Updated by Haomai Wang almost 10 years ago
You can add these lines to /etc/ceph/ceph.conf which run qemu
[client]
log_file = /var/log/ceph/ceph-rbd.log
admin_socket = /var/run/ceph/ceph-rbd.asok
Then you should get info from /var/log/ceph/ceph-rbd.log
#9 Updated by Haomai Wang almost 10 years ago
Hi,xinxin.
I boot the vm with keyvaluestore backend, and attach new disk. There nothing happened.
#10 Updated by Xinxin Shu almost 10 years ago
- File ceph-rbd.log View added
hi haimao , i can reproduce this error from time to time, the attached file is detail log for client, btw, can you list your os , libvirt/qemu and ceph configuration.
#11 Updated by Haomai Wang almost 10 years ago
It seemed that no useful info can get from log.
My ceph cluster is compiled from master branch.
(qemu-kvm-1.2.0)
I don't use libvirt, just qemu directly. I think libvirt isn't the accident
#12 Updated by Shaun McDowell almost 10 years ago
We saw this issue as well. Is there a way we can pull this fix with ceph-deploy install --dev <Branch or Commit> to get the fix and see if the problem is resolved? Currently, ceph-deploy doesn't seem to pull anything newer than the ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74). We also would like to avoid cloning and building on all of our nodes.
#13 Updated by Haomai Wang almost 10 years ago
Hi Shaun,
The fix commit is https://github.com/ceph/ceph/pull/1840. I'm not familiar to ceph-deploy, hope the link can help and if anything confused let me know.
#14 Updated by Xinxin Shu almost 10 years ago
i get the following error from dmesg, how can i identify this error :
[1117577.605328] virbr0: port 1(vnet0) entered forwarding state
[1117592.937103] kvm209389: segfault at 7f6c00000018 ip 00007f6d15f63009 sp 00007f6c7aefb7f0 error 6 in librados.so.2.0.0[7f6d15c16000+66b000]
[1117592.972125] virbr0: port 1(vnet0) entered disabled state
[1117592.972479] virbr0: port 1(vnet0) entered disabled state
[1117592.972601] device vnet0 left promiscuous mode
[1117592.972603] virbr0: port 1(vnet0) entered disabled state
#15 Updated by Haomai Wang almost 10 years ago
Thanks xinxin, I would work on it
#16 Updated by Sage Weil almost 10 years ago
- Status changed from Fix Under Review to Resolved
- Source changed from other to Community (dev)
#17 Updated by Xinxin Shu almost 10 years ago
haomai , i create a new bug report 8529 for segment fault of librados, it seem to be the problem of buffer class.