Project

General

Profile

Bug #8381

osd crash when osd use leveldb as filestore

Added by Xinxin Shu almost 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

when osd use leveldb as filestore(add configur options osd_objectstore = keyvaluestore-dev in ceph.conf), i used qemu rbd , after i attached rbd to one vm , everything seems ok , i can get disk label by 'fdisk -l' , however , when i used 'dd if=/dev/zero of=/dev/vdb bs=1M &', the ceph cluster crashed, i saw "bad op" error, the attached file is detail log.

-8> 2014-05-19 10:47:44.828717 7f55579aa700 10 osd.0 pg_epoch: 245 pg[3.3f( v 245'1 (0'0,245'1] local-les=222 n=1 ec=44 les/c 222/222 221/221/173) [0,44] r=0 lpr=221 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+clean] append_log  adding 1 keys
-7> 2014-05-19 10:47:44.828753 7f55579aa700 10 write_log with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, dirty_divergent_priors: 0, writeout_from: 245'1, trimmed:
-6> 2014-05-19 10:47:44.828821 7f5564dff700 12 KeyValueStore::op_tp worker wq KeyValueStore::OpWQ start processing 0x8b7dc20 (1 active)
-5> 2014-05-19 10:47:44.828818 7f55579aa700 10 osd.0 pg_epoch: 245 pg[3.3f( v 245'1 (0'0,245'1] local-les=222 n=1 ec=44 les/c 222/222 221/221/173) [0,44] r=0 lpr=221 luod=0'0 crt=0'0 lcod 0'0 mlcod 0'0 active+clean] eval_repop repgather(0x76a0840 245'1 rep_tid=1 committed?=0 applied?=0 lock=0 op=osd_op(client.4358.0:104 rbd_data.10426b8b4567.0000000000000004 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304] 3.bf28a03f ack+ondisk+write e245) v4) wants=ad
-4> 2014-05-19 10:47:44.828839 7f55579aa700 10 osd.0 245 dequeue_op 0xa57c780 finish
-3> 2014-05-19 10:47:44.828843 7f55579aa700 15 OSD::op_tp worker wq OSD::OpWQ done processing 0x1 (0 active)
-2> 2014-05-19 10:47:44.828846 7f55579aa700 20 OSD::op_tp worker waiting
-1> 2014-05-19 10:47:44.829013 7f5564dff700 -1 bad op 2307
0> 2014-05-19 10:47:44.830448 7f5564dff700 -1 os/KeyValueStore.cc: In function 'unsigned int KeyValueStore::_do_transaction(ObjectStore::Transaction&, KeyValueStore::BufferTransaction&, SequencerPosition&, ThreadPool::TPHandle*)' thread 7f5564dff700 time 2014-05-19 10:47:44.829023
os/KeyValueStore.cc: 1457: FAILED assert(0)
ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
1: (KeyValueStore::_do_transaction(ObjectStore::Transaction&, KeyValueStore::BufferTransaction&, SequencerPosition&, ThreadPool::TPHandle*)+0x1d0) [0x9d8720]
2: (KeyValueStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x8e) [0x9daa0e]
3: (KeyValueStore::_do_op(KeyValueStore::OpSequencer*, ThreadPool::TPHandle&)+0x97) [0x9dab17]
4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0xb5101a]
5: (ThreadPool::WorkThread::entry()+0x10) [0xb52270]
6: (()+0x7e9a) [0x7f556d321e9a]
7: (clone()+0x6d) [0x7f556b8ccccd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

osd.0.log View (1.69 MB) Xinxin Shu, 05/18/2014 08:05 PM

osd.0.log View (2.11 MB) Xinxin Shu, 05/19/2014 07:20 PM

ceph-rbd.log View (21.4 KB) Xinxin Shu, 05/31/2014 09:54 PM

Associated revisions

Revision c08adbc9 (diff)
Added by Haomai Wang almost 10 years ago

Fix set_alloc_hint op cause KeyValueStore crash problem

Now KeyValueStore doesn't support set_alloc_hit op, the implementation of
_do_transaction need to consider decoding the arguments. Otherwise, the
arguments will be regarded as the next op.

Fix the same problem for MemStore.

Fix #8381

Reported-by: Xinxin Shu <>
Signed-off-by: Haomai Wang <>

Revision fdbab468 (diff)
Added by Haomai Wang over 9 years ago

Fix set_alloc_hint op cause KeyValueStore crash problem

Now KeyValueStore doesn't support set_alloc_hit op, the implementation of
_do_transaction need to consider decoding the arguments. Otherwise, the
arguments will be regarded as the next op.

Fix the same problem for MemStore.

Fix #8381

Reported-by: Xinxin Shu <>
Signed-off-by: Haomai Wang <>
(cherry picked from commit c08adbc98ff5f380ecd215f8bd9cf3cab214913c)

History

#1 Updated by Haomai Wang almost 10 years ago

  • Assignee set to Haomai Wang

I can't reproduce the crash. It seemed ok running "dd" or "fio" in vm.

The bad op "2307" is confusing. As I know, there shouldn't exist op so large.

#2 Updated by Xinxin Shu almost 10 years ago

hi haomai , this issue occurs once i used 'dd', in order to help you root the cause, what kind of info should i provide , btw, what is your ceph configuration on your test setup.
Haomai Wang wrote:

I can't reproduce the crash. It seemed ok running "dd" or "fio" in vm.

The bad op "2307" is confusing. As I know, there shouldn't exist op so large.

#3 Updated by Haomai Wang almost 10 years ago

you can add "debug_keyvaluestore = 20/20" to ceph.conf

#4 Updated by Xinxin Shu almost 10 years ago

log with 'debug keyvaluestore = 20/20'

#5 Updated by Haomai Wang almost 10 years ago

Thanks to xinxin!

The bug is resulted from set_alloc_hint op.

case Transaction::OP_SETALLOCHINT:
// TODO: can kvstore make use of the hint?
break;

We just skip it but not decode it, it will result in the next op get the incorrect result.

#6 Updated by Haomai Wang almost 10 years ago

  • Category set to OSD
  • Status changed from New to Fix Under Review

#7 Updated by Xinxin Shu almost 10 years ago

hi haomai , after apply your patch, after i run 'virsh attach-device' to attach rbd, the vm is killed, however , when i change osd backend to filestore, everything seems ok , i check the dmesg , get nothing useful , i do not know how to identify this issues, can you give some hints, thanks, btw ,my ceph-version is ceph version 0.80.1-1-g410c990

#8 Updated by Haomai Wang almost 10 years ago

You can add these lines to /etc/ceph/ceph.conf which run qemu
[client]
log_file = /var/log/ceph/ceph-rbd.log
admin_socket = /var/run/ceph/ceph-rbd.asok

Then you should get info from /var/log/ceph/ceph-rbd.log

#9 Updated by Haomai Wang almost 10 years ago

Hi,xinxin.

I boot the vm with keyvaluestore backend, and attach new disk. There nothing happened.

#10 Updated by Xinxin Shu almost 10 years ago

hi haimao , i can reproduce this error from time to time, the attached file is detail log for client, btw, can you list your os , libvirt/qemu and ceph configuration.

#11 Updated by Haomai Wang almost 10 years ago

It seemed that no useful info can get from log.

My ceph cluster is compiled from master branch.
(qemu-kvm-1.2.0)
I don't use libvirt, just qemu directly. I think libvirt isn't the accident

#12 Updated by Shaun McDowell almost 10 years ago

We saw this issue as well. Is there a way we can pull this fix with ceph-deploy install --dev <Branch or Commit> to get the fix and see if the problem is resolved? Currently, ceph-deploy doesn't seem to pull anything newer than the ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74). We also would like to avoid cloning and building on all of our nodes.

#13 Updated by Haomai Wang almost 10 years ago

Hi Shaun,
The fix commit is https://github.com/ceph/ceph/pull/1840. I'm not familiar to ceph-deploy, hope the link can help and if anything confused let me know.

#14 Updated by Xinxin Shu almost 10 years ago

i get the following error from dmesg, how can i identify this error :

[1117577.605328] virbr0: port 1(vnet0) entered forwarding state
[1117592.937103] kvm209389: segfault at 7f6c00000018 ip 00007f6d15f63009 sp 00007f6c7aefb7f0 error 6 in librados.so.2.0.0[7f6d15c16000+66b000]
[1117592.972125] virbr0: port 1(vnet0) entered disabled state
[1117592.972479] virbr0: port 1(vnet0) entered disabled state
[1117592.972601] device vnet0 left promiscuous mode
[1117592.972603] virbr0: port 1(vnet0) entered disabled state

#15 Updated by Haomai Wang almost 10 years ago

Thanks xinxin, I would work on it

#16 Updated by Sage Weil almost 10 years ago

  • Status changed from Fix Under Review to Resolved
  • Source changed from other to Community (dev)

#17 Updated by Xinxin Shu almost 10 years ago

haomai , i create a new bug report 8529 for segment fault of librados, it seem to be the problem of buffer class.

Also available in: Atom PDF