Project

General

Profile

Bug #2563

leveldb corruption

Added by Samuel Just over 11 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This was also mentioned once in the mailing list.

ceph version 0.47.2 (8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
1: /usr/bin/ceph-osd() [0x6eb32a]
2: (()+0xfcb0) [0x7f160bfa0cb0]
3: (gsignal()+0x35) [0x7f160a491445]
4: (abort()+0x17b) [0x7f160a494bab]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f160addf69d]
6: (()+0xb5846) [0x7f160addd846]
7: (()+0xb5873) [0x7f160addd873]
8: (()+0xb596e) [0x7f160addd96e]
9: (std::__throw_length_error(char const*)+0x57) [0x7f160ad8a907]
10: (()+0x9eaa2) [0x7f160adc6aa2]
11: (char* std::string::_S_construct<char const*>(char const*, char const*, std::allocator<char> const&, std::forward_iterator_tag)+0x35) [0x7f160adc8495]
12: (std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, unsigned long, std::allocator<char> const&)+0x1d) [0x7f160adc861d]
13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*, leveldb::Slice const&) const+0x47) [0x6d1ce7]
14: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice const&)+0x92) [0x6e0712]
15: (leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*)+0x482) [0x6cc552]
16: (leveldb::DBImpl::BackgroundCompaction()+0x2b0) [0x6ccd50]
17: (leveldb::DBImpl::BackgroundCall()+0x68) [0x6cd7f8]
18: /usr/bin/ceph-osd() [0x6e679f]
19: (()+0x7e9a) [0x7f160bf98e9a]
20: (clone()+0x6d) [0x7f160a54d4bd]

omap.tgz - Omap archive (5.12 MB) Samuel Just, 06/12/2012 02:55 PM

omap-20120917.tgz - OMAP Tarball (9.2 MB) Matt Garner, 09/17/2012 02:04 PM

History

#1 Updated by Samuel Just over 11 years ago

It's triggerable without ceph, I've filed a bug below with leveldb and I'm continuing to look into it.

http://code.google.com/p/leveldb/issues/detail?id=97

#2 Updated by Samuel Just about 11 years ago

  • Status changed from New to Can't reproduce

It looks like one of the leveldb store files was corrupted, possibly by the filesystem. It may be possible to recover using the instructions in the leveldb tracker link above.

#3 Updated by Matt Garner about 11 years ago

Experiencing the same issue on a production ceph cluster.

ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)
1: /usr/bin/ceph-osd() [0x6edaba]
2: (()+0xfcb0) [0x7f5a09b47cb0]
3: (gsignal()+0x35) [0x7f5a08723445]
4: (abort()+0x17b) [0x7f5a08726bab]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f5a0907169d]
6: (()+0xb5846) [0x7f5a0906f846]
7: (()+0xb5873) [0x7f5a0906f873]
8: (()+0xb596e) [0x7f5a0906f96e]
9: (std::__throw_length_error(char const*)+0x57) [0x7f5a0901c907]
10: (()+0x9eaa2) [0x7f5a09058aa2]
11: (char* std::string::_S_construct&lt;char const*&gt;(char const*, char const*, std::allocator&lt;char&gt; const&, std::forward_iterator_tag)+0x35) [0x7f5a0905a495]
12: (std::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; >::basic_string(char const*, unsigned long, std::allocator&lt;char&gt; const&)+0x1d) [0x7f5a0905a61d]
13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*, leveldb::Slice const&) const+0x47) [0x6d43d7]
14: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice const&)+0x92) [0x6e2e02]
15: (leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*)+0x482) [0x6cec42]
16: (leveldb::DBImpl::BackgroundCompaction()+0x2b0) [0x6cf440]
17: (leveldb::DBImpl::BackgroundCall()+0x68) [0x6cfee8]
18: /usr/bin/ceph-osd() [0x6e8e8f]
19: (()+0x7e9a) [0x7f5a09b3fe9a]
20: (clone()+0x6d) [0x7f5a087df4bd]

osd.7 is one of eight identical PowerEdge 850 units with a mdadm raid0 on 2x 2TB or 3TB drives per machine running btrfs.
All machines running 12.04 and 0.48.1argonaut from deb packages.

This osd had just been added to the existing cluster and was in process of its initial population of pgs from other osds in the cluster.

The only unusual thing about this osd was that I had enabled btrfs compression=zlib on the partition housing the osd data.

I did a btrfsck of the volume containing the omap and found no errors.

df -h:
Filesystem Size Used Avail Use% Mounted on
/dev/md0 19G 3.0G 14G 18% /
udev 2.0G 4.0K 2.0G 1% /dev
tmpfs 791M 268K 791M 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 2.0G 0 2.0G 0% /run/shm
/dev/md0 19G 3.0G 14G 18% /home
/dev/sdc1 93M 31M 57M 36% /boot
/dev/md1 5.5T 655G 4.8T 12% /data

ceph.conf:
[osd]
osd data = /data/ceph/osd/ceph-7
keyring = /data/ceph/osd/ceph-7/keyring
osd journal = /data/ceph/osd/ceph-7/journal
osd journal size = 2000
filestore xattr use omap = true
debug optracker = 20
debug journal = 20

Ceph log dump is here:
http://www.mattgarner.com/ceph/ceph-osd.7-20120917.tgz

#4 Updated by Greg Farnum almost 11 years ago

  • Status changed from Can't reproduce to 12

Just got another report of this on the list.
This user has enabled btrfs' lzo compression, and I believe btrfs compression has been a common thread across everybody who's reported this problem.

#5 Updated by Samuel Just over 10 years ago

  • Status changed from 12 to Resolved

Also available in: Atom PDF