Bug #2563: leveldb corruption - Ceph - Ceph

Actions

Copy link

Bug #2563

closed

leveldb corruption

Added by Samuel Just almost 12 years ago. Updated almost 11 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Samuel Just

Category:

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

This was also mentioned once in the mailing list.

ceph version 0.47.2 (8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
1: /usr/bin/ceph-osd() [0x6eb32a]
2: (()+0xfcb0) [0x7f160bfa0cb0]
3: (gsignal()+0x35) [0x7f160a491445]
4: (abort()+0x17b) [0x7f160a494bab]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f160addf69d]
6: (()+0xb5846) [0x7f160addd846]
7: (()+0xb5873) [0x7f160addd873]
8: (()+0xb596e) [0x7f160addd96e]
9: (std::__throw_length_error(char const*)+0x57) [0x7f160ad8a907]
10: (()+0x9eaa2) [0x7f160adc6aa2]
11: (char* std::string::_S_construct<char const*>(char const*, char const*, std::allocator<char> const&, std::forward_iterator_tag)+0x35) [0x7f160adc8495]
12: (std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, unsigned long, std::allocator<char> const&)+0x1d) [0x7f160adc861d]
13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*, leveldb::Slice const&) const+0x47) [0x6d1ce7]
14: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice const&)+0x92) [0x6e0712]
15: (leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*)+0x482) [0x6cc552]
16: (leveldb::DBImpl::BackgroundCompaction()+0x2b0) [0x6ccd50]
17: (leveldb::DBImpl::BackgroundCall()+0x68) [0x6cd7f8]
18: /usr/bin/ceph-osd() [0x6e679f]
19: (()+0x7e9a) [0x7f160bf98e9a]
20: (clone()+0x6d) [0x7f160a54d4bd]

Files

Download all files

omap.tgz (5.12 MB) omap.tgz	Omap archive	Samuel Just, 06/12/2012 02:55 PM
omap-20120917.tgz (9.2 MB) omap-20120917.tgz	OMAP Tarball	Matt Garner, 09/17/2012 02:04 PM

Actions

Copy link

Updated by Samuel Just almost 12 years ago

It's triggerable without ceph, I've filed a bug below with leveldb and I'm continuing to look into it.

http://code.google.com/p/leveldb/issues/detail?id=97

Actions

Copy link

Updated by Samuel Just almost 12 years ago

Status changed from New to Can't reproduce

It looks like one of the leveldb store files was corrupted, possibly by the filesystem. It may be possible to recover using the instructions in the leveldb tracker link above.

Actions

Copy link

Updated by Matt Garner over 11 years ago

File omap-20120917.tgz omap-20120917.tgz added

Experiencing the same issue on a production ceph cluster.

ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)
 1: /usr/bin/ceph-osd() [0x6edaba]
 2: (()+0xfcb0) [0x7f5a09b47cb0]
 3: (gsignal()+0x35) [0x7f5a08723445]
 4: (abort()+0x17b) [0x7f5a08726bab]
 5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f5a0907169d]
 6: (()+0xb5846) [0x7f5a0906f846]
 7: (()+0xb5873) [0x7f5a0906f873]
 8: (()+0xb596e) [0x7f5a0906f96e]
 9: (std::__throw_length_error(char const*)+0x57) [0x7f5a0901c907]
 10: (()+0x9eaa2) [0x7f5a09058aa2]
 11: (char* std::string::_S_construct&lt;char const*&gt;(char const*, char const*, std::allocator&lt;char&gt; const&, std::forward_iterator_tag)+0x35) [0x7f5a0905a495]
 12: (std::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; >::basic_string(char const*, unsigned long, std::allocator&lt;char&gt; const&)+0x1d) [0x7f5a0905a61d]
 13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*, leveldb::Slice const&) const+0x47) [0x6d43d7]
 14: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice const&)+0x92) [0x6e2e02]
 15: (leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*)+0x482) [0x6cec42]
 16: (leveldb::DBImpl::BackgroundCompaction()+0x2b0) [0x6cf440]
 17: (leveldb::DBImpl::BackgroundCall()+0x68) [0x6cfee8]
 18: /usr/bin/ceph-osd() [0x6e8e8f]
 19: (()+0x7e9a) [0x7f5a09b3fe9a]
 20: (clone()+0x6d) [0x7f5a087df4bd]

osd.7 is one of eight identical PowerEdge 850 units with a mdadm raid0 on 2x 2TB or 3TB drives per machine running btrfs.
All machines running 12.04 and 0.48.1argonaut from deb packages.

This osd had just been added to the existing cluster and was in process of its initial population of pgs from other osds in the cluster.

The only unusual thing about this osd was that I had enabled btrfs compression=zlib on the partition housing the osd data.

I did a btrfsck of the volume containing the omap and found no errors.

df -h:
Filesystem Size Used Avail Use% Mounted on
/dev/md0 19G 3.0G 14G 18% /
udev 2.0G 4.0K 2.0G 1% /dev
tmpfs 791M 268K 791M 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 2.0G 0 2.0G 0% /run/shm
/dev/md0 19G 3.0G 14G 18% /home
/dev/sdc1 93M 31M 57M 36% /boot
/dev/md1 5.5T 655G 4.8T 12% /data

ceph.conf:
[osd]
osd data = /data/ceph/osd/ceph-7
keyring = /data/ceph/osd/ceph-7/keyring
osd journal = /data/ceph/osd/ceph-7/journal
osd journal size = 2000
filestore xattr use omap = true
debug optracker = 20
debug journal = 20

Ceph log dump is here:
http://www.mattgarner.com/ceph/ceph-osd.7-20120917.tgz

Actions

Copy link

Updated by Greg Farnum over 11 years ago

Status changed from Can't reproduce to 12

Just got another report of this on the list.
This user has enabled btrfs' lzo compression, and I believe btrfs compression has been a common thread across everybody who's reported this problem.

Actions

Copy link

Updated by Samuel Just almost 11 years ago

Status changed from 12 to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #2563

leveldb corruption

Updated by Samuel Just almost 12 years ago

Updated by Samuel Just almost 12 years ago

Updated by Matt Garner over 11 years ago

Updated by Greg Farnum over 11 years ago

Updated by Samuel Just almost 11 years ago