Project

General

Profile

Bug #3994

ceph-osd crash under little to no load

Added by Matthew Via about 11 years ago. Updated about 11 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

One of my osd's crashed a number of times in a row, and was repeatably enough that I had time to set the debugging levels and get a backtrace within minutes. However, now I'm having trouble reproducing the problem.

https://pastee.org/eqgg6

History

#1 Updated by Matthew Via about 11 years ago

It died again, here is the log output:
https://pastee.org/fbgch

#2 Updated by Matthew Via about 11 years ago

Also potentially of interest is the kernel log having some btrfs checksum failures:
btrfs csum failed ino 583798 extent 351186296832 csum 1376110139 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186280448 csum 2181979701 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186284544 csum 933723429 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186288640 csum 3626351784 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186292736 csum 3726591744 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186280448 csum 2181979701 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186280448 csum 2181979701 wanted 1645309641 mirror 0
device fsid 40fe0486-ebd3-4109-8558-a6b292fc63c9 devid 1 transid 187133 /dev/sdb3
device fsid 5016c0e4-a14a-4fb5-9229-b92829a580df devid 1 transid 176103 /dev/sda3
btrfs csum failed ino 583798 extent 351186296832 csum 1376110139 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186280448 csum 2181979701 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186284544 csum 933723429 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186288640 csum 3626351784 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186292736 csum 3726591744 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186280448 csum 2181979701 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186280448 csum 2181979701 wanted 1645309641 mirror 0
device fsid 40fe0486-ebd3-4109-8558-a6b292fc63c9 devid 1 transid 187558 /dev/sdb3
device fsid 5016c0e4-a14a-4fb5-9229-b92829a580df devid 1 transid 176148 /dev/sda3
btrfs csum failed ino 583798 extent 351186280448 csum 2181979701 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186284544 csum 933723429 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186288640 csum 3626351784 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186292736 csum 3726591744 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186296832 csum 1376110139 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186280448 csum 2181979701 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186280448 csum 2181979701 wanted 1645309641 mirror 0
device fsid 40fe0486-ebd3-4109-8558-a6b292fc63c9 devid 1 transid 190626 /dev/sdb3
device fsid 5016c0e4-a14a-4fb5-9229-b92829a580df devid 1 transid 176174 /dev/sda3
btrfs csum failed ino 583798 extent 351186280448 csum 2181979701 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186292736 csum 3726591744 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186296832 csum 1376110139 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186284544 csum 933723429 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186288640 csum 3626351784 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186280448 csum 2181979701 wanted 1645309641 mirror 0
btrfs csum failed ino 583798 extent 351186280448 csum 2181979701 wanted 1645309641 mirror 0

#3 Updated by Sage Weil about 11 years ago

  • Status changed from New to Closed
     0> 2013-02-02 17:07:51.597365 7f05237ee700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const hobject_t&, uint64_t, size_t, ceph::bufferlist&)' thread 7f05237ee700 time 2013-02-02 17:07:51.587571
os/FileStore.cc: 2732: FAILED assert(!m_filestore_fail_eio || got != -5)

This means that it is getting an EIO from the file system. In this case, it sounds like it's btrfs with a checksum error. Ceph treats this as a fatal disk error (since it doesn't know what to trust).

The simplest thing is to mark this osd 'out', let the cluster recover, and once all is well blow away this file system and re-add it as a fresh osd. If that's not an option there is some surgery that can be done to preserve other data on the disk, but if it can be avoided you can save yourself some work.

Also available in: Atom PDF