Project

General

Profile

Actions

Bug #11527

closed

KV OSD stacktrace on disk failure

Added by Kenneth Waegeman almost 9 years ago. Updated almost 9 years ago.

Status:
Closed
Priority:
Low
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
4 - irritation
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When an OSD failed, this was the stacktrace I got:

    -6> 2015-05-01 15:10:28.472385 7f02820f3700  1 -- 10.141.16.14:6846/1003323 <== osd.17 10.143.16.11:0/5157 110160 ==== osd_ping(ping e3296 stamp 2015-05-01 15:10:28.471824) v2 ==== 
47+0+0 (3225483472 0 0) 0x1ed0c400 con 0xe8e61a0
    -5> 2015-05-01 15:10:28.472491 7f02820f3700  1 -- 10.141.16.14:6846/1003323 --> 10.143.16.11:0/5157 -- osd_ping(ping_reply e3296 stamp 2015-05-01 15:10:28.471824) v2 -- ?+0 0x1a48a2
00 con 0xe8e61a0
    -4> 2015-05-01 15:10:28.474514 7f02808f0700  1 -- 10.143.16.14:6849/1003323 <== osd.89 10.143.16.15:0/3407 110135 ==== osd_ping(ping e3296 stamp 2015-05-01 15:10:28.473849) v2 ==== 
47+0+0 (605194218 0 0) 0x2d552c00 con 0xe8e4780
    -3> 2015-05-01 15:10:28.474548 7f02808f0700  1 -- 10.143.16.14:6849/1003323 --> 10.143.16.15:0/3407 -- osd_ping(ping_reply e3296 stamp 2015-05-01 15:10:28.473849) v2 -- ?+0 0x13daf2
00 con 0xe8e4780
    -2> 2015-05-01 15:10:28.474558 7f02820f3700  1 -- 10.141.16.14:6846/1003323 <== osd.89 10.143.16.15:0/3407 110135 ==== osd_ping(ping e3296 stamp 2015-05-01 15:10:28.473849) v2 ==== 
47+0+0 (605194218 0 0) 0x23ab1200 con 0xe8e23c0
    -1> 2015-05-01 15:10:28.474590 7f02820f3700  1 -- 10.141.16.14:6846/1003323 --> 10.143.16.15:0/3407 -- osd_ping(ping_reply e3296 stamp 2015-05-01 15:10:28.473849) v2 -- ?+0 0x1ed0c4
00 con 0xe8e23c0
     0> 2015-05-01 15:10:28.475037 7f02768dc700 -1 *** Caught signal (Bus error) **
 in thread 7f02768dc700

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: /usr/bin/ceph-osd() [0xac51f2]
 2: (()+0xf130) [0x7f02938f6130]
 3: (leveldb::ReadBlock(leveldb::RandomAccessFile*, leveldb::ReadOptions const&, leveldb::BlockHandle const&, leveldb::BlockContents*)+0x233) [0x7f0294510733]
 4: (leveldb::Table::BlockReader(void*, leveldb::ReadOptions const&, leveldb::Slice const&)+0x276) [0x7f02945118a6]
 5: (()+0x3acd0) [0x7f0294513cd0]
 6: (()+0x3b071) [0x7f0294514071]
 7: (()+0x38028) [0x7f0294511028]
 8: (()+0x21a45) [0x7f02944faa45]
 9: (LevelDBStore::LevelDBWholeSpaceIteratorImpl::lower_bound(std::string const&, std::string const&)+0x49) [0x96a4d9]
 10: (GenericObjectMap::list_objects(coll_t const&, ghobject_t, int, std::vector<ghobject_t, std::allocator<ghobject_t> >*, ghobject_t*)+0x907) [0xa8d777]
 11: (KeyValueStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vector<ghobject_t, std::allocator<ghobject_t> >*, ghobject_t*)+0x239) [0x930b69]
 12: (KeyValueStore::collection_list_range(coll_t, ghobject_t, ghobject_t, snapid_t, std::vector<ghobject_t, std::allocator<ghobject_t> >*)+0x164) [0x954e14]
 13: (PGBackend::objects_list_range(hobject_t const&, hobject_t const&, snapid_t, std::vector<hobject_t, std::allocator<hobject_t> >*, std::vector<ghobject_t, std::allocator<ghobject_t>
 >*)+0x106) [0x8cb496]
 14: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x1df) [0x7dd0df]
 15: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x4c2) [0x7dd8e2]
 16: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0xbe) [0x6da9ce]
 17: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbb5f16]
 18: (ThreadPool::WorkThread::entry()+0x10) [0xbb6fa0]
 19: (()+0x7df5) [0x7f02938eedf5]
 20: (clone()+0x6d) [0x7f02923d11ad]

This is not a problem, but I don't know if it is possible to get like a message the disk failed instead of this, because this looked like a ceph issue at first ?

Actions #1

Updated by Haomai Wang almost 9 years ago

From the "Bus Error" msg, I more like to think it's hardware IO error?

Actions #2

Updated by Kenneth Waegeman almost 9 years ago

Yes indeed, I should have been more clear, it was indeed a disk hardware failure, the disk needed replacement.

I was just thinking if it was possible to throw another error than a stack trace that looks like something is wrong with LevelDB :)

Actions #3

Updated by Haomai Wang almost 9 years ago

  • Status changed from New to Closed

Hmm, I thin it may need more attempts from ceph itself. But I still has no sense about this

Actions

Also available in: Atom PDF