Project

General

Profile

Actions

Bug #7996

closed

0.78: OSD is not suspend-friendly (unresponsive cluster on OSD crash)

Added by Dmitry Smirnov about 10 years ago. Updated over 9 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

One machine running MON and OSD got suspended.
Shortly after (within seconds) the whole cluster got unresponsive for over 1000 seconds
while `ceph osd tree` were still showing OSD on suspended machine as 'up'.
Waking up affected computer did not recover cluster for minutes.
It remained frozen i.e. all rbd devices were not responding etc.
OSD log examination revealed the following:

    -1> 2014-04-05 13:52:44.658331 7f5037fb7700  3 osd.9 14396 handle_osd_map epochs [14397,14399], i have 14396, src has [13847,14399]
     0> 2014-04-05 13:52:44.772065 7f50495f5700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f50495f
5700 time 2014-04-05 13:52:42.320776
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

 ceph version 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2a9) [0x7f504de1d1b9]
 2: (ceph::HeartbeatMap::is_healthy()+0xb6) [0x7f504de1da26]
 3: (ceph::HeartbeatMap::check_touch_file()+0x17) [0x7f504de1e037]
 4: (CephContextServiceThread::entry()+0x154) [0x7f504dee6d14]
 5: (()+0x8062) [0x7f504cdad062]
 6: (clone()+0x6d) [0x7f504b4d9a3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.9.log
--- end dump of recent events ---
2014-04-05 13:52:44.772305 7f50405d0700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f50405d0700 tim
e 2014-04-05 13:52:42.320793
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

 ceph version 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2a9) [0x7f504de1d1b9]
 2: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, long, long)+0x72) [0x7f504de1d4c2]
 3: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0x24a) [0x7f504dbd0b1a]
 4: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x6c) [0x7f504dbd517c]
 5: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x160) [0x7f504dbd5300]
 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7f504dec4441]
 7: (ThreadPool::WorkThread::entry()+0x10) [0x7f504dec5330]
 8: (()+0x8062) [0x7f504cdad062]
 9: (clone()+0x6d) [0x7f504b4d9a3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

2014-04-05 13:52:44.772076 7f5040dd1700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f5040dd1700 time 2014-04-05 13:52:42.320775
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

 ceph version 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2a9) [0x7f504de1d1b9]
 2: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, long, long)+0x72) [0x7f504de1d4c2]
 3: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0x24a) [0x7f504dbd0b1a]
 4: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x6c) [0x7f504dbd517c]
 5: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x160) [0x7f504dbd5300]
 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7f504dec4441]
 7: (ThreadPool::WorkThread::entry()+0x10) [0x7f504dec5330]
 8: (()+0x8062) [0x7f504cdad062]
 9: (clone()+0x6d) [0x7f504b4d9a3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
    -1> 2014-04-05 13:52:44.772305 7f50405d0700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f50405d0700 time 2014-04-05 13:52:42.320793
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

 ceph version 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2a9) [0x7f504de1d1b9]
 2: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, long, long)+0x72) [0x7f504de1d4c2]
 3: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0x24a) [0x7f504dbd0b1a]
 4: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x6c) [0x7f504dbd517c]
 5: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x160) [0x7f504dbd5300]
 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7f504dec4441]
 7: (ThreadPool::WorkThread::entry()+0x10) [0x7f504dec5330]
 8: (()+0x8062) [0x7f504cdad062]
 9: (clone()+0x6d) [0x7f504b4d9a3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2014-04-05 13:52:44.772076 7f5040dd1700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f5040dd1700 time 2014-04-05 13:52:42.320775
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

 ceph version 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2a9) [0x7f504de1d1b9]
 2: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, long, long)+0x72) [0x7f504de1d4c2]
 3: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0x24a) [0x7f504dbd0b1a]
 4: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x6c) [0x7f504dbd517c]
 5: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x160) [0x7f504dbd5300]
 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0x7f504dec4441]
 7: (ThreadPool::WorkThread::entry()+0x10) [0x7f504dec5330]
 8: (()+0x8062) [0x7f504cdad062]
 9: (clone()+0x6d) [0x7f504b4d9a3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.9.log
--- end dump of recent events ---
--- begin dump of recent events ---
--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.9.log
--- end dump of recent events ---

We started OSD some minutes after waking up computer and noticed that cluster
became responsive only when OSD initialised.

There are 10 OSDs total on 5 hosts with 4 replicas per pool.
Cluster is fully operational when the particular computer/OSD is down (it is the only OSD on the suspended machine).
OSD do not crash every time on suspend but when it crashes
it causes significant and unexpected down time for the entire cluster.
IMHO cluster should not be affected that badly on OSD crash just like loss of connectivity with machine/OSD is not freezing cluster at all.

Perhaps something can be improved in handling unresponsive OSDs when they did not brought down gracefully...


Files

ceph-osd.9.log.suicide-timeout.xz (139 KB) ceph-osd.9.log.suicide-timeout.xz Dmitry Smirnov, 05/03/2014 05:13 PM
Actions

Also available in: Atom PDF