Project

General

Profile

Actions

Bug #8301

closed

Suicide Timeout on Cache Tier OSDs

Added by Mark Nelson almost 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This appears to happen in both firefly and master. During 4MB random write tests on an RBD volume with a Ceph cache tier configuration the tier pool OSDs all commit suicide due to heartbeat timeouts. Before the suicide, all OSDs in the cluster (and in some cases at multiple different points in time) claim to be internally unhealthy. All OSDs are on the same system. This issue only appears to happen during long running tests with many objects. Smaller tests with a smaller cache pool and smaller volume did not reproduce it.

On the base pool OSDs:

osd.3.log:2014-05-07 05:44:12.973121 7ffad7f4d700 10 osd.3 152 internal heartbeat not healthy, dropping ping request
osd.3.log:2014-05-07 05:44:44.120810 7ffad6f4b700 10 osd.3 162 internal heartbeat not healthy, dropping ping request

While on the cache tier OSDs:

osd.35.log: -1536> 2014-05-07 05:42:40.533492 7f18a48a9700 10 osd.35 151 internal heartbeat not healthy, dropping ping request
osd.35.log: -1535> 2014-05-07 05:42:41.780083 7f18a38a7700 10 osd.35 151 internal heartbeat not healthy, dropping ping request

Eventually the cache tier OSDs show:

     0> 2014-05-07 05:45:27.531799 7fe2c9636700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fe2c9636700 time 2014-05-07 05:45:14.061454
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

 ceph version 0.80-rc1-153-g24c5ea8 (24c5ea8df040da0889be7ab1a9985ae03ee68d9e)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2ab) [0x99a08b]
 2: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, long, long)+0x72) [0x99a392]
 3: (ThreadPool::worker(ThreadPool::WorkThread*)+0x5dd) [0xa42afd]
 4: (ThreadPool::WorkThread::entry()+0x10) [0xa43f00]
 5: (()+0x7f6e) [0x7fe2de543f6e]
 6: (clone()+0x6d) [0x7fe2dc8e49cd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   0/ 0 mds
   0/ 0 mds_balancer
   0/ 0 mds_locker
   0/ 0 mds_log
   0/ 0 mds_log_expire
   0/ 0 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 0 filer
   0/ 1 striper
   0/ 0 objecter
   0/ 0 rados
   0/ 0 rbd
   0/ 0 journaler
   0/ 0 objectcacher
   0/ 0 client
  20/20 osd
   0/ 0 optracker
   0/ 0 objclass
  20/20 filestore
   1/ 3 keyvaluestore
   0/ 0 journal
   1/ 1 ms
   0/ 0 mon
   0/ 0 monc
   0/ 0 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   0/ 0 rgw
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /tmp/cbt/ceph/log/osd.30.log
--- end dump of recent events ---
2014-05-07 05:45:27.531801 7fe2db79d700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fe2db79d700 time 2014-05-07 05:44:57.707730
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

 ceph version 0.80-rc1-153-g24c5ea8 (24c5ea8df040da0889be7ab1a9985ae03ee68d9e)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2ab) [0x99a08b]
 2: (ceph::HeartbeatMap::is_healthy()+0xb6) [0x99a8f6]
 3: (ceph::HeartbeatMap::check_touch_file()+0x17) [0x99af07]
 4: (CephContextServiceThread::entry()+0x154) [0xa65ec4]
 5: (()+0x7f6e) [0x7fe2de543f6e]
 6: (clone()+0x6d) [0x7fe2dc8e49cd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

2014-05-07 05:45:27.679140 7fe2ca638700  1 -- 192.168.10.1:6950/7353 --> 192.168.10.2:6789/0 -- mon_subscribe({monmap=2+,osd_pg_creates=0}) v2 -- ?+0 0x4354000 con 0x1511f080
2014-05-07 05:45:27.679208 7fe2ca638700  1 -- 192.168.10.1:6950/7353 --> 192.168.10.2:6789/0 -- log(1 entries) v1 -- ?+0 0x313de480 con 0x1511f080
2014-05-07 05:45:28.205448 7fe2bd3f9700  1 -- 192.168.10.1:6950/7353 >> :/0 pipe(0xcab3480 sd=36 :6950 s=0 pgs=0 cs=0 l=0 c=0x1279d340).accept sd=36 192.168.10.2:54864/0
2014-05-07 05:45:28.406463 7fe2d7795700 20 filestore(/tmp/cbt/mnt/osd-device-30-data) sync_entry woke after 5.000071
2014-05-07 05:45:28.406494 7fe2d7795700 20 filestore(/tmp/cbt/mnt/osd-device-30-data) sync_entry waiting for max_interval 5.000000
2014-05-07 05:45:28.760636 7fe2c3e2b700 20 osd.30 152 update_osd_stat osd_stat(69656 MB used, 92432 MB avail, 158 GB total, peers []/[] op hist [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33])
2014-05-07 05:45:28.760658 7fe2c3e2b700  5 osd.30 152 heartbeat: osd_stat(69656 MB used, 92432 MB avail, 158 GB total, peers []/[] op hist [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33])
--- begin dump of recent events ---
    -7> 2014-05-07 05:45:27.531801 7fe2db79d700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fe2db79d700 time 2014-05-07 05:44:57.707730
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

 ceph version 0.80-rc1-153-g24c5ea8 (24c5ea8df040da0889be7ab1a9985ae03ee68d9e)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2ab) [0x99a08b]
 2: (ceph::HeartbeatMap::is_healthy()+0xb6) [0x99a8f6]
 3: (ceph::HeartbeatMap::check_touch_file()+0x17) [0x99af07]
 4: (CephContextServiceThread::entry()+0x154) [0xa65ec4]
 5: (()+0x7f6e) [0x7fe2de543f6e]
 6: (clone()+0x6d) [0x7fe2dc8e49cd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

    -6> 2014-05-07 05:45:27.679140 7fe2ca638700  1 -- 192.168.10.1:6950/7353 --> 192.168.10.2:6789/0 -- mon_subscribe({monmap=2+,osd_pg_creates=0}) v2 -- ?+0 0x4354000 con 0x1511f080
    -5> 2014-05-07 05:45:27.679208 7fe2ca638700  1 -- 192.168.10.1:6950/7353 --> 192.168.10.2:6789/0 -- log(1 entries) v1 -- ?+0 0x313de480 con 0x1511f080
    -4> 2014-05-07 05:45:28.205448 7fe2bd3f9700  1 -- 192.168.10.1:6950/7353 >> :/0 pipe(0xcab3480 sd=36 :6950 s=0 pgs=0 cs=0 l=0 c=0x1279d340).accept sd=36 192.168.10.2:54864/0
    -3> 2014-05-07 05:45:28.406463 7fe2d7795700 20 filestore(/tmp/cbt/mnt/osd-device-30-data) sync_entry woke after 5.000071
    -2> 2014-05-07 05:45:28.406494 7fe2d7795700 20 filestore(/tmp/cbt/mnt/osd-device-30-data) sync_entry waiting for max_interval 5.000000
    -1> 2014-05-07 05:45:28.760636 7fe2c3e2b700 20 osd.30 152 update_osd_stat osd_stat(69656 MB used, 92432 MB avail, 158 GB total, peers []/[] op hist [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33])
     0> 2014-05-07 05:45:28.760658 7fe2c3e2b700  5 osd.30 152 heartbeat: osd_stat(69656 MB used, 92432 MB avail, 158 GB total, peers []/[] op hist [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33])
--- logging levels ---
--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   0/ 0 mds
   0/ 0 mds_balancer
   0/ 0 mds_locker
   0/ 0 mds_log
   0/ 0 mds_log_expire
   0/ 0 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 0 filer
   0/ 1 striper
   0/ 0 objecter
   0/ 0 rados
   0/ 0 rbd
   0/ 0 journaler
   0/ 0 objectcacher
   0/ 0 client
  20/20 osd
   0/ 0 optracker
   0/ 0 objclass
  20/20 filestore
   1/ 3 keyvaluestore
   0/ 0 journal
   1/ 1 ms
   0/ 0 mon
   0/ 0 monc
   0/ 0 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   0/ 0 rgw
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /tmp/cbt/ceph/log/osd.30.log
--- end dump of recent events ---
Actions

Also available in: Atom PDF