Actions
Bug #8301
closedSuicide Timeout on Cache Tier OSDs
% Done:
0%
Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
This appears to happen in both firefly and master. During 4MB random write tests on an RBD volume with a Ceph cache tier configuration the tier pool OSDs all commit suicide due to heartbeat timeouts. Before the suicide, all OSDs in the cluster (and in some cases at multiple different points in time) claim to be internally unhealthy. All OSDs are on the same system. This issue only appears to happen during long running tests with many objects. Smaller tests with a smaller cache pool and smaller volume did not reproduce it.
On the base pool OSDs:
osd.3.log:2014-05-07 05:44:12.973121 7ffad7f4d700 10 osd.3 152 internal heartbeat not healthy, dropping ping request osd.3.log:2014-05-07 05:44:44.120810 7ffad6f4b700 10 osd.3 162 internal heartbeat not healthy, dropping ping request
While on the cache tier OSDs:
osd.35.log: -1536> 2014-05-07 05:42:40.533492 7f18a48a9700 10 osd.35 151 internal heartbeat not healthy, dropping ping request osd.35.log: -1535> 2014-05-07 05:42:41.780083 7f18a38a7700 10 osd.35 151 internal heartbeat not healthy, dropping ping request
Eventually the cache tier OSDs show:
0> 2014-05-07 05:45:27.531799 7fe2c9636700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fe2c9636700 time 2014-05-07 05:45:14.061454 common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout") ceph version 0.80-rc1-153-g24c5ea8 (24c5ea8df040da0889be7ab1a9985ae03ee68d9e) 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2ab) [0x99a08b] 2: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, long, long)+0x72) [0x99a392] 3: (ThreadPool::worker(ThreadPool::WorkThread*)+0x5dd) [0xa42afd] 4: (ThreadPool::WorkThread::entry()+0x10) [0xa43f00] 5: (()+0x7f6e) [0x7fe2de543f6e] 6: (clone()+0x6d) [0x7fe2dc8e49cd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 0 lockdep 0/ 0 context 0/ 0 crush 0/ 0 mds 0/ 0 mds_balancer 0/ 0 mds_locker 0/ 0 mds_log 0/ 0 mds_log_expire 0/ 0 mds_migrator 0/ 0 buffer 0/ 0 timer 0/ 0 filer 0/ 1 striper 0/ 0 objecter 0/ 0 rados 0/ 0 rbd 0/ 0 journaler 0/ 0 objectcacher 0/ 0 client 20/20 osd 0/ 0 optracker 0/ 0 objclass 20/20 filestore 1/ 3 keyvaluestore 0/ 0 journal 1/ 1 ms 0/ 0 mon 0/ 0 monc 0/ 0 paxos 0/ 0 tp 0/ 0 auth 1/ 5 crypto 0/ 0 finisher 0/ 0 heartbeatmap 0/ 0 perfcounter 0/ 0 rgw 1/ 5 javaclient 0/ 0 asok 0/ 0 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /tmp/cbt/ceph/log/osd.30.log --- end dump of recent events --- 2014-05-07 05:45:27.531801 7fe2db79d700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fe2db79d700 time 2014-05-07 05:44:57.707730 common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout") ceph version 0.80-rc1-153-g24c5ea8 (24c5ea8df040da0889be7ab1a9985ae03ee68d9e) 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2ab) [0x99a08b] 2: (ceph::HeartbeatMap::is_healthy()+0xb6) [0x99a8f6] 3: (ceph::HeartbeatMap::check_touch_file()+0x17) [0x99af07] 4: (CephContextServiceThread::entry()+0x154) [0xa65ec4] 5: (()+0x7f6e) [0x7fe2de543f6e] 6: (clone()+0x6d) [0x7fe2dc8e49cd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2014-05-07 05:45:27.679140 7fe2ca638700 1 -- 192.168.10.1:6950/7353 --> 192.168.10.2:6789/0 -- mon_subscribe({monmap=2+,osd_pg_creates=0}) v2 -- ?+0 0x4354000 con 0x1511f080 2014-05-07 05:45:27.679208 7fe2ca638700 1 -- 192.168.10.1:6950/7353 --> 192.168.10.2:6789/0 -- log(1 entries) v1 -- ?+0 0x313de480 con 0x1511f080 2014-05-07 05:45:28.205448 7fe2bd3f9700 1 -- 192.168.10.1:6950/7353 >> :/0 pipe(0xcab3480 sd=36 :6950 s=0 pgs=0 cs=0 l=0 c=0x1279d340).accept sd=36 192.168.10.2:54864/0 2014-05-07 05:45:28.406463 7fe2d7795700 20 filestore(/tmp/cbt/mnt/osd-device-30-data) sync_entry woke after 5.000071 2014-05-07 05:45:28.406494 7fe2d7795700 20 filestore(/tmp/cbt/mnt/osd-device-30-data) sync_entry waiting for max_interval 5.000000 2014-05-07 05:45:28.760636 7fe2c3e2b700 20 osd.30 152 update_osd_stat osd_stat(69656 MB used, 92432 MB avail, 158 GB total, peers []/[] op hist [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33]) 2014-05-07 05:45:28.760658 7fe2c3e2b700 5 osd.30 152 heartbeat: osd_stat(69656 MB used, 92432 MB avail, 158 GB total, peers []/[] op hist [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33]) --- begin dump of recent events --- -7> 2014-05-07 05:45:27.531801 7fe2db79d700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fe2db79d700 time 2014-05-07 05:44:57.707730 common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout") ceph version 0.80-rc1-153-g24c5ea8 (24c5ea8df040da0889be7ab1a9985ae03ee68d9e) 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2ab) [0x99a08b] 2: (ceph::HeartbeatMap::is_healthy()+0xb6) [0x99a8f6] 3: (ceph::HeartbeatMap::check_touch_file()+0x17) [0x99af07] 4: (CephContextServiceThread::entry()+0x154) [0xa65ec4] 5: (()+0x7f6e) [0x7fe2de543f6e] 6: (clone()+0x6d) [0x7fe2dc8e49cd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. -6> 2014-05-07 05:45:27.679140 7fe2ca638700 1 -- 192.168.10.1:6950/7353 --> 192.168.10.2:6789/0 -- mon_subscribe({monmap=2+,osd_pg_creates=0}) v2 -- ?+0 0x4354000 con 0x1511f080 -5> 2014-05-07 05:45:27.679208 7fe2ca638700 1 -- 192.168.10.1:6950/7353 --> 192.168.10.2:6789/0 -- log(1 entries) v1 -- ?+0 0x313de480 con 0x1511f080 -4> 2014-05-07 05:45:28.205448 7fe2bd3f9700 1 -- 192.168.10.1:6950/7353 >> :/0 pipe(0xcab3480 sd=36 :6950 s=0 pgs=0 cs=0 l=0 c=0x1279d340).accept sd=36 192.168.10.2:54864/0 -3> 2014-05-07 05:45:28.406463 7fe2d7795700 20 filestore(/tmp/cbt/mnt/osd-device-30-data) sync_entry woke after 5.000071 -2> 2014-05-07 05:45:28.406494 7fe2d7795700 20 filestore(/tmp/cbt/mnt/osd-device-30-data) sync_entry waiting for max_interval 5.000000 -1> 2014-05-07 05:45:28.760636 7fe2c3e2b700 20 osd.30 152 update_osd_stat osd_stat(69656 MB used, 92432 MB avail, 158 GB total, peers []/[] op hist [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33]) 0> 2014-05-07 05:45:28.760658 7fe2c3e2b700 5 osd.30 152 heartbeat: osd_stat(69656 MB used, 92432 MB avail, 158 GB total, peers []/[] op hist [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33]) --- logging levels --- --- logging levels --- 0/ 5 none 0/ 0 lockdep 0/ 0 context 0/ 0 crush 0/ 0 mds 0/ 0 mds_balancer 0/ 0 mds_locker 0/ 0 mds_log 0/ 0 mds_log_expire 0/ 0 mds_migrator 0/ 0 buffer 0/ 0 timer 0/ 0 filer 0/ 1 striper 0/ 0 objecter 0/ 0 rados 0/ 0 rbd 0/ 0 journaler 0/ 0 objectcacher 0/ 0 client 20/20 osd 0/ 0 optracker 0/ 0 objclass 20/20 filestore 1/ 3 keyvaluestore 0/ 0 journal 1/ 1 ms 0/ 0 mon 0/ 0 monc 0/ 0 paxos 0/ 0 tp 0/ 0 auth 1/ 5 crypto 0/ 0 finisher 0/ 0 heartbeatmap 0/ 0 perfcounter 0/ 0 rgw 1/ 5 javaclient 0/ 0 asok 0/ 0 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /tmp/cbt/ceph/log/osd.30.log --- end dump of recent events ---
Actions