Project

General

Profile

Bug #12328

Ceph KV OSD crashes sometimes with FAILED assert(0 == "hit suicide timeout") on 0.94.1

Added by Kenneth Waegeman over 8 years ago. Updated almost 7 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When rsyncing data to cephfs , sometimes one or more osds crashes.
Content of log file of such an osd:

   -12> 2015-07-14 13:29:04.693090 7f6743db0700  0 log_channel(cluster) log [WRN] : slow request 142.985014 seconds old, received at 2015-07-14 13:26:41.707961: osd_repop(osd.1.0:474288 1.20 99cd4c20/10000017a0f.00000000/head//1 v 122'5460) currently no flag points reached
   -11> 2015-07-14 13:29:04.693093 7f6743db0700  0 log_channel(cluster) log [WRN] : slow request 142.833813 seconds old, received at 2015-07-14 13:26:41.859161: osd_repop(osd.33.0:493429 1.63 78aa6c63/1000001ee19.00000000/head//1 v 122'5731) currently started
   -10> 2015-07-14 13:29:04.930965 7f6736bfc700  1 -- 10.143.8.181:6803/19927 <== osd.19 10.143.8.181:0/21497 105431 ==== osd_ping(ping e122 stamp 2015-07-14 13:29:04.930745) v2 ==== 47+0+0 (1630228558 0 0) 0x6c47000 con 0x57eef60
    -9> 2015-07-14 13:29:04.930960 7f67383ff700  1 -- 10.141.8.181:6803/19927 <== osd.19 10.143.8.181:0/21497 105431 ==== osd_ping(ping e122 stamp 2015-07-14 13:29:04.930745) v2 ==== 47+0+0 (1630228558 0 0) 0xdf34400 con 0x57ec200
    -8> 2015-07-14 13:29:04.931016 7f6736bfc700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f672fbee700' had timed out after 15
    -7> 2015-07-14 13:29:04.931020 7f6736bfc700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f67313f1700' had timed out after 15
    -6> 2015-07-14 13:29:04.931033 7f67383ff700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f672fbee700' had timed out after 15
    -5> 2015-07-14 13:29:04.931037 7f67383ff700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f67313f1700' had timed out after 15
    -4> 2015-07-14 13:29:05.205493 7f67383ff700  1 -- 10.141.8.181:6803/19927 <== osd.18 10.143.8.181:0/20704 105272 ==== osd_ping(ping e122 stamp 2015-07-14 13:29:05.205277) v2 ==== 47+0+0 (628769321 0 0) 0x7283400 con 0x57e82c0
    -3> 2015-07-14 13:29:05.205508 7f6736bfc700  1 -- 10.143.8.181:6803/19927 <== osd.18 10.143.8.181:0/20704 105272 ==== osd_ping(ping e122 stamp 2015-07-14 13:29:05.205277) v2 ==== 47+0+0 (628769321 0 0) 0x7ce9a00 con 0x57e8160
    -2> 2015-07-14 13:29:05.205522 7f67383ff700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f672fbee700' had timed out after 15
    -1> 2015-07-14 13:29:05.205526 7f67383ff700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f672fbee700' had suicide timed out after 150
     0> 2015-07-14 13:29:05.209756 7f67383ff700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f67383ff700 time 2015-07-14 13:29:05.205531
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc51f5]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0xafaff9]
 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0xafb8ee]
 4: (OSD::handle_osd_ping(MOSDPing*)+0x733) [0x695f13]
 5: (OSD::heartbeat_dispatch(Message*)+0x2fb) [0x69718b]
 6: (DispatchQueue::entry()+0x62a) [0xc7cc4a]
 7: (DispatchQueue::DispatchThread::entry()+0xd) [0xba403d]
 8: (()+0x7df5) [0x7f674c587df5]
 9: (clone()+0x6d) [0x7f674b04e1ad]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

...
2015-07-14 13:29:05.271802 7f67383ff700 -1 *** Caught signal (Aborted) **
 in thread 7f67383ff700

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: /usr/bin/ceph-osd() [0xac51f2]
 2: (()+0xf130) [0x7f674c58f130]
 3: (gsignal()+0x37) [0x7f674af8d5d7]
 4: (abort()+0x148) [0x7f674af8ecc8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f674b8a09b5]
 6: (()+0x5e926) [0x7f674b89e926]
 7: (()+0x5e953) [0x7f674b89e953]
 8: (()+0x5eb73) [0x7f674b89eb73]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xbc53ea]
 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0xafaff9]
 11: (ceph::HeartbeatMap::is_healthy()+0xde) [0xafb8ee]
 12: (OSD::handle_osd_ping(MOSDPing*)+0x733) [0x695f13]
 13: (OSD::heartbeat_dispatch(Message*)+0x2fb) [0x69718b]
 14: (DispatchQueue::entry()+0x62a) [0xc7cc4a]
 15: (DispatchQueue::DispatchThread::entry()+0xd) [0xba403d]
 16: (()+0x7df5) [0x7f674c587df5]
 17: (clone()+0x6d) [0x7f674b04e1ad]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2015-07-14 13:29:05.271802 7f67383ff700 -1 *** Caught signal (Aborted) **
 in thread 7f67383ff700

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: /usr/bin/ceph-osd() [0xac51f2]
 2: (()+0xf130) [0x7f674c58f130]
 3: (gsignal()+0x37) [0x7f674af8d5d7]
 4: (abort()+0x148) [0x7f674af8ecc8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f674b8a09b5]
 6: (()+0x5e926) [0x7f674b89e926]
 7: (()+0x5e953) [0x7f674b89e953]
 8: (()+0x5eb73) [0x7f674b89eb73]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xbc53ea]
 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0xafaff9]
 11: (ceph::HeartbeatMap::is_healthy()+0xde) [0xafb8ee]
 12: (OSD::handle_osd_ping(MOSDPing*)+0x733) [0x695f13]
 13: (OSD::heartbeat_dispatch(Message*)+0x2fb) [0x69718b]
 14: (DispatchQueue::entry()+0x62a) [0xc7cc4a]
 15: (DispatchQueue::DispatchThread::entry()+0xd) [0xba403d]
 16: (()+0x7df5) [0x7f674c587df5]
 17: (clone()+0x6d) [0x7f674b04e1ad]

OSD can be restarted normally.

History

#1 Updated by Kenneth Waegeman over 8 years ago

We see the same thing in our production cluster, running the same ceph version and configuration with KV on EC OSDs:

11> 2015-07-14 17:05:24.299882 7f0824479700  1 - 10.141.16.11:0/17839 <== osd.233 10.141.16.23:6827/27123 323 ==== osd_ping(you_died e21218 stamp 2015-07-14 17:05:24.289598) v2 ==== 47+0+0 (862640411 0 0) 0
x6b61a00 con 0x9313180
10> 2015-07-14 17:05:24.299888 7f0824479700 1 - 10.141.16.11:0/17839 <== osd.173 10.141.16.20:6829/53390 323 ==== osd_ping(you_died e21218 stamp 2015-07-14 17:05:24.289598) v2 ==== 47+0+0 (862640411 0 0) 0
x13dcb600 con 0x4572dee0
9> 2015-07-14 17:05:24.299894 7f0824479700 1 - 10.141.16.11:0/17839 <== osd.87 10.143.16.15:6823/25376 325 ==== osd_ping(you_died e21218 stamp 2015-07-14 17:05:24.289598) v2 ==== 47+0+0 (862640411 0 0) 0x
1e7dda00 con 0x5f626880
8> 2015-07-14 17:05:28.136387 7f08317a9700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f081e46d700' had timed out after 15
-7> 2015-07-14 17:05:28.139616 7f08317a9700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f081e46d700' had suicide timed out after 150
-6> 2015-07-14 17:05:28.344302 7f081f46f700 10 monclient: tick
-5> 2015-07-14 17:05:28.368416 7f081f46f700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2015-07-14 17:04:58.367510)
-4> 2015-07-14 17:05:28.372101 7f081f46f700 10 monclient: renew subs? (now: 2015-07-14 17:05:28.372100; renew after: 2015-07-14 17:04:33.075361) -
yes
3> 2015-07-14 17:05:28.372753 7f081f46f700 10 monclient: renew_subs
-2> 2015-07-14 17:05:28.375533 7f081f46f700 10 monclient: _send_mon_message to mon.mds03 at 10.141.16.3:6789/0
-1> 2015-07-14 17:05:28.375569 7f081f46f700 1 -
10.141.16.11:6810/17839 --> 10.141.16.3:6789/0 -- mon_subscribe({monmap=2+,osd_pg_creates=0,osdmap=21128}) v2 -- ?+0 0x16d24600 con 0x8e95120
0> 2015-07-14 17:05:28.413966 7f08317a9700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f08317a9700 time 2015-07-14 17:05:
28.140873
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc51f5]
2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0xafaff9]
3: (ceph::HeartbeatMap::is_healthy()+0xde) [0xafb8ee]
4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0xafc00c]
5: (CephContextServiceThread::entry()+0x15b) [0xbd542b]
6: (()+0x7df5) [0x7f083457ddf5]
7: (clone()+0x6d) [0x7f08330601ad]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

#2 Updated by Haomai Wang over 8 years ago

From my intuition, it should be blocking on submit transaction to backend. It's possible that leveldb may stuck into compaction and make client io blocking.

Do you use leveldb ?

#3 Updated by Kenneth Waegeman over 8 years ago

Yes, our OSDs are using leveldb

#4 Updated by Haomai Wang over 8 years ago

I think what we can do is boost the keyvaluestore thread timeout value, and tune leveldb config or use rocksdb as backend.

#5 Updated by Kenneth Waegeman over 8 years ago

I guess I then should add `keyvaluestore_backend = rocksdb` to the config ?
Can this be done for existing OSDS?
Can they be mixed in the same pool ? (Then I could recreate/change OSDs gradually)

Thank you !!
Kenneth

#6 Updated by Kenneth Waegeman over 8 years ago

This time, we have the same error, but with a filestore OSD! :

  
 -11> 2015-07-17 12:20:53.171071 7f72bc889700  1 -- 10.143.8.180:0/21943 <== osd.5 10.141.8.180:6813/2025958 2490 ==== osd_ping(ping_reply e973 stam
p 2015-07-17 12:20:53.036238) v2 ==== 47+0+0 (2350916550 0 0) 0x65a9c00 con 0xc7c1e40
   -10> 2015-07-17 12:20:53.173162 7f72bc889700  1 -- 10.143.8.180:0/21943 <== osd.2 10.141.8.180:6805/23455 2490 ==== osd_ping(ping_reply e973 stamp 
2015-07-17 12:20:53.036238) v2 ==== 47+0+0 (2350916550 0 0) 0xb46a800 con 0xc7c7bc0
    -9> 2015-07-17 12:20:53.175126 7f72bc889700  1 -- 10.143.8.180:0/21943 <== osd.2 10.143.8.180:6805/23455 2490 ==== osd_ping(ping_reply e973 stamp 
2015-07-17 12:20:53.036238) v2 ==== 47+0+0 (2350916550 0 0) 0x7e85e00 con 0xc7c48e0
    -8> 2015-07-17 12:20:53.175224 7f72bc889700  1 -- 10.143.8.180:0/21943 <== osd.5 10.143.8.180:6835/2025958 2490 ==== osd_ping(ping_reply e973 stam
p 2015-07-17 12:20:53.036238) v2 ==== 47+0+0 (2350916550 0 0) 0x1acd4a00 con 0xc7c14a0
    -7> 2015-07-17 12:20:53.189213 7f72bc889700  1 -- 10.143.8.180:0/21943 <== osd.4 10.143.8.180:6809/25053 185101 ==== osd_ping(ping_reply e973 stam
p 2015-07-17 12:20:53.036238) v2 ==== 47+0+0 (2350916550 0 0) 0x787a800 con 0x67cb2e0
    -6> 2015-07-17 12:20:53.719974 7f72ba084700  1 -- 10.143.8.180:6801/21943 <== osd.1 10.143.8.180:0/19736 2511 ==== osd_ping(ping e973 stamp 2015-0
7-17 12:20:53.719120) v2 ==== 47+0+0 (3630202744 0 0) 0x293ec800 con 0x5d056a0
    -5> 2015-07-17 12:20:53.720115 7f72ba084700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f72b4078700' had timed out after 15
    -4> 2015-07-17 12:20:53.720123 7f72ba084700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f72b4078700' had suicide timed out after 150
    -3> 2015-07-17 12:20:53.868534 7f72ccc07700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f72b4078700' had timed out after 15
    -2> 2015-07-17 12:20:53.868573 7f72ccc07700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f72b4078700' had suicide timed out after 150
    -1> 2015-07-17 12:20:53.927786 7f72bb887700  1 -- 10.141.8.180:6801/21943 <== osd.34 10.141.8.182:0/22430 185102 ==== osd_ping(ping e973 stamp 201
5-07-17 12:20:53.926670) v2 ==== 47+0+0 (4208267044 0 0) 0x7e87400 con 0x60a3180
     0> 2015-07-17 12:20:53.928257 7f72ccc07700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, con
st char*, time_t)' thread 7f72ccc07700 time 2015-07-17 12:20:53.868594
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc51f5]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0xafaff9]
 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0xafb8ee]
 4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0xafc00c]
 5: (CephContextServiceThread::entry()+0x15b) [0xbd542b]
 6: (()+0x7df5) [0x7f72cfa0fdf5]
 7: (clone()+0x6d) [0x7f72ce4d61ad]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2015-07-17 12:20:53.928268 7f72ba084700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbea
t_handle_d*, const char*, time_t)' thread 7f72ba084700 time 2015-07-17 12:20:53.722164
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc51f5]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2d9) [0xafaff9]
 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0xafb8ee]
 4: (OSD::handle_osd_ping(MOSDPing*)+0x733) [0x695f13]
 5: (OSD::heartbeat_dispatch(Message*)+0x2fb) [0x69718b]
 6: (DispatchQueue::entry()+0x62a) [0xc7cc4a]
 7: (DispatchQueue::DispatchThread::entry()+0xd) [0xba403d]
 8: (()+0x7df5) [0x7f72cfa0fdf5]
 9: (clone()+0x6d) [0x7f72ce4d61ad]

2015-07-17 13:57:53.425744 7f55ae088880  0 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff), process ceph-osd, pid 55201
2015-07-17 13:57:53.457758 7f55ae088880 -1 WARNING: the following dangerous and experimental features are enabled: keyvaluestore
2015-07-17 13:57:53.501718 7f55ae088880  0 filestore(/var/lib/ceph/osd/ceph-0) backend xfs (magic 0x58465342)
2015-07-17 13:57:53.510131 7f55ae088880  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl is supporte
d and appears to work
2015-07-17 13:57:53.510141 7f55ae088880  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl is disabled
 via 'filestore fiemap' config option
2015-07-17 13:57:53.510662 7f55ae088880  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: syncfs(2) syscall fully 
supported (by glibc and kernel)
2015-07-17 13:57:53.510742 7f55ae088880  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: extsize is supported and kern
el 3.10.0-229.4.2.el7.centos.ug.x86_64 >= 3.5
2015-07-17 13:57:53.633826 7f55ae088880  0 filestore(/var/lib/ceph/osd/ceph-0) mount: enabling WRITEAHEAD journal mode: checkpoint is 
not enabled

So it is not limited to KV stores..

#7 Updated by huang jun over 8 years ago

hi, Kenneth
We also encounter this suicide error very often when we test the 4K random writes.

the osd_op_tp threads maintain the op_wq workqueue, for write, it will clear the timeout when delivery the write op the filejournal's workqueue.

if the pressure is from submit transactions to backend, it should be filestore's op_tp timeout, but not osd_op_tp timeout.

i's using lockdep and lttng to trace this problem, but with no lucky.

#8 Updated by huang jun over 8 years ago

huang jun wrote:

hi, Kenneth
We also encounter this suicide error very often when we test the 4K random writes.

the osd_op_tp threads maintain the op_wq workqueue, for write, it will clear the timeout when delivery the write op to the filejournal's workqueue.

if the pressure is from submit transactions to backend, it should be filestore's op_tp timeout, but not osd_op_tp timeout.

i's using lockdep and lttng to trace this problem, but with no lucky.

#9 Updated by Dan van der Ster over 8 years ago

Did you guys find a fix for this?

We upgraded from firefly to hammer 0.94.5 two days ago and since then we've had three OSD heartbeat suicides. The most recent one was a hanging 4k object create:

An example slow request:

2015-11-19 16:31:23.001378 osd.655 128.142.23.84:6942/3240023 539 : cluster [WRN] slow request 240.720215 seconds old, received
at 2015-11-19 16:27:22.281109: osd_op(client.53707886.0:7400356601 rbd_data.33a409b2d523beb.0000000000007923 [write 2154496~40
96] 4.29b54dc1 ack+ondisk+write e295797) currently waiting for subops from 541,767

and osd.767 had one osd thread timing out until it committed suicide:

2015-11-19 16:33:34.936161 7f2dd9a0e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f2db1c26700' had suicide timed out after 150

Here’s the last op from that thread:

2015-11-19 16:31:01.222687 7f2db1c26700  7 osd.767 pg_epoch: 295797 pg[64.31e( v 295797'301109 (292493'298050,295797'301109] lo
cal-les=294244 n=6 ec=105353 les/c 294244/294244 294243/294243/294243) [767,18,655] r=0 lpr=294243 crt=295794'301107 lcod 29579
7'301108 mlcod 295797'301108 active+clean] issue_repop rep_tid 593075 o a2b44b1e/benchmark_data_p01001532021656.cern.ch_1533185
_object76/head//64

(the op comes from one of our monitoring probes — which is a very light load of 4k writes every 5 minutes).

I posted that OSD's log:
ceph-post-file: da8fd559-8417-4718-8979-362c864ba38c

#10 Updated by Greg Farnum over 8 years ago

This is unlikely to be the same issue you're hitting, Dan, unless you're also using KV-backed OSDs.

Probably you're just hitting slowness that you weren't noticing before (I think some of the timeouts got turned down) and will need to adjust, but that's for a separate thread/ticket.

#11 Updated by Sage Weil almost 7 years ago

  • Status changed from New to Won't Fix

Also available in: Atom PDF