Bug #17180: Osd restarts intermittently while running ceph FS IO - Ceph - Ceph

Actions

Copy link

Bug #17180

closed

Osd restarts intermittently while running ceph FS IO

Added by Rohith Radhakrishnan over 7 years ago. Updated over 7 years ago.

Status:

Closed

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v10.2.2

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

When running vd bench from client using cephFS on few osds(not same osd always) this is happening.

client logs:-
[ 505.680367] libceph: osd10 10.242.43.100:6816 socket closed (con state CONNECTING)
[ 505.744420] libceph: osd16 10.242.43.100:6824 socket closed (con state CONNECTING)
[ 547.086427] libceph: wrong peer, want 10.242.43.100:6816/149418, got 10.242.43.100:6816/164064
[ 547.086435] libceph: osd10 10.242.43.100:6816 wrong peer at address
[ 547.086532] libceph: wrong peer, want 10.242.43.100:6824/150869, got 10.242.43.100:6824/164064
[ 547.086539] libceph: osd16 10.242.43.100:6824 wrong peer at address
[ 547.523701] libceph: osd10 down
[ 547.523707] libceph: osd16 down
[ 548.534148] libceph: osd10 up
[ 548.534151] libceph: osd16 up
[ 1017.276007] libceph: osd4 10.242.43.100:6804 socket closed (con state OPEN)
[ 1017.281995] libceph: osd4 10.242.43.100:6804 socket closed (con state CONNECTING)
[ 1017.671234] libceph: osd4 10.242.43.100:6804 socket closed (con state CONNECTING)
[ 1042.397481] libceph: wrong peer, want 10.242.43.100:6804/146211, got 10.242.43.100:6804/166493
[ 1042.397489] libceph: osd4 10.242.43.100:6804 wrong peer at address
[ 1042.779505] libceph: osd4 down
[ 1043.789496] libceph: osd4 up

=================================================================================================================================================================================================================

Also ceph ~~w give below logs:~~

2016-08-31 17:11:17.131070 mon.0 [INF] pgmap v1705: 1600 pgs: 1600 active+clean; 831 GB data, 1702 GB used, 221 TB / 223 TB avail; 65303 kB/s wr, 18 op/s
2016-08-31 17:11:18.135933 mon.0 [INF] pgmap v1706: 1600 pgs: 1600 active+clean; 831 GB data, 1703 GB used, 221 TB / 223 TB avail; 509 B/s wr, 2 op/s
2016-08-31 17:11:13.734663 mds.0 [WRN] 8 slow requests, 5 included below; oldest blocked for > 62.760281 secs
2016-08-31 17:11:13.734665 mds.0 [WRN] slow request 62.760281 seconds old, received at 2016-08-31 17:10:10.974331: client_request(client.15391:58026 create #1000000e36b/vdb_f0197.file 2016-08-31 17:10:13.986093) currently submit entry: journal_and_reply
2016-08-31 17:11:13.734666 mds.0 [WRN] slow request 62.443545 seconds old, received at 2016-08-31 17:10:11.291067: client_request(client.15391:58027 create #1000000e36b/vdb_f0198.file 2016-08-31 17:10:14.302100) currently submit entry: journal_and_reply
2016-08-31 17:11:13.734667 mds.0 [WRN] slow request 61.445985 seconds old, received at 2016-08-31 17:10:12.288627: client_request(client.15391:58028 create #1000000e36b/vdb_f0199.file 2016-08-31 17:10:15.298124) currently submit entry: journal_and_reply
2016-08-31 17:11:13.734669 mds.0 [WRN] slow request 61.022644 seconds old, received at 2016-08-31 17:10:12.711968: client_request(client.15391:58029 create #1000000e36b/vdb_f0200.file 2016-08-31 17:10:15.722134) currently submit entry: journal_and_reply
2016-08-31 17:11:13.734670 mds.0 [WRN] slow request 60.770056 seconds old, received at 2016-08-31 17:10:12.964556: client_request(client.15391:58030 create #1000000e36b/vdb_f0201.file 2016-08-31 17:10:15.974140) currently submit entry: journal_and_reply
2016-08-31 17:11:19.142947 mon.0 [INF] pgmap v1707: 1600 pgs: 1600 active+clean; 831 GB data, 1703 GB used, 221 TB / 223 TB avail; 10187 kB/s wr, 4 op/s
2016-08-31 17:11:20.145094 mon.0 [INF] pgmap v1708: 1600 pgs: 1600 active+clean; 832 GB data, 1704 GB used, 221 TB / 223 TB avail; 119 MB/s wr, 31 op/s========================================================================================================================================================================================================================

On osd nodes dmesg shows below logs.
81827.232096] init: ceph-osd (ceph/4) main process (146211) killed by ABRT signal
[81827.232106] init: ceph-osd (ceph/4) main process ended, respawning
[84656.436709] init: ceph-osd (ceph/10) main process (164063) killed by ABRT signal
[84656.436726] init: ceph-osd (ceph/10) main process ended, respawning
[84656.528517] init: ceph-osd (ceph/6) main process (147483) killed by ABRT signal
[84656.528524] init: ceph-osd (ceph/6) main process ended, respawning

============================================================================================================================================
On the osd nodes tried changing the open file limit to sysctl -w fs.file-max=6550696000, but still the same problem is happening
cat /proc/sys/fs/file-max
6550696000

Actions

Copy link

Updated by huang jun over 7 years ago

as i know the 'ceph-run' script will respawn osd after osd aborted.
you should see the osd log to find out why the osd aborted.

Actions

Copy link

Updated by Rohith Radhakrishnan over 7 years ago

OSD logs:-

6-09-16 11:41:38.708334 7f9e52e24700 1 osd.19 5702 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:40:12.495354 front 2016-09-16 11:40:12.495354 (cutoff 2016-09-16 11:41:18.708281)
-298> 2016-09-16 11:41:41.008734 7f9e52e24700 -1 osd.19 5702 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:40:12.495354 front 2016-09-16 11:40:12.495354 (cutoff 2016-09-16 11:41:21.008732)
-297> 2016-09-16 11:41:44.509101 7f9e52e24700 -1 osd.19 5702 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:40:12.495354 front 2016-09-16 11:40:12.495354 (cutoff 2016-09-16 11:41:24.509100)
-296> 2016-09-16 11:41:49.209487 7f9e52e24700 -1 osd.19 5702 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:40:12.495354 front 2016-09-16 11:40:12.495354 (cutoff 2016-09-16 11:41:29.209484)
-295> 2016-09-16 11:41:50.309867 7f9e52e24700 -1 osd.19 5702 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:40:12.495354 front 2016-09-16 11:40:12.495354 (cutoff 2016-09-16 11:41:30.309863)
-294> 2016-09-16 11:41:51.410233 7f9e52e24700 -1 osd.19 5702 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:40:12.495354 front 2016-09-16 11:40:12.495354 (cutoff 2016-09-16 11:41:31.410231)
-293> 2016-09-16 11:42:05.560936 7f9e216ac700 0 - 10.242.43.102:6807/3203977 >> 10.242.43.114:6824/15400084 pipe(0x7f9eb0320000 sd=489 :6807 s=0 pgs=0 cs=0 l=0 c=0x7f9ecebb5900).accept we reset (peer sent cseq 1), sending RESETSESSION
~~292> 2016-09-16 11:42:09.148128 7f9e216ac700 0 -~~ 10.242.43.102:6807/3203977 >> 10.242.43.114:6824/15400084 pipe(0x7f9eb0320000 sd=489 :6807 s=2 pgs=4502 cs=1 l=0 c=0x7f9ecebb5900).reader missed message? skipped from seq 0 to 1244008781
~~291> 2016-09-16 11:42:09.148639 7f9e216ac700 0 -~~ 10.242.43.102:6807/3203977 >> 10.242.43.114:6824/15400084 pipe(0x7f9ec41e8800 sd=489 :6807 s=0 pgs=0 cs=0 l=0 c=0x7f9ecebb5a80).accept we reset (peer sent cseq 2), sending RESETSESSION
~~290> 2016-09-16 11:42:09.155042 7f9e216ac700 0 -~~ 10.242.43.102:6807/3203977 >> 10.242.43.114:6824/15400084 pipe(0x7f9ec41e8800 sd=489 :6807 s=2 pgs=4638 cs=1 l=0 c=0x7f9ecebb5a80).reader missed message? skipped from seq 0 to 203218437
~~289> 2016-09-16 11:42:09.155509 7f9e216ac700 0 -~~ 10.242.43.102:6807/3203977 >> 10.242.43.114:6824/15400084 pipe(0x7f9efb7be800 sd=489 :6807 s=0 pgs=0 cs=0 l=0 c=0x7f9ea68d5000).accept we reset (peer sent cseq 2), sending RESETSESSION

2016-09-16 11:45:17.681698 7f9e5ee3c700  0 -- 10.242.43.102:6803/3203977 submit_message osd_op_reply(18415972 10000178daf.0000021b [write 0~4194304] v5700'18885 uv18885 ondisk = 0) v7 remote, 10.242.43.106:0/198074132, failed lossy con, dropping message 0x7f9ec2cc5c80
  138> 2016-09-16 11:45:17.681822 7f9e5ee3c700  0 - 10.242.43.102:6803/3203977 submit_message osd_op_reply(18416016 10000178da9.0000021b [write 0~4194304] v5700'18886 uv18886 ondisk = 0) v7 remote, 10.242.43.106:0/198074132, failed lossy con, dropping message 0x7f9ec2cc3b80
  137> 2016-09-16 11:45:18.015692 7f9e7065f700  0 - 10.242.43.102:6803/3203977 submit_message osd_op_reply(18418981 10000178daa.00000380 [write 0~4194304] v5702'19322 uv19322 ondisk = 0) v7 remote, 10.242.43.106:0/198074132, failed lossy con, dropping message 0x7f9f15bc3b80

2016-09-16 11:46:32.056220 7f9e3b7c9700 0 -- 10.242.43.102:6807/3203977 >> 10.242.43.116:6825/5154101 pipe(0x7f9eb021d400 sd=263 :58135 s=1 pgs=2950 cs=2 l=0 c=0x7f9eaa92d000).connect got RESETSESSION
~~5> 2016-09-16 11:46:32.056265 7f9e4b4dc700 0 -~~ 10.242.43.102:6807/3203977 >> 10.242.43.103:6805/2659717 pipe(0x7f9eafe8e000 sd=233 :37457 s=1 pgs=6785 cs=2 l=0 c=0x7f9eafe57500).connect got RESETSESSION
~~4> 2016-09-16 11:46:32.457419 7f9e357d9700 0 -~~ 10.242.43.102:6807/3203977 >> 10.242.43.116:6805/4158656 pipe(0x7f9eb6dc5400 sd=154 :52223 s=2 pgs=3268 cs=1 l=0 c=0x7f9ea9f4fb00).fault, initiating reconnect
~~3> 2016-09-16 11:46:32.457777 7f9e3c023700 0 -~~ 10.242.43.102:6807/3203977 >> 10.242.43.116:6805/4158656 pipe(0x7f9eb6dc5400 sd=154 :52236 s=1 pgs=3268 cs=2 l=0 c=0x7f9ea9f4fb00).connect got RESETSESSION
~~2> 2016-09-16 11:46:32.470500 7f9e357d9700 0 -~~ 10.242.43.102:6807/3203977 >> 10.242.43.116:6805/4158656 pipe(0x7f9eb6dc5400 sd=154 :52236 s=2 pgs=3269 cs=1 l=0 c=0x7f9ea9f4fb00).fault, initiating reconnect
~~1> 2016-09-16 11:46:32.470963 7f9e3c023700 0 -~~ 10.242.43.102:6807/3203977 >> 10.242.43.116:6805/4158656 pipe(0x7f9eb6dc5400 sd=154 :52237 s=1 pgs=3269 cs=2 l=0 c=0x7f9ea9f4fb00).connect got RESETSESSION
0> 2016-09-16 11:47:53.371813 7f9e95757700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f9e95757700 time 2016-09-16 11:47:52.820218
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")

ceph version 10.2.2-CEPH-1.4.0.10 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f9e9b2759cb]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2b1) [0x7f9e9b1b9271]
 3: (ceph::HeartbeatMap::is_healthy()+0xc6) [0x7f9e9b1b9a36]
 4: (ceph::HeartbeatMap::check_touch_file()+0x17) [0x7f9e9b1ba227]
 5: (CephContextServiceThread::entry()+0x14b) [0x7f9e9b28d11b]
 6: (()+0x8184) [0x7f9e99317184]
 7: (clone()+0x6d) [0x7f9e972f137d]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 0 lockdep
0/ 0 context
0/ 0 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 0 buffer
0/ 0 timer
0/ 0 filer
0/ 1 striper
0/ 0 objecter
0/ 0 rados
0/ 0 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 0 journaler
0/ 5 objectcacher
0/ 0 client
0/ 0 osd
0/ 0 optracker
0/ 0 objclass
0/ 0 filestore
0/ 0 journal
0/ 0 ms
0/ 0 mon
0/ 0 monc
0/ 0 paxos
0/ 0 tp
0/ 0 auth
1/ 5 crypto
0/ 0 finisher
0/ 0 heartbeatmap
0/ 0 perfcounter
0/ 0 rgw
1/10 civetweb
1/ 5 javaclient
0/ 0 asok
0/ 0 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 newstore
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
1/ 5 kinetic
1/ 5 fuse
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.19.log
-- end dump of recent events ---
2016-09-16 11:47:57.841882 7f9e52e24700 -1 osd.19 5713 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:47:36.437667 front 2016-09-16 11:47:36.437667 (cutoff 2016-09-16 11:47:37.841880)
2016-09-16 11:48:03.742529 7f9e52e24700 -1 osd.19 5713 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:47:36.437667 front 2016-09-16 11:47:36.437667 (cutoff 2016-09-16 11:47:43.742526)
2016-09-16 11:48:04.243002 7f9e52e24700 -1 osd.19 5713 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:47:36.437667 front 2016-09-16 11:47:36.437667 (cutoff 2016-09-16 11:47:44.243000)
2016-09-16 11:48:08.943473 7f9e52e24700 -1 osd.19 5713 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:47:36.437667 front 2016-09-16 11:47:36.437667 (cutoff 2016-09-16 11:47:48.943468)
2016-09-16 11:48:13.043954 7f9e52e24700 -1 osd.19 5713 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:47:36.437667 front 2016-09-16 11:47:36.437667 (cutoff 2016-09-16 11:47:53.043928)
2016-09-16 11:48:44.025447 7f9e95757700 -1 ** Caught signal (Aborted) *
in thread 7f9e95757700 thread_name:service

ceph version 10.2.2-CEPH-1.4.0.10 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x8fd7f2) [0x7f9e9b17c7f2]
 2: (()+0x10330) [0x7f9e9931f330]
 3: (gsignal()+0x37) [0x7f9e9722dc37]
 4: (abort()+0x148) [0x7f9e97231028]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x7f9e9b275ba5]
 6: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2b1) [0x7f9e9b1b9271]
 7: (ceph::HeartbeatMap::is_healthy()+0xc6) [0x7f9e9b1b9a36]
 8: (ceph::HeartbeatMap::check_touch_file()+0x17) [0x7f9e9b1ba227]
 9: (CephContextServiceThread::entry()+0x14b) [0x7f9e9b28d11b]
 10: (()+0x8184) [0x7f9e99317184]
 11: (clone()+0x6d) [0x7f9e972f137d]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- begin dump of recent events ---
-5> 2016-09-16 11:47:57.841882 7f9e52e24700 -1 osd.19 5713 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:47:36.437667 front 2016-09-16 11:47:36.437667 (cutoff 2016-09-16 11:47:37.841880)
-4> 2016-09-16 11:48:03.742529 7f9e52e24700 -1 osd.19 5713 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:47:36.437667 front 2016-09-16 11:47:36.437667 (cutoff 2016-09-16 11:47:43.742526)
-3> 2016-09-16 11:48:04.243002 7f9e52e24700 -1 osd.19 5713 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:47:36.437667 front 2016-09-16 11:47:36.437667 (cutoff 2016-09-16 11:47:44.243000)
-2> 2016-09-16 11:48:08.943473 7f9e52e24700 -1 osd.19 5713 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:47:36.437667 front 2016-09-16 11:47:36.437667 (cutoff 2016-09-16 11:47:48.943468)
-1> 2016-09-16 11:48:13.043954 7f9e52e24700 -1 osd.19 5713 heartbeat_check: no reply from osd.12 since back 2016-09-16 11:47:36.437667 front 2016-09-16 11:47:36.437667 (cutoff 2016-09-16 11:47:53.043928)
0> 2016-09-16 11:48:44.025447 7f9e95757700 -1 ** Caught signal (Aborted) *
in thread 7f9e95757700 thread_name:service

ceph version 10.2.2-CEPH-1.4.0.10 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x8fd7f2) [0x7f9e9b17c7f2]
 2: (()+0x10330) [0x7f9e9931f330]
 3: (gsignal()+0x37) [0x7f9e9722dc37]
 4: (abort()+0x148) [0x7f9e97231028]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x7f9e9b275ba5]
 6: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2b1) [0x7f9e9b1b9271]
 7: (ceph::HeartbeatMap::is_healthy()+0xc6) [0x7f9e9b1b9a36]
 8: (ceph::HeartbeatMap::check_touch_file()+0x17) [0x7f9e9b1ba227]
 9: (CephContextServiceThread::entry()+0x14b) [0x7f9e9b28d11b]
 10: (()+0x8184) [0x7f9e99317184]
 11: (clone()+0x6d) [0x7f9e972f137d]

Actions

Copy link

Updated by Peter Maloney over 7 years ago

Same here with 10.2.3. I was doing a test when it happened. I had a test cluster set up with 3 VMs. I just ran fio on an rbd image mapped by qemu and a cephfs at the same time, and then hit ctrl+c on one of the VMs, and then the osd on one of the other VMs died this way.

11> 2016-10-04 16:54:32.262349 7ff1004ca700 -1 osd.0 1822 heartbeat_check: no reply from osd.2 since back 2016-10-04 16:51:59.767043 front 2016-10-04 16:51:59.767043 (cutoff 2016-10-04
16:54:12.262342)
-10> 2016-10-04 16:54:32.262518 7ff1004ca700 0 log_channel(cluster) log [WRN] : 102 slow requests, 5 included below; oldest blocked for > 152.099946 secs
-9> 2016-10-04 16:54:32.262527 7ff1004ca700 0 log_channel(cluster) log [WRN] : slow request 61.776620 seconds old, received at 2016-10-04 16:53:30.485826: osd_op(client.314110.0:141 0.
51cf7e68 (undecoded) ondisk+write+known_if_redirected e1825) currently no flag points reached
-8> 2016-10-04 16:54:32.262532 7ff1004ca700 0 log_channel(cluster) log [WRN] : slow request 121.446212 seconds old, received at 2016-10-04 16:52:30.816235: osd_op(client.314107.0:18176 0.8c66552c (undecoded) ack+ondisk+retry+write+known_if_redirected e1824) currently no flag points reached
-7> 2016-10-04 16:54:32.262535 7ff1004ca700 0 log_channel(cluster) log [WRN] : slow request 121.445922 seconds old, received at 2016-10-04 16:52:30.816524: osd_op(client.314107.0:18189 0.43e88885 (undecoded) ack+ondisk+retry+write+known_if_redirected e1824) currently no flag points reached
-6> 2016-10-04 16:54:32.262539 7ff1004ca700 0 log_channel(cluster) log [WRN] : slow request 121.445654 seconds old, received at 2016-10-04 16:52:30.816793: osd_op(client.314107.0:18194 0.6353eb85 (undecoded) ack+ondisk+retry+write+known_if_redirected e1824) currently no flag points reached
-5> 2016-10-04 16:54:32.262545 7ff1004ca700 0 log_channel(cluster) log [WRN] : slow request 121.445520 seconds old, received at 2016-10-04 16:52:30.816926: osd_op(client.314107.0:18198 0.c51da27e (undecoded) ack+ondisk+retry+write+known_if_redirected e1824) currently no flag points reached
-4> 2016-10-04 16:54:32.686557 7ff0f2fc8700 1 - 10.3.75.1:6801/1339 <== osd.1 10.3.75.2:0/1208 203 ==== osd_ping(ping e1825 stamp 2016-10-04 16:54:32.685693) v2 ==== 47+0+0 (481035695 0 0) 0x7ff121fbd200 con 0x7ff12111a100
3> 2016-10-04 16:54:32.686599 7ff0f2fc8700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7ff0e8eb3700' had timed out after 15
-2> 2016-10-04 16:54:32.686602 7ff0f2fc8700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7ff0e8eb3700' had suicide timed out after 150
-1> 2016-10-04 16:54:32.691641 7ff0f17c5700 1 - 10.99.75.1:6801/1339 <== osd.1 10.3.75.2:0/1208 203 ==== osd_ping(ping e1825 stamp 2016-10-04 16:54:32.685693) v2 ==== 47+0+0 (481035695 0 0) 0x7ff124b9a800 con 0x7ff12179f480
0> 2016-10-04 16:54:32.691837 7ff0f2fc8700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7ff0f2fc8700 time 2016-10-04 16:54:32.686617
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")

ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7ff10bd92e0b]
2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2b1) [0x7ff10bcd9851]
3: (ceph::HeartbeatMap::is_healthy()+0xc6) [0x7ff10bcda016]
4: (OSD::handle_osd_ping(MOSDPing*)+0x94f) [0x7ff10b73893f]
5: (OSD::heartbeat_dispatch(Message*)+0x3cb) [0x7ff10b739b7b]
6: (DispatchQueue::entry()+0x78b) [0x7ff10be4f4bb]
7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff10bd6f4dd]
8: (()+0x8184) [0x7ff10a25f184]
9: (clone()+0x6d) [0x7ff10838b37d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions

Copy link

Updated by Samuel Just over 7 years ago

Status changed from New to Closed

This crash indicates that the thread in question hung for longer than the configured timeout, so the OSD assumed the disk/fs had hung and committed suicide. You can avoid it by changing the suicide heartbeat value.

Actions

Copy link

Updated by Yoann Moulin over 7 years ago

Hello,

I have a similar behavior on my ceph cluster running Jewel 10.2.2, the suicide timeout has appeared few minutes after I started to push 30TB of
data on a S3 bucket on an EC 8+2 pool. Previously, I had pushed 4TB on that bucket without any issue.

here the ceph-post-file: c86638df-a297-4f58-a337-0e570d4b8702

list of file :

cephprod_20161015_nodebug.log
cephprod_20161025_debug.log
cephprod-osd.0_20161025_debug.log
cephprod-osd.107_20161015_nodebug.log
cephprod-osd.131_20161015_nodebug.log
cephprod-osd.136_20161015_nodebug.log
cephprod-osd.24_20161015_nodebug.log
cephprod-osd.27_20161015_nodebug.log
cephprod-osd.37_20161015_nodebug.log
cephprod-osd.46_20161015_nodebug.log
cephprod-osd.64_20161015_nodebug.log
cephprod-osd.86_20161015_nodebug.log
cephprod-osd.90_20161025_debug.log
cephprod-osd.93_20161025_debug.log
cephprod-osd.95_20161015_nodebug.log
report.log

tag 20161015_nodebug : log file when the behaviors stared wihtout debug activate
tag 20161025_debug : log file with debug activate when I reactivate scrubing and deep-scrubing
tag

my previous mail on the ceph-user list : https://www.mail-archive.com/ceph-users@lists.ceph.com/msg33179.html

thanks for your help

Yoann

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #17180

Osd restarts intermittently while running ceph FS IO

Updated by huang jun over 7 years ago

Updated by Rohith Radhakrishnan over 7 years ago

Updated by Peter Maloney over 7 years ago

Updated by Samuel Just over 7 years ago

Updated by Yoann Moulin over 7 years ago