Bug #22995
closedOSD_DOWN during ffsb on knfs
0%
Description
This is for v12.2.3 release
Run: http://pulpito.ceph.com/yuriw-2018-02-13_20:44:59-knfs-luminous-testing-basic-smithi/
Jobs: 2186577,2186578
Logs: http://qa-proxy.ceph.com/teuthology/yuriw-2018-02-13_20:44:59-knfs-luminous-testing-basic-smithi/2186577/teuthology.log
description: 'knfs/basic/{ceph/base.yaml clusters/extra-client.yaml mount/v4.yaml tasks/nfs_workunit_suites_ffsb.yaml}', duration: 1241.8487780094147, failure_reason: '"2018-02-13 21:04:09.678426 mon.b mon.0 172.21.15.41:6789/0 117 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)" in cluster log', flavor: basic, mon.a-kernel-sha1: 0b474c9ea79dfba5e2ff22ce96661f1ac21d04f3
Updated by Yuri Weinstein about 6 years ago
- Related to Bug #21425: "cluster [WRN] Health check failed: noup flag(s) set (OSDMAP_FLAGS)"" in rest-luminous-distro added
Updated by Yuri Weinstein about 6 years ago
note that jobs passed on ovh ?!
http://pulpito.ceph.com/yuriw-2018-02-13_18:11:12-knfs-luminous-testing-basic-ovh/
the question is what makes in different ?
Updated by Jeff Layton about 6 years ago
So these tests just download ffsb and iozone and run them -- I guess to just generate some I/O with particular patterns? It's hard to tell exactly from the yaml files, but I'm assuming that this mounts a cephfs filesystem via kcephfs, reexports it via knfsd and then mounts it with NFSv4 to perform the tests?
One of the iozone tests failed with a hung kernel task that may or may not be related, but we couldn't get any actionable info from that.
The rest of the tests fail due to OSD_DOWN warnings in the logs. I think we need to:
a) determine which OSD had the problem, and...
b) why it went down -- maybe have a look at its logs?
One possibility: the kernel client is occasionally sending some packet-o-death to the OSD and killing it. Alternately, maybe it's becoming so busy that it can't respond to health checks?
In any case, let's have a look at one of the logs:
2018-02-10T18:30:43.683 INFO:teuthology.run:Summary data: {client.0-kernel-sha1: 97c85e4f0e4ba581b753e7fda180fc80169e73c3, client.1-kernel-sha1: 97c85e4f0e4ba581b753e7fda180fc80169e73c3, description: 'knfs/basic/{ceph/base.yaml clusters/extra-client.yaml mount/v4.yaml tasks/nfs_workunit_suites_ffsb.yaml}', duration: 1188.9396090507507, failure_reason: '"2018-02-10 18:20:14.639320 mon.b mon.0 172.21.15.2:6789/0 120 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)" in cluster log', flavor: basic, mon.a-kernel-sha1: 97c85e4f0e4ba581b753e7fda180fc80169e73c3, mon.b-kernel-sha1: 97c85e4f0e4ba581b753e7fda180fc80169e73c3, owner: scheduled_yuriw@teuthology, success: false}
...around that time, in the same log:
2018-02-10T18:20:13.641 INFO:tasks.ceph.osd.1.smithi082.stderr:2018-02-10 18:20:13.643214 7f6b7b8dc700 -1 osd.1 18 heartbeat_check: no reply from 172.21.15.82:6807 osd.2 since back 2018-02-10 18:19:53.127583 front 2018-02-10 18:19:53.127583 (cutoff 2018-02-10 18:19:53.643183) 2018-02-10T18:20:13.667 INFO:tasks.ceph.osd.3.smithi002.stderr:2018-02-10 18:20:13.673219 7f7d29751700 -1 osd.3 18 heartbeat_check: no reply from 172.21.15.82:6807 osd.2 since back 2018-02-10 18:19:53.162748 front 2018-02-10 18:19:53.162748 (cutoff 2018-02-10 18:19:53.673177) 2018-02-10T18:20:14.640 INFO:tasks.ceph.osd.1.smithi082.stderr:2018-02-10 18:20:14.643442 7f6b7b8dc700 -1 osd.1 18 heartbeat_check: no reply from 172.21.15.82:6807 osd.2 since back 2018-02-10 18:19:53.127583 front 2018-02-10 18:19:53.127583 (cutoff 2018-02-10 18:19:54.643387) 2018-02-10T18:21:17.665 INFO:tasks.ceph.osd.0.smithi082.stderr:2018-02-10 18:21:17.669793 7fd861832700 -1 osd.0 24 heartbeat_check: no reply from 172.21.15.82:6814 osd.2 since back 2018-02-10 18:20:57.143469 front 2018-02-10 18:20:57.143469 (cutoff 2018-02-10 18:20:57.669783) 2018-02-10T18:21:18.666 INFO:tasks.ceph.osd.0.smithi082.stderr:2018-02-10 18:21:18.670028 7fd861832700 -1 osd.0 24 heartbeat_check: no reply from 172.21.15.82:6814 osd.2 since back 2018-02-10 18:20:57.143469 front 2018-02-10 18:20:57.143469 (cutoff 2018-02-10 18:20:58.669991) 2018-02-10T18:21:19.665 INFO:tasks.ceph.osd.0.smithi082.stderr:2018-02-10 18:21:19.670298 7fd861832700 -1 osd.0 24 heartbeat_check: no reply from 172.21.15.82:6814 osd.2 since back 2018-02-10 18:20:57.143469 front 2018-02-10 18:20:57.143469 (cutoff 2018-02-10 18:20:59.670261)
Those messages seem like they might be related. So, groveling around for the osd logs and looking at that time shows this:
2018-02-10 18:20:14.604683 7f84ad203700 20 osd.2 op_wq(4) _process empty q, waiting 2018-02-10 18:20:14.624586 7f84cb3a9700 10 osd.2 18 tick 2018-02-10 18:20:14.624604 7f84cb3a9700 10 osd.2 18 do_waiters -- start 2018-02-10 18:20:14.624608 7f84cb3a9700 10 osd.2 18 do_waiters -- finish 2018-02-10 18:20:14.642517 7f84ceb99700 1 -- 172.21.15.82:6804/22807 >> 172.21.15.2:6813/2043293334 conn(0x55cb7fdd5000 :6804 s=STATE_OPEN pgs=10 cs=1 l=1).read_bulk peer close file descriptor 62 2018-02-10 18:20:14.642531 7f84ceb99700 1 -- 172.21.15.82:6804/22807 >> 172.21.15.2:6813/2043293334 conn(0x55cb7fdd5000 :6804 s=STATE_OPEN pgs=10 cs=1 l=1).read_until read failed 2018-02-10 18:20:14.642535 7f84ceb99700 1 -- 172.21.15.82:6804/22807 >> 172.21.15.2:6813/2043293334 conn(0x55cb7fdd5000 :6804 s=STATE_OPEN pgs=10 cs=1 l=1).process read tag failed 2018-02-10 18:20:14.642538 7f84ceb99700 1 -- 172.21.15.82:6804/22807 >> 172.21.15.2:6813/2043293334 conn(0x55cb7fdd5000 :6804 s=STATE_OPEN pgs=10 cs=1 l=1).fault on lossy channel, failing 2018-02-10 18:20:14.642639 7f84bca22700 2 osd.2 18 ms_handle_reset con 0x55cb7fdd5000 session 0x55cb80724a00 2018-02-10 18:20:14.644804 7f84ce398700 1 -- 172.21.15.82:6807/22807 >> 172.21.15.2:0/26226 conn(0x55cb7fd96800 :6807 s=STATE_OPEN pgs=3 cs=1 l=1).read_bulk peer close file descriptor 82 2018-02-10 18:20:14.644817 7f84ce398700 1 -- 172.21.15.82:6807/22807 >> 172.21.15.2:0/26226 conn(0x55cb7fd96800 :6807 s=STATE_OPEN pgs=3 cs=1 l=1).read_until read failed 2018-02-10 18:20:14.644821 7f84ce398700 1 -- 172.21.15.82:6807/22807 >> 172.21.15.2:0/26226 conn(0x55cb7fd96800 :6807 s=STATE_OPEN pgs=3 cs=1 l=1).process read tag failed 2018-02-10 18:20:14.644824 7f84ce398700 1 -- 172.21.15.82:6807/22807 >> 172.21.15.2:0/26226 conn(0x55cb7fd96800 :6807 s=STATE_OPEN pgs=3 cs=1 l=1).fault on lossy channel, failing 2018-02-10 18:20:14.645035 7f84cf39a700 1 -- 172.21.15.82:6805/22807 >> 172.21.15.2:6805/26226 conn(0x55cb7fdc1800 :-1 s=STATE_OPEN pgs=6 cs=1 l=0).read_bulk peer close file descriptor 92 2018-02-10 18:20:14.645056 7f84cf39a700 1 -- 172.21.15.82:6805/22807 >> 172.21.15.2:6805/26226 conn(0x55cb7fdc1800 :-1 s=STATE_OPEN pgs=6 cs=1 l=0).read_until read failed 2018-02-10 18:20:14.645063 7f84cf39a700 1 -- 172.21.15.82:6805/22807 >> 172.21.15.2:6805/26226 conn(0x55cb7fdc1800 :-1 s=STATE_OPEN pgs=6 cs=1 l=0).process read tag failed 2018-02-10 18:20:14.645308 7f84cf39a700 1 -- 172.21.15.82:6805/22807 >> 172.21.15.82:6801/22806 conn(0x55cb7fdb2000 :-1 s=STATE_OPEN pgs=8 cs=1 l=0).read_bulk peer close file descriptor 86 2018-02-10 18:20:14.645327 7f84cf39a700 1 -- 172.21.15.82:6805/22807 >> 172.21.15.82:6801/22806 conn(0x55cb7fdb2000 :-1 s=STATE_OPEN pgs=8 cs=1 l=0).read_until read failed 2018-02-10 18:20:14.645334 7f84cf39a700 1 -- 172.21.15.82:6805/22807 >> 172.21.15.82:6801/22806 conn(0x55cb7fdb2000 :-1 s=STATE_OPEN pgs=8 cs=1 l=0).process read tag failed 2018-02-10 18:20:14.645425 7f84cf39a700 1 -- 172.21.15.82:6807/22807 >> 172.21.15.82:0/22806 conn(0x55cb7fd92000 :6807 s=STATE_OPEN pgs=2 cs=1 l=1).read_bulk peer close file descriptor 72 2018-02-10 18:20:14.645433 7f84cf39a700 1 -- 172.21.15.82:6807/22807 >> 172.21.15.82:0/22806 conn(0x55cb7fd92000 :6807 s=STATE_OPEN pgs=2 cs=1 l=1).read_until read failed 2018-02-10 18:20:14.645439 7f84cf39a700 1 -- 172.21.15.82:6807/22807 >> 172.21.15.82:0/22806 conn(0x55cb7fd92000 :6807 s=STATE_OPEN pgs=2 cs=1 l=1).process read tag failed 2018-02-10 18:20:14.645444 7f84cf39a700 1 -- 172.21.15.82:6807/22807 >> 172.21.15.82:0/22806 conn(0x55cb7fd92000 :6807 s=STATE_OPEN pgs=2 cs=1 l=1).fault on lossy channel, failing 2018-02-10 18:20:14.645925 7f84ce398700 1 -- 172.21.15.82:6806/22807 >> 172.21.15.82:0/22806 conn(0x55cb7fd85800 :6806 s=STATE_OPEN pgs=2 cs=1 l=1).read_bulk peer close file descriptor 69 2018-02-10 18:20:14.645933 7f84ce398700 1 -- 172.21.15.82:6806/22807 >> 172.21.15.82:0/22806 conn(0x55cb7fd85800 :6806 s=STATE_OPEN pgs=2 cs=1 l=1).read_until read failed 2018-02-10 18:20:14.645936 7f84ce398700 1 -- 172.21.15.82:6806/22807 >> 172.21.15.82:0/22806 conn(0x55cb7fd85800 :6806 s=STATE_OPEN pgs=2 cs=1 l=1).process read tag failed 2018-02-10 18:20:14.645941 7f84ce398700 1 -- 172.21.15.82:6806/22807 >> 172.21.15.82:0/22806 conn(0x55cb7fd85800 :6806 s=STATE_OPEN pgs=2 cs=1 l=1).fault on lossy channel, failing 2018-02-10 18:20:14.646709 7f84cf39a700 1 -- 172.21.15.82:6806/22807 >> 172.21.15.2:0/26226 conn(0x55cb7fd95000 :6806 s=STATE_OPEN pgs=2 cs=1 l=1).read_bulk peer close file descriptor 81 2018-02-10 18:20:14.646721 7f84cf39a700 1 -- 172.21.15.82:6806/22807 >> 172.21.15.2:0/26226 conn(0x55cb7fd95000 :6806 s=STATE_OPEN pgs=2 cs=1 l=1).read_until read failed 2018-02-10 18:20:14.646727 7f84cf39a700 1 -- 172.21.15.82:6806/22807 >> 172.21.15.2:0/26226 conn(0x55cb7fd95000 :6806 s=STATE_OPEN pgs=2 cs=1 l=1).process read tag failed 2018-02-10 18:20:14.646733 7f84cf39a700 1 -- 172.21.15.82:6806/22807 >> 172.21.15.2:0/26226 conn(0x55cb7fd95000 :6806 s=STATE_OPEN pgs=2 cs=1 l=1).fault on lossy channel, failing 2018-02-10 18:20:14.649065 7f84cf39a700 1 -- 172.21.15.82:6805/22807 >> 172.21.15.82:6809/22805 conn(0x55cb7fdc0000 :-1 s=STATE_OPEN pgs=7 cs=1 l=0).read_bulk peer close file descriptor 87 2018-02-10 18:20:14.649078 7f84cf39a700 1 -- 172.21.15.82:6805/22807 >> 172.21.15.82:6809/22805 conn(0x55cb7fdc0000 :-1 s=STATE_OPEN pgs=7 cs=1 l=0).read_until read failed 2018-02-10 18:20:14.649082 7f84cf39a700 1 -- 172.21.15.82:6805/22807 >> 172.21.15.82:6809/22805 conn(0x55cb7fdc0000 :-1 s=STATE_OPEN pgs=7 cs=1 l=0).process read tag failed 2018-02-10 18:20:14.649142 7f84cf39a700 1 -- 172.21.15.82:6806/22807 >> 172.21.15.82:0/22805 conn(0x55cb7fd87000 :6806 s=STATE_OPEN pgs=3 cs=1 l=1).read_bulk peer close file descriptor 70 2018-02-10 18:20:14.649147 7f84cf39a700 1 -- 172.21.15.82:6806/22807 >> 172.21.15.82:0/22805 conn(0x55cb7fd87000 :6806 s=STATE_OPEN pgs=3 cs=1 l=1).read_until read failed 2018-02-10 18:20:14.649150 7f84cf39a700 1 -- 172.21.15.82:6806/22807 >> 172.21.15.82:0/22805 conn(0x55cb7fd87000 :6806 s=STATE_OPEN pgs=3 cs=1 l=1).process read tag failed 2018-02-10 18:20:14.649153 7f84cf39a700 1 -- 172.21.15.82:6806/22807 >> 172.21.15.82:0/22805 conn(0x55cb7fd87000 :6806 s=STATE_OPEN pgs=3 cs=1 l=1).fault on lossy channel, failing 2018-02-10 18:20:14.649812 7f84ceb99700 1 -- 172.21.15.82:6807/22807 >> 172.21.15.82:0/22805 conn(0x55cb7fd93800 :6807 s=STATE_OPEN pgs=1 cs=1 l=1).read_bulk peer close file descriptor 74 2018-02-10 18:20:14.649824 7f84ceb99700 1 -- 172.21.15.82:6807/22807 >> 172.21.15.82:0/22805 conn(0x55cb7fd93800 :6807 s=STATE_OPEN pgs=1 cs=1 l=1).read_until read failed 2018-02-10 18:20:14.649827 7f84ceb99700 1 -- 172.21.15.82:6807/22807 >> 172.21.15.82:0/22805 conn(0x55cb7fd93800 :6807 s=STATE_OPEN pgs=1 cs=1 l=1).process read tag failed 2018-02-10 18:20:14.649832 7f84ceb99700 1 -- 172.21.15.82:6807/22807 >> 172.21.15.82:0/22805 conn(0x55cb7fd93800 :6807 s=STATE_OPEN pgs=1 cs=1 l=1).fault on lossy channel, failing 2018-02-10 18:20:14.652830 7f84caba8700 10 osd.2 18 tick_without_osd_lock 2018-02-10 18:20:14.652839 7f84caba8700 25 osd.2 18 heartbeat_check osd.0 first_tx 2018-02-10 18:16:36.693008 last_tx 2018-02-10 18:20:12.438136 last_rx_back 2018-02-10 18:20:12.438136 last_rx_front 2018-02-10 18:20:12.438136 2018-02-10 18:20:14.652847 7f84caba8700 25 osd.2 18 heartbeat_check osd.1 first_tx 2018-02-10 18:16:36.693008 last_tx 2018-02-10 18:20:12.438136 last_rx_back 2018-02-10 18:20:12.438136 last_rx_front 2018-02-10 18:20:12.438136 2018-02-10 18:20:14.652853 7f84caba8700 25 osd.2 18 heartbeat_check osd.3 first_tx 2018-02-10 18:16:36.693008 last_tx 2018-02-10 18:20:12.438136 last_rx_back 2018-02-10 18:20:12.438136 last_rx_front 2018-02-10 18:20:12.438136 2018-02-10 18:20:14.652858 7f84caba8700 25 osd.2 18 heartbeat_check osd.4 first_tx 2018-02-10 18:16:36.693008 last_tx 2018-02-10 18:20:12.438136 last_rx_back 2018-02-10 18:20:12.438136 last_rx_front 2018-02-10 18:20:12.438136 2018-02-10 18:20:14.652863 7f84caba8700 25 osd.2 18 heartbeat_check osd.5 first_tx 2018-02-10 18:16:36.693008 last_tx 2018-02-10 18:20:12.438136 last_rx_back 2018-02-10 18:20:12.438136 last_rx_front 2018-02-10 18:20:12.438136 2018-02-10 18:20:14.652901 7f84caba8700 20 osd.2 18 can_inc_scrubs_pending 0 -> 1 (max 1, active 0) 2018-02-10 18:20:14.652905 7f84caba8700 20 osd.2 18 scrub_time_permit should run between 0 - 24 now 18 = yes 2018-02-10 18:20:14.652947 7f84caba8700 20 osd.2 18 scrub_load_below_threshold loadavg 1.43 < max 5 = yes 2018-02-10 18:20:14.652960 7f84caba8700 20 osd.2 18 sched_scrub load_is_low=1 2018-02-10 18:20:14.652963 7f84caba8700 10 osd.2 18 sched_scrub 2.0 scheduled at 2018-02-11 18:24:07.690467 > 2018-02-10 18:20:14.652904 2018-02-10 18:20:14.652967 7f84caba8700 20 osd.2 18 sched_scrub done 2018-02-10 18:20:14.652969 7f84caba8700 10 osd.2 18 promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 bytes; target 25 obj/sec or 5120 k bytes/sec 2018-02-10 18:20:14.652971 7f84caba8700 20 osd.2 18 promote_throttle_recalibrate new_prob 1000 2018-02-10 18:20:14.652972 7f84caba8700 10 osd.2 18 promote_throttle_recalibrate actual 0, actual/prob ratio 1, adjusted new_prob 1000, prob 1000 -> 1000 2018-02-10 18:20:15.466756 7f84ce398700 1 -- 172.21.15.82:6805/22807 <== osd.4 172.21.15.2:6802/26227 248 ==== osd_repop(client.4253.0:1771 3.1 e18/16) v2 ==== 994+0+4195019 (2915700608 0
Looks like a bunch of socket fds got closed around that time? (I haven't dug around in the OSD code to be sure, but that's what the logs seem to suggest).
I think we should get someone better versed in the OSD code to have a look.
Updated by Jeff Layton about 6 years ago
Around the time of the other failed heartbeat check, I see this:
2018-02-10 18:21:16.971907 7f84ceb99700 1 -- 172.21.15.82:6813/1022807 <== osd.3 172.21.15.2:0/26226 10 ==== osd_ping(ping e24 stamp 2018-02-10 18:21:16.974447) v4 ==== 2004+0+0 (766020934 0 0) 0x55cb80cc5000 con 0x55cb817f2000 2018-02-10 18:21:16.971924 7f84ceb99700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f84b320f700' had timed out after 15 2018-02-10 18:21:16.971927 7f84ceb99700 10 osd.2 24 internal heartbeat not healthy, dropping ping request 2018-02-10 18:21:16.971985 7f84cf39a700 1 -- 172.21.15.82:6814/1022807 <== osd.3 172.21.15.2:0/26226 10 ==== osd_ping(ping e24 stamp 2018-02-10 18:21:16.974447) v4 ==== 2004+0+0 (766020934 0 0) 0x55cb886c1c00 con 0x55cb817f3800 2018-02-10 18:21:16.972003 7f84cf39a700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f84b320f700' had timed out after 15 2018-02-10 18:21:16.972004 7f84cf39a700 10 osd.2 24 internal heartbeat not healthy, dropping ping request 2018-02-10 18:21:17.573827 7f84cd3f9700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f84b320f700' had timed out after 15 2018-02-10 18:21:17.666126 7f84caba8700 10 osd.2 24 tick_without_osd_lock 2018-02-10 18:21:17.666133 7f84caba8700 25 osd.2 24 heartbeat_check osd.0 first_tx 2018-02-10 18:20:33.658446 last_tx 2018-02-10 18:21:16.668142 last_rx_back 2018-02-10 18:21:16.668142 last_rx_front 2018-02-10 18:21:16.668142 2018-02-10 18:21:17.666147 7f84caba8700 25 osd.2 24 heartbeat_check osd.1 first_tx 2018-02-10 18:20:33.658446 last_tx 2018-02-10 18:21:16.668142 last_rx_back 2018-02-10 18:21:16.668142 last_rx_front 2018-02-10 18:21:16.668142 2018-02-10 18:21:17.666150 7f84caba8700 25 osd.2 24 heartbeat_check osd.3 first_tx 2018-02-10 18:20:33.658446 last_tx 2018-02-10 18:21:16.668142 last_rx_back 2018-02-10 18:21:16.668142 last_rx_front 2018-02-10 18:21:16.668142 2018-02-10 18:21:17.666153 7f84caba8700 25 osd.2 24 heartbeat_check osd.4 first_tx 2018-02-10 18:20:33.658446 last_tx 2018-02-10 18:21:16.668142 last_rx_back 2018-02-10 18:21:16.668142 last_rx_front 2018-02-10 18:21:16.668142 2018-02-10 18:21:17.666155 7f84caba8700 25 osd.2 24 heartbeat_check osd.5 first_tx 2018-02-10 18:20:33.658446 last_tx 2018-02-10 18:21:16.668142 last_rx_back 2018-02-10 18:21:16.668142 last_rx_front 2018-02-10 18:21:16.668142 2018-02-10 18:21:17.666159 7f84caba8700 20 osd.2 24 scrub_random_backoff lost coin flip, randomly backing off 2018-02-10 18:21:17.666160 7f84caba8700 10 osd.2 24 promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 bytes; target 25 obj/sec or 5120 k bytes/sec 2018-02-10 18:21:17.666162 7f84caba8700 20 osd.2 24 promote_throttle_recalibrate new_prob 1000 2018-02-10 18:21:17.666162 7f84caba8700 10 osd.2 24 promote_throttle_recalibrate actual 0, actual/prob ratio 1, adjusted new_prob 1000, prob 1000 -> 1000 2018-02-10 18:21:17.691450 7f84cb3a9700 10 osd.2 24 tick 2018-02-10 18:21:17.691459 7f84cb3a9700 10 osd.2 24 do_waiters -- start 2018-02-10 18:21:17.691460 7f84cb3a9700 10 osd.2 24 do_waiters -- finish 2018-02-10 18:21:17.968136 7f84ceb99700 1 -- 172.21.15.82:6814/1022807 <== osd.4 172.21.15.2:0/26227 15 ==== osd_ping(ping e24 stamp 2018-02-10 18:21:17.970646) v4 ==== 2004+0+0 (241253251 0 0) 0x55cb80cc5000 con 0x55cb7fda4000 2018-02-10 18:21:17.968151 7f84ceb99700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f84b320f700' had timed out after 15 2018-02-10 18:21:17.968153 7f84ceb99700 10 osd.2 24 internal heartbeat not healthy, dropping ping request 2018-02-10 18:21:17.968143 7f84cf39a700 1 -- 172.21.15.82:6813/1022807 <== osd.4 172.21.15.2:0/26227 15 ==== osd_ping(ping e24 stamp 2018-02-10 18:21:17.970646) v4 ==== 2004+0+0 (241253251 0 0) 0x55cb886c1c00 con 0x55cb7fd95000 2018-02-10 18:21:17.968248 7f84cf39a700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f84b320f700' had timed out after 15 2018-02-10 18:21:17.968267 7f84cf39a700 10 osd.2 24 internal heartbeat not healthy, dropping ping request 2018-02-10 18:21:18.162479 7f84ce398700 1 -- 172.21.15.82:6813/1022807 <== osd.5 172.21.15.2:0/26228 13 ==== osd_ping(ping e24 stamp 2018-02-10 18:21:18.164826) v4 ==== 2004+0+0 (658040782 0 0) 0x55cc143c7c00 con 0x55cb7fdc3000 2018-02-10 18:21:18.162498 7f84ce398700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f84b320f700' had timed out after 15 2018-02-10 18:21:18.162500 7f84ce398700 10 osd.2 24 internal heartbeat not healthy, dropping ping request 2018-02-10 18:21:18.162756 7f84cf39a700 1 -- 172.21.15.82:6814/1022807 <== osd.5 172.21.15.2:0/26228 13 ==== osd_ping(ping e24 stamp 2018-02-10 18:21:18.164826) v4 ==== 2004+0+0 (658040782 0 0) 0x55cb886c1c00 con 0x55cb7fdd6800 2018-02-10 18:21:18.162776 7f84cf39a700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f84b320f700' had timed out after 15 2018-02-10 18:21:18.162777 7f84cf39a700 10 osd.2 24 internal heartbeat not healthy, dropping ping request 2018-02-10 18:21:18.246832 7f84ce398700 1 -- 172.21.15.82:6813/1022807 <== osd.0 172.21.15.82:0/22805 11 ==== osd_ping(ping e24 stamp 2018-02-10 18:21:18.246652) v4 ==== 2004+0+0 (1777999654 0 0) 0x55cc143c7c00 con 0x55cb817f5000 2018-02-10 18:21:18.246849 7f84ce398700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f84b320f700' had timed out after 15 2018-02-10 18:21:18.246850 7f84ce398700 10 osd.2 24 internal heartbeat not healthy, dropping ping request 2018-02-10 18:21:18.246879 7f84ceb99700 1 -- 172.21.15.82:6814/1022807 <== osd.0 172.21.15.82:0/22805 11 ==== osd_ping(ping e24 stamp 2018-02-10 18:21:18.246652) v4 ==== 2004+0+0 (1777999654 0 0) 0x55cb80cc5000 con 0x55cb817f6800 2018-02-10 18:21:18.246887 7f84ceb99700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f84b320f700' had timed out after 15 2018-02-10 18:21:18.246887 7f84ceb99700 10 osd.2 24 internal heartbeat not healthy, dropping ping request
Which again leads me to believe that there is something wrong on this OSD.
Updated by Kefu Chai about 6 years ago
in osd.2's log
2018-02-13 21:03:22.103436 7f4b393dd700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f4b1c250700' had timed out after 15 2018-02-13 21:03:22.103438 7f4b393dd700 10 osd.2 18 internal heartbeat not healthy, dropping ping request 2018-02-13 21:03:22.103462 7f4b38bdc700 1 -- 172.21.15.200:6807/23267 <== osd.0 172.21.15.200:0/23265 49 ==== osd_ping(ping e18 stamp 2018-02-13 21:03:22.103304) v4 ==== 2004+0+0 (1933461075 0 0) 0x5590ec15d200 con 0x5590eb536000 2018-02-13 21:03:22.103558 7f4b38bdc700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f4b1c250700' had timed out after 15 2018-02-13 21:03:22.103572 7f4b38bdc700 10 osd.2 18 internal heartbeat not healthy, dropping ping request
that's why OSD.2 was considered down.
Updated by Jeff Layton about 6 years ago
Yuri Weinstein wrote:
note that jobs passed on ovh ?!
http://pulpito.ceph.com/yuriw-2018-02-13_18:11:12-knfs-luminous-testing-basic-ovh/the question is what makes in different ?
OVH is the openstack provider? I assume that these setups were using the same kernel on the osds?
It may simply be the difference in hardware.
This test is evidently driving I/O via knfsd+kcephfs, and it's possible that that is able to generate more load than we typically do via libcephfs-based tests. One would think that we'd see similar problems with krbd tests too though if this were something load related.
Updated by Sage Weil about 6 years ago
the op_wq htread went to sleep for >30 seconds:
2018-02-13 21:03:11.861298 7f4b1924a700 20 osd.2 op_wq(0) _process 3.0 item PGQueueable(0x5590ec0c6700 prio 127 cost 4195019 e18) queued ... 2018-02-13 21:03:37.714696 7f4b1924a700 20 osd.2 op_wq(0) _process 3.0 item PGQueueable(0x5590ed816bc0 prio 127 cost 4195019 e18) pg 0x5590ebf2e000
on osd.2. The internal heartbeat thus failed and the osd didn't reply to pings.
I don't see any way for the sharded wq code to block without printing something else. WHich makes me thing the osd was swapping...
Updated by Sage Weil about 6 years ago
- Subject changed from "[WRN] Health check failed: 1 osds down (OSD_DOWN)" in knfs-luminous-testing-basic-smithi to OSD_DOWN during ffsb on knfs
Hmm, ffsb in teh past has caused transient failures (in kcephfs suite iirc) just because.. it issues tons of small ios? not really sure. We should run it with some collectl or pcp or similar to see what the memory etc looks like during this time to understand why it is such a stressful test/workload. I don't think this is a new thing, though!
Updated by Josh Durgin about 4 years ago
- Status changed from New to Can't reproduce