Bug #22995: OSD_DOWN during ffsb on knfs - Ceph - Ceph

Actions

Copy link

Bug #22995

closed

OSD_DOWN during ffsb on knfs

Added by Yuri Weinstein about 6 years ago. Updated about 4 years ago.

Status:

Can't reproduce

Priority:

Urgent

Assignee:

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

knfs

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

This is for v12.2.3 release

Run: http://pulpito.ceph.com/yuriw-2018-02-13_20:44:59-knfs-luminous-testing-basic-smithi/
Jobs: 2186577,2186578
Logs: http://qa-proxy.ceph.com/teuthology/yuriw-2018-02-13_20:44:59-knfs-luminous-testing-basic-smithi/2186577/teuthology.log

description: 'knfs/basic/{ceph/base.yaml clusters/extra-client.yaml mount/v4.yaml
    tasks/nfs_workunit_suites_ffsb.yaml}', duration: 1241.8487780094147, failure_reason: '"2018-02-13
    21:04:09.678426 mon.b mon.0 172.21.15.41:6789/0 117 : cluster [WRN] Health check
    failed: 1 osds down (OSD_DOWN)" in cluster log', flavor: basic, mon.a-kernel-sha1: 0b474c9ea79dfba5e2ff22ce96661f1ac21d04f3

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Yuri Weinstein about 6 years ago

Related to Bug #21425: "cluster [WRN] Health check failed: noup flag(s) set (OSDMAP_FLAGS)"" in rest-luminous-distro added

Actions

Copy link

Updated by Yuri Weinstein about 6 years ago

note that jobs passed on ovh ?!
http://pulpito.ceph.com/yuriw-2018-02-13_18:11:12-knfs-luminous-testing-basic-ovh/

the question is what makes in different ?

Actions

Copy link

Updated by Jeff Layton about 6 years ago

So these tests just download ffsb and iozone and run them -- I guess to just generate some I/O with particular patterns? It's hard to tell exactly from the yaml files, but I'm assuming that this mounts a cephfs filesystem via kcephfs, reexports it via knfsd and then mounts it with NFSv4 to perform the tests?

One of the iozone tests failed with a hung kernel task that may or may not be related, but we couldn't get any actionable info from that.

The rest of the tests fail due to OSD_DOWN warnings in the logs. I think we need to:

a) determine which OSD had the problem, and...
    b) why it went down -- maybe have a look at its logs?

One possibility: the kernel client is occasionally sending some packet-o-death to the OSD and killing it. Alternately, maybe it's becoming so busy that it can't respond to health checks?

In any case, let's have a look at one of the logs:

2018-02-10T18:30:43.683 INFO:teuthology.run:Summary data:                       
{client.0-kernel-sha1: 97c85e4f0e4ba581b753e7fda180fc80169e73c3, client.1-kernel-sha1: 97c85e4f0e4ba581b753e7fda180fc80169e73c3,
  description: 'knfs/basic/{ceph/base.yaml clusters/extra-client.yaml mount/v4.yaml
    tasks/nfs_workunit_suites_ffsb.yaml}', duration: 1188.9396090507507, failure_reason: '"2018-02-10
    18:20:14.639320 mon.b mon.0 172.21.15.2:6789/0 120 : cluster [WRN] Health check
    failed: 1 osds down (OSD_DOWN)" in cluster log', flavor: basic, mon.a-kernel-sha1: 97c85e4f0e4ba581b753e7fda180fc80169e73c3,
  mon.b-kernel-sha1: 97c85e4f0e4ba581b753e7fda180fc80169e73c3, owner: scheduled_yuriw@teuthology,
  success: false}

...around that time, in the same log:

2018-02-10T18:20:13.641 INFO:tasks.ceph.osd.1.smithi082.stderr:2018-02-10 18:20:13.643214 7f6b7b8dc700 -1 osd.1 18 heartbeat_check: no reply from 172.21.15.82:6807 osd.2 since back 2018-02-10 18:19:53.127583 front 2018-02-10 18:19:53.127583 (cutoff 2018-02-10 18:19:53.643183)
2018-02-10T18:20:13.667 INFO:tasks.ceph.osd.3.smithi002.stderr:2018-02-10 18:20:13.673219 7f7d29751700 -1 osd.3 18 heartbeat_check: no reply from 172.21.15.82:6807 osd.2 since back 2018-02-10 18:19:53.162748 front 2018-02-10 18:19:53.162748 (cutoff 2018-02-10 18:19:53.673177)
2018-02-10T18:20:14.640 INFO:tasks.ceph.osd.1.smithi082.stderr:2018-02-10 18:20:14.643442 7f6b7b8dc700 -1 osd.1 18 heartbeat_check: no reply from 172.21.15.82:6807 osd.2 since back 2018-02-10 18:19:53.127583 front 2018-02-10 18:19:53.127583 (cutoff 2018-02-10 18:19:54.643387)
2018-02-10T18:21:17.665 INFO:tasks.ceph.osd.0.smithi082.stderr:2018-02-10 18:21:17.669793 7fd861832700 -1 osd.0 24 heartbeat_check: no reply from 172.21.15.82:6814 osd.2 since back 2018-02-10 18:20:57.143469 front 2018-02-10 18:20:57.143469 (cutoff 2018-02-10 18:20:57.669783)
2018-02-10T18:21:18.666 INFO:tasks.ceph.osd.0.smithi082.stderr:2018-02-10 18:21:18.670028 7fd861832700 -1 osd.0 24 heartbeat_check: no reply from 172.21.15.82:6814 osd.2 since back 2018-02-10 18:20:57.143469 front 2018-02-10 18:20:57.143469 (cutoff 2018-02-10 18:20:58.669991)
2018-02-10T18:21:19.665 INFO:tasks.ceph.osd.0.smithi082.stderr:2018-02-10 18:21:19.670298 7fd861832700 -1 osd.0 24 heartbeat_check: no reply from 172.21.15.82:6814 osd.2 since back 2018-02-10 18:20:57.143469 front 2018-02-10 18:20:57.143469 (cutoff 2018-02-10 18:20:59.670261)

Those messages seem like they might be related. So, groveling around for the osd logs and looking at that time shows this:

2018-02-10 18:20:14.604683 7f84ad203700 20 osd.2 op_wq(4) _process empty q, waiting
2018-02-10 18:20:14.624586 7f84cb3a9700 10 osd.2 18 tick
2018-02-10 18:20:14.624604 7f84cb3a9700 10 osd.2 18 do_waiters -- start
2018-02-10 18:20:14.624608 7f84cb3a9700 10 osd.2 18 do_waiters -- finish
2018-02-10 18:20:14.642517 7f84ceb99700  1 -- 172.21.15.82:6804/22807 >> 172.21.15.2:6813/2043293334 conn(0x55cb7fdd5000 :6804 s=STATE_OPEN pgs=10 cs=1 l=1).read_bulk peer close file descriptor 62
2018-02-10 18:20:14.642531 7f84ceb99700  1 -- 172.21.15.82:6804/22807 >> 172.21.15.2:6813/2043293334 conn(0x55cb7fdd5000 :6804 s=STATE_OPEN pgs=10 cs=1 l=1).read_until read failed
2018-02-10 18:20:14.642535 7f84ceb99700  1 -- 172.21.15.82:6804/22807 >> 172.21.15.2:6813/2043293334 conn(0x55cb7fdd5000 :6804 s=STATE_OPEN pgs=10 cs=1 l=1).process read tag failed
2018-02-10 18:20:14.642538 7f84ceb99700  1 -- 172.21.15.82:6804/22807 >> 172.21.15.2:6813/2043293334 conn(0x55cb7fdd5000 :6804 s=STATE_OPEN pgs=10 cs=1 l=1).fault on lossy channel, failing
2018-02-10 18:20:14.642639 7f84bca22700  2 osd.2 18 ms_handle_reset con 0x55cb7fdd5000 session 0x55cb80724a00
2018-02-10 18:20:14.644804 7f84ce398700  1 -- 172.21.15.82:6807/22807 >> 172.21.15.2:0/26226 conn(0x55cb7fd96800 :6807 s=STATE_OPEN pgs=3 cs=1 l=1).read_bulk peer close file descriptor 82
2018-02-10 18:20:14.644817 7f84ce398700  1 -- 172.21.15.82:6807/22807 >> 172.21.15.2:0/26226 conn(0x55cb7fd96800 :6807 s=STATE_OPEN pgs=3 cs=1 l=1).read_until read failed
2018-02-10 18:20:14.644821 7f84ce398700  1 -- 172.21.15.82:6807/22807 >> 172.21.15.2:0/26226 conn(0x55cb7fd96800 :6807 s=STATE_OPEN pgs=3 cs=1 l=1).process read tag failed
2018-02-10 18:20:14.644824 7f84ce398700  1 -- 172.21.15.82:6807/22807 >> 172.21.15.2:0/26226 conn(0x55cb7fd96800 :6807 s=STATE_OPEN pgs=3 cs=1 l=1).fault on lossy channel, failing
2018-02-10 18:20:14.645035 7f84cf39a700  1 -- 172.21.15.82:6805/22807 >> 172.21.15.2:6805/26226 conn(0x55cb7fdc1800 :-1 s=STATE_OPEN pgs=6 cs=1 l=0).read_bulk peer close file descriptor 92
2018-02-10 18:20:14.645056 7f84cf39a700  1 -- 172.21.15.82:6805/22807 >> 172.21.15.2:6805/26226 conn(0x55cb7fdc1800 :-1 s=STATE_OPEN pgs=6 cs=1 l=0).read_until read failed
2018-02-10 18:20:14.645063 7f84cf39a700  1 -- 172.21.15.82:6805/22807 >> 172.21.15.2:6805/26226 conn(0x55cb7fdc1800 :-1 s=STATE_OPEN pgs=6 cs=1 l=0).process read tag failed
2018-02-10 18:20:14.645308 7f84cf39a700  1 -- 172.21.15.82:6805/22807 >> 172.21.15.82:6801/22806 conn(0x55cb7fdb2000 :-1 s=STATE_OPEN pgs=8 cs=1 l=0).read_bulk peer close file descriptor 86
2018-02-10 18:20:14.645327 7f84cf39a700  1 -- 172.21.15.82:6805/22807 >> 172.21.15.82:6801/22806 conn(0x55cb7fdb2000 :-1 s=STATE_OPEN pgs=8 cs=1 l=0).read_until read failed
2018-02-10 18:20:14.645334 7f84cf39a700  1 -- 172.21.15.82:6805/22807 >> 172.21.15.82:6801/22806 conn(0x55cb7fdb2000 :-1 s=STATE_OPEN pgs=8 cs=1 l=0).process read tag failed
2018-02-10 18:20:14.645425 7f84cf39a700  1 -- 172.21.15.82:6807/22807 >> 172.21.15.82:0/22806 conn(0x55cb7fd92000 :6807 s=STATE_OPEN pgs=2 cs=1 l=1).read_bulk peer close file descriptor 72
2018-02-10 18:20:14.645433 7f84cf39a700  1 -- 172.21.15.82:6807/22807 >> 172.21.15.82:0/22806 conn(0x55cb7fd92000 :6807 s=STATE_OPEN pgs=2 cs=1 l=1).read_until read failed
2018-02-10 18:20:14.645439 7f84cf39a700  1 -- 172.21.15.82:6807/22807 >> 172.21.15.82:0/22806 conn(0x55cb7fd92000 :6807 s=STATE_OPEN pgs=2 cs=1 l=1).process read tag failed
2018-02-10 18:20:14.645444 7f84cf39a700  1 -- 172.21.15.82:6807/22807 >> 172.21.15.82:0/22806 conn(0x55cb7fd92000 :6807 s=STATE_OPEN pgs=2 cs=1 l=1).fault on lossy channel, failing
2018-02-10 18:20:14.645925 7f84ce398700  1 -- 172.21.15.82:6806/22807 >> 172.21.15.82:0/22806 conn(0x55cb7fd85800 :6806 s=STATE_OPEN pgs=2 cs=1 l=1).read_bulk peer close file descriptor 69
2018-02-10 18:20:14.645933 7f84ce398700  1 -- 172.21.15.82:6806/22807 >> 172.21.15.82:0/22806 conn(0x55cb7fd85800 :6806 s=STATE_OPEN pgs=2 cs=1 l=1).read_until read failed
2018-02-10 18:20:14.645936 7f84ce398700  1 -- 172.21.15.82:6806/22807 >> 172.21.15.82:0/22806 conn(0x55cb7fd85800 :6806 s=STATE_OPEN pgs=2 cs=1 l=1).process read tag failed
2018-02-10 18:20:14.645941 7f84ce398700  1 -- 172.21.15.82:6806/22807 >> 172.21.15.82:0/22806 conn(0x55cb7fd85800 :6806 s=STATE_OPEN pgs=2 cs=1 l=1).fault on lossy channel, failing
2018-02-10 18:20:14.646709 7f84cf39a700  1 -- 172.21.15.82:6806/22807 >> 172.21.15.2:0/26226 conn(0x55cb7fd95000 :6806 s=STATE_OPEN pgs=2 cs=1 l=1).read_bulk peer close file descriptor 81
2018-02-10 18:20:14.646721 7f84cf39a700  1 -- 172.21.15.82:6806/22807 >> 172.21.15.2:0/26226 conn(0x55cb7fd95000 :6806 s=STATE_OPEN pgs=2 cs=1 l=1).read_until read failed
2018-02-10 18:20:14.646727 7f84cf39a700  1 -- 172.21.15.82:6806/22807 >> 172.21.15.2:0/26226 conn(0x55cb7fd95000 :6806 s=STATE_OPEN pgs=2 cs=1 l=1).process read tag failed
2018-02-10 18:20:14.646733 7f84cf39a700  1 -- 172.21.15.82:6806/22807 >> 172.21.15.2:0/26226 conn(0x55cb7fd95000 :6806 s=STATE_OPEN pgs=2 cs=1 l=1).fault on lossy channel, failing
2018-02-10 18:20:14.649065 7f84cf39a700  1 -- 172.21.15.82:6805/22807 >> 172.21.15.82:6809/22805 conn(0x55cb7fdc0000 :-1 s=STATE_OPEN pgs=7 cs=1 l=0).read_bulk peer close file descriptor 87
2018-02-10 18:20:14.649078 7f84cf39a700  1 -- 172.21.15.82:6805/22807 >> 172.21.15.82:6809/22805 conn(0x55cb7fdc0000 :-1 s=STATE_OPEN pgs=7 cs=1 l=0).read_until read failed
2018-02-10 18:20:14.649082 7f84cf39a700  1 -- 172.21.15.82:6805/22807 >> 172.21.15.82:6809/22805 conn(0x55cb7fdc0000 :-1 s=STATE_OPEN pgs=7 cs=1 l=0).process read tag failed
2018-02-10 18:20:14.649142 7f84cf39a700  1 -- 172.21.15.82:6806/22807 >> 172.21.15.82:0/22805 conn(0x55cb7fd87000 :6806 s=STATE_OPEN pgs=3 cs=1 l=1).read_bulk peer close file descriptor 70
2018-02-10 18:20:14.649147 7f84cf39a700  1 -- 172.21.15.82:6806/22807 >> 172.21.15.82:0/22805 conn(0x55cb7fd87000 :6806 s=STATE_OPEN pgs=3 cs=1 l=1).read_until read failed
2018-02-10 18:20:14.649150 7f84cf39a700  1 -- 172.21.15.82:6806/22807 >> 172.21.15.82:0/22805 conn(0x55cb7fd87000 :6806 s=STATE_OPEN pgs=3 cs=1 l=1).process read tag failed
2018-02-10 18:20:14.649153 7f84cf39a700  1 -- 172.21.15.82:6806/22807 >> 172.21.15.82:0/22805 conn(0x55cb7fd87000 :6806 s=STATE_OPEN pgs=3 cs=1 l=1).fault on lossy channel, failing
2018-02-10 18:20:14.649812 7f84ceb99700  1 -- 172.21.15.82:6807/22807 >> 172.21.15.82:0/22805 conn(0x55cb7fd93800 :6807 s=STATE_OPEN pgs=1 cs=1 l=1).read_bulk peer close file descriptor 74
2018-02-10 18:20:14.649824 7f84ceb99700  1 -- 172.21.15.82:6807/22807 >> 172.21.15.82:0/22805 conn(0x55cb7fd93800 :6807 s=STATE_OPEN pgs=1 cs=1 l=1).read_until read failed
2018-02-10 18:20:14.649827 7f84ceb99700  1 -- 172.21.15.82:6807/22807 >> 172.21.15.82:0/22805 conn(0x55cb7fd93800 :6807 s=STATE_OPEN pgs=1 cs=1 l=1).process read tag failed
2018-02-10 18:20:14.649832 7f84ceb99700  1 -- 172.21.15.82:6807/22807 >> 172.21.15.82:0/22805 conn(0x55cb7fd93800 :6807 s=STATE_OPEN pgs=1 cs=1 l=1).fault on lossy channel, failing
2018-02-10 18:20:14.652830 7f84caba8700 10 osd.2 18 tick_without_osd_lock
2018-02-10 18:20:14.652839 7f84caba8700 25 osd.2 18 heartbeat_check osd.0 first_tx 2018-02-10 18:16:36.693008 last_tx 2018-02-10 18:20:12.438136 last_rx_back 2018-02-10 18:20:12.438136 last_rx_front 2018-02-10 18:20:12.438136
2018-02-10 18:20:14.652847 7f84caba8700 25 osd.2 18 heartbeat_check osd.1 first_tx 2018-02-10 18:16:36.693008 last_tx 2018-02-10 18:20:12.438136 last_rx_back 2018-02-10 18:20:12.438136 last_rx_front 2018-02-10 18:20:12.438136
2018-02-10 18:20:14.652853 7f84caba8700 25 osd.2 18 heartbeat_check osd.3 first_tx 2018-02-10 18:16:36.693008 last_tx 2018-02-10 18:20:12.438136 last_rx_back 2018-02-10 18:20:12.438136 last_rx_front 2018-02-10 18:20:12.438136
2018-02-10 18:20:14.652858 7f84caba8700 25 osd.2 18 heartbeat_check osd.4 first_tx 2018-02-10 18:16:36.693008 last_tx 2018-02-10 18:20:12.438136 last_rx_back 2018-02-10 18:20:12.438136 last_rx_front 2018-02-10 18:20:12.438136
2018-02-10 18:20:14.652863 7f84caba8700 25 osd.2 18 heartbeat_check osd.5 first_tx 2018-02-10 18:16:36.693008 last_tx 2018-02-10 18:20:12.438136 last_rx_back 2018-02-10 18:20:12.438136 last_rx_front 2018-02-10 18:20:12.438136
2018-02-10 18:20:14.652901 7f84caba8700 20 osd.2 18 can_inc_scrubs_pending 0 -> 1 (max 1, active 0)
2018-02-10 18:20:14.652905 7f84caba8700 20 osd.2 18 scrub_time_permit should run between 0 - 24 now 18 = yes
2018-02-10 18:20:14.652947 7f84caba8700 20 osd.2 18 scrub_load_below_threshold loadavg 1.43 < max 5 = yes
2018-02-10 18:20:14.652960 7f84caba8700 20 osd.2 18 sched_scrub load_is_low=1
2018-02-10 18:20:14.652963 7f84caba8700 10 osd.2 18 sched_scrub 2.0 scheduled at 2018-02-11 18:24:07.690467 > 2018-02-10 18:20:14.652904
2018-02-10 18:20:14.652967 7f84caba8700 20 osd.2 18 sched_scrub done
2018-02-10 18:20:14.652969 7f84caba8700 10 osd.2 18 promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0  bytes; target 25 obj/sec or 5120 k bytes/sec
2018-02-10 18:20:14.652971 7f84caba8700 20 osd.2 18 promote_throttle_recalibrate  new_prob 1000
2018-02-10 18:20:14.652972 7f84caba8700 10 osd.2 18 promote_throttle_recalibrate  actual 0, actual/prob ratio 1, adjusted new_prob 1000, prob 1000 -> 1000
2018-02-10 18:20:15.466756 7f84ce398700  1 -- 172.21.15.82:6805/22807 <== osd.4 172.21.15.2:6802/26227 248 ==== osd_repop(client.4253.0:1771 3.1 e18/16) v2 ==== 994+0+4195019 (2915700608 0

Looks like a bunch of socket fds got closed around that time? (I haven't dug around in the OSD code to be sure, but that's what the logs seem to suggest).

I think we should get someone better versed in the OSD code to have a look.

Actions

Copy link

Updated by Jeff Layton about 6 years ago

Around the time of the other failed heartbeat check, I see this:

2018-02-10 18:21:16.971907 7f84ceb99700  1 -- 172.21.15.82:6813/1022807 <== osd.3 172.21.15.2:0/26226 10 ==== osd_ping(ping e24 stamp 2018-02-10 18:21:16.974447) v4 ==== 2004+0+0 (766020934 0 0) 0x55cb80cc5000 con 0x55cb817f2000
2018-02-10 18:21:16.971924 7f84ceb99700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f84b320f700' had timed out after 15
2018-02-10 18:21:16.971927 7f84ceb99700 10 osd.2 24 internal heartbeat not healthy, dropping ping request
2018-02-10 18:21:16.971985 7f84cf39a700  1 -- 172.21.15.82:6814/1022807 <== osd.3 172.21.15.2:0/26226 10 ==== osd_ping(ping e24 stamp 2018-02-10 18:21:16.974447) v4 ==== 2004+0+0 (766020934 0 0) 0x55cb886c1c00 con 0x55cb817f3800
2018-02-10 18:21:16.972003 7f84cf39a700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f84b320f700' had timed out after 15
2018-02-10 18:21:16.972004 7f84cf39a700 10 osd.2 24 internal heartbeat not healthy, dropping ping request
2018-02-10 18:21:17.573827 7f84cd3f9700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f84b320f700' had timed out after 15
2018-02-10 18:21:17.666126 7f84caba8700 10 osd.2 24 tick_without_osd_lock
2018-02-10 18:21:17.666133 7f84caba8700 25 osd.2 24 heartbeat_check osd.0 first_tx 2018-02-10 18:20:33.658446 last_tx 2018-02-10 18:21:16.668142 last_rx_back 2018-02-10 18:21:16.668142 last_rx_front 2018-02-10 18:21:16.668142
2018-02-10 18:21:17.666147 7f84caba8700 25 osd.2 24 heartbeat_check osd.1 first_tx 2018-02-10 18:20:33.658446 last_tx 2018-02-10 18:21:16.668142 last_rx_back 2018-02-10 18:21:16.668142 last_rx_front 2018-02-10 18:21:16.668142
2018-02-10 18:21:17.666150 7f84caba8700 25 osd.2 24 heartbeat_check osd.3 first_tx 2018-02-10 18:20:33.658446 last_tx 2018-02-10 18:21:16.668142 last_rx_back 2018-02-10 18:21:16.668142 last_rx_front 2018-02-10 18:21:16.668142
2018-02-10 18:21:17.666153 7f84caba8700 25 osd.2 24 heartbeat_check osd.4 first_tx 2018-02-10 18:20:33.658446 last_tx 2018-02-10 18:21:16.668142 last_rx_back 2018-02-10 18:21:16.668142 last_rx_front 2018-02-10 18:21:16.668142
2018-02-10 18:21:17.666155 7f84caba8700 25 osd.2 24 heartbeat_check osd.5 first_tx 2018-02-10 18:20:33.658446 last_tx 2018-02-10 18:21:16.668142 last_rx_back 2018-02-10 18:21:16.668142 last_rx_front 2018-02-10 18:21:16.668142
2018-02-10 18:21:17.666159 7f84caba8700 20 osd.2 24 scrub_random_backoff lost coin flip, randomly backing off
2018-02-10 18:21:17.666160 7f84caba8700 10 osd.2 24 promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0  bytes; target 25 obj/sec or 5120 k bytes/sec
2018-02-10 18:21:17.666162 7f84caba8700 20 osd.2 24 promote_throttle_recalibrate  new_prob 1000
2018-02-10 18:21:17.666162 7f84caba8700 10 osd.2 24 promote_throttle_recalibrate  actual 0, actual/prob ratio 1, adjusted new_prob 1000, prob 1000 -> 1000
2018-02-10 18:21:17.691450 7f84cb3a9700 10 osd.2 24 tick
2018-02-10 18:21:17.691459 7f84cb3a9700 10 osd.2 24 do_waiters -- start
2018-02-10 18:21:17.691460 7f84cb3a9700 10 osd.2 24 do_waiters -- finish
2018-02-10 18:21:17.968136 7f84ceb99700  1 -- 172.21.15.82:6814/1022807 <== osd.4 172.21.15.2:0/26227 15 ==== osd_ping(ping e24 stamp 2018-02-10 18:21:17.970646) v4 ==== 2004+0+0 (241253251 0 0) 0x55cb80cc5000 con 0x55cb7fda4000
2018-02-10 18:21:17.968151 7f84ceb99700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f84b320f700' had timed out after 15
2018-02-10 18:21:17.968153 7f84ceb99700 10 osd.2 24 internal heartbeat not healthy, dropping ping request
2018-02-10 18:21:17.968143 7f84cf39a700  1 -- 172.21.15.82:6813/1022807 <== osd.4 172.21.15.2:0/26227 15 ==== osd_ping(ping e24 stamp 2018-02-10 18:21:17.970646) v4 ==== 2004+0+0 (241253251 0 0) 0x55cb886c1c00 con 0x55cb7fd95000
2018-02-10 18:21:17.968248 7f84cf39a700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f84b320f700' had timed out after 15
2018-02-10 18:21:17.968267 7f84cf39a700 10 osd.2 24 internal heartbeat not healthy, dropping ping request
2018-02-10 18:21:18.162479 7f84ce398700  1 -- 172.21.15.82:6813/1022807 <== osd.5 172.21.15.2:0/26228 13 ==== osd_ping(ping e24 stamp 2018-02-10 18:21:18.164826) v4 ==== 2004+0+0 (658040782 0 0) 0x55cc143c7c00 con 0x55cb7fdc3000
2018-02-10 18:21:18.162498 7f84ce398700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f84b320f700' had timed out after 15
2018-02-10 18:21:18.162500 7f84ce398700 10 osd.2 24 internal heartbeat not healthy, dropping ping request
2018-02-10 18:21:18.162756 7f84cf39a700  1 -- 172.21.15.82:6814/1022807 <== osd.5 172.21.15.2:0/26228 13 ==== osd_ping(ping e24 stamp 2018-02-10 18:21:18.164826) v4 ==== 2004+0+0 (658040782 0 0) 0x55cb886c1c00 con 0x55cb7fdd6800
2018-02-10 18:21:18.162776 7f84cf39a700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f84b320f700' had timed out after 15
2018-02-10 18:21:18.162777 7f84cf39a700 10 osd.2 24 internal heartbeat not healthy, dropping ping request
2018-02-10 18:21:18.246832 7f84ce398700  1 -- 172.21.15.82:6813/1022807 <== osd.0 172.21.15.82:0/22805 11 ==== osd_ping(ping e24 stamp 2018-02-10 18:21:18.246652) v4 ==== 2004+0+0 (1777999654 0 0) 0x55cc143c7c00 con 0x55cb817f5000
2018-02-10 18:21:18.246849 7f84ce398700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f84b320f700' had timed out after 15
2018-02-10 18:21:18.246850 7f84ce398700 10 osd.2 24 internal heartbeat not healthy, dropping ping request
2018-02-10 18:21:18.246879 7f84ceb99700  1 -- 172.21.15.82:6814/1022807 <== osd.0 172.21.15.82:0/22805 11 ==== osd_ping(ping e24 stamp 2018-02-10 18:21:18.246652) v4 ==== 2004+0+0 (1777999654 0 0) 0x55cb80cc5000 con 0x55cb817f6800
2018-02-10 18:21:18.246887 7f84ceb99700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f84b320f700' had timed out after 15
2018-02-10 18:21:18.246887 7f84ceb99700 10 osd.2 24 internal heartbeat not healthy, dropping ping request

Which again leads me to believe that there is something wrong on this OSD.

Actions

Copy link

Updated by Kefu Chai about 6 years ago

in osd.2's log

2018-02-13 21:03:22.103436 7f4b393dd700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f4b1c250700' had timed out after 15
2018-02-13 21:03:22.103438 7f4b393dd700 10 osd.2 18 internal heartbeat not healthy, dropping ping request
2018-02-13 21:03:22.103462 7f4b38bdc700  1 -- 172.21.15.200:6807/23267 <== osd.0 172.21.15.200:0/23265 49 ==== osd_ping(ping e18 stamp 2018-02-13 21:03:22.103304) v4 ==== 2004+0+0 (1933461075 0 0) 0x5590ec15d200 con 0x5590eb536000
2018-02-13 21:03:22.103558 7f4b38bdc700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f4b1c250700' had timed out after 15
2018-02-13 21:03:22.103572 7f4b38bdc700 10 osd.2 18 internal heartbeat not healthy, dropping ping request

that's why OSD.2 was considered down.

Actions

Copy link

Updated by Jeff Layton about 6 years ago

Yuri Weinstein wrote:

note that jobs passed on ovh ?!
http://pulpito.ceph.com/yuriw-2018-02-13_18:11:12-knfs-luminous-testing-basic-ovh/

the question is what makes in different ?

OVH is the openstack provider? I assume that these setups were using the same kernel on the osds?

It may simply be the difference in hardware.

This test is evidently driving I/O via knfsd+kcephfs, and it's possible that that is able to generate more load than we typically do via libcephfs-based tests. One would think that we'd see similar problems with krbd tests too though if this were something load related.

Actions

Copy link

Updated by Sage Weil about 6 years ago

the op_wq htread went to sleep for >30 seconds:

2018-02-13 21:03:11.861298 7f4b1924a700 20 osd.2 op_wq(0) _process 3.0 item PGQueueable(0x5590ec0c6700 prio 127 cost 4195019 e18) queued
...
2018-02-13 21:03:37.714696 7f4b1924a700 20 osd.2 op_wq(0) _process 3.0 item PGQueueable(0x5590ed816bc0 prio 127 cost 4195019 e18) pg 0x5590ebf2e000

on osd.2. The internal heartbeat thus failed and the osd didn't reply to pings.

I don't see any way for the sharded wq code to block without printing something else. WHich makes me thing the osd was swapping...

Actions

Copy link

Updated by Sage Weil about 6 years ago

Subject changed from "[WRN] Health check failed: 1 osds down (OSD_DOWN)" in knfs-luminous-testing-basic-smithi to OSD_DOWN during ffsb on knfs

Hmm, ffsb in teh past has caused transient failures (in kcephfs suite iirc) just because.. it issues tons of small ios? not really sure. We should run it with some collectl or pcp or similar to see what the memory etc looks like during this time to understand why it is such a stressful test/workload. I don't think this is a new thing, though!

Actions

Copy link

Updated by Josh Durgin about 4 years ago

Status changed from New to Can't reproduce

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #22995

OSD_DOWN during ffsb on knfs

Updated by Yuri Weinstein about 6 years ago

Updated by Yuri Weinstein about 6 years ago

Updated by Jeff Layton about 6 years ago

Updated by Jeff Layton about 6 years ago

Updated by Kefu Chai about 6 years ago

Updated by Jeff Layton about 6 years ago

Updated by Sage Weil about 6 years ago

Updated by Sage Weil about 6 years ago

Updated by Josh Durgin about 4 years ago