Support #39319
openEvery 15 min - Monitor daemon marked osd.x down, but it is still running
0%
Description
1. Install Ceph (ceph version 13.2.5 mimic (stable)) in 4 node (CentOS7, in test environment VmWare ESXI 5.5)
firewalld stop and disable in all node. All node in one ESXI server, in one network port group and network no problem.
- nodeadm (for deploy)
- node1 (osd, mon, mgr)
- node2 (osd, mon, mgr)
- node3 (osd, mon, mgr)
2. Every 15 min - Monitor daemon marked osd.x down, but it is still running for all node
in node.x
#ceph -s
cluster:
id: 943c0ac1-f168-4129-984b-84cb65846a95
health: HEALTH_OK
services:
mon: 3 daemons, quorum greend02-n01ceph02,greend02-n02ceph02,greend02-n03ceph02
mgr: greend02-n03ceph02(active), standbys: greend02-n01ceph02, greend02-n02ceph02
osd: 3 osds: 3 up, 3 in
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 3.2 GiB used, 147 GiB / 150 GiB avail
pgs:
#ceph -w
....
2019-04-16 12:03:07.731048 osd.2 [WRN] Monitor daemon marked osd.2 down, but it is still running
2019-04-16 12:06:17.673535 mon.greend02-n01ceph02 [INF] osd.0 marked down after no beacon for 300.090443 seconds
2019-04-16 12:06:17.673797 mon.greend02-n01ceph02 [WRN] Health check failed: 1 osds down (OSD_DOWN)
2019-04-16 12:06:17.673846 mon.greend02-n01ceph02 [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2019-04-16 12:06:18.729747 mon.greend02-n01ceph02 [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-04-16 12:06:18.729812 mon.greend02-n01ceph02 [INF] Health check cleared: OSD_HOST_DOWN (was: 1 host (1 osds) down)
2019-04-16 12:06:18.729845 mon.greend02-n01ceph02 [INF] Cluster is now healthy
2019-04-16 12:06:18.734399 mon.greend02-n01ceph02 [INF] osd.0 192.168.118.17:6800/5561 boot
2019-04-16 12:06:17.735692 osd.0 [WRN] Monitor daemon marked osd.0 down, but it is still running
2019-04-16 12:11:07.696520 mon.greend02-n01ceph02 [INF] osd.1 marked down after no beacon for 300.198038 seconds
2019-04-16 12:11:07.698011 mon.greend02-n01ceph02 [WRN] Health check failed: 1 osds down (OSD_DOWN)
2019-04-16 12:11:07.698060 mon.greend02-n01ceph02 [WRN] Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)
2019-04-16 12:11:11.137107 mon.greend02-n01ceph02 [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-04-16 12:11:11.137152 mon.greend02-n01ceph02 [INF] Health check cleared: OSD_HOST_DOWN (was: 1 host (1 osds) down)
2019-04-16 12:11:11.137167 mon.greend02-n01ceph02 [INF] Cluster is now healthy
.....
- cat ceph-osd.1.log | grep says (in all node, this in node2)
2019-04-16 11:10:49.626 7fa4cdaf2700 10 osd.1 3912 handle_osd_ping osd.2 192.168.118.19:6803/5271 says i am down in 3913
2019-04-16 11:10:49.626 7fa4cd2f1700 10 osd.1 3912 handle_osd_ping osd.0 192.168.118.17:6803/4005561 says i am down in 3913
2019-04-16 11:10:49.626 7fa4cdaf2700 10 osd.1 3912 handle_osd_ping osd.0 192.168.118.17:6802/4005561 says i am down in 3913
2019-04-16 11:10:49.626 7fa4cd2f1700 10 osd.1 3912 handle_osd_ping osd.2 192.168.118.19:6804/5271 says i am down in 3913
2019-04-16 11:25:53.834 7fa4ccaf0700 10 osd.1 3918 handle_osd_ping osd.0 192.168.118.17:6806/5005561 says i am down in 3919
2019-04-16 11:25:53.834 7fa4cd2f1700 10 osd.1 3918 handle_osd_ping osd.0 192.168.118.17:6805/5005561 says i am down in 3919
2019-04-16 11:25:53.835 7fa4ccaf0700 10 osd.1 3918 handle_osd_ping osd.2 192.168.118.19:6806/1005271 says i am down in 3919
2019-04-16 11:25:53.835 7fa4cdaf2700 10 osd.1 3918 handle_osd_ping osd.2 192.168.118.19:6807/1005271 says i am down in 3919
#systemctl status ceph-osd@x -> Active: active (running) and long time Active
- why in node1, node2, node3 "... says i am down..." ?
- why in node1, node2, node3 "...Monitor daemon marked osd.x down, but it is still running..." ?
- how to solve a problem?
Files