Project

General

Profile

Actions

Support #21589

open

ceph 12.2.0 health check failed osd down

Added by zheng liu over 6 years ago. Updated over 6 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Pull request ID:

Description

thanks all!
I am a question in Ceph luminous (rc) 12.2.0, I swear my Ceph cluster network no problem,
My Ceph cluster have 4 node , 8 osd in total .but no any operation,the osd are down.During the time, each server's CPU,MEMORY,IO,IOPS are not used,

I need your help ,thanks all again!

The following section is Ceph Cluster's log, Ceph releases,ceph.conf.

attachment is the log file,config file,ping test result file

[root@storage3 ceph]#
[root@storage3 ceph]# ceph -v
ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
[root@storage3 ceph]#
[root@storage3 ceph]# tail -n50 /var/log/ceph/ceph.log
2017-09-28 14:47:59.152723 mon.compute1 mon.0 10.200.246.20:6789/0 1509 : cluster [INF] osd.1 marked down after no beacon for 300.886271 seconds
2017-09-28 14:47:59.154945 mon.compute1 mon.0 10.200.246.20:6789/0 1510 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2017-09-28 14:48:00.638308 mon.compute1 mon.0 10.200.246.20:6789/0 1519 : cluster [WRN] Health check failed: Reduced data availability: 4 pgs inactive, 12 pgs peering (PG_AVAILABILITY)
2017-09-28 14:48:00.638379 mon.compute1 mon.0 10.200.246.20:6789/0 1520 : cluster [WRN] Health check failed: Degraded data redundancy: 12 pgs unclean (PG_DEGRADED)
2017-09-28 14:48:02.611313 mon.compute1 mon.0 10.200.246.20:6789/0 1525 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2017-09-28 14:48:02.621193 mon.compute1 mon.0 10.200.246.20:6789/0 1526 : cluster [WRN] Health check update: Degraded data redundancy: 7/96 objects degraded (7.292%), 85 pgs unclean, 73 pgs degraded (PG_DEGRADED)
2017-09-28 14:48:02.673049 mon.compute1 mon.0 10.200.246.20:6789/0 1527 : cluster [INF] osd.1 10.200.246.20:6804/1975 boot
2017-09-28 14:48:06.148432 mon.compute1 mon.0 10.200.246.20:6789/0 1536 : cluster [WRN] Health check update: Reduced data availability: 4 pgs inactive, 4 pgs peering (PG_AVAILABILITY)
2017-09-28 14:48:06.148522 mon.compute1 mon.0 10.200.246.20:6789/0 1537 : cluster [WRN] Health check update: Degraded data redundancy: 7/96 objects degraded (7.292%), 79 pgs unclean, 73 pgs degraded (PG_DEGRADED)
2017-09-28 14:48:08.150154 mon.compute1 mon.0 10.200.246.20:6789/0 1538 : cluster [WRN] Health check update: Degraded data redundancy: 2 pgs unclean (PG_DEGRADED)
2017-09-28 14:48:08.150216 mon.compute1 mon.0 10.200.246.20:6789/0 1539 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 4 pgs inactive, 4 pgs peering)
2017-09-28 14:48:01.259717 osd.1 osd.1 10.200.246.20:6804/1975 15 : cluster [WRN] Monitor daemon marked osd.1 down, but it is still running
2017-09-28 14:48:11.084090 mon.compute1 mon.0 10.200.246.20:6789/0 1540 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 2 pgs unclean)
2017-09-28 14:48:11.084158 mon.compute1 mon.0 10.200.246.20:6789/0 1541 : cluster [INF] Cluster is now healthy
2017-09-28 14:52:59.340358 mon.compute1 mon.0 10.200.246.20:6789/0 1544 : cluster [INF] osd.0 marked down after no beacon for 300.224962 seconds
2017-09-28 14:52:59.342223 mon.compute1 mon.0 10.200.246.20:6789/0 1545 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2017-09-28 14:53:01.100433 mon.compute1 mon.0 10.200.246.20:6789/0 1554 : cluster [WRN] Health check failed: Reduced data availability: 5 pgs inactive, 19 pgs peering (PG_AVAILABILITY)
2017-09-28 14:53:01.100543 mon.compute1 mon.0 10.200.246.20:6789/0 1555 : cluster [WRN] Health check failed: Degraded data redundancy: 19 pgs unclean (PG_DEGRADED)
2017-09-28 14:53:02.831879 mon.compute1 mon.0 10.200.246.20:6789/0 1560 : cluster [WRN] Health check update: Degraded data redundancy: 7/96 objects degraded (7.292%), 95 pgs unclean, 77 pgs degraded (PG_DEGRADED)
2017-09-28 14:53:02.844945 mon.compute1 mon.0 10.200.246.20:6789/0 1561 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2017-09-28 14:53:02.963916 mon.compute1 mon.0 10.200.246.20:6789/0 1562 : cluster [INF] osd.0 10.200.246.20:6800/1732 boot
2017-09-28 14:53:05.157621 mon.compute1 mon.0 10.200.246.20:6789/0 1571 : cluster [WRN] Health check update: Degraded data redundancy: 7/96 objects degraded (7.292%), 110 pgs unclean, 77 pgs degraded (PG_DEGRADED)
2017-09-28 14:53:06.289097 mon.compute1 mon.0 10.200.246.20:6789/0 1572 : cluster [WRN] Health check update: Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY)
2017-09-28 14:53:06.289173 mon.compute1 mon.0 10.200.246.20:6789/0 1573 : cluster [WRN] Health check update: Degraded data redundancy: 7/96 objects degraded (7.292%), 96 pgs unclean, 77 pgs degraded (PG_DEGRADED)
2017-09-28 14:53:08.333526 mon.compute1 mon.0 10.200.246.20:6789/0 1574 : cluster [WRN] Health check update: Degraded data redundancy: 15 pgs unclean (PG_DEGRADED)
2017-09-28 14:53:08.333592 mon.compute1 mon.0 10.200.246.20:6789/0 1575 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 5 pgs inactive, 5 pgs peering)
2017-09-28 14:53:01.899641 osd.0 osd.0 10.200.246.20:6800/1732 15 : cluster [WRN] Monitor daemon marked osd.0 down, but it is still running
2017-09-28 14:53:10.866229 mon.compute1 mon.0 10.200.246.20:6789/0 1576 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 15 pgs unclean)
2017-09-28 14:53:10.866318 mon.compute1 mon.0 10.200.246.20:6789/0 1577 : cluster [INF] Cluster is now healthy
2017-09-28 14:54:04.380384 mon.compute1 mon.0 10.200.246.20:6789/0 1579 : cluster [INF] osd.6 marked down after no beacon for 300.819489 seconds
2017-09-28 14:54:04.382641 mon.compute1 mon.0 10.200.246.20:6789/0 1580 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2017-09-28 14:54:05.822527 mon.compute1 mon.0 10.200.246.20:6789/0 1585 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2017-09-28 14:54:05.822622 mon.compute1 mon.0 10.200.246.20:6789/0 1586 : cluster [INF] Cluster is now healthy
2017-09-28 14:54:05.975682 mon.compute1 mon.0 10.200.246.20:6789/0 1587 : cluster [INF] osd.6 10.200.246.63:6800/1410 boot
2017-09-28 14:54:09.083395 mon.compute1 mon.0 10.200.246.20:6789/0 1596 : cluster [WRN] Health check failed: Degraded data redundancy: 93 pgs unclean (PG_DEGRADED)
2017-09-28 14:54:12.372374 mon.compute1 mon.0 10.200.246.20:6789/0 1597 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 93 pgs unclean)
2017-09-28 14:54:12.372477 mon.compute1 mon.0 10.200.246.20:6789/0 1598 : cluster [INF] Cluster is now healthy
2017-09-28 14:54:05.071436 osd.6 osd.6 10.200.246.63:6800/1410 13 : cluster [WRN] Monitor daemon marked osd.6 down, but it is still running
2017-09-28 14:54:39.397852 mon.compute1 mon.0 10.200.246.20:6789/0 1601 : cluster [INF] osd.2 marked down after no beacon for 300.647558 seconds
2017-09-28 14:54:39.399874 mon.compute1 mon.0 10.200.246.20:6789/0 1602 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2017-09-28 14:54:40.700622 mon.compute1 mon.0 10.200.246.20:6789/0 1611 : cluster [WRN] Health check failed: Reduced data availability: 8 pgs peering (PG_AVAILABILITY)
2017-09-28 14:54:40.700715 mon.compute1 mon.0 10.200.246.20:6789/0 1612 : cluster [WRN] Health check failed: Degraded data redundancy: 8 pgs unclean (PG_DEGRADED)
2017-09-28 14:54:41.873960 mon.compute1 mon.0 10.200.246.20:6789/0 1617 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2017-09-28 14:54:41.958227 mon.compute1 mon.0 10.200.246.20:6789/0 1618 : cluster [INF] osd.2 10.200.246.61:6804/1802 boot
2017-09-28 14:54:43.157865 mon.compute1 mon.0 10.200.246.20:6789/0 1627 : cluster [WRN] Health check update: Reduced data availability: 1 pg peering (PG_AVAILABILITY)
2017-09-28 14:54:43.157973 mon.compute1 mon.0 10.200.246.20:6789/0 1628 : cluster [WRN] Health check update: Degraded data redundancy: 1 pg unclean (PG_DEGRADED)
2017-09-28 14:54:48.392928 mon.compute1 mon.0 10.200.246.20:6789/0 1633 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2017-09-28 14:54:48.392996 mon.compute1 mon.0 10.200.246.20:6789/0 1634 : cluster [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 1 pg unclean)
2017-09-28 14:54:48.393040 mon.compute1 mon.0 10.200.246.20:6789/0 1635 : cluster [INF] Cluster is now healthy
2017-09-28 14:54:40.018250 osd.2 osd.2 10.200.246.61:6804/1802 15 : cluster [WRN] Monitor daemon marked osd.2 down, but it is still running
[root@storage3 ceph]#
[root@storage3 ceph]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 6.54892 root default
-3 1.09137 host compute1
0 hdd 0.54568 osd.0 up 1.00000 1.00000
1 hdd 0.54568 osd.1 up 1.00000 1.00000
-5 1.81918 host storage1
2 hdd 0.90959 osd.2 up 1.00000 1.00000
3 hdd 0.90959 osd.3 up 1.00000 1.00000
-7 1.81918 host storage2
4 hdd 0.90959 osd.4 up 1.00000 1.00000
5 hdd 0.90959 osd.5 up 1.00000 1.00000
-9 1.81918 host storage3
6 hdd 0.90959 osd.6 up 1.00000 1.00000
7 hdd 0.90959 osd.7 up 1.00000 1.00000
[root@storage3 ceph]#
[root@storage3 ceph]# cat /etc/ceph/ceph.conf
[global]
fsid = c55f7e91-d2cc-41b9-bff3-bfa18a69ea2f
mon_initial_members = compute1, storage1, storage2, storage3
mon_host = 10.200.246.20,10.200.246.61,10.200.246.62,10.200.246.63
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public_network = 10.200.0.0/16

max open files = 100000
filestore_xattr_use_omap = true
osd pool default size = 3
osd pool default min size = 1
osd pool default pg num = 128
osd pool default pgp num = 128

[osd]
osd data = /var/lib/ceph/osd/$cluster-$id

osd_mkfs_options_xfs = -f
osd_mkfs_type = xfs
osd_mount_options_xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier"

filestore op threads = 12
filestore min sync interval = 5
filestore max sync interval = 10
filestore queue max ops = 10000
filestore queue max bytes = 10485760
filestore queue committing max ops = 5000
filestore queue committing max bytes = 10485760000

journal aio = true
journal dio = true
journal block align = true
journal max write bytes = 100000000
journal max write entries = 10000
journal queue max ops = 50000
journal queue max bytes = 10485760000
osd op threads = 4
osd disk threads = 4
osd map cache size = 1024
osd map cache bl size = 128

[mon]
mon_clock_drift_allowed = .50
mon osd full ratio = .90
mon osd nearfull ratio = .85
mon_osd_down_out_interval = 300
mon_osd_report_timeout = 150

[mgr]
mgr data = /var/lib/ceph/mgr/$cluster-$id

[client]
rbd cache = true
rbd cache writethrough until flush = true
rbd concurrent management ops = 20
log file = /var/log/qemu/qemu-guest.$pid.log
[root@storage3 ceph]#


Files

246_20.res (303 KB) 246_20.res ping test zheng liu, 09/28/2017 07:03 AM
246_61.res (304 KB) 246_61.res ping test zheng liu, 09/28/2017 07:03 AM
246_62.res (304 KB) 246_62.res ping test zheng liu, 09/28/2017 07:03 AM
246_63.res (305 KB) 246_63.res ping test zheng liu, 09/28/2017 07:03 AM
ceph.conf (1.56 KB) ceph.conf config zheng liu, 09/28/2017 07:03 AM
ceph.log (83.4 KB) ceph.log log zheng liu, 09/28/2017 07:03 AM
Actions #1

Updated by zheng liu over 6 years ago

2017-09-28 17:15:10.047907 7fabe24fa700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.2 down, but it is still running
2017-09-28 17:15:10.047920 7fabe24fa700 0 log_channel(cluster) log [DBG] : map e8660 wrongly marked me down at e8658
2017-09-28 17:15:10.047923 7fabe24fa700 1 osd.2 8660 start_waiting_for_healthy

2017-09-28 17:15:10.105998 7fabe24fa700 1 osd.2 8660 is_healthy false -- only 0/7 up peers (less than 33%)
2017-09-28 17:15:10.106008 7fabe24fa700 1 osd.2 8660 not healthy; waiting to boot
2017-09-28 17:15:10.222417 7fabec50e700 1 osd.2 8660 start_boot
2017-09-28 17:15:11.164757 7fabe24fa700 1 osd.2 8661 state: booting -> active

Tthis osd log

Actions #2

Updated by Aleksei Gutikov over 6 years ago

It can happen due to mon_osd_report_timeout < osd_mon_report_interval_max.
Check config from
  1. sudo ceph daemon mon.xxxx config show
    (or sudo ceph --admin-daemon /var/run/ceph/mon-xxxxxx.asok config show)
Actions #3

Updated by Dmitry Smirnov over 6 years ago

I have a similar issue.

My settings are defaults:
"mon_osd_report_timeout": "900" AND "osd_mon_report_interval_max": "600"

Actions

Also available in: Atom PDF