Project

General

Profile

Bug #23049

ceph Status shows only WARN when traffic to cluster fails

Added by Nokia ceph-users about 6 years ago. Updated over 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Administration/Usability
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,

While using Kraken, i have seen the status change to ERR but in luminous we do not see the status of ceph change to ERROR from WARN in failure cases..

Environment used - 3 node cluster
Erasure Coding - 2+1

Steps :

1. I have stopped osds on all the three nodes using systemctl stop ceph-osd.target

2. In this state, All read/write failed to the cluster. I expected the status to change to ERR since no read/write is possible at this state. But after 15 minutes, i could see the status was still as WARN and not ERR. (copied ceph status below)

cluster:
id: c36fb424-038a-4c38-84a4-1469481ad5c8
health: HEALTH_WARN
26 osds down
2 hosts (24 osds) down
Reduced data availability: 362 pgs inactive, 1024 pgs down
Degraded data redundancy: 1024 pgs unclean
services:
mon: 3 daemons, quorum pl12-cn1,pl12-cn2,pl12-cn3
mgr: pl12-cn3(active), standbys: pl12-cn2, pl12-cn1
osd: 36 osds: 10 up, 36 in
data:
pools: 1 pools, 1024 pgs
objects: 1512 objects, 2607 MB
usage: 44427 MB used, 196 TB / 196 TB avail
pgs: 100.000% pgs not active
869 down
155 stale+down

Also one more thing i noticed in this status is that, even though no osd processes were running on the system , the status still shows as 10 up. It did not change to 0 up. . Is this the expected?


Related issues

Related to RADOS - Bug #23565: Inactive PGs don't seem to cause HEALTH_ERR Fix Under Review

History

#1 Updated by Nokia ceph-users about 6 years ago

Please let me know of the required logs/info to be added if any.

#2 Updated by Josh Durgin about 6 years ago

  • Project changed from Ceph to RADOS
  • Priority changed from Normal to High

Can reproduce easily - thanks for the report.

2 bugs here - 1) the monitor is still enforcing the mon_osd_min_up_ratio even with polite shutdowns
2) health_warn doesn't turn into health_error even when all osds are down

#3 Updated by Josh Durgin about 6 years ago

  • Category set to Administration/Usability

#4 Updated by Nokia ceph-users almost 6 years ago

hi,
which is the expected fix release version?

Thanks,

#5 Updated by Josh Durgin almost 6 years ago

  • Priority changed from High to Urgent

#6 Updated by Josh Durgin over 4 years ago

  • Priority changed from Urgent to Normal

#7 Updated by Greg Farnum over 4 years ago

  • Related to Bug #23565: Inactive PGs don't seem to cause HEALTH_ERR added

Also available in: Atom PDF