Project

General

Profile

Actions

Bug #3720

closed

Ceph Reporting Negative Number of Degraded objects

Added by Mike Dawson over 11 years ago. Updated about 11 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Changed the replication of two pools from 2x to 3x. Cluster rebalanced to nearly HEALTH_OK but got stuck at:

HEALTH_WARN 18 pgs degraded; 18 pgs stuck unclean; recovery 106/90035 degraded (0.118%)

The PGs that were stuck unclean were still mapped to only two OSDs:

root@node1:~# ceph pg dump | grep degraded
3.2dd 1 0 0 0 4194304 154 154 active+degraded 2013-01-03 14:42:54.733326 10'1 330'168 [1,0] [1,0] 0'0 2013-01-02 12:19:33.305832 0'0 2013-01-02 12:19:33.305832
4.2dc 0 0 -9 0 0 79618 79618 active+degraded 2013-01-03 14:42:54.780529 360'517 330'989 [1,0] [1,0] 0'0 2013-01-02 16:07:04.509564 0'0 2013-01-02 16:07:04.509564
4.1a2 0 0 -14 0 0 131516 131516 active+degraded 2013-01-03 14:43:39.431023 360'854 344'1492 [4,5] [4,5] 0'0 2013-01-02 13:19:31.365322 0'0 2013-01-02 13:19:31.365323
3.1a3 2 0 0 0 8388608 307 307 active+degraded 2013-01-03 14:43:39.432830 11'2 344'133 [4,5] [4,5] 0'0 2013-01-02 12:13:27.185501 0'0 2013-01-02 12:13:27.185501
4.148 0 0 -13 0 0 147840 147840 active+degraded 2013-01-03 14:43:15.819473 360'960 338'1806 [3,2] [3,2] 0'0 2013-01-02 14:35:52.227881 0'0 2013-01-02 14:35:52.227882
3.149 3 0 0 0 12582912 461 461 active+degraded 2013-01-03 14:43:15.821554 11'3 338'128 [3,2] [3,2] 0'0 2013-01-02 12:13:06.365653 0'0 2013-01-02 12:13:06.365653
4.100 0 0 -5 0 0 47740 47740 active+degraded 2013-01-03 14:43:15.821759 360'310 338'651 [3,2] [3,2] 0'0 2013-01-02 13:16:04.298871 0'0 2013-01-02 13:16:04.298871
3.101 1 0 0 0 4194304 154 154 active+degraded 2013-01-03 14:43:15.822507 10'1 338'141 [3,2] [3,2] 0'0 2013-01-02 12:09:49.117324 0'0 2013-01-02 12:09:49.117324
4.d0 0 0 -16 0 0 147532 147532 active+degraded 2013-01-03 14:43:15.824769 360'958 338'1754 [3,2] [3,2] 0'0 2013-01-02 13:18:16.616246 0'0 2013-01-02 13:18:16.616246
3.d1 5 0 0 0 20971520 770 770 active+degraded 2013-01-03 14:43:15.826829 10'5 338'134 [3,2] [3,2] 0'0 2013-01-02 12:10:03.300437 0'0 2013-01-02 12:10:03.300437
4.7d9 0 0 -15 0 0 141372 141372 active+degraded 2013-01-03 14:42:54.586303 360'918 330'1626 [1,0] [1,0] 0'0 2013-01-02 16:16:37.668688 0'0 2013-01-02 16:16:37.668688
3.7da 1 0 0 0 4194304 154 154 active+degraded 2013-01-03 14:42:54.590050 10'1 330'168 [1,0] [1,0] 0'0 2013-01-02 12:55:41.983775 0'0 2013-01-02 12:55:41.983775
4.722 0 0 -12 0 0 103180 103180 active+degraded 2013-01-03 14:43:42.423262 360'670 354'1191 [7,6] [7,6] 0'0 2013-01-02 14:47:19.126260 0'0 2013-01-02 14:47:19.126260
3.723 3 0 0 0 12582912 462 462 active+degraded 2013-01-03 14:43:42.424585 10'3 354'99 [7,6] [7,6] 0'0 2013-01-02 12:51:17.900154 0'0 2013-01-02 12:51:17.900154
4.5d1 0 0 -10 0 0 91630 91630 active+degraded 2013-01-03 14:43:19.385726 360'595 333'1144 [2,3] [2,3] 0'0 2013-01-02 14:43:59.613276 0'0 2013-01-02 14:43:59.613276
3.5d2 1 0 0 0 4194304 154 154 active+degraded 2013-01-03 14:43:19.382324 10'1 333'149 [2,3] [2,3] 0'0 2013-01-02 12:44:55.046840 0'0 2013-01-02 12:44:55.046840
4.412 0 0 -12 0 0 99792 99792 active+degraded 2013-01-03 14:43:31.666032 360'648 349'1204 [5,4] [5,4] 0'0 2013-01-02 14:41:52.669729 0'0 2013-01-02 14:41:52.669729
3.413 1 0 0 0 4194304 153 153 active+degraded 2013-01-03 14:43:31.667578 11'1 349'142 [5,4] [5,4] 0'0 2013-01-02 12:34:24.819494 0'0 2013-01-02 12:34:24.819494

Then, I deleted an RBD volume from one of the pools with degraded PGs. After that, ceph -s shows:

root@node1:~# ceph -s
health HEALTH_WARN 18 pgs degraded; 18 pgs stuck unclean; recovery -32/34802 degraded (-0.092%)
monmap e1: 1 mons at {a=10.2.1.1:6789/0}, election epoch 1, quorum 0 a
osdmap e360: 8 osds: 8 up, 8 in
pgmap v25271: 4672 pgs: 4654 active+clean, 18 active+degraded; 40115 MB data, 57735 MB used, 22291 GB / 22348 GB avail; -32/34802 degraded (-0.092%)
mdsmap e1: 0/0/1 up


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #4254: osd: failure to recover before timeout on rados bench and thrashing; negative statsResolvedGuang Yang02/23/2013

Actions
Actions #1

Updated by Mike Dawson over 11 years ago

Per Josh D's suggestion, I set the tunables and it resolved the issue.

  1. ceph osd getcrushmap -o /tmp/crush
  2. crushtool --enable-unsafe-tunables -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new
  3. ceph osd setcrushmap -i /tmp/crush.new
Actions #2

Updated by Sage Weil over 11 years ago

  • Priority changed from Normal to High
Actions #3

Updated by Ian Colle over 11 years ago

  • Assignee set to Samuel Just
Actions #4

Updated by Gerben Meijer about 11 years ago

I reproduced it by:

1. Creating 160GB of rbd devices, x2 replication
2. Offline 1 node with 12 OSDs (out of 5 nodes)
3. Watch it recover, and meanwhile
4. Delete all rbd devices (in parallel, e.g. for i in a b c; do rbd rm $blah & done)
5. -13/43630 degraded (-0.030%)

Actions #5

Updated by Ian Colle about 11 years ago

  • Target version deleted (v0.56)
Actions #6

Updated by Sage Weil about 11 years ago

  • Subject changed from Ceph Reporting Negative Number of Degraded Placement Groups to Ceph Reporting Negative Number of Degraded objects
Actions #7

Updated by Sage Weil about 11 years ago

  • Priority changed from High to Normal
Actions #8

Updated by Sage Weil about 11 years ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF