Bug #3720: Ceph Reporting Negative Number of Degraded objects - Ceph - Ceph

Actions

Copy link

Bug #3720

closed

Ceph Reporting Negative Number of Degraded objects

Added by Mike Dawson over 11 years ago. Updated about 11 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Samuel Just

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Changed the replication of two pools from 2x to 3x. Cluster rebalanced to nearly HEALTH_OK but got stuck at:

HEALTH_WARN 18 pgs degraded; 18 pgs stuck unclean; recovery 106/90035 degraded (0.118%)

The PGs that were stuck unclean were still mapped to only two OSDs:

root@node1:~# ceph pg dump | grep degraded
3.2dd 1 0 0 0 4194304 154 154 active+degraded 2013-01-03 14:42:54.733326 10'1 330'168 [1,0] [1,0] 0'0 2013-01-02 12:19:33.305832 0'0 2013-01-02 12:19:33.305832
4.2dc 0 0 -9 0 0 79618 79618 active+degraded 2013-01-03 14:42:54.780529 360'517 330'989 [1,0] [1,0] 0'0 2013-01-02 16:07:04.509564 0'0 2013-01-02 16:07:04.509564
4.1a2 0 0 -14 0 0 131516 131516 active+degraded 2013-01-03 14:43:39.431023 360'854 344'1492 [4,5] [4,5] 0'0 2013-01-02 13:19:31.365322 0'0 2013-01-02 13:19:31.365323
3.1a3 2 0 0 0 8388608 307 307 active+degraded 2013-01-03 14:43:39.432830 11'2 344'133 [4,5] [4,5] 0'0 2013-01-02 12:13:27.185501 0'0 2013-01-02 12:13:27.185501
4.148 0 0 -13 0 0 147840 147840 active+degraded 2013-01-03 14:43:15.819473 360'960 338'1806 [3,2] [3,2] 0'0 2013-01-02 14:35:52.227881 0'0 2013-01-02 14:35:52.227882
3.149 3 0 0 0 12582912 461 461 active+degraded 2013-01-03 14:43:15.821554 11'3 338'128 [3,2] [3,2] 0'0 2013-01-02 12:13:06.365653 0'0 2013-01-02 12:13:06.365653
4.100 0 0 -5 0 0 47740 47740 active+degraded 2013-01-03 14:43:15.821759 360'310 338'651 [3,2] [3,2] 0'0 2013-01-02 13:16:04.298871 0'0 2013-01-02 13:16:04.298871
3.101 1 0 0 0 4194304 154 154 active+degraded 2013-01-03 14:43:15.822507 10'1 338'141 [3,2] [3,2] 0'0 2013-01-02 12:09:49.117324 0'0 2013-01-02 12:09:49.117324
4.d0 0 0 -16 0 0 147532 147532 active+degraded 2013-01-03 14:43:15.824769 360'958 338'1754 [3,2] [3,2] 0'0 2013-01-02 13:18:16.616246 0'0 2013-01-02 13:18:16.616246
3.d1 5 0 0 0 20971520 770 770 active+degraded 2013-01-03 14:43:15.826829 10'5 338'134 [3,2] [3,2] 0'0 2013-01-02 12:10:03.300437 0'0 2013-01-02 12:10:03.300437
4.7d9 0 0 -15 0 0 141372 141372 active+degraded 2013-01-03 14:42:54.586303 360'918 330'1626 [1,0] [1,0] 0'0 2013-01-02 16:16:37.668688 0'0 2013-01-02 16:16:37.668688
3.7da 1 0 0 0 4194304 154 154 active+degraded 2013-01-03 14:42:54.590050 10'1 330'168 [1,0] [1,0] 0'0 2013-01-02 12:55:41.983775 0'0 2013-01-02 12:55:41.983775
4.722 0 0 -12 0 0 103180 103180 active+degraded 2013-01-03 14:43:42.423262 360'670 354'1191 [7,6] [7,6] 0'0 2013-01-02 14:47:19.126260 0'0 2013-01-02 14:47:19.126260
3.723 3 0 0 0 12582912 462 462 active+degraded 2013-01-03 14:43:42.424585 10'3 354'99 [7,6] [7,6] 0'0 2013-01-02 12:51:17.900154 0'0 2013-01-02 12:51:17.900154
4.5d1 0 0 -10 0 0 91630 91630 active+degraded 2013-01-03 14:43:19.385726 360'595 333'1144 [2,3] [2,3] 0'0 2013-01-02 14:43:59.613276 0'0 2013-01-02 14:43:59.613276
3.5d2 1 0 0 0 4194304 154 154 active+degraded 2013-01-03 14:43:19.382324 10'1 333'149 [2,3] [2,3] 0'0 2013-01-02 12:44:55.046840 0'0 2013-01-02 12:44:55.046840
4.412 0 0 -12 0 0 99792 99792 active+degraded 2013-01-03 14:43:31.666032 360'648 349'1204 [5,4] [5,4] 0'0 2013-01-02 14:41:52.669729 0'0 2013-01-02 14:41:52.669729
3.413 1 0 0 0 4194304 153 153 active+degraded 2013-01-03 14:43:31.667578 11'1 349'142 [5,4] [5,4] 0'0 2013-01-02 12:34:24.819494 0'0 2013-01-02 12:34:24.819494

Then, I deleted an RBD volume from one of the pools with degraded PGs. After that, ceph -s shows:

root@node1:~# ceph -s
health HEALTH_WARN 18 pgs degraded; 18 pgs stuck unclean; recovery -32/34802 degraded (-0.092%)
monmap e1: 1 mons at {a=10.2.1.1:6789/0}, election epoch 1, quorum 0 a
osdmap e360: 8 osds: 8 up, 8 in
pgmap v25271: 4672 pgs: 4654 active+clean, 18 active+degraded; 40115 MB data, 57735 MB used, 22291 GB / 22348 GB avail; -32/34802 degraded (-0.092%)
mdsmap e1: 0/0/1 up

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Mike Dawson over 11 years ago

Per Josh D's suggestion, I set the tunables and it resolved the issue.

ceph osd getcrushmap -o /tmp/crush
crushtool --enable-unsafe-tunables -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new
ceph osd setcrushmap -i /tmp/crush.new

Actions

Copy link

Updated by Sage Weil over 11 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by Ian Colle over 11 years ago

Assignee set to Samuel Just

Actions

Copy link

Updated by Gerben Meijer about 11 years ago

I reproduced it by:

1. Creating 160GB of rbd devices, x2 replication
2. Offline 1 node with 12 OSDs (out of 5 nodes)
3. Watch it recover, and meanwhile
4. Delete all rbd devices (in parallel, e.g. for i in a b c; do rbd rm $blah & done)
5. -13/43630 degraded (-0.030%)

Actions

Copy link