Bug #23467
closedceph-disk: Destroyed OSDs keeps old CRUSH weight if new device is different size
0%
Description
Before I go to the bug, let me explain where this is coming from.
In a cluster with 3.84TB Samsung SSDs a SSD failed. The Ceph operator asked the datacenter to replace the SSD and they did.
The employee in the datacenter made a mistake: Instead of replacing the SSD with a 3.84TB he installed a 960GB SSD in that slot.
The Ceph operator did not notice this mistake and he took the following steps:
- Destroy the old OSD with 'ceph osd destroy X'
- Prepare the new SSD as the old OSD ID using 'ceph-disk prepare --osd-id X /dev/sdX'
The OSD was added again, but it kept it's weight of 3.48700 in CRUSH while the OSD is only 960GB large.
Backfilling started and Ceph went to HEALTH_OK. The admin thought everything was going just fine and let the cluster continue.
Suddenly the system went into HEALTH_ERR and I/O stopped due to this OSD being 95% full. Expected behavior of Ceph, but not completely expected here.
The admin then found out this OSD was only 960GB in size. The OSD was stopped and the problem was resolved. Datacenter later swapped the SSD again with a properly sized SSD.
To verify this behavior of Luminous I build a very simple Ceph cluster with a few OSDs:
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.03918 root default -3 0.01959 host alpha 0 ssd 0.00980 osd.0 up 1.00000 1.00000 3 ssd 0.00980 osd.3 up 1.00000 1.00000 -5 0.00980 host bravo 1 ssd 0.00980 osd.1 up 1.00000 1.00000 -7 0.00980 host charlie 2 ssd 0.00980 osd.2 up 1.00000 1.00000
In this case all the OSDs are 10GB running inside Virtual Machines.
I stopped and destroyed osd.3 and re-added it with a 5GB disk.
The CRUSH tree however remained the same, but 'ceph df' shows the new OSD size:
ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 0 ssd 0.00980 1.00000 10236M 1065M 9170M 10.41 0.88 68 3 ssd 0.00980 1.00000 5116M 1054M 4061M 20.61 1.74 60 1 ssd 0.00980 1.00000 10236M 1061M 9174M 10.37 0.88 128 2 ssd 0.00980 1.00000 10236M 1061M 9174M 10.37 0.88 128 TOTAL 35824M 4242M 31581M 11.84 MIN/MAX VAR: 0.88/1.74 STDDEV: 4.56
Take a close look, the size of the OSD is 5116M while it's weight is still 0.00980.
Looking at the logs of the MON I see that the OSD ran the correct crush create-or-move with the proper weights:
2018-03-26 16:14:09.622816 7faf99315700 0 mon.alpha@0(leader).osd e28 create-or-move crush item name 'osd.3' initial_weight 0.0098 at location {host=alpha,root=default} 2018-03-26 16:17:11.380765 7f9ee734f700 0 mon.alpha@0(leader).osd e41 create-or-move crush item name 'osd.3' initial_weight 0.0049 at location {host=alpha,root=default}
initial_weight is only used when the OSD is completely new, but it's not taken into account when the OSD is destroyed.
Shouldn't we accept initial_weight when the OSD is destroyed and when it's booted for the first time?
Updated by Wido den Hollander about 6 years ago
Wido den Hollander wrote:
Backfilling started and Ceph went to HEALTH_OK. The admin thought everything was going just fine and let the cluster continue.
I mean HEALTH_WARN here, but I can not change the original post.
Updated by Greg Farnum about 6 years ago
- Subject changed from Destroyed OSDs keeps old CRUSH weight if new device is different size to ceph-disk: Destroyed OSDs keeps old CRUSH weight if new device is different size