Project

General

Profile

Actions

Bug #9998

closed

Replaced OSD weight below 0

Added by Pawel Sadowski over 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
firefly,giant
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I've hit a bug when replacing OSDs. Under specific conditions replaced OSD gets weight of -3.052e-05.

How to reproduce

  • build a cluster with few nodes
  • remove OSD node with id lower than maximum (i.e. not the last one)
  • change host weight while not changing OSD weights (i.e. sum(weight(osd)) != weight(host))
  • add new osd (it should receive missing id and host weight will be updated)

Following part of the code looks to be responsible for this bug (src/crush/CrushWrapper.cc):

    if (item_exists(item)) {
      weight = get_item_weightf(item);

The value is -3.052e-05 is returned by get_item_weightf when get_item_weight returns -ENOENT (-2):

  float get_item_weightf(int id) {
    return (float)get_item_weight(id) / (float)0x10000;
  }

The worst thing is that such OSD receives most (almost all) of the data!

This looks like somewhere in the code int is casted to unsigned int and value just below zero become a huge weight in crushmap.

OSD tree:

host406249# ceph osd tree
-2    4.4            host host406249
3    1.29                osd.3    up    1    
4    1.29                osd.4    up    1    
6    1.82                osd.6    up    1    
1    -3.052e-05                osd.1    up    1    
-5    6.22        rack C2
-4    6.22            host host406251
0    1.82                osd.0    up    1    
2    1.29                osd.2    up    1    
5    1.29                osd.5    up    1    
7    1.82                osd.7    up    1    
-7    6.22        rack C3
-6    6.22            host host406254
9    1.82                osd.9    up    1    
10    1.29                osd.10    up    1    
11    1.29                osd.11    up    1    
8    1.82                osd.8    up    1    

Host with "broken" OSD:

host406249# df -h /osd-sd{a..d}
Filesystem           Size  Used Avail Use% Mounted on
/dev/mapper/osd-sda  1.3T   88M  1.3T   1% /osd-sda
/dev/mapper/osd-sdb  1.3T   55M  1.3T   1% /osd-sdb
/dev/mapper/osd-sdc  1.9T  100M  1.9T   1% /osd-sdc
/dev/mapper/osd-sdd  1.9T   10G  1.9T   1% /osd-sdd             <------ this the osd.1 with negative weight

Normal host:

host406254# df -h /osd-sd{a..d}
Filesystem           Size  Used Avail Use% Mounted on
/dev/mapper/osd-sda  1.3T  2.4G  1.3T   1% /osd-sda
/dev/mapper/osd-sdb  1.3T  2.0G  1.3T   1% /osd-sdb
/dev/mapper/osd-sdc  1.9T  2.9G  1.9T   1% /osd-sdc
/dev/mapper/osd-sdd  1.9T  3.1G  1.9T   1% /osd-sdd


Files

crushmap.e6984.txt (2.82 KB) crushmap.e6984.txt Before removing OSD.1 Pawel Sadowski, 11/22/2014 10:13 AM
crushmap.e7120.txt (2.79 KB) crushmap.e7120.txt OSD.1 removed Pawel Sadowski, 11/22/2014 10:13 AM
crushmap.e7121.txt (2.79 KB) crushmap.e7121.txt host406247 weight changed by (getcrushmap/setcrushmap) - this is osdmap just prior re-adding OSD Pawel Sadowski, 11/22/2014 10:13 AM
crushmap.e7154.txt (2.82 KB) crushmap.e7154.txt After adding OSD.1 again (new OSD with same id) Pawel Sadowski, 11/22/2014 10:13 AM
Actions

Also available in: Atom PDF