Bug #24224: The cluster does not go into the OK state - Ceph - Ceph

Actions

Copy link

Bug #24224

closed

The cluster does not go into the OK state

Added by Vasilii Alekseenko almost 6 years ago. Updated almost 6 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Monitor

Target version:

% Done:

Source:

Tags:

rule, rack

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v12.2.5

ceph-qa-suite:

ceph-disk

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I have a test cluster Ceph in a virtual environment.

  cluster:
    id:     22d6464d-f137-423e-b8aa-bec5e9219755
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum cn1,cn2,cn3
    mgr: cn1(active)
    osd: 12 osds: 12 up, 12 in

  data:
    pools:   1 pools, 256 pgs
    objects: 5 objects, 487 kB
    usage:   61914 MB used, 60845 MB / 119 GB avail
    pgs:     256 active+clean

$ ceph versions

{
    "mon": {
        "ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)": 3
    },
    "mgr": {
        "ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)": 1
    },
    "osd": {
        "ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)": 12
    },
    "mds": {},
    "overall": {
        "ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)": 16
    }
}

I created a hierarchy using commands
$ ceph osd crush add-bucket rack1 rack $ ceph osd crush add-bucket rack2 rack $ ceph osd crush move cn1 rack=rack1
...

Final configuration
$ ceph osd tree

ID  CLASS WEIGHT  TYPE NAME          STATUS REWEIGHT PRI-AFF 
 -1       0.11755 root default                               
-15       0.05878     rack rack1                             
 -2       0.01959         host cn1                           
  0   hdd 0.00980             osd.0      up  1.00000 1.00000 
  1   hdd 0.00980             osd.1      up  1.00000 1.00000 
 -4       0.01959         host cn3                           
  4   hdd 0.00980             osd.4      up  1.00000 1.00000 
  5   hdd 0.00980             osd.5      up  1.00000 1.00000 
 -6       0.01959         host cn5                           
  8   hdd 0.00980             osd.8      up  1.00000 1.00000 
  9   hdd 0.00980             osd.9      up  1.00000 1.00000 
-16       0.05878     rack rack2                             
 -3       0.01959         host cn2                           
  2   hdd 0.00980             osd.2      up  1.00000 1.00000 
  3   hdd 0.00980             osd.3      up  1.00000 1.00000 
 -5       0.01959         host cn4                           
  6   hdd 0.00980             osd.6      up  1.00000 1.00000 
  7   hdd 0.00980             osd.7      up  1.00000 1.00000 
 -7       0.01959         host cn6                           
 10   hdd 0.00980             osd.10     up  1.00000 1.00000 
 11   hdd 0.00980             osd.11     up  1.00000 1.00000

The current rule - "replicated_ruleset" with type:host

I created a new rule with a point of failure - "rack".

$ ceph osd crush rule create-replicated RackStar default rack $ ceph osd pool set rbd crush_rule RackStar

set pool 0 crush_rule to RackStar

$ ceph osd dump | grep rule

pool 0 'rbd' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 256 pgp_num 256 last_change 291 flags hashpspool stripe_width 0 application rbd

My cluster does not go into the state "HEALTH_OK". I rebooted all the server's in my cluster, but the problem does not disappear.

 cluster:
    id:     22d6464d-f137-423e-b8aa-bec5e9219755
    health: HEALTH_WARN
            2/6 objects misplaced (33.333%)

  services:
    mon: 3 daemons, quorum cn1,cn2,cn3
    mgr: cn1(active)
    osd: 12 osds: 12 up, 12 in; 256 remapped pgs

  data:
    pools:   1 pools, 256 pgs
    objects: 2 objects, 19 bytes
    usage:   61918 MB used, 60841 MB / 119 GB avail
    pgs:     2/6 objects misplaced (33.333%)
             256 active+clean+remapped

video shows the problem in more detail [[https://youtu.be/UtM7vItjsWY]]

Actions

Copy link

Updated by John Spray almost 6 years ago

Status changed from New to Closed

Thanks for the comprehensive information. In this case, you've created a rule that requests each copy is on a separate rack, and a pool that requires three copies. However, you only have two racks, so Ceph can't satisfy that -- you'd need a third rack to have three copies on separate racks.

If what you really want is two copies on one rack and one copy on another rack, then you can construct a slightly more complicated rule to do that -- ask on ceph-users for advice.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #24224

The cluster does not go into the OK state

Updated by John Spray almost 6 years ago