Bug #24224
closedThe cluster does not go into the OK state
0%
Description
I have a test cluster Ceph in a virtual environment.
cluster: id: 22d6464d-f137-423e-b8aa-bec5e9219755 health: HEALTH_OK services: mon: 3 daemons, quorum cn1,cn2,cn3 mgr: cn1(active) osd: 12 osds: 12 up, 12 in data: pools: 1 pools, 256 pgs objects: 5 objects, 487 kB usage: 61914 MB used, 60845 MB / 119 GB avail pgs: 256 active+clean
$ ceph versions
{ "mon": { "ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)": 3 }, "mgr": { "ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)": 1 }, "osd": { "ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)": 12 }, "mds": {}, "overall": { "ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)": 16 } }
I created a hierarchy using commands
$ ceph osd crush add-bucket rack1 rack
$ ceph osd crush add-bucket rack2 rack
$ ceph osd crush move cn1 rack=rack1
...
Final configuration$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.11755 root default -15 0.05878 rack rack1 -2 0.01959 host cn1 0 hdd 0.00980 osd.0 up 1.00000 1.00000 1 hdd 0.00980 osd.1 up 1.00000 1.00000 -4 0.01959 host cn3 4 hdd 0.00980 osd.4 up 1.00000 1.00000 5 hdd 0.00980 osd.5 up 1.00000 1.00000 -6 0.01959 host cn5 8 hdd 0.00980 osd.8 up 1.00000 1.00000 9 hdd 0.00980 osd.9 up 1.00000 1.00000 -16 0.05878 rack rack2 -3 0.01959 host cn2 2 hdd 0.00980 osd.2 up 1.00000 1.00000 3 hdd 0.00980 osd.3 up 1.00000 1.00000 -5 0.01959 host cn4 6 hdd 0.00980 osd.6 up 1.00000 1.00000 7 hdd 0.00980 osd.7 up 1.00000 1.00000 -7 0.01959 host cn6 10 hdd 0.00980 osd.10 up 1.00000 1.00000 11 hdd 0.00980 osd.11 up 1.00000 1.00000
The current rule - "replicated_ruleset" with type:host
I created a new rule with a point of failure - "rack".
$ ceph osd crush rule create-replicated RackStar default rack
$ ceph osd pool set rbd crush_rule RackStar
set pool 0 crush_rule to RackStar
$ ceph osd dump | grep rule
pool 0 'rbd' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 256 pgp_num 256 last_change 291 flags hashpspool stripe_width 0 application rbd
My cluster does not go into the state "HEALTH_OK". I rebooted all the server's in my cluster, but the problem does not disappear.
cluster: id: 22d6464d-f137-423e-b8aa-bec5e9219755 health: HEALTH_WARN 2/6 objects misplaced (33.333%) services: mon: 3 daemons, quorum cn1,cn2,cn3 mgr: cn1(active) osd: 12 osds: 12 up, 12 in; 256 remapped pgs data: pools: 1 pools, 256 pgs objects: 2 objects, 19 bytes usage: 61918 MB used, 60841 MB / 119 GB avail pgs: 2/6 objects misplaced (33.333%) 256 active+clean+remapped
video shows the problem in more detail [[https://youtu.be/UtM7vItjsWY]]
Updated by John Spray almost 6 years ago
- Status changed from New to Closed
Thanks for the comprehensive information. In this case, you've created a rule that requests each copy is on a separate rack, and a pool that requires three copies. However, you only have two racks, so Ceph can't satisfy that -- you'd need a third rack to have three copies on separate racks.
If what you really want is two copies on one rack and one copy on another rack, then you can construct a slightly more complicated rule to do that -- ask on ceph-users for advice.