Project

General

Profile

Actions

Bug #18693

closed

min_size = k + 1

Added by Muthusamy Muthiah about 7 years ago. Updated almost 7 years ago.

Status:
Won't Fix
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Installed version : v11.2.0
ceph cluster : 5 node cluster with 12 disk(6TB) on each node , enabled bluestore with EC 4+1

Issue:

When an OSD is down , the peering is not happening and ceph health status moved to ERR state after few mins. This was working in previous releases( till jewel and filestore) .

We also noted that when the storage utilisation is less than 5% , the peering is happening immediately when the OSD moves out of the cluster ( attached sceenshot of ceph status ) , however when the storage utilisation goes beyond 5% the cluster moves to ERR state after OSD is out of the cluster.

Following is our ceph configuration:

mon_osd_down_out_interval = 30
mon_osd_report_timeout = 30
mon_osd_down_out_subtree_limit = host
mon_osd_reporter_subtree_level = host

and the recovery parameters set to default.

[root@ca-cn1 ceph]# ceph osd crush show-tunables

{
"choose_local_tries": 0,
"choose_local_fallback_tries": 0,
"choose_total_tries": 50,
"chooseleaf_descend_once": 1,
"chooseleaf_vary_r": 1,
"chooseleaf_stable": 1,
"straw_calc_version": 1,
"allowed_bucket_algs": 54,
"profile": "jewel",
"optimal_tunables": 1,
"legacy_tunables": 0,
"minimum_required_version": "jewel",
"require_feature_tunables": 1,
"require_feature_tunables2": 1,
"has_v2_rules": 1,
"require_feature_tunables3": 1,
"has_v3_rules": 0,
"has_v4_buckets": 0,
"require_feature_tunables5": 1,
"has_v5_rules": 0
}

ceph status:

health HEALTH_ERR
173 pgs are stuck inactive for more than 300 seconds
173 pgs incomplete
173 pgs stuck inactive
173 pgs stuck unclean
monmap e2: 5 mons at {ca-cn1=10.50.5.117:6789/0,ca-cn2=10.50.5.118:6789/0,ca-cn3=10.50.5.119:6789/0,ca-cn4=10.50.5.120:6789/0,ca-cn5=10.50.5.121:6789/0}
election epoch 106, quorum 0,1,2,3,4 ca-cn1,ca-cn2,ca-cn3,ca-cn4,ca-cn5
mgr active: ca-cn1 standbys: ca-cn2, ca-cn4, ca-cn5, ca-cn3
osdmap e1128: 60 osds: 59 up, 59 in; 173 remapped pgs
flags sortbitwise,require_jewel_osds,require_kraken_osds
pgmap v782747: 2048 pgs, 1 pools, 63133 GB data, 46293 kobjects
85199 GB used, 238 TB / 322 TB avail
1868 active+clean
173 remapped+incomplete
7 active+clean+scrubbing

MON log:

2017-01-20 09:25:54.715684 7f55bcafb700 0 log_channel(cluster) log [INF] : osd.54 out (down for 31.703786)
2017-01-20 09:25:54.725688 7f55bf4d5700 0 mon.ca-cn1@0(leader).osd e1120 crush map has features 288250512065953792, adjusting msgr requires
2017-01-20 09:25:54.729019 7f55bf4d5700 0 log_channel(cluster) log [INF] : osdmap e1120: 60 osds: 59 up, 59 in
2017-01-20 09:25:54.735987 7f55bf4d5700 0 log_channel(cluster) log [INF] : pgmap v781993: 2048 pgs: 1869 active+clean, 173 incomplete, 6 active+clean+scrubbing; 63159 GB data, 85201 GB used, 238 TB / 322 TB avail; 21825 B/s rd, 163 MB/s wr, 2046 op/s
2017-01-20 09:25:55.737749 7f55bf4d5700 0 mon.ca-cn1@0(leader).osd e1121 crush map has features 288250512065953792, adjusting msgr requires
2017-01-20 09:25:55.744338 7f55bf4d5700 0 log_channel(cluster) log [INF] : osdmap e1121: 60 osds: 59 up, 59 in
2017-01-20 09:25:55.749616 7f55bf4d5700 0 log_channel(cluster) log [INF] : pgmap v781994: 2048 pgs: 29 remapped+incomplete, 1869 active+clean, 144 incomplete, 6 active+clean+scrubbing; 63159 GB data, 85201 GB used, 238 TB / 322 TB avail; 44503 B/s rd, 45681 kB/s wr, 518 op/s
2017-01-20 09:25:56.768721 7f55bf4d5700 0 log_channel(cluster) log [INF] : pgmap v781995: 2048 pgs: 47 remapped+incomplete, 1869 active+clean, 126 incomplete, 6 active+clean+scrubbing; 63159 GB data, 85201 GB used, 238 TB / 322 TB avail; 20275 B/s rd, 72742 kB/s wr, 665 op/s

Let us know if any logs required.

Thanks,
Muthu


Files

recovery-peering.JPG (84.7 KB) recovery-peering.JPG ceph peering when storage is less Muthusamy Muthiah, 01/27/2017 06:47 AM
Actions #1

Updated by Muthusamy Muthiah about 7 years ago

Hi All,

Also tried EC profile 3+1 on 5 node cluster with bluestore enabled . When an OSD is down the cluster goes to ERROR state even when the cluster is n+1 . No recovery happening.

health HEALTH_ERR
75 pgs are stuck inactive for more than 300 seconds
75 pgs incomplete
75 pgs stuck inactive
75 pgs stuck unclean
monmap e2: 5 mons at {ca-cn1=10.50.5.117:6789/0,ca-cn2=10.50.5.118:6789/0,ca-cn3=10.50.5.119:6789/0,ca-cn4=10.50.5.120:6789/0,ca-cn5=10.50.5.121:6789/0}
election epoch 10, quorum 0,1,2,3,4 ca-cn1,ca-cn2,ca-cn3,ca-cn4,ca-cn5
mgr active: ca-cn1 standbys: ca-cn4, ca-cn3, ca-cn5, ca-cn2
osdmap e264: 60 osds: 59 up, 59 in; 75 remapped pgs
flags sortbitwise,require_jewel_osds,require_kraken_osds
pgmap v119402: 1024 pgs, 1 pools, 28519 GB data, 21548 kobjects
39976 GB used, 282 TB / 322 TB avail
941 active+clean
75 remapped+incomplete
8 active+clean+scrubbing

this seems to be an issue with bluestore , recovery not happening properly with EC .

Thanks,
Muthu

Actions #2

Updated by Muthusamy Muthiah about 7 years ago

Hi All,

Following are the test outcomes on EC profile ( n = k + m)

1. Kraken filestore and bluetore with m=1 , recovery does not start .
2. Jewel filestore and bluestore with m=1 , recovery happens .
3. Kraken bluestore all default configuration and m=1, no recovery.
4. Kraken bluestore with m=2 , recovery happens when one OSD is down and for 2 OSD fails.

So, the issue seems to be on ceph-kraken release.

Thanks,
Muthu

Actions #3

Updated by Muthusamy Muthiah about 7 years ago

Hi All,

the problem is in kraken, when a pool is created with EC profile , min_size equals erasure size.

For 3+1 profile , following is the pool status ,
pool 2 'cdvr_ec' erasure size 4 min_size 4 crush_ruleset 1 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 234 flags hashpspool stripe_width 4128

For 4+1 profile:
pool 5 'cdvr_ec' erasure size 5 min_size 5 crush_ruleset 1 object_hash rjenkins pg_num 4096 pgp_num 4096

For 3+2 profile :
pool 3 'cdvr_ec' erasure size 5 min_size 4 crush_ruleset 1 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 412 flags hashpspool stripe_width 4128

Where as on Jewel release for EC 4+1:
pool 30 'cdvr_ec' erasure size 5 min_size 4 crush_ruleset 1 object_hash rjenkins pg_num 4096 pgp_num 4096

Trying to modify min_size and verify the status.

Is there any reason behind this change in ceph kraken or a bug.

Thanks,
Muthu

Actions #4

Updated by Sage Weil almost 7 years ago

  • Subject changed from Bluestore: v11.2.0 peering not happening when OSD is down to min_size = k + 1
Actions #5

Updated by Sage Weil almost 7 years ago

  • Status changed from New to Won't Fix

this is intentional. the aim is to mimic min_size of 2 for replication pools. that is, we won't write if a single additional failure would lose the update. you can adjust min_size in your environment if your preference is different

Actions

Also available in: Atom PDF