Bug #55726
openDrained OSDs are still ACTIVE_PRIMARY - casuing high IO latency on clients
0%
Description
Hi
I observed high latencies and mount points hanging since Octopus release
and it's still observed on Pacific latest while draining OSD.
Cluster setup:
Ceph Pacific 16.2.7
Cephfs with EC data pool
EC profile setup:
crush-device-class= crush-failure-domain=host crush-root=default jerasure-per-chunk-alignment=false k=10 m=2 plugin=jerasure technique=reed_sol_van w=8
Description:
If we have broken drive, we are removing it from Ceph cluster by
draining it first. That means changing its crush weight to 0
ceph osd crush reweight osd.1 0
Normally on Nautilus it didn't affected clients. But after upgrade to
Octopus (and since Octopus till current Pacific release) I can observe
very high IO latencies on clients while OSD being drained (10sec and
higher).
By debugging I found out that drained OSD is still listed as
ACTIVE_PRIMARY and that happens only on EC pools and only since Octopus.
I tested it back on Nautilus, to be sure, where behavior is correct and
drained OSD is not listed under UP and ACTIVE OSDs for PGs.
Even if setting up primary-affinity for given OSD to 0 this doesn't have
any effect on EC pool.
Bellow are my debugs:
Buggy behavior on Octopus and Pacific:¶
- Before draining osd.70:
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN 16.1fff 2269 0 0 0 0 8955297727 0 0 2449 2449 active+clean 2022-05-19T08:41:55.241734+0200 19403690'275685 19407588:19607199 [70,206,216,375,307,57] 70 [70,206,216,375,307,57] 70 19384365'275621 2022-05-19T08:41:55.241493+0200 19384365'275621 2022-05-19T08:41:55.241493+0200 0 dumped pgs
- after setting osd.70 crush weight to 0 (osd.70 is still acting primary):
UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN 16.1fff 2269 0 0 2269 0 8955297727 0 0 2449 2449 active+remapped+backfill_wait 2022-05-20T08:51:54.249071+0200 19403690'275685 19407668:19607289 [71,206,216,375,307,57] 71 [70,206,216,375,307,57] 70 19384365'275621 2022-05-19T08:41:55.241493+0200 19384365'275621 2022-05-19T08:41:55.241493+0200 0 dumped pgs
Correct behavior on Nautilus:¶
- Before draining osd.10:
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN 2.4e 2 0 0 0 0 8388608 0 0 2 2 active+clean 2022-05-20 02:13:47.432104 61'2 75:40 [10,0,7] 10 [10,0,7] 10 0'0 2022-05-20 01:44:36.217286 0'0 2022-05-20 01:44:36.217286 0
- after setting osd.10 crush weight to 0 (behavior is correct, osd.10 is not listed, not used):
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN 2.4e 14 0 0 0 0 58720256 0 0 18 18 active+clean 2022-05-20 02:18:59.414812 75'18 80:43 [22,0,7] 22 [22,0,7] 22 0'0 2022-05-20 01:44:36.217286 0'0 2022-05-20 01:44:36.217286 0
Updated by Ilya Dryomov almost 2 years ago
- Project changed from rbd to RADOS
- Category set to Performance/Resource Usage
Updated by Radoslaw Zarzynski almost 2 years ago
- Status changed from New to Need More Info
It would be really helpful to compare logs around choose_acting
from Nautilus vs Octopus.
Updated by Denis Polom almost 2 years ago
Hi,
set debug mode on OSDs and MONs but didn't find string 'choose_acting'.
Also what I found, our EC profile is
crush-device-class= crush-failure-domain=host crush-root=default jerasure-per-chunk-alignment=false k=10 m=2 plugin=jerasure technique=reed_sol_van w=8
but pool min_size is 10
In docs I found:
the number of data chunks, that is the number of chunks into which the original object is divided. For example, if k = 2 a 10 kB object will be divided into k objects of 5 kB each. The default min_size on erasure coded pools is k + 1. However, we recommend min_size to be k + 2 or more to prevent loss of writes and data.
Can it be the cause of this issue?
Updated by Radoslaw Zarzynski almost 2 years ago
Could you please provide the output from ceph osd lspools
as well?
Updated by Denis Polom almost 2 years ago
pool 1 'cephfs_data' erasure profile ec_10_2 size 12 min_size 10 crush_rule 1 object_hash rjenkins pg_num 8192 pgp_num 8192 autoscale_mode warn last_change 13991755 lfor 0/77222/150462 flags hashpspool,ec_overwrites,nearfull stripe_width 40960 deep_scrub_interval 4.592e+06 scrub_max_interval 4.2096e+06 scrub_min_interval 604800 application cephfs pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 12989651 lfor 0/0/12318966 flags hashpspool stripe_width 0 deep_scrub_interval 4.592e+06 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 scrub_max_interval 4.2096e+06 scrub_min_interval 604800 application cephfs pool 3 'device_health_metrics' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode warn last_change 13991755 flags hashpspool,nearfull stripe_width 0 deep_scrub_interval 4.592e+06 pg_num_min 1 scrub_max_interval 4.2096e+06 scrub_min_interval 604800 application mgr_devicehealth