Project

General

Profile

Actions

Bug #55726

open

Drained OSDs are still ACTIVE_PRIMARY - casuing high IO latency on clients

Added by Denis Polom almost 2 years ago. Updated almost 2 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
Performance/Resource Usage
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi

I observed high latencies and mount points hanging since Octopus release
and it's still observed on Pacific latest while draining OSD.

Cluster setup:

Ceph Pacific 16.2.7

Cephfs with EC data pool

EC profile setup:

crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=10
m=2
plugin=jerasure
technique=reed_sol_van
w=8

Description:

If we have broken drive, we are removing it from Ceph cluster by
draining it first. That means changing its crush weight to 0

ceph osd crush reweight osd.1 0

Normally on Nautilus it didn't affected clients. But after upgrade to
Octopus (and since Octopus till current Pacific release) I can observe
very high IO latencies on clients while OSD being drained (10sec and
higher).

By debugging I found out that drained OSD is still listed as
ACTIVE_PRIMARY and that happens only on EC pools and only since Octopus.
I tested it back on Nautilus, to be sure, where behavior is correct and
drained OSD is not listed under UP and ACTIVE OSDs for PGs.

Even if setting up primary-affinity for given OSD to 0 this doesn't have
any effect on EC pool.

Bellow are my debugs:

Buggy behavior on Octopus and Pacific:

  • Before draining osd.70:
    PG_STAT  OBJECTS  MISSING_ON_PRIMARY  DEGRADED  MISPLACED UNFOUND  
    BYTES       OMAP_BYTES*  OMAP_KEYS*  LOG   DISK_LOG 
    STATE                          STATE_STAMP VERSION            
    REPORTED           UP UP_PRIMARY  ACTING                     
    ACTING_PRIMARY LAST_SCRUB         SCRUB_STAMP LAST_DEEP_SCRUB    
    DEEP_SCRUB_STAMP                 SNAPTRIMQ_LEN
    16.1fff     2269                   0         0          0 0  
    8955297727            0           0  2449 2449                   
    active+clean 2022-05-19T08:41:55.241734+0200    19403690'275685 
    19407588:19607199    [70,206,216,375,307,57]          70 
    [70,206,216,375,307,57]              70    19384365'275621 
    2022-05-19T08:41:55.241493+0200    19384365'275621 
    2022-05-19T08:41:55.241493+0200              0
    dumped pgs
    
  • after setting osd.70 crush weight to 0 (osd.70 is still acting primary):
      UP                         UP_PRIMARY ACTING                     
    ACTING_PRIMARY  LAST_SCRUB SCRUB_STAMP                      
    LAST_DEEP_SCRUB DEEP_SCRUB_STAMP                 SNAPTRIMQ_LEN
    16.1fff     2269                   0         0       2269 0  
    8955297727            0           0  2449      2449 
    active+remapped+backfill_wait  2022-05-20T08:51:54.249071+0200 
    19403690'275685  19407668:19607289 [71,206,216,375,307,57]          71 
    [70,206,216,375,307,57]              70    19384365'275621 
    2022-05-19T08:41:55.241493+0200    19384365'275621 
    2022-05-19T08:41:55.241493+0200              0
    dumped pgs
    

Correct behavior on Nautilus:

  • Before draining osd.10:
    PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES    
    OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE STATE_STAMP                
    VERSION REPORTED UP         UP_PRIMARY ACTING     ACTING_PRIMARY 
    LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP           
    SNAPTRIMQ_LEN
    2.4e          2                  0        0         0       0 
    8388608           0          0   2        2 active+clean 2022-05-20 
    02:13:47.432104    61'2    75:40   [10,0,7] 10   [10,0,7]             
    10        0'0 2022-05-20 01:44:36.217286             0'0 2022-05-20 
    01:44:36.217286             0
    
  • after setting osd.10 crush weight to 0 (behavior is correct, osd.10 is not listed, not used):
    PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES     
    OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE                         
    STATE_STAMP                VERSION REPORTED UP         UP_PRIMARY 
    ACTING     ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP                
    LAST_DEEP_SCRUB DEEP_SCRUB_STAMP           SNAPTRIMQ_LEN
    2.4e         14                  0        0         0       0 
    58720256           0          0  18       18 active+clean 2022-05-20 
    02:18:59.414812   75'18    80:43 [22,0,7]         22   
    [22,0,7]             22        0'0 2022-05-20 
    01:44:36.217286             0'0 2022-05-20 01:44:36.217286             0
    
Actions #1

Updated by Ilya Dryomov almost 2 years ago

  • Project changed from rbd to RADOS
  • Category set to Performance/Resource Usage
Actions #3

Updated by Radoslaw Zarzynski almost 2 years ago

  • Status changed from New to Need More Info

It would be really helpful to compare logs around choose_acting from Nautilus vs Octopus.

Actions #4

Updated by Denis Polom almost 2 years ago

Hi,

set debug mode on OSDs and MONs but didn't find string 'choose_acting'.

Also what I found, our EC profile is

crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=10
m=2
plugin=jerasure
technique=reed_sol_van
w=8

but pool min_size is 10

In docs I found:
the number of data chunks, that is the number of chunks into which the original object is divided. For example, if k = 2 a 10 kB object will be divided into k objects of 5 kB each. The default min_size on erasure coded pools is k + 1. However, we recommend min_size to be k + 2 or more to prevent loss of writes and data.

Can it be the cause of this issue?

Actions #5

Updated by Radoslaw Zarzynski almost 2 years ago

Could you please provide the output from ceph osd lspools as well?

Actions #6

Updated by Denis Polom almost 2 years ago

pool 1 'cephfs_data' erasure profile ec_10_2 size 12 min_size 10 crush_rule 1 object_hash rjenkins pg_num 8192 pgp_num 8192 autoscale_mode warn last_change 13991755 lfor 0/77222/150462 flags hashpspool,ec_overwrites,nearfull stripe_width 40960 deep_scrub_interval 4.592e+06 scrub_max_interval 4.2096e+06 scrub_min_interval 604800 application cephfs
pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 12989651 lfor 0/0/12318966 flags hashpspool stripe_width 0 deep_scrub_interval 4.592e+06 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 scrub_max_interval 4.2096e+06 scrub_min_interval 604800 application cephfs
pool 3 'device_health_metrics' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode warn last_change 13991755 flags hashpspool,nearfull stripe_width 0 deep_scrub_interval 4.592e+06 pg_num_min 1 scrub_max_interval 4.2096e+06 scrub_min_interval 604800 application mgr_devicehealth
Actions

Also available in: Atom PDF