Bug #64333: PG autoscaler tuning => catastrophic ceph cluster crash - RADOS - Ceph

Actions

Copy link

Bug #64333

open

PG autoscaler tuning => catastrophic ceph cluster crash

Added by Loïc Dachary 3 months ago. Updated about 2 months ago.

Status:

Pending Backport

Priority:

High

Assignee:

Radoslaw Zarzynski

Category:

EC Pools

Target version:

% Done:

Source:

Community (user)

Tags:

backport_processed

Backport:

quincy,reef,squid

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Posting this report on behalf of a Ceph user. They will followup if there are any questions.

After deploying some monitoring on the Ceph cluster nodes we finally started the benchmark suite on Friday 2024-01-26 afternoon. While doing so, we did a quick review of the Ceph pool settings, for the shards/shards-data rbd pool on which we had started to ingest images with winery.

During the review we noticed that the shards-data had very few PGs (64), which kept most OSDs idle, or at least with a very unbalanced load. As the autoscaler was set up, we decided to just go ahead and enable the "bulk" flag on the `shards-data` pool to let the autoscaler scale up the number of PGs.

The autoscaler immediately moved the pool to 4096 PGs and started the data movement process.

As soon as the reallocation started, 10-15% of the OSDs crashed hard. This crash looks persistent (the OSDs crash again as soon as systemd restarts them), and therefore we consider that the data are lost and the cluster is unavailable.

Remedial steps attempted (some of them happened multiple times, so the order isn't guaranteed):
- manual restart of OSDs that were disabled by systemd after consecutive crashes * no difference, apparently the crash is persistent
- review of similar upstream tickets : * https://tracker.ceph.com/issues/53584 * https://tracker.ceph.com/issues/55662
- attempt to set osd_read_ec_check_for_errors = true on all osds, no mitigation of the crash
- revert of the bulk flag on the pool * autoscaler target config moved back to 64 pgs * no impact on data availability after restarting the crashed OSDs
- ceph osd set noout * stabilized the number of crashed OSDs (as no new reallocations are happening) * no revival of dead OSDs after restarting them

All the current diagnostic information is dumped below:

ceph status: https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-status-2024-01-29-143117.txt

cluster:
    id:     e0a98ad0-fd1f-4079-894f-ed4554ce40c6
    health: HEALTH_ERR
            noout flag(s) set
            25 osds down
            7055371 scrub errors
            Reduced data availability: 138 pgs inactive, 103 pgs down
            Possible data damage: 30 pgs inconsistent
            Degraded data redundancy: 1797720/26981188 objects degraded (6.663%), 47 pgs degraded, 130 pgs undersized
            49 daemons have recently crashed

services:
    mon: 3 daemons, quorum dwalin001,dwalin003,dwalin002 (age 2d)
    mgr: dwalin003(active, since 2d), standbys: dwalin001, dwalin002
    osd: 240 osds: 190 up (since 7h), 215 in (since 2d); 73 remapped pgs
         flags noout

data:
    pools:   6 pools, 389 pgs
    objects: 3.85M objects, 15 TiB
    usage:   18 TiB used, 2.0 PiB / 2.0 PiB avail
    pgs:     35.476% pgs not active
             1797720/26981188 objects degraded (6.663%)
             134 active+clean
             73  down+remapped
             62  active+undersized
             29  down
             29  active+undersized+degraded
             22  active+clean+inconsistent
             21  undersized+peered
             11  undersized+degraded+peered
             4   active+undersized+degraded+inconsistent
             3   undersized+degraded+inconsistent+peered
             1   down+inconsistent

ceph report: https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-report-2024-01-29-152825.txt
ceph health detail: https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-health-detail-2024-01-29-143133.txt
ceph crash ls: https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-crash-ls-2024-01-29-143402.txt
full logs (1.1 GB compressed, 31 GB uncompresed): https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-crash-2024-01-26.tar.zst

Related issues 3 (3 open — 0 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #64333

PG autoscaler tuning => catastrophic ceph cluster crash

Updated by Radoslaw Zarzynski 3 months ago

Updated by Radoslaw Zarzynski 3 months ago

Updated by Radoslaw Zarzynski 3 months ago

Updated by Kamoltat (Junior) Sirivadhna 3 months ago

Updated by Radoslaw Zarzynski 3 months ago

Updated by Loïc Dachary 3 months ago

Updated by Radoslaw Zarzynski 3 months ago

Updated by Radoslaw Zarzynski 3 months ago

Updated by Radoslaw Zarzynski 3 months ago

Updated by Radoslaw Zarzynski 3 months ago

Updated by Radoslaw Zarzynski 3 months ago

Updated by Radoslaw Zarzynski 3 months ago

Updated by Radoslaw Zarzynski 3 months ago

Updated by Radoslaw Zarzynski 3 months ago

Updated by Radoslaw Zarzynski 3 months ago

Updated by Nicolas Dandrimont 3 months ago

Updated by Nicolas Dandrimont 3 months ago

Updated by Radoslaw Zarzynski 3 months ago

Updated by Radoslaw Zarzynski 2 months ago

Updated by Radoslaw Zarzynski 2 months ago

Updated by Radoslaw Zarzynski 2 months ago

Updated by Radoslaw Zarzynski about 2 months ago

Updated by Backport Bot about 2 months ago

Updated by Backport Bot about 2 months ago

Updated by Backport Bot about 2 months ago

Updated by Backport Bot about 2 months ago