Project

General

Profile

Actions

Bug #64333

open

PG autoscaler tuning => catastrophic ceph cluster crash

Added by Loïc Dachary 3 months ago. Updated about 2 months ago.

Status:
Pending Backport
Priority:
High
Category:
EC Pools
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
backport_processed
Backport:
quincy,reef,squid
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Posting this report on behalf of a Ceph user. They will followup if there are any questions.


After deploying some monitoring on the Ceph cluster nodes we finally started the benchmark suite on Friday 2024-01-26 afternoon. While doing so, we did a quick review of the Ceph pool settings, for the shards/shards-data rbd pool on which we had started to ingest images with winery.

During the review we noticed that the shards-data had very few PGs (64), which kept most OSDs idle, or at least with a very unbalanced load. As the autoscaler was set up, we decided to just go ahead and enable the "bulk" flag on the `shards-data` pool to let the autoscaler scale up the number of PGs.

The autoscaler immediately moved the pool to 4096 PGs and started the data movement process.

As soon as the reallocation started, 10-15% of the OSDs crashed hard. This crash looks persistent (the OSDs crash again as soon as systemd restarts them), and therefore we consider that the data are lost and the cluster is unavailable.

Remedial steps attempted (some of them happened multiple times, so the order isn't guaranteed):
- manual restart of OSDs that were disabled by systemd after consecutive crashes * no difference, apparently the crash is persistent
- review of similar upstream tickets : * https://tracker.ceph.com/issues/53584 * https://tracker.ceph.com/issues/55662
- attempt to set osd_read_ec_check_for_errors = true on all osds, no mitigation of the crash
- revert of the bulk flag on the pool * autoscaler target config moved back to 64 pgs * no impact on data availability after restarting the crashed OSDs
- ceph osd set noout * stabilized the number of crashed OSDs (as no new reallocations are happening) * no revival of dead OSDs after restarting them

All the current diagnostic information is dumped below:

ceph status: https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-status-2024-01-29-143117.txt

cluster:
id: e0a98ad0-fd1f-4079-894f-ed4554ce40c6
health: HEALTH_ERR
noout flag(s) set
25 osds down
7055371 scrub errors
Reduced data availability: 138 pgs inactive, 103 pgs down
Possible data damage: 30 pgs inconsistent
Degraded data redundancy: 1797720/26981188 objects degraded (6.663%), 47 pgs degraded, 130 pgs undersized
49 daemons have recently crashed
services:
mon: 3 daemons, quorum dwalin001,dwalin003,dwalin002 (age 2d)
mgr: dwalin003(active, since 2d), standbys: dwalin001, dwalin002
osd: 240 osds: 190 up (since 7h), 215 in (since 2d); 73 remapped pgs
flags noout
data:
pools: 6 pools, 389 pgs
objects: 3.85M objects, 15 TiB
usage: 18 TiB used, 2.0 PiB / 2.0 PiB avail
pgs: 35.476% pgs not active
1797720/26981188 objects degraded (6.663%)
134 active+clean
73 down+remapped
62 active+undersized
29 down
29 active+undersized+degraded
22 active+clean+inconsistent
21 undersized+peered
11 undersized+degraded+peered
4 active+undersized+degraded+inconsistent
3 undersized+degraded+inconsistent+peered
1 down+inconsistent

ceph report: https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-report-2024-01-29-152825.txt
ceph health detail: https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-health-detail-2024-01-29-143133.txt
ceph crash ls: https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-crash-ls-2024-01-29-143402.txt
full logs (1.1 GB compressed, 31 GB uncompresed): https://krkr.eu/tmp/2024-01-29-O7KIXM08Qls/ceph-crash-2024-01-26.tar.zst


Related issues 3 (3 open0 closed)

Copied to RADOS - Backport #65119: quincy: PG autoscaler tuning => catastrophic ceph cluster crashNewRadoslaw ZarzynskiActions
Copied to RADOS - Backport #65120: squid: PG autoscaler tuning => catastrophic ceph cluster crashNewRadoslaw ZarzynskiActions
Copied to RADOS - Backport #65121: reef: PG autoscaler tuning => catastrophic ceph cluster crashNewRadoslaw ZarzynskiActions
Actions

Also available in: Atom PDF