Bug #39055: OSD's crash when specific PG is trying to backfill - RADOS - Ceph

Actions

Copy link

Bug #39055

open

OSD's crash when specific PG is trying to backfill

Added by Alex Tijhuis about 5 years ago. Updated about 5 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

Ceph - v12.2.12

% Done:

Source:

other

Tags:

osd flapping pg backfill

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v12.2.11

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi,

I've got a peculiar issue whereby a specific PG is trying to backfill it's objects to the other peers, but the moment it tries to do so on some specific objects within the placement group, the other OSD's it's trying to backfill it to, crash (simultaneously).

The PG in question:

ceph pg 1.3e4 query | head -n 30
{
    "state": "undersized+degraded+remapped+backfill_wait+peered",
    "snap_trimq": "[130~1,13d~1]",
    "snap_trimq_len": 2,
    "epoch": 41343,
    "up": [
        8,
        1,
        25
    ],
    "acting": [
        16
    ],
    "backfill_targets": [
        "1",
        "8",
        "25" 
    ],
    "actingbackfill": [
        "1",
        "8",
        "16",
        "25" 
    ],

In this case above, it's crashing targets 1, 8 and 25 simultaneously the moment it tries to backfill.
As soon as they're crashed, Ceph goes in recovery mode, the OSD's come back online again after about 20 seconds and as soon as Ceph tries to recover/backfill the same PG again, it's all starting over again like clockwork.

Initially thought has HDD issues, so have removed the original target drives, but no change. Have taken the original owner of the PG out, still no change. (hence why in the example above, the "acting" is not one of the targets)

A snippet of the main Ceph log file, showing osd.8 and osd.23 flapping, and PG 1.3e4 starting the backfill process after which they go down again:

2019-03-31 06:26:02.406944 mon.prox1 mon.0 192.168.1.81:6789/0 744951 : cluster [INF] osd.8 failed (root=default,host=prox7) (connection refused reported by osd.0)
2019-03-31 06:26:02.409424 mon.prox1 mon.0 192.168.1.81:6789/0 744969 : cluster [INF] osd.23 failed (root=default,host=prox6) (connection refused reported by osd.6)
2019-03-31 06:26:03.324917 mon.prox1 mon.0 192.168.1.81:6789/0 745198 : cluster [WRN] Health check failed: 2 osds down (OSD_DOWN)
2019-03-31 06:26:03.325094 mon.prox1 mon.0 192.168.1.81:6789/0 745199 : cluster [WRN] Health check update: 489/5747322 objects misplaced (0.009%) (OBJECT_MISPLACED)
2019-03-31 06:26:04.550642 mon.prox1 mon.0 192.168.1.81:6789/0 745201 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)
2019-03-31 06:26:06.186950 mon.prox1 mon.0 192.168.1.81:6789/0 745203 : cluster [WRN] Health check update: Degraded data redundancy: 586/5747346 objects degraded (0.010%), 15 pgs degraded, 1 pg undersized (PG_DEGRADED)
2019-03-31 06:26:08.543045 mon.prox1 mon.0 192.168.1.81:6789/0 745204 : cluster [WRN] Health check update: 3668/5747490 objects misplaced (0.064%) (OBJECT_MISPLACED)
2019-03-31 06:26:10.873192 mon.prox1 mon.0 192.168.1.81:6789/0 745206 : cluster [WRN] Health check update: Reduced data availability: 2 pgs inactive (PG_AVAILABILITY)
2019-03-31 06:26:11.187406 mon.prox1 mon.0 192.168.1.81:6789/0 745207 : cluster [WRN] Health check update: Degraded data redundancy: 453462/5747499 objects degraded (7.890%), 297 pgs degraded (PG_DEGRADED)
2019-03-31 06:26:16.187967 mon.prox1 mon.0 192.168.1.81:6789/0 745208 : cluster [WRN] Health check update: 3668/5747499 objects misplaced (0.064%) (OBJECT_MISPLACED)
2019-03-31 06:26:34.987288 mon.prox1 mon.0 192.168.1.81:6789/0 745210 : cluster [WRN] Health check update: 3668/5747502 objects misplaced (0.064%) (OBJECT_MISPLACED)
2019-03-31 06:26:34.987356 mon.prox1 mon.0 192.168.1.81:6789/0 745211 : cluster [WRN] Health check update: Degraded data redundancy: 453462/5747502 objects degraded (7.890%), 297 pgs degraded (PG_DEGRADED)
2019-03-31 06:26:40.280592 mon.prox1 mon.0 192.168.1.81:6789/0 745214 : cluster [WRN] Health check update: 1 osds down (OSD_DOWN)
2019-03-31 06:26:40.430710 mon.prox1 mon.0 192.168.1.81:6789/0 745215 : cluster [INF] osd.23 192.168.1.86:6817/3115652 boot
2019-03-31 06:26:41.190103 mon.prox1 mon.0 192.168.1.81:6789/0 745217 : cluster [WRN] Health check update: 3668/5747508 objects misplaced (0.064%) (OBJECT_MISPLACED)
2019-03-31 06:26:41.190179 mon.prox1 mon.0 192.168.1.81:6789/0 745218 : cluster [WRN] Health check update: Degraded data redundancy: 453462/5747508 objects degraded (7.890%), 297 pgs degraded (PG_DEGRADED)
2019-03-31 06:26:43.794538 mon.prox1 mon.0 192.168.1.81:6789/0 745220 : cluster [INF] Health check cleared: OBJECT_MISPLACED (was: 3668/5747508 objects misplaced (0.064%))
2019-03-31 06:26:45.034919 mon.prox1 mon.0 192.168.1.81:6789/0 745221 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 2 pgs inactive)
2019-03-31 06:26:46.190647 mon.prox1 mon.0 192.168.1.81:6789/0 745222 : cluster [WRN] Health check update: Degraded data redundancy: 302190/5747508 objects degraded (5.258%), 247 pgs degraded (PG_DEGRADED)
2019-03-31 06:26:49.549834 mon.prox1 mon.0 192.168.1.81:6789/0 745225 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-03-31 06:26:49.598780 mon.prox1 mon.0 192.168.1.81:6789/0 745226 : cluster [WRN] Health check failed: 1834/5747508 objects misplaced (0.032%) (OBJECT_MISPLACED)
2019-03-31 06:26:49.690099 mon.prox1 mon.0 192.168.1.81:6789/0 745227 : cluster [INF] osd.8 192.168.1.87:6800/815810 boot
2019-03-31 06:26:51.191126 mon.prox1 mon.0 192.168.1.81:6789/0 745230 : cluster [WRN] Health check update: Degraded data redundancy: 288856/5747508 objects degraded (5.026%), 236 pgs degraded (PG_DEGRADED)
2019-03-31 06:26:52.142607 osd.14 osd.14 192.168.1.81:6812/3099 8370 : cluster [INF] 1.3e4 continuing backfill to osd.23 from (27492'9481638,39119'9526978] 1:27efe7b3:::rbd_data.f21da26b8b4567.00000000000010a4:head to 39119'9526978
2019-03-31 06:26:55.961547 mon.prox1 mon.0 192.168.1.81:6789/0 745231 : cluster [WRN] Health check update: 489/5747523 objects misplaced (0.009%) (OBJECT_MISPLACED)
2019-03-31 06:26:56.191516 mon.prox1 mon.0 192.168.1.81:6789/0 745232 : cluster [WRN] Health check update: Degraded data redundancy: 733/5747523 objects degraded (0.013%), 111 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:01.191946 mon.prox1 mon.0 192.168.1.81:6789/0 745233 : cluster [WRN] Health check update: 489/5747541 objects misplaced (0.009%) (OBJECT_MISPLACED)
2019-03-31 06:27:01.192047 mon.prox1 mon.0 192.168.1.81:6789/0 745234 : cluster [WRN] Health check update: Degraded data redundancy: 728/5747541 objects degraded (0.013%), 103 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:06.192432 mon.prox1 mon.0 192.168.1.81:6789/0 745237 : cluster [WRN] Health check update: Degraded data redundancy: 677/5747541 objects degraded (0.012%), 76 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:10.147576 mon.prox1 mon.0 192.168.1.81:6789/0 745239 : cluster [WRN] Health check update: 489/5747544 objects misplaced (0.009%) (OBJECT_MISPLACED)
2019-03-31 06:27:11.192871 mon.prox1 mon.0 192.168.1.81:6789/0 745240 : cluster [WRN] Health check update: Degraded data redundancy: 662/5747544 objects degraded (0.012%), 66 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:15.339742 mon.prox1 mon.0 192.168.1.81:6789/0 745241 : cluster [WRN] Health check update: 489/5747613 objects misplaced (0.009%) (OBJECT_MISPLACED)
2019-03-31 06:27:16.240006 mon.prox1 mon.0 192.168.1.81:6789/0 745244 : cluster [WRN] Health check update: Degraded data redundancy: 631/5747613 objects degraded (0.011%), 49 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:21.240485 mon.prox1 mon.0 192.168.1.81:6789/0 745247 : cluster [WRN] Health check update: Degraded data redundancy: 623/5747613 objects degraded (0.011%), 44 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:26.240859 mon.prox1 mon.0 192.168.1.81:6789/0 745248 : cluster [WRN] Health check update: Degraded data redundancy: 598/5747613 objects degraded (0.010%), 26 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:29.354948 mon.prox1 mon.0 192.168.1.81:6789/0 745252 : cluster [INF] osd.8 failed (root=default,host=prox7) (connection refused reported by osd.19)
2019-03-31 06:27:29.390502 mon.prox1 mon.0 192.168.1.81:6789/0 745287 : cluster [INF] osd.23 failed (root=default,host=prox6) (connection refused reported by osd.0)

Had a really good look around for current bugs or fixes, but couldn't find anything that came close. Any help much appreciated! - Apologies if this is the wrong place for it.
Let me know what other information you'd need. I've got the individual OSD logs as well, but as they're getting about half a gig each, I didn't think it's smart to attach them straight away ;)

Running Ceph 12.2.11-pve1, on ProxMox 5.3-12

Thanks in advance,

Alex

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #39055

OSD's crash when specific PG is trying to backfill

Updated by Patrick Donnelly about 5 years ago

Updated by Greg Farnum about 5 years ago

Updated by Alex Tijhuis about 5 years ago