Bug #39055: OSD's crash when specific PG is trying to backfill - RADOS - Ceph

Actions

Copy link

Bug #39055

open

OSD's crash when specific PG is trying to backfill

Added by Alex Tijhuis about 5 years ago. Updated about 5 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

Ceph - v12.2.12

% Done:

Source:

other

Tags:

osd flapping pg backfill

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v12.2.11

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi,

I've got a peculiar issue whereby a specific PG is trying to backfill it's objects to the other peers, but the moment it tries to do so on some specific objects within the placement group, the other OSD's it's trying to backfill it to, crash (simultaneously).

The PG in question:

ceph pg 1.3e4 query | head -n 30
{
    "state": "undersized+degraded+remapped+backfill_wait+peered",
    "snap_trimq": "[130~1,13d~1]",
    "snap_trimq_len": 2,
    "epoch": 41343,
    "up": [
        8,
        1,
        25
    ],
    "acting": [
        16
    ],
    "backfill_targets": [
        "1",
        "8",
        "25" 
    ],
    "actingbackfill": [
        "1",
        "8",
        "16",
        "25" 
    ],

In this case above, it's crashing targets 1, 8 and 25 simultaneously the moment it tries to backfill.
As soon as they're crashed, Ceph goes in recovery mode, the OSD's come back online again after about 20 seconds and as soon as Ceph tries to recover/backfill the same PG again, it's all starting over again like clockwork.

Initially thought has HDD issues, so have removed the original target drives, but no change. Have taken the original owner of the PG out, still no change. (hence why in the example above, the "acting" is not one of the targets)

A snippet of the main Ceph log file, showing osd.8 and osd.23 flapping, and PG 1.3e4 starting the backfill process after which they go down again:

2019-03-31 06:26:02.406944 mon.prox1 mon.0 192.168.1.81:6789/0 744951 : cluster [INF] osd.8 failed (root=default,host=prox7) (connection refused reported by osd.0)
2019-03-31 06:26:02.409424 mon.prox1 mon.0 192.168.1.81:6789/0 744969 : cluster [INF] osd.23 failed (root=default,host=prox6) (connection refused reported by osd.6)
2019-03-31 06:26:03.324917 mon.prox1 mon.0 192.168.1.81:6789/0 745198 : cluster [WRN] Health check failed: 2 osds down (OSD_DOWN)
2019-03-31 06:26:03.325094 mon.prox1 mon.0 192.168.1.81:6789/0 745199 : cluster [WRN] Health check update: 489/5747322 objects misplaced (0.009%) (OBJECT_MISPLACED)
2019-03-31 06:26:04.550642 mon.prox1 mon.0 192.168.1.81:6789/0 745201 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)
2019-03-31 06:26:06.186950 mon.prox1 mon.0 192.168.1.81:6789/0 745203 : cluster [WRN] Health check update: Degraded data redundancy: 586/5747346 objects degraded (0.010%), 15 pgs degraded, 1 pg undersized (PG_DEGRADED)
2019-03-31 06:26:08.543045 mon.prox1 mon.0 192.168.1.81:6789/0 745204 : cluster [WRN] Health check update: 3668/5747490 objects misplaced (0.064%) (OBJECT_MISPLACED)
2019-03-31 06:26:10.873192 mon.prox1 mon.0 192.168.1.81:6789/0 745206 : cluster [WRN] Health check update: Reduced data availability: 2 pgs inactive (PG_AVAILABILITY)
2019-03-31 06:26:11.187406 mon.prox1 mon.0 192.168.1.81:6789/0 745207 : cluster [WRN] Health check update: Degraded data redundancy: 453462/5747499 objects degraded (7.890%), 297 pgs degraded (PG_DEGRADED)
2019-03-31 06:26:16.187967 mon.prox1 mon.0 192.168.1.81:6789/0 745208 : cluster [WRN] Health check update: 3668/5747499 objects misplaced (0.064%) (OBJECT_MISPLACED)
2019-03-31 06:26:34.987288 mon.prox1 mon.0 192.168.1.81:6789/0 745210 : cluster [WRN] Health check update: 3668/5747502 objects misplaced (0.064%) (OBJECT_MISPLACED)
2019-03-31 06:26:34.987356 mon.prox1 mon.0 192.168.1.81:6789/0 745211 : cluster [WRN] Health check update: Degraded data redundancy: 453462/5747502 objects degraded (7.890%), 297 pgs degraded (PG_DEGRADED)
2019-03-31 06:26:40.280592 mon.prox1 mon.0 192.168.1.81:6789/0 745214 : cluster [WRN] Health check update: 1 osds down (OSD_DOWN)
2019-03-31 06:26:40.430710 mon.prox1 mon.0 192.168.1.81:6789/0 745215 : cluster [INF] osd.23 192.168.1.86:6817/3115652 boot
2019-03-31 06:26:41.190103 mon.prox1 mon.0 192.168.1.81:6789/0 745217 : cluster [WRN] Health check update: 3668/5747508 objects misplaced (0.064%) (OBJECT_MISPLACED)
2019-03-31 06:26:41.190179 mon.prox1 mon.0 192.168.1.81:6789/0 745218 : cluster [WRN] Health check update: Degraded data redundancy: 453462/5747508 objects degraded (7.890%), 297 pgs degraded (PG_DEGRADED)
2019-03-31 06:26:43.794538 mon.prox1 mon.0 192.168.1.81:6789/0 745220 : cluster [INF] Health check cleared: OBJECT_MISPLACED (was: 3668/5747508 objects misplaced (0.064%))
2019-03-31 06:26:45.034919 mon.prox1 mon.0 192.168.1.81:6789/0 745221 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 2 pgs inactive)
2019-03-31 06:26:46.190647 mon.prox1 mon.0 192.168.1.81:6789/0 745222 : cluster [WRN] Health check update: Degraded data redundancy: 302190/5747508 objects degraded (5.258%), 247 pgs degraded (PG_DEGRADED)
2019-03-31 06:26:49.549834 mon.prox1 mon.0 192.168.1.81:6789/0 745225 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-03-31 06:26:49.598780 mon.prox1 mon.0 192.168.1.81:6789/0 745226 : cluster [WRN] Health check failed: 1834/5747508 objects misplaced (0.032%) (OBJECT_MISPLACED)
2019-03-31 06:26:49.690099 mon.prox1 mon.0 192.168.1.81:6789/0 745227 : cluster [INF] osd.8 192.168.1.87:6800/815810 boot
2019-03-31 06:26:51.191126 mon.prox1 mon.0 192.168.1.81:6789/0 745230 : cluster [WRN] Health check update: Degraded data redundancy: 288856/5747508 objects degraded (5.026%), 236 pgs degraded (PG_DEGRADED)
2019-03-31 06:26:52.142607 osd.14 osd.14 192.168.1.81:6812/3099 8370 : cluster [INF] 1.3e4 continuing backfill to osd.23 from (27492'9481638,39119'9526978] 1:27efe7b3:::rbd_data.f21da26b8b4567.00000000000010a4:head to 39119'9526978
2019-03-31 06:26:55.961547 mon.prox1 mon.0 192.168.1.81:6789/0 745231 : cluster [WRN] Health check update: 489/5747523 objects misplaced (0.009%) (OBJECT_MISPLACED)
2019-03-31 06:26:56.191516 mon.prox1 mon.0 192.168.1.81:6789/0 745232 : cluster [WRN] Health check update: Degraded data redundancy: 733/5747523 objects degraded (0.013%), 111 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:01.191946 mon.prox1 mon.0 192.168.1.81:6789/0 745233 : cluster [WRN] Health check update: 489/5747541 objects misplaced (0.009%) (OBJECT_MISPLACED)
2019-03-31 06:27:01.192047 mon.prox1 mon.0 192.168.1.81:6789/0 745234 : cluster [WRN] Health check update: Degraded data redundancy: 728/5747541 objects degraded (0.013%), 103 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:06.192432 mon.prox1 mon.0 192.168.1.81:6789/0 745237 : cluster [WRN] Health check update: Degraded data redundancy: 677/5747541 objects degraded (0.012%), 76 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:10.147576 mon.prox1 mon.0 192.168.1.81:6789/0 745239 : cluster [WRN] Health check update: 489/5747544 objects misplaced (0.009%) (OBJECT_MISPLACED)
2019-03-31 06:27:11.192871 mon.prox1 mon.0 192.168.1.81:6789/0 745240 : cluster [WRN] Health check update: Degraded data redundancy: 662/5747544 objects degraded (0.012%), 66 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:15.339742 mon.prox1 mon.0 192.168.1.81:6789/0 745241 : cluster [WRN] Health check update: 489/5747613 objects misplaced (0.009%) (OBJECT_MISPLACED)
2019-03-31 06:27:16.240006 mon.prox1 mon.0 192.168.1.81:6789/0 745244 : cluster [WRN] Health check update: Degraded data redundancy: 631/5747613 objects degraded (0.011%), 49 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:21.240485 mon.prox1 mon.0 192.168.1.81:6789/0 745247 : cluster [WRN] Health check update: Degraded data redundancy: 623/5747613 objects degraded (0.011%), 44 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:26.240859 mon.prox1 mon.0 192.168.1.81:6789/0 745248 : cluster [WRN] Health check update: Degraded data redundancy: 598/5747613 objects degraded (0.010%), 26 pgs degraded (PG_DEGRADED)
2019-03-31 06:27:29.354948 mon.prox1 mon.0 192.168.1.81:6789/0 745252 : cluster [INF] osd.8 failed (root=default,host=prox7) (connection refused reported by osd.19)
2019-03-31 06:27:29.390502 mon.prox1 mon.0 192.168.1.81:6789/0 745287 : cluster [INF] osd.23 failed (root=default,host=prox6) (connection refused reported by osd.0)

Had a really good look around for current bugs or fixes, but couldn't find anything that came close. Any help much appreciated! - Apologies if this is the wrong place for it.
Let me know what other information you'd need. I've got the individual OSD logs as well, but as they're getting about half a gig each, I didn't think it's smart to attach them straight away ;)

Running Ceph 12.2.11-pve1, on ProxMox 5.3-12

Thanks in advance,

Alex

Actions

Copy link

Updated by Patrick Donnelly about 5 years ago

Project changed from Ceph to RADOS
Category deleted (~~OSD~~)

Actions

Copy link

Updated by Greg Farnum about 5 years ago

You'll need to gather full debug logs of the crash and as much as possible about the object(s) which the PG is working to backfill when it crashes.

Actions

Copy link

Updated by Alex Tijhuis about 5 years ago

Hi Greg,

Thanks for getting back.
After a while, I resorted to creating a new pool and migrated all the data off the Pool & PG in question. However, until the last moment before I deleted the pool, the problem remained and every time the configuration tried to backfill/recover the PG in question, all OSD's where it was recovering to, would crash.

I've increased logging to 20/20 on all the OSD's in question and have copied all the OSD, MDS, MGR and MON logs from a time that this happened. You can find these logs here:
https://www.dropbox.com/s/iykxek4hwqj3sj2/ceph-logs-extended.zip?dl=0

Order of events as found in the log files;
Around 00:15, cleared out most logs to only show exact events
Then set the log levels to 20/20 on the OSD's
Around 00:16, unset the norecover flag
Almost immediately followed by OSD's 17 and 23 crashing (presumably as there is now so little data in this pool, it got around syncing data for PG 1.3e4 a lot quicker)
Then follows about 3 minutes of OSD's 17 and 23 flapping (17 flapping more than 23, but both in pretty bad shape), the cluster trying to recover in the mean time.
Around 00:20, I manually start OSD 23 again, as it doesn't seem to come back up, works well, both OSD's are up for a few seconds again and both go down again.
Around 00:21, I set the norecover flag again and within about 10 - 20 seconds both OSD 17 and 23 come back online again and all stable again.
Set the log levels back to 1/5 again and captured the log files.

Note; during the process there was still about 15% of objects in the wrong locations, so that may give some background noise - sorry for that - I couldn't get the cluster to stabilize before capturing these logs.

The full discussion and further troubleshooting steps that we've followed can be found here:
https://forum.proxmox.com/threads/ceph-osds-crashing-when-trying-to-backfill-a-specific-pg.52957/
(A shout-out to Alwin from Proxmox - he's been a great help!)

As mentioned, the problem no longer exists in my cluster as I've moved all the data off the pool in question, but as I was able to crash the OSD's almost on command, I still expect there's a bug in it somewhere. Hope these logs help.

Alex

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #39055

OSD's crash when specific PG is trying to backfill

Updated by Patrick Donnelly about 5 years ago

Updated by Greg Farnum about 5 years ago

Updated by Alex Tijhuis about 5 years ago