Bug #10250: PG stuck incomplete after interrupted backfill. - Ceph - Ceph

Actions

Copy link

Bug #10250

closed

PG stuck incomplete after interrupted backfill.

Added by Aaron Bassett over 9 years ago. Updated over 9 years ago.

Status:

Closed

Priority:

High

Assignee:

Sage Weil

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Ceph version: 0.87
OS: Ubuntu 14.04
Cluster: 3x osd nodes with ~24 osds each

Issue: I had a pool accidentally set to 2. The SATA-DOM on one of my OSD nodes started to go, but I could still read off of it. I sacrificed one of my OSD disks (osd.0) (thinking I would be fine since my pools were at size 3), and moved the system onto it. Once I got the system back online and the cluster finished backfilling, I was left with several incomplete PGs. All but one of them are in a pool with test data that I will probably just remove once I get this sorted. There is one in my rdb image pool, however, that is causing heartache (pg.19.6e). Any requests for data cause block requests on its primary OSD (osd.70). The other osd (osd.4) it was on when it went down currently has about 4GB of data (when exported with ceph_objectstore_tool). I tried using ceph_objectstore_tool to move the data from 4 to 11 and now the pg is down+incomplete.

The pg query shows log_tail lagging behind last_complete so I think that may be the problem, but I don't know enough internals to be sure.

Also note, I've been flailing around quite a bit on this, so thats why the query has so many intervals.

Files

Download all files

query (83.3 KB) query	pg query from the afflicted pg	Aaron Bassett, 12/05/2014 05:57 AM
osdlog (5.59 KB) osdlog	log from current primary osd grepped for pg	Aaron Bassett, 12/05/2014 05:57 AM
health (2.91 KB) health	health detail	Aaron Bassett, 12/05/2014 05:57 AM
osdlog_kicked.txt (3 KB) osdlog_kicked.txt		Aaron Bassett, 12/08/2014 06:48 AM
osd.grepped.log (1.16 MB) osd.grepped.log		Aaron Bassett, 12/08/2014 10:21 AM

Actions

Copy link

Updated by Aaron Bassett over 9 years ago

Oh I forgot to mention, min_size is 1 on this pool.

Actions

Copy link

Updated by Aaron Bassett over 9 years ago

File osdlog_kicked.txt osdlog_kicked.txt added

I found more info in the log relating to this pg. It looks like it's kicking it, but still hanging requests. At this point I'd be happy to lose the data in the pg if it stops blocking requests so I can get the rest of the data out of the pool

Actions

Copy link

Updated by Sage Weil over 9 years ago

Source changed from other to Community (user)

Can you try 'ceph osd lost 11' ? (I take it osd.11 is the one that you wiped and removed?)

If you can catpure the osd log for that PG during peering (with debug osd = 20) that would help by letting us see the decision process.

Actions

Copy link

Updated by Aaron Bassett over 9 years ago

File osd.grepped.log osd.grepped.log added

I actually had already marked 11 as lost a few days ago. Just this morning I re-activated the disk and it came up as another (now new) 11. Attached find the log from osd 70 on restart. The log grew to 130M in just a couple minutes after restart, so I did grep -A20 19\\.6e on it to get it to a reasonable size. If that's not useful, please let me know how to get it to you.

Actions

Copy link

Updated by Sage Weil over 9 years ago

2014-12-08 18:14:01.108163 7f28fcb46700  5 osd.70 pg_epoch: 71592 pg[19.6e( v 51072'48910307 (50495'48906592,51072'48910307] local-les=51066 n=855 ec=21298 les/c 51101/50495 71591/71591/71591) [70,16,13] r=0 lpr=71591 pi=50489-71590/150 crt=51071'48910301 lcod 0'0 mlcod 0'0 peering] enter Started/Primary/Peering/GetLog
2014-12-08 18:14:01.108222 7f28fcb46700 10 osd.70 pg_epoch: 71592 pg[19.6e( v 51072'48910307 (50495'48906592,51072'48910307] local-les=51066 n=855 ec=21298 les/c 51101/50495 71591/71591/71591) [70,16,13] r=0 lpr=71591 pi=50489-71590/150 crt=51071'48910301 lcod 0'0 mlcod 0'0 peering] calc_acting osd.0 19.6e( DNE empty local-les=0 n=0 ec=0 les/c 0/0 0/0/0)
2014-12-08 18:14:01.108236 7f28fcb46700 10 osd.70 pg_epoch: 71592 pg[19.6e( v 51072'48910307 (50495'48906592,51072'48910307] local-les=51066 n=855 ec=21298 les/c 51101/50495 71591/71591/71591) [70,16,13] r=0 lpr=71591 pi=50489-71590/150 crt=51071'48910301 lcod 0'0 mlcod 0'0 peering] calc_acting osd.4 19.6e( v 51072'48910307 (50495'48906592,51072'48910307] local-les=51066 n=855 ec=21298 les/c 51101/50495 71591/71591/71591)
2014-12-08 18:14:01.108251 7f28fcb46700 10 osd.70 pg_epoch: 71592 pg[19.6e( v 51072'48910307 (50495'48906592,51072'48910307] local-les=51066 n=855 ec=21298 les/c 51101/50495 71591/71591/71591) [70,16,13] r=0 lpr=71591 pi=50489-71590/150 crt=51071'48910301 lcod 0'0 mlcod 0'0 peering] calc_acting osd.11 19.6e( DNE empty local-les=0 n=0 ec=0 les/c 0/0 0/0/0)
2014-12-08 18:14:01.108266 7f28fcb46700 10 osd.70 pg_epoch: 71592 pg[19.6e( v 51072'48910307 (50495'48906592,51072'48910307] local-les=51066 n=855 ec=21298 les/c 51101/50495 71591/71591/71591) [70,16,13] r=0 lpr=71591 pi=50489-71590/150 crt=51071'48910301 lcod 0'0 mlcod 0'0 peering] calc_acting osd.13 19.6e( DNE empty local-les=0 n=0 ec=0 les/c 0/0 0/0/0)
2014-12-08 18:14:01.108276 7f28fcb46700 10 osd.70 pg_epoch: 71592 pg[19.6e( v 51072'48910307 (50495'48906592,51072'48910307] local-les=51066 n=855 ec=21298 les/c 51101/50495 71591/71591/71591) [70,16,13] r=0 lpr=71591 pi=50489-71590/150 crt=51071'48910301 lcod 0'0 mlcod 0'0 peering] calc_acting osd.16 19.6e( empty local-les=0 n=0 ec=21298 les/c 51101/50495 71591/71591/71591)
2014-12-08 18:14:01.108287 7f28fcb46700 10 osd.70 pg_epoch: 71592 pg[19.6e( v 51072'48910307 (50495'48906592,51072'48910307] local-les=51066 n=855 ec=21298 les/c 51101/50495 71591/71591/71591) [70,16,13] r=0 lpr=71591 pi=50489-71590/150 crt=51071'48910301 lcod 0'0 mlcod 0'0 peering] calc_acting osd.20 19.6e( DNE empty local-les=0 n=0 ec=0 les/c 0/0 0/0/0)
2014-12-08 18:14:01.108296 7f28fcb46700 10 osd.70 pg_epoch: 71592 pg[19.6e( v 51072'48910307 (50495'48906592,51072'48910307] local-les=51066 n=855 ec=21298 les/c 51101/50495 71591/71591/71591) [70,16,13] r=0 lpr=71591 pi=50489-71590/150 crt=51071'48910301 lcod 0'0 mlcod 0'0 peering] calc_acting osd.30 19.6e( empty local-les=0 n=0 ec=21298 les/c 51101/50495 71591/71591/71591)
2014-12-08 18:14:01.108313 7f28fcb46700 10 osd.70 pg_epoch: 71592 pg[19.6e( v 51072'48910307 (50495'48906592,51072'48910307] local-les=51066 n=855 ec=21298 les/c 51101/50495 71591/71591/71591) [70,16,13] r=0 lpr=71591 pi=50489-71590/150 crt=51071'48910301 lcod 0'0 mlcod 0'0 peering] calc_acting osd.31 19.6e( DNE empty local-les=0 n=0 ec=0 les/c 0/0 0/0/0)
2014-12-08 18:14:01.108325 7f28fcb46700 10 osd.70 pg_epoch: 71592 pg[19.6e( v 51072'48910307 (50495'48906592,51072'48910307] local-les=51066 n=855 ec=21298 les/c 51101/50495 71591/71591/71591) [70,16,13] r=0 lpr=71591 pi=50489-71590/150 crt=51071'48910301 lcod 0'0 mlcod 0'0 peering] calc_acting osd.56 19.6e( DNE empty local-les=0 n=0 ec=0 les/c 0/0 0/0/0)
2014-12-08 18:14:01.108339 7f28fcb46700 10 osd.70 pg_epoch: 71592 pg[19.6e( v 51072'48910307 (50495'48906592,51072'48910307] local-les=51066 n=855 ec=21298 les/c 51101/50495 71591/71591/71591) [70,16,13] r=0 lpr=71591 pi=50489-71590/150 crt=51071'48910301 lcod 0'0 mlcod 0'0 peering] calc_acting osd.58 19.6e( empty local-les=0 n=0 ec=21298 les/c 51101/50495 71591/71591/71591)
2014-12-08 18:14:01.108356 7f28fcb46700 10 osd.70 pg_epoch: 71592 pg[19.6e( v 51072'48910307 (50495'48906592,51072'48910307] local-les=51066 n=855 ec=21298 les/c 51101/50495 71591/71591/71591) [70,16,13] r=0 lpr=71591 pi=50489-71590/150 crt=51071'48910301 lcod 0'0 mlcod 0'0 peering] calc_acting osd.59 19.6e( DNE empty local-les=0 n=0 ec=0 les/c 0/0 0/0/0)
2014-12-08 18:14:01.108374 7f28fcb46700 10 osd.70 pg_epoch: 71592 pg[19.6e( v 51072'48910307 (50495'48906592,51072'48910307] local-les=51066 n=855 ec=21298 les/c 51101/50495 71591/71591/71591) [70,16,13] r=0 lpr=71591 pi=50489-71590/150 crt=51071'48910301 lcod 0'0 mlcod 0'0 peering] calc_acting osd.63 19.6e( DNE empty local-les=0 n=0 ec=0 les/c 0/0 0/0/0)
2014-12-08 18:14:01.108391 7f28fcb46700 10 osd.70 pg_epoch: 71592 pg[19.6e( v 51072'48910307 (50495'48906592,51072'48910307] local-les=51066 n=855 ec=21298 les/c 51101/50495 71591/71591/71591) [70,16,13] r=0 lpr=71591 pi=50489-71590/150 crt=51071'48910301 lcod 0'0 mlcod 0'0 peering] calc_acting osd.70 19.6e( v 51072'48910307 (50495'48906592,51072'48910307] local-les=51066 n=855 ec=21298 les/c 51101/50495 71591/71591/71591)
2014-12-08 18:14:01.108414 7f28fcb46700 10 osd.70 pg_epoch: 71592 pg[19.6e( v 51072'48910307 (50495'48906592,51072'48910307] local-les=51066 n=855 ec=21298 les/c 51101/50495 71591/71591/71591) [70,16,13] r=0 lpr=71591 pi=50489-71590/150 crt=51071'48910301 lcod 0'0 mlcod 0'0 peering] calc_acting osd.76 19.6e( empty local-les=0 n=0 ec=21298 les/c 51101/50495 71591/71591/71591)
2014-12-08 18:14:01.108431 7f28fcb46700 10 osd.70 pg_epoch: 71592 pg[19.6e( v 51072'48910307 (50495'48906592,51072'48910307] local-les=51066 n=855 ec=21298 les/c 51101/50495 71591/71591/71591) [70,16,13] r=0 lpr=71591 pi=50489-71590/150 crt=51071'48910301 lcod 0'0 mlcod 0'0 peering] choose_acting failed

Actions

Copy link