Project

General

Profile

Actions

Bug #847

closed

OSD: Recovery does not complete, PG's stay active

Added by Wido den Hollander about 13 years ago. Updated about 13 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After I've expanded my cluster from 4 to 8 OSD's the recovery would not complete.

The state right now:

pg v112366: 8240 pgs: 48 active, 8192 active+clean; 260 GB data, 523 GB used, 14364 GB / 14904 GB avail; 3126/134690 degraded (2.321%)

First it were 64 PG's which stayed active, after some OSD restarts it got back to 48.

3.7c9   46      46      46      92      188416  192937984       4704    4704    active  316'242 1381'458        [4,6]   [4,6]   316'242 2011-02-22 07:29:12.919625

That PG for example, that one stays "active".

Browsing through the logs I found:

Feb 25 22:27:23 atom osd.6[18705]: 7f81df605700 osd6 1258 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1186 1258/1258/1248) [4,6] r=1 crashed+stray m=46] clear_recovery_state
Feb 25 22:27:23 atom osd.6[18705]: 7f81df605700 osd6 1258 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1186 1258/1258/1248) [4,6] r=1 crashed+stray m=46] up [4] -> [4,6], acting [4] -> [4,6], role -1 -> 1
Feb 25 22:27:23 atom osd.6[18705]: 7f81df605700 osd6 1258 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1186 1258/1258/1248) [4,6] r=1 crashed+stray m=46] on_role_change
Feb 25 22:27:23 atom osd.6[18705]: 7f81df605700 osd6 1258 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1186 1258/1258/1248) [4,6] r=1 crashed+stray m=46] cancel_generate_backlog
Feb 25 22:27:30 atom osd.4[18838]: 7f3a0d7b3700 osd4 1258 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1185 1258/1258/1248) [4,6] r=0 mlcod 0'0 crashed+replay+degraded m=46 u=46] cancel_recovery

This process keeps repeating for this PG, the recovery gets cancelled?

In about 20 min a lot of these lines are printed, which then ends with:

Feb 25 22:47:44 atom osd.4[21301]: 7faa5870f700 osd4 1379 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=
1186 1377/1377/1377) [4,6] r=0 mlcod 0'0 crashed+replay m=46 u=46] update_stats 1377'437
Feb 25 22:48:29 noisy osd.1[27292]: 7f31fb7a9700 osd1 1380 pg[3.7c9( v 316'242 (316'238,316'242]+backlog n=46 ec=2 les=1185 1
380/1380/1380) [6] r=-1 lcod 0'0 stray] up [4,6] -> [6], acting [4,6] -> [6], role -1 -> -1
Feb 25 22:48:29 noisy osd.1[27292]: 7f31fb7a9700 osd1 1380 pg[3.7c9( v 316'242 (316'238,316'242]+backlog n=46 ec=2 les=1185 1
380/1380/1380) [6] r=-1 lcod 0'0 stray] on_change
Feb 25 22:48:29 noisy osd.1[27292]: 7f31fb7a9700 osd1 1380 pg[3.7c9( v 316'242 (316'238,316'242]+backlog n=46 ec=2 les=1185 1
380/1380/1380) [6] r=-1 lcod 0'0 stray] [4,6] -> [6], acting primary 4 -> 6

After that, it starts printing lines like, after which the PG seems "idle":

Feb 26 00:24:11 noisy osd.1[27292]: 7f3201fb6700 osd1 1389  on 2011-02-22 07:29:12.919625 3.7c9
Feb 26 00:24:12 noisy osd.1[27292]: 7f3201fb6700 osd1 1389  on 2011-02-22 07:29:12.919625 3.7c9
Feb 26 00:24:14 noisy osd.1[27292]: 7f3201fb6700 osd1 1389  on 2011-02-22 07:29:12.919625 3.7c9

What is osd1 doing with this PG? (This keeps repeating for thousands of loglines)

I do see this PG being scrubbed multiple times, but other then that, I see the "osd1" lines repeating.

I just restarted osd4 again and saw:

Feb 28 20:56:39 atom osd.4[24320]: 7f434613b720 filestore(/var/lib/ceph/osd.4) getattr /var/lib/ceph/osd.4/current/3.7c9_head/rb.0.1.000000002dd0_head '_'
Feb 28 20:56:39 atom osd.4[24320]: 7f434613b720 osd4 1415 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1389 1381/1381/1381) [] r=0 mlcod 0'0 inactive m=40 u=40] read_log  missing 316'78 (316'77) b rb.0.1.000000002dd0/head by client4304.0:64776 2011-02-21 12:31:58.473405
Feb 28 20:56:39 atom osd.4[24320]: 7f434613b720 filestore(/var/lib/ceph/osd.4) getattr /var/lib/ceph/osd.4/current/3.7c9_head/rb.0.1.00000000117f_head '_' = -2
Feb 28 20:56:39 atom osd.4[24320]: 7f434613b720 filestore(/var/lib/ceph/osd.4) getattr /var/lib/ceph/osd.4/current/3.7c9_head/rb.0.1.0000000010be_head '_' = -2
Feb 28 20:56:39 atom osd.4[24320]: 7f434613b720 osd4 1415 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1389 1381/1381/1381) [] r=0 mlcod 0'0 inactive m=44 u=44] read_log  missing 316'59 (316'58) b rb.0.1.0000000010be/head by client4304.0:32191 2011-02-21 11:18:13.436974
Feb 28 20:56:39 atom osd.4[24320]: 7f434613b720 filestore(/var/lib/ceph/osd.4) getattr /var/lib/ceph/osd.4/current/3.7c9_head/rb.0.1.00000000a638_head '_'
Feb 28 20:56:39 atom osd.4[24320]: 7f434613b720 osd4 1415 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1389 1381/1381/1381) [] r=0 mlcod 0'0 inactive m=45 u=45] read_log  missing 316'55 (316'54) b rb.0.1.00000000a638/head by client4304.0:21256 2011-02-21 11:00:31.259864
Feb 28 20:56:39 atom osd.4[24320]: 7f434613b720 osd4 1415 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1389 1381/1381/1381) [] r=0 mlcod 0'0 inactive m=46 u=46] read_log done
Feb 28 20:56:42 atom osd.4[24320]: 7f433b388700 osd4 1416 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1389 1416/1416/1416) [6] r=-1 inactive m=46] cancel_recovery

The recovery still seems to get cancelled? Due to the missing rb.* file?

The recovery stuff is still a bit new to me, so I'm not sure what to look at.

The logs are available on logger.ceph.widodh.nl:/srv/ceph/remote-syslog when you will find "ceph-osd.log", 2.gz and 3.gz are the ones where I got this data from.

On this cluster I'm running: c78b29a47d7211a4b8b1585112ac22b8435a82c7

Actions

Also available in: Atom PDF