Project

General

Profile

Actions

Bug #39115

closed

ceph pg repair doesn't fix itself if osd is bluestore

Added by Iain Buclaw about 5 years ago. Updated over 4 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
David Zafman
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Running ceph pg repair on an inconsistent PG with missing data, I usually notice that the OSD is marked as down/up before the cluster returns to healthy.

e.g:

2019-04-04 11:32:47.226989 osd.3 [ERR] 51.10d shard 3 51:b0edece3::17308400658964302403:169:head : missing
2019-04-04 11:32:58.018797 osd.3 [ERR] 51.10d shard 9 51:b0fd8af8::2389926346209815023:169:head : missing
2019-04-04 11:32:58.018799 osd.3 [ERR] 51.10d shard 9 51:b0fd8af8::2389926346209815023:21:head : missing
2019-04-04 11:32:58.018813 osd.3 [ERR] 51.10d shard 9 51:b0fd8af8::2389926346209815023:22:head : missing
2019-04-04 11:32:58.018814 osd.3 [ERR] 51.10d shard 9 51:b0fd8af8::2389926346209815023:23:head : missing
2019-04-04 11:33:03.522849 mon.120 [INF] osd.3 failed (root=default,host=121) (connection refused reported by osd.6)
2019-04-04 11:33:04.122713 mon.120 [WRN] Health check failed: 1 osds down (OSD_DOWN)
2019-04-04 11:33:06.449766 mon.120 [WRN] Health check failed: Reduced data availability: 21 pgs peering (PG_AVAILABILITY)
2019-04-04 11:33:06.449797 mon.120 [INF] Health check cleared: OSD_SCRUB_ERRORS (was: 17 scrub errors)
2019-04-04 11:33:06.449812 mon.120 [INF] Health check cleared: PG_DAMAGED (was: Possible data damage: 1 pg inconsistent)
2019-04-04 11:33:08.785071 mon.120 [WRN] Health check failed: Degraded data redundancy: 1903317/24115186 objects degraded (7.893%), 121 pgs degraded (PG_DEGRADED)
2019-04-04 11:33:11.944066 mon.120 [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 21 pgs peering)
2019-04-04 11:33:15.813380 mon.120 [WRN] Health check update: Degraded data redundancy: 2432784/24115186 objects degraded (10.088%), 155 pgs degraded (PG_DEGRADED)
2019-04-04 11:33:31.774861 mon.120 [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-04-04 11:33:31.996274 mon.120 [INF] osd.3 172.28.19.6:6804/3630259 boot
2019-04-04 11:33:32.959333 mon.120 [WRN] Health check update: Degraded data redundancy: 2177355/24115186 objects degraded (9.029%), 140 pgs degraded (PG_DEGRADED)
2019-04-04 11:33:38.913328 mon.120 [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 1849337/24115186 objects degraded (7.669%), 116 pgs degraded)
2019-04-04 11:33:38.913364 mon.120 [INF] Cluster is now healthy

If the OSD is bluestore, this does not happen though, the same objects are just reported again and again as missing.

2019-04-04 14:45:33.743300 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:169:head : missing
2019-04-04 14:45:33.743301 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:21:head : missing
2019-04-04 14:45:33.743302 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:22:head : missing
2019-04-04 14:45:33.743303 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:23:head : missing
2019-04-04 14:45:33.743304 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:24:head : missing
2019-04-04 14:45:33.743304 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:25:head : missing
2019-04-04 14:45:33.743305 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:26:head : missing
2019-04-04 14:45:33.743306 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:27:head : missing
2019-04-04 14:45:33.743307 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:28:head : missing
2019-04-04 14:45:33.743308 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:33:head : missing
2019-04-04 14:45:33.743309 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:34:head : missing
2019-04-04 14:45:33.743310 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:36:head : missing
2019-04-04 14:45:39.026046 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:169:head : missing
2019-04-04 14:45:39.026048 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:21:head : missing
2019-04-04 14:45:39.026049 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:22:head : missing
2019-04-04 14:45:39.026050 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:23:head : missing
2019-04-04 14:45:39.026051 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:24:head : missing
2019-04-04 14:45:39.026051 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:25:head : missing
2019-04-04 14:45:39.026052 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:26:head : missing
2019-04-04 14:45:39.026053 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:27:head : missing
2019-04-04 14:45:39.026054 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:28:head : missing
2019-04-04 14:45:39.026055 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:33:head : missing
2019-04-04 14:45:39.026056 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:34:head : missing
2019-04-04 14:45:39.026063 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:36:head : missing
2019-04-04 14:46:05.500587 osd.9 [ERR] 51.1e5 shard 9 51:a7a6a3fe::18256662680708088428:169:head : missing
2019-04-04 14:46:10.795413 osd.9 [ERR] 51.1e5 shard 5 51:a7a6a3fe::18256662680708088428:169:head : missing
2019-04-04 14:47:46.115763 osd.9 [ERR] 51.1e5 repair 13 missing, 0 inconsistent objects
2019-04-04 14:47:46.115809 osd.9 [ERR] 51.1e5 repair 26 errors, 13 fixed
2019-04-04 14:47:51.417526 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:169:head : missing
2019-04-04 14:47:51.417527 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:21:head : missing
2019-04-04 14:47:51.417528 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:22:head : missing
2019-04-04 14:47:51.417529 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:23:head : missing
2019-04-04 14:47:51.417530 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:24:head : missing
2019-04-04 14:47:51.417538 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:25:head : missing
2019-04-04 14:47:51.417539 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:26:head : missing
2019-04-04 14:47:51.417540 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:27:head : missing
2019-04-04 14:47:51.417543 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:28:head : missing
2019-04-04 14:47:51.417544 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:33:head : missing
2019-04-04 14:47:51.417545 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:34:head : missing
2019-04-04 14:47:51.417545 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:36:head : missing
2019-04-04 14:47:56.711487 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:169:head : missing
2019-04-04 14:47:56.711488 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:21:head : missing
2019-04-04 14:47:56.711489 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:22:head : missing
2019-04-04 14:47:56.711490 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:23:head : missing
2019-04-04 14:47:56.711491 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:24:head : missing
2019-04-04 14:47:56.711492 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:25:head : missing
2019-04-04 14:47:56.711492 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:26:head : missing
2019-04-04 14:47:56.711493 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:27:head : missing
2019-04-04 14:47:56.711494 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:28:head : missing
2019-04-04 14:47:56.711495 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:33:head : missing
2019-04-04 14:47:56.711496 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:34:head : missing
2019-04-04 14:47:56.711496 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:36:head : missing
2019-04-04 14:48:23.182563 osd.9 [ERR] 51.1e5 shard 9 51:a7a6a3fe::18256662680708088428:169:head : missing
2019-04-04 14:48:28.472711 osd.9 [ERR] 51.1e5 shard 5 51:a7a6a3fe::18256662680708088428:169:head : missing
2019-04-04 14:50:03.649604 osd.9 [ERR] 51.1e5 deep-scrub 13 missing, 0 inconsistent objects
2019-04-04 14:50:03.649608 osd.9 [ERR] 51.1e5 deep-scrub 26 errors

As it so happens, the datastore where it's reported missing (osd.5) is a filestore, and I verified by inspecting the disk that all objects exist.

If I restart the osd that is using bluestore (osd.9), then the cluster returns back to healthy.

This is reproducible 100% of the time.


Related issues 1 (1 open0 closed)

Is duplicate of RADOS - Bug #39116: Draining filestore osd, removing, and adding new bluestore osd causes OSDs to crashNew04/04/2019

Actions
Actions

Also available in: Atom PDF