Bug #39115
closedceph pg repair doesn't fix itself if osd is bluestore
0%
Description
Running ceph pg repair on an inconsistent PG with missing data, I usually notice that the OSD is marked as down/up before the cluster returns to healthy.
e.g:
2019-04-04 11:32:47.226989 osd.3 [ERR] 51.10d shard 3 51:b0edece3::17308400658964302403:169:head : missing 2019-04-04 11:32:58.018797 osd.3 [ERR] 51.10d shard 9 51:b0fd8af8::2389926346209815023:169:head : missing 2019-04-04 11:32:58.018799 osd.3 [ERR] 51.10d shard 9 51:b0fd8af8::2389926346209815023:21:head : missing 2019-04-04 11:32:58.018813 osd.3 [ERR] 51.10d shard 9 51:b0fd8af8::2389926346209815023:22:head : missing 2019-04-04 11:32:58.018814 osd.3 [ERR] 51.10d shard 9 51:b0fd8af8::2389926346209815023:23:head : missing 2019-04-04 11:33:03.522849 mon.120 [INF] osd.3 failed (root=default,host=121) (connection refused reported by osd.6) 2019-04-04 11:33:04.122713 mon.120 [WRN] Health check failed: 1 osds down (OSD_DOWN) 2019-04-04 11:33:06.449766 mon.120 [WRN] Health check failed: Reduced data availability: 21 pgs peering (PG_AVAILABILITY) 2019-04-04 11:33:06.449797 mon.120 [INF] Health check cleared: OSD_SCRUB_ERRORS (was: 17 scrub errors) 2019-04-04 11:33:06.449812 mon.120 [INF] Health check cleared: PG_DAMAGED (was: Possible data damage: 1 pg inconsistent) 2019-04-04 11:33:08.785071 mon.120 [WRN] Health check failed: Degraded data redundancy: 1903317/24115186 objects degraded (7.893%), 121 pgs degraded (PG_DEGRADED) 2019-04-04 11:33:11.944066 mon.120 [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 21 pgs peering) 2019-04-04 11:33:15.813380 mon.120 [WRN] Health check update: Degraded data redundancy: 2432784/24115186 objects degraded (10.088%), 155 pgs degraded (PG_DEGRADED) 2019-04-04 11:33:31.774861 mon.120 [INF] Health check cleared: OSD_DOWN (was: 1 osds down) 2019-04-04 11:33:31.996274 mon.120 [INF] osd.3 172.28.19.6:6804/3630259 boot 2019-04-04 11:33:32.959333 mon.120 [WRN] Health check update: Degraded data redundancy: 2177355/24115186 objects degraded (9.029%), 140 pgs degraded (PG_DEGRADED) 2019-04-04 11:33:38.913328 mon.120 [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 1849337/24115186 objects degraded (7.669%), 116 pgs degraded) 2019-04-04 11:33:38.913364 mon.120 [INF] Cluster is now healthy
If the OSD is bluestore, this does not happen though, the same objects are just reported again and again as missing.
2019-04-04 14:45:33.743300 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:169:head : missing 2019-04-04 14:45:33.743301 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:21:head : missing 2019-04-04 14:45:33.743302 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:22:head : missing 2019-04-04 14:45:33.743303 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:23:head : missing 2019-04-04 14:45:33.743304 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:24:head : missing 2019-04-04 14:45:33.743304 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:25:head : missing 2019-04-04 14:45:33.743305 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:26:head : missing 2019-04-04 14:45:33.743306 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:27:head : missing 2019-04-04 14:45:33.743307 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:28:head : missing 2019-04-04 14:45:33.743308 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:33:head : missing 2019-04-04 14:45:33.743309 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:34:head : missing 2019-04-04 14:45:33.743310 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:36:head : missing 2019-04-04 14:45:39.026046 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:169:head : missing 2019-04-04 14:45:39.026048 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:21:head : missing 2019-04-04 14:45:39.026049 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:22:head : missing 2019-04-04 14:45:39.026050 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:23:head : missing 2019-04-04 14:45:39.026051 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:24:head : missing 2019-04-04 14:45:39.026051 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:25:head : missing 2019-04-04 14:45:39.026052 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:26:head : missing 2019-04-04 14:45:39.026053 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:27:head : missing 2019-04-04 14:45:39.026054 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:28:head : missing 2019-04-04 14:45:39.026055 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:33:head : missing 2019-04-04 14:45:39.026056 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:34:head : missing 2019-04-04 14:45:39.026063 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:36:head : missing 2019-04-04 14:46:05.500587 osd.9 [ERR] 51.1e5 shard 9 51:a7a6a3fe::18256662680708088428:169:head : missing 2019-04-04 14:46:10.795413 osd.9 [ERR] 51.1e5 shard 5 51:a7a6a3fe::18256662680708088428:169:head : missing 2019-04-04 14:47:46.115763 osd.9 [ERR] 51.1e5 repair 13 missing, 0 inconsistent objects 2019-04-04 14:47:46.115809 osd.9 [ERR] 51.1e5 repair 26 errors, 13 fixed 2019-04-04 14:47:51.417526 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:169:head : missing 2019-04-04 14:47:51.417527 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:21:head : missing 2019-04-04 14:47:51.417528 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:22:head : missing 2019-04-04 14:47:51.417529 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:23:head : missing 2019-04-04 14:47:51.417530 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:24:head : missing 2019-04-04 14:47:51.417538 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:25:head : missing 2019-04-04 14:47:51.417539 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:26:head : missing 2019-04-04 14:47:51.417540 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:27:head : missing 2019-04-04 14:47:51.417543 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:28:head : missing 2019-04-04 14:47:51.417544 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:33:head : missing 2019-04-04 14:47:51.417545 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:34:head : missing 2019-04-04 14:47:51.417545 osd.9 [ERR] 51.1e5 shard 9 51:a785a6aa::3966486014759199568:36:head : missing 2019-04-04 14:47:56.711487 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:169:head : missing 2019-04-04 14:47:56.711488 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:21:head : missing 2019-04-04 14:47:56.711489 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:22:head : missing 2019-04-04 14:47:56.711490 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:23:head : missing 2019-04-04 14:47:56.711491 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:24:head : missing 2019-04-04 14:47:56.711492 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:25:head : missing 2019-04-04 14:47:56.711492 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:26:head : missing 2019-04-04 14:47:56.711493 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:27:head : missing 2019-04-04 14:47:56.711494 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:28:head : missing 2019-04-04 14:47:56.711495 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:33:head : missing 2019-04-04 14:47:56.711496 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:34:head : missing 2019-04-04 14:47:56.711496 osd.9 [ERR] 51.1e5 shard 5 51:a785a6aa::3966486014759199568:36:head : missing 2019-04-04 14:48:23.182563 osd.9 [ERR] 51.1e5 shard 9 51:a7a6a3fe::18256662680708088428:169:head : missing 2019-04-04 14:48:28.472711 osd.9 [ERR] 51.1e5 shard 5 51:a7a6a3fe::18256662680708088428:169:head : missing 2019-04-04 14:50:03.649604 osd.9 [ERR] 51.1e5 deep-scrub 13 missing, 0 inconsistent objects 2019-04-04 14:50:03.649608 osd.9 [ERR] 51.1e5 deep-scrub 26 errors
As it so happens, the datastore where it's reported missing (osd.5) is a filestore, and I verified by inspecting the disk that all objects exist.
If I restart the osd that is using bluestore (osd.9), then the cluster returns back to healthy.
This is reproducible 100% of the time.
Updated by David Zafman about 5 years ago
It would be helpful to see a ceph pg deep-scrub (wait for it to finish) followed by the output of rados list-inconsistent-obj.
Updated by Iain Buclaw about 5 years ago
David Zafman wrote:
It would be helpful to see a ceph pg deep-scrub (wait for it to finish) followed by the output of rados list-inconsistent-obj.
Actually, this is more related to #39116 than what I first thought.
The following behaviour:
2019-04-04 11:33:03.522849 mon.120 [INF] osd.3 failed (root=default,host=121) (connection refused reported by osd.6)
Happens as a result of the assert in #39116, it just so happens that the cluster returns to healthy after the filestore OSD boots back up.
So perhaps the bluestore OSD does not assert / crash, but the cluster does not return to healthy either.
Updated by Iain Buclaw about 5 years ago
# ceph pg deep-scrub 51.177 instructing pg 51.177 on osd.9 to deep-scrub
ceph -w logs:
2019-04-05 07:12:01.200548 osd.9 [ERR] 51.177 shard 0 51:eedfc6f8::4653376157940334334:47:head : missing 2019-04-05 07:12:06.580296 osd.9 [ERR] 51.177 shard 9 51:eedfc6f8::4653376157940334334:47:head : missing 2019-04-05 07:12:38.708078 osd.9 [ERR] 51.177 deep-scrub : stat mismatch, got 19714/19713 objects, 0/0 clones, 19714/19713 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 246025587/246016008 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. 2019-04-05 07:12:38.708093 osd.9 [ERR] 51.177 deep-scrub 1 missing, 0 inconsistent objects 2019-04-05 07:12:38.708096 osd.9 [ERR] 51.177 deep-scrub 3 errors
On osd.0, confirmed file is not missing:
# find 51.177_head/ -name "*47_4653376157940334334*" 51.177_head/DIR_7/DIR_7/DIR_B/DIR_F/47_4653376157940334334_head_1F63FB77__33
rados list-inconsistent-obj
{ "epoch": 53141, "inconsistents": [ { "object": { "name": "47", "nspace": "", "locator": "4653376157940334334", "snap": "head", "version": 4526675 }, "errors": [], "union_shard_errors": [ "missing" ], "selected_object_info": { "oid": { "oid": "47", "key": "4653376157940334334", "snapid": -2, "hash": 526646135, "max": 0, "pool": 51, "namespace": "" }, "version": "52284'454181", "prior_version": "52249'377382", "last_reqid": "client.36576007.0:15677702", "user_version": 4526675, "size": 9579, "mtime": "2019-02-07 10:58:34.211435", "local_mtime": "2019-02-07 10:58:34.218297", "lost": 0, "flags": [ "dirty", "data_digest", "omap_digest" ], "truncate_seq": 0, "truncate_size": 0, "data_digest": "0x39906511", "omap_digest": "0xffffffff", "expected_object_size": 0, "expected_write_size": 0, "alloc_hint_flags": 0, "manifest": { "type": 0 }, "watchers": {} }, "shards": [ { "osd": 0, "primary": false, "errors": [], "size": 9579, "omap_digest": "0xffffffff", "data_digest": "0x39906511" }, { "osd": 9, "primary": true, "errors": [ "missing" ] } ] } ] }
Repairing:
# ceph pg repair 51.177 instructing pg 51.177 on osd.9 to repair
ceph -w logs:
2019-04-05 07:31:34.248658 osd.9 [ERR] 51.177 shard 9 51:eee4d2c1::171336252089860175:169:head : missing 2019-04-05 07:31:39.532834 osd.9 [ERR] 51.177 shard 0 51:eee4d2c1::171336252089860175:169:head : missing 2019-04-05 07:31:39.532835 osd.9 [ERR] 51.177 shard 0 51:eeea33e8::383290891110620472:41:head : missing 2019-04-05 07:31:44.802083 osd.9 [ERR] 51.177 shard 9 51:eeea33e8::383290891110620472:41:head : missing 2019-04-05 07:32:06.433237 mon.ap-120 [INF] osd.9 failed (root=default,host=ap-124) (connection refused reported by osd.8) 2019-04-05 07:32:06.485653 mon.ap-120 [WRN] Health check failed: 1 osds down (OSD_DOWN) 2019-04-05 07:32:09.979661 mon.ap-120 [WRN] Health check failed: Reduced data availability: 3 pgs inactive, 41 pgs peering (PG_AVAILABILITY) 2019-04-05 07:32:09.979712 mon.ap-120 [WRN] Health check failed: Degraded data redundancy: 312778/23936452 objects degraded (1.307%), 18 pgs degraded (PG_DEGRADED) 2019-04-05 07:32:11.240626 mon.ap-120 [INF] Health check cleared: OSD_SCRUB_ERRORS (was: 3 scrub errors) 2019-04-05 07:32:11.240652 mon.ap-120 [INF] Health check cleared: PG_DAMAGED (was: Possible data damage: 1 pg inconsistent) 2019-04-05 07:32:13.483748 mon.ap-120 [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 3 pgs inactive, 41 pgs peering) 2019-04-05 07:32:19.295771 mon.ap-120 [WRN] Health check update: Degraded data redundancy: 2429737/23936452 objects degraded (10.151%), 159 pgs degraded (PG_DEGRADED) 2019-04-05 07:32:25.339488 mon.ap-120 [WRN] Health check update: Degraded data redundancy: 2429737/23936488 objects degraded (10.151%), 159 pgs degraded (PG_DEGRADED) 2019-04-05 07:32:35.207942 mon.ap-120 [INF] Health check cleared: OSD_DOWN (was: 1 osds down) 2019-04-05 07:32:35.580655 mon.ap-120 [INF] osd.9 172.28.19.9:6800/379046 boot 2019-04-05 07:32:37.834878 mon.ap-120 [WRN] Health check update: Degraded data redundancy: 2228323/23936488 objects degraded (9.309%), 147 pgs degraded (PG_DEGRADED) 2019-04-05 07:32:41.458732 mon.ap-120 [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 1335902/23936488 objects degraded (5.581%), 85 pgs degraded) 2019-04-05 07:32:41.458757 mon.ap-120 [INF] Cluster is now healthy
Running deep scrub again:
2019-04-05 07:43:42.636565 osd.9 [ERR] 51.177 shard 9 51:eee4d2c1::171336252089860175:169:head : missing 2019-04-05 07:43:47.926850 osd.9 [ERR] 51.177 shard 0 51:eee4d2c1::171336252089860175:169:head : missing 2019-04-05 07:43:47.926851 osd.9 [ERR] 51.177 shard 0 51:eeea33e8::383290891110620472:41:head : missing 2019-04-05 07:43:53.195249 osd.9 [ERR] 51.177 shard 9 51:eeea33e8::383290891110620472:41:head : missing 2019-04-05 07:44:14.234075 osd.9 [ERR] 51.177 deep-scrub : stat mismatch, got 19733/19731 objects, 0/0 clones, 19733/19731 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 247108485/247093139 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. 2019-04-05 07:44:14.234086 osd.9 [ERR] 51.177 deep-scrub 2 missing, 0 inconsistent objects 2019-04-05 07:44:14.234092 osd.9 [ERR] 51.177 deep-scrub 5 errors 2019-04-05 07:44:16.666888 mon.ap-120 [ERR] Health check update: 24 scrub errors (OSD_SCRUB_ERRORS) 2019-04-05 07:44:16.666934 mon.ap-120 [ERR] Health check update: Possible data damage: 2 pgs inconsistent (PG_DAMAGED)
rados list-inconsistent-obj again:
{ "epoch": 53148, "inconsistents": [ { "object": { "name": "169", "nspace": "", "locator": "171336252089860175", "snap": "head", "version": 4440812 }, "errors": [], "union_shard_errors": [ "missing" ], "selected_object_info": { "oid": { "oid": "169", "key": "171336252089860175", "snapid": -2, "hash": 2202740599, "max": 0, "pool": 51, "namespace": "" }, "version": "50939'367208", "prior_version": "50939'336508", "last_reqid": "client.36575077.0:14561030", "user_version": 4440812, "size": 2005, "mtime": "2019-01-20 20:54:12.394827", "local_mtime": "2019-01-20 20:54:12.392736", "lost": 0, "flags": [ "dirty", "data_digest", "omap_digest" ], "truncate_seq": 0, "truncate_size": 0, "data_digest": "0x54b33d8e", "omap_digest": "0xffffffff", "expected_object_size": 0, "expected_write_size": 0, "alloc_hint_flags": 0, "manifest": { "type": 0 }, "watchers": {} }, "shards": [ { "osd": 0, "primary": false, "errors": [ "missing" ] }, { "osd": 9, "primary": true, "errors": [], "size": 2005, "omap_digest": "0xffffffff", "data_digest": "0x54b33d8e" } ] }, { "object": { "name": "41", "nspace": "", "locator": "383290891110620472", "snap": "head", "version": 4373192 }, "errors": [], "union_shard_errors": [ "missing" ], "selected_object_info": { "oid": { "oid": "41", "key": "383290891110620472", "snapid": -2, "hash": 399267703, "max": 0, "pool": 51, "namespace": "" }, "version": "53145'527997", "prior_version": "50939'286779", "last_reqid": "osd.9.0:707605", "user_version": 4373192, "size": 13341, "mtime": "2018-12-12 06:37:42.094992", "local_mtime": "2018-12-12 06:37:42.096127", "lost": 0, "flags": [ "dirty", "data_digest", "omap_digest" ], "truncate_seq": 0, "truncate_size": 0, "data_digest": "0xc89910ed", "omap_digest": "0xffffffff", "expected_object_size": 0, "expected_write_size": 0, "alloc_hint_flags": 0, "manifest": { "type": 0 }, "watchers": {} }, "shards": [ { "osd": 0, "primary": false, "errors": [], "size": 13341, "omap_digest": "0xffffffff", "data_digest": "0xc89910ed" }, { "osd": 9, "primary": true, "errors": [ "missing" ] } ] } ] }
ceph pg repair again:
2019-04-05 07:49:09.475060 osd.9 [ERR] 51.177 shard 9 51:eee4d2c1::171336252089860175:169:head : missing 2019-04-05 07:49:14.755573 osd.9 [ERR] 51.177 shard 0 51:eee4d2c1::171336252089860175:169:head : missing 2019-04-05 07:49:14.755575 osd.9 [ERR] 51.177 shard 0 51:eeea33e8::383290891110620472:41:head : missing 2019-04-05 07:49:20.021187 osd.9 [ERR] 51.177 shard 9 51:eeea33e8::383290891110620472:41:head : missing 2019-04-05 07:49:41.227801 mon.ap-120 [INF] osd.9 failed (root=default,host=ap-124) (connection refused reported by osd.7) 2019-04-05 07:49:41.787921 mon.ap-120 [WRN] Health check failed: 1 osds down (OSD_DOWN) 2019-04-05 07:49:44.132558 mon.ap-120 [WRN] Health check failed: Reduced data availability: 3 pgs inactive, 69 pgs peering (PG_AVAILABILITY) 2019-04-05 07:49:46.327214 mon.ap-120 [WRN] Health check failed: Degraded data redundancy: 1052659/23936848 objects degraded (4.398%), 74 pgs degraded (PG_DEGRADED) 2019-04-05 07:49:46.327246 mon.ap-120 [INF] Health check cleared: OSD_SCRUB_ERRORS (was: 24 scrub errors) 2019-04-05 07:49:46.327259 mon.ap-120 [INF] Health check cleared: PG_DAMAGED (was: Possible data damage: 2 pgs inconsistent) 2019-04-05 07:49:49.745143 mon.ap-120 [WRN] Health check update: Reduced data availability: 1 pg inactive, 39 pgs peering (PG_AVAILABILITY) 2019-04-05 07:49:49.858132 mon.ap-120 [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 39 pgs peering) 2019-04-05 07:49:54.745487 mon.ap-120 [WRN] Health check update: Degraded data redundancy: 2429737/23936848 objects degraded (10.151%), 159 pgs degraded (PG_DEGRADED) 2019-04-05 07:50:11.077110 mon.ap-120 [INF] Health check cleared: OSD_DOWN (was: 1 osds down) 2019-04-05 07:50:11.339766 mon.ap-120 [INF] osd.9 172.28.19.9:6800/386390 boot 2019-04-05 07:50:14.761104 mon.ap-120 [WRN] Health check update: Degraded data redundancy: 1757404/23936848 objects degraded (7.342%), 112 pgs degraded (PG_DEGRADED) 2019-04-05 07:50:16.859843 mon.ap-120 [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 1757404/23936848 objects degraded (7.342%), 112 pgs degraded) 2019-04-05 07:50:16.859884 mon.ap-120 [INF] Cluster is now healthy
Restarted both osd.0 and osd.9, running deep scrub for third time:
2019-04-05 08:00:44.170028 osd.9 [ERR] 51.177 shard 9 51:eee4d2c1::171336252089860175:169:head : missing 2019-04-05 08:00:49.461396 osd.9 [ERR] 51.177 shard 0 51:eee4d2c1::171336252089860175:169:head : missing 2019-04-05 08:00:49.461398 osd.9 [ERR] 51.177 shard 0 51:eeea33e8::383290891110620472:41:head : missing 2019-04-05 08:00:54.735733 osd.9 [ERR] 51.177 shard 9 51:eeea33e8::383290891110620472:41:head : missing 2019-04-05 08:01:20.882362 mon.ap-120 [ERR] Health check update: 24 scrub errors (OSD_SCRUB_ERRORS) 2019-04-05 08:01:20.882390 mon.ap-120 [ERR] Health check update: Possible data damage: 2 pgs inconsistent (PG_DAMAGED) 2019-04-05 08:01:15.761672 osd.9 [ERR] 51.177 deep-scrub : stat mismatch, got 19733/19731 objects, 0/0 clones, 19733/19731 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 247108485/247093139 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. 2019-04-05 08:01:15.761683 osd.9 [ERR] 51.177 deep-scrub 2 missing, 0 inconsistent objects 2019-04-05 08:01:15.761685 osd.9 [ERR] 51.177 deep-scrub 5 errors
So my initial assumption was wrong, pg repair isn't working at all.
Updated by Iain Buclaw about 5 years ago
Another PG, where the missing is reported on osd.0/filestore (not osd.9/bluestore in the previous).
# ceph pg deep-scrub 51.1fa instructing pg 51.1fa on osd.0 to deep-scrub
Logs:
2019-04-05 08:11:42.606348 osd.0 [ERR] 51.1fa shard 9 51:5fd90f47::193061343227287063:169:head : missing 2019-04-05 08:11:47.944111 osd.0 [ERR] 51.1fa shard 0 51:5fd90f47::193061343227287063:169:head : missing 2019-04-05 08:11:58.658938 osd.0 [ERR] 51.1fa shard 9 51:5fe9e70d::238938732312934742:169:head : missing 2019-04-05 08:11:58.658940 osd.0 [ERR] 51.1fa shard 9 51:5fe9e70d::238938732312934742:21:head : missing 2019-04-05 08:11:58.658940 osd.0 [ERR] 51.1fa shard 9 51:5fe9e70d::238938732312934742:22:head : missing 2019-04-05 08:11:58.658941 osd.0 [ERR] 51.1fa shard 9 51:5fe9e70d::238938732312934742:23:head : missing 2019-04-05 08:12:04.019376 osd.0 [ERR] 51.1fa shard 0 51:5fe9e70d::238938732312934742:169:head : missing 2019-04-05 08:12:04.019379 osd.0 [ERR] 51.1fa shard 0 51:5fe9e70d::238938732312934742:21:head : missing 2019-04-05 08:12:04.019380 osd.0 [ERR] 51.1fa shard 0 51:5fe9e70d::238938732312934742:22:head : missing 2019-04-05 08:12:04.019381 osd.0 [ERR] 51.1fa shard 0 51:5fe9e70d::238938732312934742:23:head : missing 2019-04-05 08:12:25.412986 osd.0 [ERR] 51.1fa deep-scrub : stat mismatch, got 19524/19519 objects, 0/0 clones, 19524/19519 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 227386438/227371540 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. 2019-04-05 08:12:25.412999 osd.0 [ERR] 51.1fa deep-scrub 5 missing, 0 inconsistent objects 2019-04-05 08:12:25.413002 osd.0 [ERR] 51.1fa deep-scrub 11 errors
rados list-inconsistent-obj
{ "epoch": 53170, "inconsistents": [ { "object": { "name": "169", "nspace": "", "locator": "193061343227287063", "snap": "head", "version": 4309859 }, "errors": [], "union_shard_errors": [ "missing" ], "selected_object_info": { "oid": { "oid": "169", "key": "193061343227287063", "snapid": -2, "hash": 3807419386, "max": 0, "pool": 51, "namespace": "" }, "version": "50939'299113", "prior_version": "0'0", "last_reqid": "client.36576007.0:7932947", "user_version": 4309859, "size": 2160, "mtime": "2018-12-19 10:15:30.411967", "local_mtime": "2018-12-19 10:15:30.412661", "lost": 0, "flags": [ "dirty", "data_digest" ], "truncate_seq": 0, "truncate_size": 0, "data_digest": "0xd7ca497f", "omap_digest": "0xffffffff", "expected_object_size": 0, "expected_write_size": 0, "alloc_hint_flags": 0, "manifest": { "type": 0 }, "watchers": {} }, "shards": [ { "osd": 0, "primary": true, "errors": [ "missing" ] }, { "osd": 9, "primary": false, "errors": [], "size": 2160, "omap_digest": "0xffffffff", "data_digest": "0xd7ca497f" } ] }, { "object": { "name": "169", "nspace": "", "locator": "238938732312934742", "snap": "head", "version": 4285966 }, "errors": [], "union_shard_errors": [ "missing" ], "selected_object_info": { "oid": { "oid": "169", "key": "238938732312934742", "snapid": -2, "hash": 2967967738, "max": 0, "pool": 51, "namespace": "" }, "version": "50939'271117", "prior_version": "50905'206245", "last_reqid": "client.36574111.0:5414590", "user_version": 4285966, "size": 4794, "mtime": "2018-12-06 01:14:21.215731", "local_mtime": "2018-12-06 01:14:21.216145", "lost": 0, "flags": [ "dirty", "data_digest" ], "truncate_seq": 0, "truncate_size": 0, "data_digest": "0x5edc7438", "omap_digest": "0xffffffff", "expected_object_size": 0, "expected_write_size": 0, "alloc_hint_flags": 0, "manifest": { "type": 0 }, "watchers": {} }, "shards": [ { "osd": 0, "primary": true, "errors": [ "missing" ] }, { "osd": 9, "primary": false, "errors": [], "size": 4794, "omap_digest": "0xffffffff", "data_digest": "0x5edc7438" } ] }, { "object": { "name": "21", "nspace": "", "locator": "238938732312934742", "snap": "head", "version": 4285956 }, "errors": [], "union_shard_errors": [ "missing" ], "selected_object_info": { "oid": { "oid": "21", "key": "238938732312934742", "snapid": -2, "hash": 2967967738, "max": 0, "pool": 51, "namespace": "" }, "version": "50939'271107", "prior_version": "50905'206235", "last_reqid": "client.36574111.0:5414580", "user_version": 4285956, "size": 1014, "mtime": "2018-12-06 01:14:21.144941", "local_mtime": "2018-12-06 01:14:21.145108", "lost": 0, "flags": [ "dirty", "data_digest" ], "truncate_seq": 0, "truncate_size": 0, "data_digest": "0x224edafa", "omap_digest": "0xffffffff", "expected_object_size": 0, "expected_write_size": 0, "alloc_hint_flags": 0, "manifest": { "type": 0 }, "watchers": {} }, "shards": [ { "osd": 0, "primary": true, "errors": [ "missing" ] }, { "osd": 9, "primary": false, "errors": [], "size": 1014, "omap_digest": "0xffffffff", "data_digest": "0x224edafa" } ] }, { "object": { "name": "22", "nspace": "", "locator": "238938732312934742", "snap": "head", "version": 4285957 }, "errors": [], "union_shard_errors": [ "missing" ], "selected_object_info": { "oid": { "oid": "22", "key": "238938732312934742", "snapid": -2, "hash": 2967967738, "max": 0, "pool": 51, "namespace": "" }, "version": "50939'271108", "prior_version": "50905'206236", "last_reqid": "client.36574111.0:5414581", "user_version": 4285957, "size": 2136, "mtime": "2018-12-06 01:14:21.151310", "local_mtime": "2018-12-06 01:14:21.151584", "lost": 0, "flags": [ "dirty", "data_digest" ], "truncate_seq": 0, "truncate_size": 0, "data_digest": "0x5f174e41", "omap_digest": "0xffffffff", "expected_object_size": 0, "expected_write_size": 0, "alloc_hint_flags": 0, "manifest": { "type": 0 }, "watchers": {} }, "shards": [ { "osd": 0, "primary": true, "errors": [ "missing" ] }, { "osd": 9, "primary": false, "errors": [], "size": 2136, "omap_digest": "0xffffffff", "data_digest": "0x5f174e41" } ] }, { "object": { "name": "23", "nspace": "", "locator": "238938732312934742", "snap": "head", "version": 4285958 }, "errors": [], "union_shard_errors": [ "missing" ], "selected_object_info": { "oid": { "oid": "23", "key": "238938732312934742", "snapid": -2, "hash": 2967967738, "max": 0, "pool": 51, "namespace": "" }, "version": "50939'271109", "prior_version": "50905'206237", "last_reqid": "client.36574111.0:5414582", "user_version": 4285958, "size": 4794, "mtime": "2018-12-06 01:14:21.157871", "local_mtime": "2018-12-06 01:14:21.158320", "lost": 0, "flags": [ "dirty", "data_digest" ], "truncate_seq": 0, "truncate_size": 0, "data_digest": "0x5edc7438", "omap_digest": "0xffffffff", "expected_object_size": 0, "expected_write_size": 0, "alloc_hint_flags": 0, "manifest": { "type": 0 }, "watchers": {} }, "shards": [ { "osd": 0, "primary": true, "errors": [ "missing" ] }, { "osd": 9, "primary": false, "errors": [], "size": 4794, "omap_digest": "0xffffffff", "data_digest": "0x5edc7438" } ] } ] }
Checked one of the reported missing files on osd.0, and the file is not missing on the physical disk.
# find 51.1fa_head -name "*169_193061343227287063*" 51.1fa_head/DIR_A/DIR_F/DIR_B/DIR_9/DIR_0/169_193061343227287063_head_E2F09BFA__33
ceph pg repair 51.1fa logs
2019-04-05 08:16:56.694007 osd.0 [ERR] 51.1fa shard 9 51:5fd90f47::193061343227287063:169:head : missing 2019-04-05 08:17:01.955389 osd.0 [ERR] 51.1fa shard 0 51:5fd90f47::193061343227287063:169:head : missing 2019-04-05 08:17:12.493701 osd.0 [ERR] 51.1fa shard 9 51:5fe9e70d::238938732312934742:169:head : missing 2019-04-05 08:17:12.493703 osd.0 [ERR] 51.1fa shard 9 51:5fe9e70d::238938732312934742:21:head : missing 2019-04-05 08:17:12.493704 osd.0 [ERR] 51.1fa shard 9 51:5fe9e70d::238938732312934742:22:head : missing 2019-04-05 08:17:12.493704 osd.0 [ERR] 51.1fa shard 9 51:5fe9e70d::238938732312934742:23:head : missing 2019-04-05 08:17:17.773788 osd.0 [ERR] 51.1fa shard 0 51:5fe9e70d::238938732312934742:169:head : missing 2019-04-05 08:17:17.773790 osd.0 [ERR] 51.1fa shard 0 51:5fe9e70d::238938732312934742:21:head : missing 2019-04-05 08:17:17.773791 osd.0 [ERR] 51.1fa shard 0 51:5fe9e70d::238938732312934742:22:head : missing 2019-04-05 08:17:17.773792 osd.0 [ERR] 51.1fa shard 0 51:5fe9e70d::238938732312934742:23:head : missing 2019-04-05 08:17:38.850212 mon.ap-120 [INF] osd.0 failed (root=default,host=ap-120) (connection refused reported by osd.4) 2019-04-05 08:17:39.202716 mon.ap-120 [WRN] Health check failed: 1 osds down (OSD_DOWN) 2019-04-05 08:17:42.869226 mon.ap-120 [ERR] Health check update: 19 scrub errors (OSD_SCRUB_ERRORS) 2019-04-05 08:17:42.869254 mon.ap-120 [ERR] Health check update: Possible data damage: 1 pg inconsistent (PG_DAMAGED) 2019-04-05 08:17:42.869269 mon.ap-120 [WRN] Health check failed: Degraded data redundancy: 199697/23937388 objects degraded (0.834%), 15 pgs degraded (PG_DEGRADED) 2019-04-05 08:17:50.515536 mon.ap-120 [WRN] Health check update: Degraded data redundancy: 2394605/23937388 objects degraded (10.004%), 158 pgs degraded (PG_DEGRADED) 2019-04-05 08:18:04.646456 mon.ap-120 [INF] Health check cleared: OSD_DOWN (was: 1 osds down) 2019-04-05 08:18:04.817317 mon.ap-120 [INF] osd.0 172.28.19.5:6804/977646 boot 2019-04-05 08:18:09.017805 mon.ap-120 [WRN] Health check update: Degraded data redundancy: 1450925/23937388 objects degraded (6.061%), 84 pgs degraded (PG_DEGRADED) 2019-04-05 08:18:11.384578 mon.ap-120 [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 1450925/23937388 objects degraded (6.061%), 84 pgs degraded)
osd.0 crashed with same assert stack trace.
Running deep-scrub again...
2019-04-05 08:21:15.039834 osd.0 [ERR] 51.1fa shard 9 51:5fd90f47::193061343227287063:169:head : missing 2019-04-05 08:21:20.318156 osd.0 [ERR] 51.1fa shard 0 51:5fd90f47::193061343227287063:169:head : missing 2019-04-05 08:21:30.853106 osd.0 [ERR] 51.1fa shard 9 51:5fe9e70d::238938732312934742:169:head : missing 2019-04-05 08:21:30.853108 osd.0 [ERR] 51.1fa shard 9 51:5fe9e70d::238938732312934742:21:head : missing 2019-04-05 08:21:30.853109 osd.0 [ERR] 51.1fa shard 9 51:5fe9e70d::238938732312934742:22:head : missing 2019-04-05 08:21:30.853111 osd.0 [ERR] 51.1fa shard 9 51:5fe9e70d::238938732312934742:23:head : missing 2019-04-05 08:21:36.134245 osd.0 [ERR] 51.1fa shard 0 51:5fe9e70d::238938732312934742:169:head : missing 2019-04-05 08:21:36.134246 osd.0 [ERR] 51.1fa shard 0 51:5fe9e70d::238938732312934742:21:head : missing 2019-04-05 08:21:36.134247 osd.0 [ERR] 51.1fa shard 0 51:5fe9e70d::238938732312934742:22:head : missing 2019-04-05 08:21:36.134248 osd.0 [ERR] 51.1fa shard 0 51:5fe9e70d::238938732312934742:23:head : missing 2019-04-05 08:22:00.949764 mon.ap-120 [ERR] Health check update: 30 scrub errors (OSD_SCRUB_ERRORS) 2019-04-05 08:22:00.949791 mon.ap-120 [ERR] Health check update: Possible data damage: 2 pgs inconsistent (PG_DAMAGED) 2019-04-05 08:21:57.076095 osd.0 [ERR] 51.1fa deep-scrub : stat mismatch, got 19524/19519 objects, 0/0 clones, 19524/19519 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 227386438/227371540 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. 2019-04-05 08:21:57.076105 osd.0 [ERR] 51.1fa deep-scrub 5 missing, 0 inconsistent objects 2019-04-05 08:21:57.076107 osd.0 [ERR] 51.1fa deep-scrub 11 errors
And rados list-inconsistent-obj again.
{ "epoch": 53177, "inconsistents": [ { "object": { "name": "169", "nspace": "", "locator": "193061343227287063", "snap": "head", "version": 4309859 }, "errors": [], "union_shard_errors": [ "missing" ], "selected_object_info": { "oid": { "oid": "169", "key": "193061343227287063", "snapid": -2, "hash": 3807419386, "max": 0, "pool": 51, "namespace": "" }, "version": "50939'299113", "prior_version": "0'0", "last_reqid": "client.36576007.0:7932947", "user_version": 4309859, "size": 2160, "mtime": "2018-12-19 10:15:30.411967", "local_mtime": "2018-12-19 10:15:30.412661", "lost": 0, "flags": [ "dirty", "data_digest" ], "truncate_seq": 0, "truncate_size": 0, "data_digest": "0xd7ca497f", "omap_digest": "0xffffffff", "expected_object_size": 0, "expected_write_size": 0, "alloc_hint_flags": 0, "manifest": { "type": 0 }, "watchers": {} }, "shards": [ { "osd": 0, "primary": true, "errors": [ "missing" ] }, { "osd": 9, "primary": false, "errors": [], "size": 2160, "omap_digest": "0xffffffff", "data_digest": "0xd7ca497f" } ] }, { "object": { "name": "169", "nspace": "", "locator": "238938732312934742", "snap": "head", "version": 4285966 }, "errors": [], "union_shard_errors": [ "missing" ], "selected_object_info": { "oid": { "oid": "169", "key": "238938732312934742", "snapid": -2, "hash": 2967967738, "max": 0, "pool": 51, "namespace": "" }, "version": "50939'271117", "prior_version": "50905'206245", "last_reqid": "client.36574111.0:5414590", "user_version": 4285966, "size": 4794, "mtime": "2018-12-06 01:14:21.215731", "local_mtime": "2018-12-06 01:14:21.216145", "lost": 0, "flags": [ "dirty", "data_digest" ], "truncate_seq": 0, "truncate_size": 0, "data_digest": "0x5edc7438", "omap_digest": "0xffffffff", "expected_object_size": 0, "expected_write_size": 0, "alloc_hint_flags": 0, "manifest": { "type": 0 }, "watchers": {} }, "shards": [ { "osd": 0, "primary": true, "errors": [ "missing" ] }, { "osd": 9, "primary": false, "errors": [], "size": 4794, "omap_digest": "0xffffffff", "data_digest": "0x5edc7438" } ] }, { "object": { "name": "21", "nspace": "", "locator": "238938732312934742", "snap": "head", "version": 4285956 }, "errors": [], "union_shard_errors": [ "missing" ], "selected_object_info": { "oid": { "oid": "21", "key": "238938732312934742", "snapid": -2, "hash": 2967967738, "max": 0, "pool": 51, "namespace": "" }, "version": "50939'271107", "prior_version": "50905'206235", "last_reqid": "client.36574111.0:5414580", "user_version": 4285956, "size": 1014, "mtime": "2018-12-06 01:14:21.144941", "local_mtime": "2018-12-06 01:14:21.145108", "lost": 0, "flags": [ "dirty", "data_digest" ], "truncate_seq": 0, "truncate_size": 0, "data_digest": "0x224edafa", "omap_digest": "0xffffffff", "expected_object_size": 0, "expected_write_size": 0, "alloc_hint_flags": 0, "manifest": { "type": 0 }, "watchers": {} }, "shards": [ { "osd": 0, "primary": true, "errors": [ "missing" ] }, { "osd": 9, "primary": false, "errors": [], "size": 1014, "omap_digest": "0xffffffff", "data_digest": "0x224edafa" } ] }, { "object": { "name": "22", "nspace": "", "locator": "238938732312934742", "snap": "head", "version": 4285957 }, "errors": [], "union_shard_errors": [ "missing" ], "selected_object_info": { "oid": { "oid": "22", "key": "238938732312934742", "snapid": -2, "hash": 2967967738, "max": 0, "pool": 51, "namespace": "" }, "version": "50939'271108", "prior_version": "50905'206236", "last_reqid": "client.36574111.0:5414581", "user_version": 4285957, "size": 2136, "mtime": "2018-12-06 01:14:21.151310", "local_mtime": "2018-12-06 01:14:21.151584", "lost": 0, "flags": [ "dirty", "data_digest" ], "truncate_seq": 0, "truncate_size": 0, "data_digest": "0x5f174e41", "omap_digest": "0xffffffff", "expected_object_size": 0, "expected_write_size": 0, "alloc_hint_flags": 0, "manifest": { "type": 0 }, "watchers": {} }, "shards": [ { "osd": 0, "primary": true, "errors": [ "missing" ] }, { "osd": 9, "primary": false, "errors": [], "size": 2136, "omap_digest": "0xffffffff", "data_digest": "0x5f174e41" } ] }, { "object": { "name": "23", "nspace": "", "locator": "238938732312934742", "snap": "head", "version": 4285958 }, "errors": [], "union_shard_errors": [ "missing" ], "selected_object_info": { "oid": { "oid": "23", "key": "238938732312934742", "snapid": -2, "hash": 2967967738, "max": 0, "pool": 51, "namespace": "" }, "version": "50939'271109", "prior_version": "50905'206237", "last_reqid": "client.36574111.0:5414582", "user_version": 4285958, "size": 4794, "mtime": "2018-12-06 01:14:21.157871", "local_mtime": "2018-12-06 01:14:21.158320", "lost": 0, "flags": [ "dirty", "data_digest" ], "truncate_seq": 0, "truncate_size": 0, "data_digest": "0x5edc7438", "omap_digest": "0xffffffff", "expected_object_size": 0, "expected_write_size": 0, "alloc_hint_flags": 0, "manifest": { "type": 0 }, "watchers": {} }, "shards": [ { "osd": 0, "primary": true, "errors": [ "missing" ] }, { "osd": 9, "primary": false, "errors": [], "size": 4794, "omap_digest": "0xffffffff", "data_digest": "0x5edc7438" } ] } ] }
So the same behaviour on both the filestore and bluestore osd.
Updated by David Zafman almost 5 years ago
Since OSDs are crashing we should get stack traces out of the logs (e.g osd.9). Per http://tracker.ceph.com/issues/39115#note-5 osd.9 crashes right when the missing copy would be getting pulled (copied) from osd.0 to osd.9.
Updated by Iain Buclaw almost 5 years ago
See #39116 for the stack trace.
I initially thought that this and the other issue were two separate problems. However they could be related, thereby one of them is a duplicate.
Updated by David Zafman over 4 years ago
- Is duplicate of Bug #39116: Draining filestore osd, removing, and adding new bluestore osd causes OSDs to crash added
Updated by David Zafman over 4 years ago
- Status changed from New to Duplicate
OSD crashes are the underlying issue here and we can't say anything about repair until there aren't any more crashes.