Bug #5671
closedpgs stuck in recovering, not progressing at all
0%
Description
All OSDs run 0.66; 4 of them were upgraded to ea1c623406ac0761c6d63041dcc5ca849d1fa932, then I hit #5655. After upgrading those 4 to 3612802bdc53fe6ecdd63c77cbba4077bf0c0402 (wip-5655), I had this situation:
2013-07-18 23:57:57.533795 mon.0 [INF] pgmap v10134893: 16760 pgs: 16268 active+clean, 448 active+recovery_wait, 4 active+clean+scrubbing+deep, 40 active+recovering; 46827 GB data, 144 TB used, 116 TB / 261 TB avail; 0B/s rd, 0B/s wr, 0op/s; 1389/860715494 degraded (0.000%)
with this going on for quite a while (more than half an hour) and pg queries on those 40 recovering showed no progress at all
A pg dump | grep recovery showed that in all of those 40, one of the four osds is the primary OSD.
Updated by Faidon Liambotis over 10 years ago
cephdrop has 5671/ceph-osd.1.log.gz, --debug-ms=1 --debug-osd=20 --debug-filestore=20 from osd.1, one of the offending osds.
Updated by Faidon Liambotis over 10 years ago
OK, it gets even weirder.
1) Stop all 4 offending osds.
2013-07-19 00:46:02.148965 mon.0 [INF] pgmap v10135983: 16760 pgs: 15341 active+clean, 1409 active+degraded, 10 active+clean+scrubbing+deep; 46827 GB data, 144 TB used, 116 TB / 261 TB avail; 0B/s rd, 0op/s; 24241697/860716265 degraded (2.816%)
2) Start osd.0
2013-07-19 00:46:34.173860 mon.0 [INF] pgmap v10135997: 16760 pgs: 15584 active+clean, 109 active+recovery_wait, 1056 active+degraded, 11 active+clean+scrubbing+deep; 46827 GB data, 144 TB used, 116 TB / 261 TB avail; 0B/s rd, 0op/s; 18113429/860716271 degraded (2.104%)
(a bunch recovery_wait, none recovering!)
3) Start osd.1
2013-07-19 00:47:30.926428 mon.0 [INF] pgmap v10136013: 16760 pgs: 15836 active+clean, 212 active+recovery_wait, 693 active+degraded, 9 active+clean+scrubbing+deep, 10 active+recovering; 46827 GB data, 144 TB used, 116 TB / 261 TB avail; 0B/s rd, 0op/s; 11939206/860716271 degraded (1.387%)
PGs in recovering state, all with osd.0 as primary:
3.3b4c 17429 4 4 0 2959102177 700 700 active+recovering 2013-07-19 00:47:04.089227 195557'21727 195610'136881 [0,125,47] [0,125,47] 195324'21665 2013-07-13 15:08:15.303277 195324'21665 2013-07-13 15:08:15.303277 3.352f 17400 1 1 0 2983271180 852 852 active+recovering 2013-07-19 00:47:04.089676 195557'23263 195610'95029 [0,73,98] [0,73,98] 195324'23212 2013-07-13 11:01:28.610117 195324'23212 2013-07-13 11:01:28.610117 3.339d 17352 4 4 0 3063528005 3006 3006 active+recovering 2013-07-19 00:47:04.089965 195557'1771010 195610'1462586 [0,112,143] [0,112,143] 195324'1769678 2013-07-13 09:40:24.129726 195324'1769678 2013-07-13 09:40:24.129726 3.3086 17474 5 5 0 3140224695 2610 2610 active+recovering 2013-07-19 00:47:04.089660 195593'195688 195610'149253 [0,43,66] [0,43,66] 195324'194950 2013-07-13 06:37:11.328844 195324'194950 2013-07-13 06:37:11.328844 3.2f22 17420 4 4 0 3095628235 3019 3019 active+recovering 2013-07-19 00:47:04.089388 195557'1718296 195610'2449674 [0,31,55] [0,31,55] 195324'1717034 2013-07-13 06:13:11.778770 195324'1717034 2013-07-13 06:13:11.778770 3.2a80 17639 2 2 0 3173414707 688 688 active+recovering 2013-07-19 00:47:04.089859 195557'21730 195610'87013 [0,134,114] [0,134,114] 195324'21656 2013-07-13 05:20:09.511508 195324'21656 2013-07-13 05:20:09.511508 3.2157 17524 5 5 0 3095001629 1106 1106 active+recovering 2013-07-19 00:47:04.089572 195557'29561 195610'138750 [0,125,60] [0,125,60] 195324'29291 2013-07-13 02:54:05.376660 195324'29291 2013-07-13 02:54:05.376660 3.923 17496 3 3 0 3137158093 826 826 active+recovering 2013-07-19 00:47:04.089508 195557'21602 195610'95853 [0,131,56] [0,131,56] 195324'21593 2013-07-17 23:39:57.077571 195324'21593 2013-07-17 23:39:57.077571 3.795 17621 2 2 0 3102108121 791 791 active+recovering 2013-07-19 00:47:04.089711 195557'21576 195610'77691 [0,47,59] [0,47,59] 195324'21566 2013-07-17 02:23:36.546292 195324'21566 2013-07-17 02:23:36.546292 3.184 17209 1 1 0 3020602998 3005 3005 active+recovering 2013-07-19 00:47:04.089473 195557'1721230 195610'1615620 [0,137,74] [0,137,74] 195324'1719921 2013-07-15 11:43:49.727598 195324'1719921 2013-07-15 11:43:49.727598
pg query for the first one:
# ceph pg 3.3b4c query { "state": "active+recovering", "epoch": 195610, "up": [ 0, 125, 47], "acting": [ 0, 125, 47], "info": { "pgid": "3.3b4c", "last_update": "195557'21727", "last_complete": "195541'21720", "log_tail": "188654'21027", "last_backfill": "MAX", "purged_snaps": "[]", "history": { "epoch_created": 14, "last_epoch_started": 195608, "last_epoch_clean": 195604, "last_epoch_split": 0, "same_up_since": 195607, "same_interval_since": 195607, "same_primary_since": 195607, "last_scrub": "195324'21665", "last_scrub_stamp": "2013-07-13 15:08:15.303277", "last_deep_scrub": "195324'21665", "last_deep_scrub_stamp": "2013-07-13 15:08:15.303277", "last_clean_scrub_stamp": "2013-07-13 15:08:15.303277"}, "stats": { "version": "195557'21727", "reported_seq": "136881", "reported_epoch": "195610", "state": "active+recovering", "last_fresh": "2013-07-19 00:47:05.687414", "last_change": "2013-07-19 00:47:04.089227", "last_active": "2013-07-19 00:47:05.687414", "last_clean": "2013-07-17 20:15:36.502113", "last_became_active": "2013-06-06 01:28:32.755032", "last_unstale": "2013-07-19 00:47:05.687414", "mapping_epoch": 195603, "log_start": "188654'21027", "ondisk_log_start": "188654'21027", "created": 14, "last_epoch_clean": 195604, "parent": "0.0", "parent_split_bits": 0, "last_scrub": "195324'21665", "last_scrub_stamp": "2013-07-13 15:08:15.303277", "last_deep_scrub": "195324'21665", "last_deep_scrub_stamp": "2013-07-13 15:08:15.303277", "last_clean_scrub_stamp": "2013-07-13 15:08:15.303277", "log_size": 700, "ondisk_log_size": 700, "stats_invalid": "0", "stat_sum": { "num_bytes": 2959102177, "num_objects": 17429, "num_object_clones": 0, "num_object_copies": 0, "num_objects_missing_on_primary": 0, "num_objects_degraded": 0, "num_objects_unfound": 0, "num_read": 1053, "num_read_kb": 6587, "num_write": 105396, "num_write_kb": 3048112, "num_scrub_errors": 0, "num_shallow_scrub_errors": 0, "num_deep_scrub_errors": 0, "num_objects_recovered": 50555, "num_bytes_recovered": 8517898807, "num_keys_recovered": 790}, "stat_cat_sum": {}, "up": [ 0, 125, 47], "acting": [ 0, 125, 47]}, "empty": 0, "dne": 0, "incomplete": 0, "last_epoch_started": 195608}, "recovery_state": [ { "name": "Started\/Primary\/Active", "enter_time": "2013-07-19 00:46:19.830713", "might_have_unfound": [ { "osd": 47, "status": "already probed"}, { "osd": 125, "status": "already probed"}], "recovery_progress": { "backfill_target": -1, "waiting_on_backfill": 0, "backfill_pos": "0\/\/0\/\/-1", "backfill_info": { "begin": "0\/\/0\/\/-1", "end": "0\/\/0\/\/-1", "objects": []}, "peer_backfill_info": { "begin": "0\/\/0\/\/-1", "end": "0\/\/0\/\/-1", "objects": []}, "backfills_in_flight": [], "pull_from_peer": [ { "pull_from": 47, "pulls": [ { "recovery_progress": { "first?": 1, "data_complete?": 0, "data_recovered_to": 0, "omap_complete?": 0, "omap_recovered_to": ""}, "recovery_info": { "object": "7d373b4c\/4320.42_4\/4c\/Barcelona_MNAC_DSCF7773-100.JPG\/head\/\/3", "at_version": "195557'21727", "size": "18446744073709551615", "object_info": { "oid": { "oid": "", "key": "", "snapid": 0, "hash": 0, "max": 0, "pool": -1, "namespace": ""}, "category": "", "version": "0'0", "prior_version": "0'0", "last_reqid": "unknown.0.0:0", "size": 0, "mtime": "0.000000", "lost": 0, "wrlock_by": "unknown.0.0:0", "snaps": [], "truncate_seq": 0, "truncate_size": 0, "watchers": {}}, "snapset": { "snap_context": { "seq": 0, "snaps": []}, "head_exists": 0, "clones": []}, "copy_subset": "[0~18446744073709551615]", "clone_subset": "{}"}}, { "recovery_progress": { "first?": 1, "data_complete?": 0, "data_recovered_to": 0, "omap_complete?": 0, "omap_recovered_to": ""}, "recovery_info": { "object": "a6027b4c\/4465.31_5\/53\/Saint-Vaast-lès-Mello_(60),_église_St-Vaast,_nef,_2e_grande_arcade_du_nord_1.jpg\/head\/\/3", "at_version": "195557'21722", "size": "18446744073709551615", "object_info": { "oid": { "oid": "", "key": "", "snapid": 0, "hash": 0, "max": 0, "pool": -1, "namespace": ""}, "category": "", "version": "0'0", "prior_version": "0'0", "last_reqid": "unknown.0.0:0", "size": 0, "mtime": "0.000000", "lost": 0, "wrlock_by": "unknown.0.0:0", "snaps": [], "truncate_seq": 0, "truncate_size": 0, "watchers": {}}, "snapset": { "snap_context": { "seq": 0, "snaps": []}, "head_exists": 0, "clones": []}, "copy_subset": "[0~18446744073709551615]", "clone_subset": "{}"}}, { "recovery_progress": { "first?": 1, "data_complete?": 0, "data_recovered_to": 0, "omap_complete?": 0, "omap_recovered_to": ""}, "recovery_info": { "object": "6f257b4c\/4465.62__shadow__Hy8nYXY_pRRUl-det5IWs2Lpss-iRss_2\/head\/\/3", "at_version": "195557'21725", "size": "18446744073709551615", "object_info": { "oid": { "oid": "", "key": "", "snapid": 0, "hash": 0, "max": 0, "pool": -1, "namespace": ""}, "category": "", "version": "0'0", "prior_version": "0'0", "last_reqid": "unknown.0.0:0", "size": 0, "mtime": "0.000000", "lost": 0, "wrlock_by": "unknown.0.0:0", "snaps": [], "truncate_seq": 0, "truncate_size": 0, "watchers": {}}, "snapset": { "snap_context": { "seq": 0, "snaps": []}, "head_exists": 0, "clones": []}, "copy_subset": "[0~18446744073709551615]", "clone_subset": "{}"}}, { "recovery_progress": { "first?": 1, "data_complete?": 0, "data_recovered_to": 0, "omap_complete?": 0, "omap_recovered_to": ""}, "recovery_info": { "object": "d78fb4c\/4465.43__shadow__7b3xJ4gpIqIEOBTEv-lTyfBnwcYjse4_1\/head\/\/3", "at_version": "195541'21721", "size": "18446744073709551615", "object_info": { "oid": { "oid": "", "key": "", "snapid": 0, "hash": 0, "max": 0, "pool": -1, "namespace": ""}, "category": "", "version": "0'0", "prior_version": "0'0", "last_reqid": "unknown.0.0:0", "size": 0, "mtime": "0.000000", "lost": 0, "wrlock_by": "unknown.0.0:0", "snaps": [], "truncate_seq": 0, "truncate_size": 0, "watchers": {}}, "snapset": { "snap_context": { "seq": 0, "snaps": []}, "head_exists": 0, "clones": []}, "copy_subset": "[0~18446744073709551615]", "clone_subset": "{}"}}]}], "pushing": []}, "scrub": { "scrubber.epoch_start": "0", "scrubber.active": 0, "scrubber.block_writes": 0, "scrubber.finalizing": 0, "scrubber.waiting_on": 0, "scrubber.waiting_on_whom": []}}, { "name": "Started", "enter_time": "2013-07-19 00:46:18.133301"}]}
Updated by Samuel Just over 10 years ago
- Status changed from New to Duplicate
The osd 1 log indicates that it is running 9fe7806d4f1fe67fa10906df20cbed017321effe which is actually 2 commits behind wip-5655 (which is apparently c0e77c91b6c39998ef4e19a726db87b66850cf2c at the moment rather than 3612802bdc53fe6ecdd63c77cbba4077bf0c0402, maybe the branch got switched). Thus, osd 1 at least should still be suffering from the same bug as 5655, which this probably is.
Updated by Faidon Liambotis over 10 years ago
Ugh. This (9fe7806d4f1fe67fa10906df20cbed017321effe) isn't actually 2 commits behind wip-5655, it's just an earlier incarnation of wip-5655 (your very first attempt). I didn't see the "unknown message type" in the other OSD logs so I assumed it had worked. I installed wip-5655 just as it appeared on gitbuilder, not accounting for its backlog, and foolishly didn't check the sha1. Sorry about that...
I installed c0e77c91b6c39998ef4e19a726db87b66850cf2c and confirmed that it did the trick now. Thanks!