Project

General

Profile

Actions

Bug #5671

closed

pgs stuck in recovering, not progressing at all

Added by Faidon Liambotis over 10 years ago. Updated over 10 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

All OSDs run 0.66; 4 of them were upgraded to ea1c623406ac0761c6d63041dcc5ca849d1fa932, then I hit #5655. After upgrading those 4 to 3612802bdc53fe6ecdd63c77cbba4077bf0c0402 (wip-5655), I had this situation:

2013-07-18 23:57:57.533795 mon.0 [INF] pgmap v10134893: 16760 pgs: 16268 active+clean, 448 active+recovery_wait, 4 active+clean+scrubbing+deep, 40 active+recovering; 46827 GB data, 144 TB used, 116 TB / 261 TB avail; 0B/s rd, 0B/s wr, 0op/s; 1389/860715494 degraded (0.000%)

with this going on for quite a while (more than half an hour) and pg queries on those 40 recovering showed no progress at all

A pg dump | grep recovery showed that in all of those 40, one of the four osds is the primary OSD.


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #5655: Slow requests for 1h30 "currently waiting for missing objects"Resolved07/17/2013

Actions
Actions #1

Updated by Faidon Liambotis over 10 years ago

cephdrop has 5671/ceph-osd.1.log.gz, --debug-ms=1 --debug-osd=20 --debug-filestore=20 from osd.1, one of the offending osds.

Actions #2

Updated by Faidon Liambotis over 10 years ago

OK, it gets even weirder.

1) Stop all 4 offending osds.

2013-07-19 00:46:02.148965 mon.0 [INF] pgmap v10135983: 16760 pgs: 15341 active+clean, 1409 active+degraded, 10 active+clean+scrubbing+deep; 46827 GB data, 144 TB used, 116 TB / 261 TB avail; 0B/s rd, 0op/s; 24241697/860716265 degraded (2.816%)

2) Start osd.0

2013-07-19 00:46:34.173860 mon.0 [INF] pgmap v10135997: 16760 pgs: 15584 active+clean, 109 active+recovery_wait, 1056 active+degraded, 11 active+clean+scrubbing+deep; 46827 GB data, 144 TB used, 116 TB / 261 TB avail; 0B/s rd, 0op/s; 18113429/860716271 degraded (2.104%)

(a bunch recovery_wait, none recovering!)

3) Start osd.1

2013-07-19 00:47:30.926428 mon.0 [INF] pgmap v10136013: 16760 pgs: 15836 active+clean, 212 active+recovery_wait, 693 active+degraded, 9 active+clean+scrubbing+deep, 10 active+recovering; 46827 GB data, 144 TB used, 116 TB / 261 TB avail; 0B/s rd, 0op/s; 11939206/860716271 degraded (1.387%)

PGs in recovering state, all with osd.0 as primary:

3.3b4c    17429    4    4    0    2959102177    700    700    active+recovering    2013-07-19 00:47:04.089227    195557'21727    195610'136881    [0,125,47]    [0,125,47]    195324'21665    2013-07-13 15:08:15.303277    195324'21665    2013-07-13 15:08:15.303277
3.352f    17400    1    1    0    2983271180    852    852    active+recovering    2013-07-19 00:47:04.089676    195557'23263    195610'95029    [0,73,98]    [0,73,98]    195324'23212    2013-07-13 11:01:28.610117    195324'23212    2013-07-13 11:01:28.610117
3.339d    17352    4    4    0    3063528005    3006    3006    active+recovering    2013-07-19 00:47:04.089965    195557'1771010    195610'1462586    [0,112,143]    [0,112,143]    195324'1769678    2013-07-13 09:40:24.129726    195324'1769678    2013-07-13 09:40:24.129726
3.3086    17474    5    5    0    3140224695    2610    2610    active+recovering    2013-07-19 00:47:04.089660    195593'195688    195610'149253    [0,43,66]    [0,43,66]    195324'194950    2013-07-13 06:37:11.328844    195324'194950    2013-07-13 06:37:11.328844
3.2f22    17420    4    4    0    3095628235    3019    3019    active+recovering    2013-07-19 00:47:04.089388    195557'1718296    195610'2449674    [0,31,55]    [0,31,55]    195324'1717034    2013-07-13 06:13:11.778770    195324'1717034    2013-07-13 06:13:11.778770
3.2a80    17639    2    2    0    3173414707    688    688    active+recovering    2013-07-19 00:47:04.089859    195557'21730    195610'87013    [0,134,114]    [0,134,114]    195324'21656    2013-07-13 05:20:09.511508    195324'21656    2013-07-13 05:20:09.511508
3.2157    17524    5    5    0    3095001629    1106    1106    active+recovering    2013-07-19 00:47:04.089572    195557'29561    195610'138750    [0,125,60]    [0,125,60]    195324'29291    2013-07-13 02:54:05.376660    195324'29291    2013-07-13 02:54:05.376660
3.923    17496    3    3    0    3137158093    826    826    active+recovering    2013-07-19 00:47:04.089508    195557'21602    195610'95853    [0,131,56]    [0,131,56]    195324'21593    2013-07-17 23:39:57.077571    195324'21593    2013-07-17 23:39:57.077571
3.795    17621    2    2    0    3102108121    791    791    active+recovering    2013-07-19 00:47:04.089711    195557'21576    195610'77691    [0,47,59]    [0,47,59]    195324'21566    2013-07-17 02:23:36.546292    195324'21566    2013-07-17 02:23:36.546292
3.184    17209    1    1    0    3020602998    3005    3005    active+recovering    2013-07-19 00:47:04.089473    195557'1721230    195610'1615620    [0,137,74]    [0,137,74]    195324'1719921    2013-07-15 11:43:49.727598    195324'1719921    2013-07-15 11:43:49.727598

pg query for the first one:

# ceph pg 3.3b4c query
{ "state": "active+recovering",
  "epoch": 195610,
  "up": [
        0,
        125,
        47],
  "acting": [
        0,
        125,
        47],
  "info": { "pgid": "3.3b4c",
      "last_update": "195557'21727",
      "last_complete": "195541'21720",
      "log_tail": "188654'21027",
      "last_backfill": "MAX",
      "purged_snaps": "[]",
      "history": { "epoch_created": 14,
          "last_epoch_started": 195608,
          "last_epoch_clean": 195604,
          "last_epoch_split": 0,
          "same_up_since": 195607,
          "same_interval_since": 195607,
          "same_primary_since": 195607,
          "last_scrub": "195324'21665",
          "last_scrub_stamp": "2013-07-13 15:08:15.303277",
          "last_deep_scrub": "195324'21665",
          "last_deep_scrub_stamp": "2013-07-13 15:08:15.303277",
          "last_clean_scrub_stamp": "2013-07-13 15:08:15.303277"},
      "stats": { "version": "195557'21727",
          "reported_seq": "136881",
          "reported_epoch": "195610",
          "state": "active+recovering",
          "last_fresh": "2013-07-19 00:47:05.687414",
          "last_change": "2013-07-19 00:47:04.089227",
          "last_active": "2013-07-19 00:47:05.687414",
          "last_clean": "2013-07-17 20:15:36.502113",
          "last_became_active": "2013-06-06 01:28:32.755032",
          "last_unstale": "2013-07-19 00:47:05.687414",
          "mapping_epoch": 195603,
          "log_start": "188654'21027",
          "ondisk_log_start": "188654'21027",
          "created": 14,
          "last_epoch_clean": 195604,
          "parent": "0.0",
          "parent_split_bits": 0,
          "last_scrub": "195324'21665",
          "last_scrub_stamp": "2013-07-13 15:08:15.303277",
          "last_deep_scrub": "195324'21665",
          "last_deep_scrub_stamp": "2013-07-13 15:08:15.303277",
          "last_clean_scrub_stamp": "2013-07-13 15:08:15.303277",
          "log_size": 700,
          "ondisk_log_size": 700,
          "stats_invalid": "0",
          "stat_sum": { "num_bytes": 2959102177,
              "num_objects": 17429,
              "num_object_clones": 0,
              "num_object_copies": 0,
              "num_objects_missing_on_primary": 0,
              "num_objects_degraded": 0,
              "num_objects_unfound": 0,
              "num_read": 1053,
              "num_read_kb": 6587,
              "num_write": 105396,
              "num_write_kb": 3048112,
              "num_scrub_errors": 0,
              "num_shallow_scrub_errors": 0,
              "num_deep_scrub_errors": 0,
              "num_objects_recovered": 50555,
              "num_bytes_recovered": 8517898807,
              "num_keys_recovered": 790},
          "stat_cat_sum": {},
          "up": [
                0,
                125,
                47],
          "acting": [
                0,
                125,
                47]},
      "empty": 0,
      "dne": 0,
      "incomplete": 0,
      "last_epoch_started": 195608},
  "recovery_state": [
        { "name": "Started\/Primary\/Active",
          "enter_time": "2013-07-19 00:46:19.830713",
          "might_have_unfound": [
                { "osd": 47,
                  "status": "already probed"},
                { "osd": 125,
                  "status": "already probed"}],
          "recovery_progress": { "backfill_target": -1,
              "waiting_on_backfill": 0,
              "backfill_pos": "0\/\/0\/\/-1",
              "backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "peer_backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "backfills_in_flight": [],
              "pull_from_peer": [
                    { "pull_from": 47,
                      "pulls": [
                            { "recovery_progress": { "first?": 1,
                                  "data_complete?": 0,
                                  "data_recovered_to": 0,
                                  "omap_complete?": 0,
                                  "omap_recovered_to": ""},
                              "recovery_info": { "object": "7d373b4c\/4320.42_4\/4c\/Barcelona_MNAC_DSCF7773-100.JPG\/head\/\/3",
                                  "at_version": "195557'21727",
                                  "size": "18446744073709551615",
                                  "object_info": { "oid": { "oid": "",
                                          "key": "",
                                          "snapid": 0,
                                          "hash": 0,
                                          "max": 0,
                                          "pool": -1,
                                          "namespace": ""},
                                      "category": "",
                                      "version": "0'0",
                                      "prior_version": "0'0",
                                      "last_reqid": "unknown.0.0:0",
                                      "size": 0,
                                      "mtime": "0.000000",
                                      "lost": 0,
                                      "wrlock_by": "unknown.0.0:0",
                                      "snaps": [],
                                      "truncate_seq": 0,
                                      "truncate_size": 0,
                                      "watchers": {}},
                                  "snapset": { "snap_context": { "seq": 0,
                                          "snaps": []},
                                      "head_exists": 0,
                                      "clones": []},
                                  "copy_subset": "[0~18446744073709551615]",
                                  "clone_subset": "{}"}},
                            { "recovery_progress": { "first?": 1,
                                  "data_complete?": 0,
                                  "data_recovered_to": 0,
                                  "omap_complete?": 0,
                                  "omap_recovered_to": ""},
                              "recovery_info": { "object": "a6027b4c\/4465.31_5\/53\/Saint-Vaast-lès-Mello_(60),_église_St-Vaast,_nef,_2e_grande_arcade_du_nord_1.jpg\/head\/\/3",
                                  "at_version": "195557'21722",
                                  "size": "18446744073709551615",
                                  "object_info": { "oid": { "oid": "",
                                          "key": "",
                                          "snapid": 0,
                                          "hash": 0,
                                          "max": 0,
                                          "pool": -1,
                                          "namespace": ""},
                                      "category": "",
                                      "version": "0'0",
                                      "prior_version": "0'0",
                                      "last_reqid": "unknown.0.0:0",
                                      "size": 0,
                                      "mtime": "0.000000",
                                      "lost": 0,
                                      "wrlock_by": "unknown.0.0:0",
                                      "snaps": [],
                                      "truncate_seq": 0,
                                      "truncate_size": 0,
                                      "watchers": {}},
                                  "snapset": { "snap_context": { "seq": 0,
                                          "snaps": []},
                                      "head_exists": 0,
                                      "clones": []},
                                  "copy_subset": "[0~18446744073709551615]",
                                  "clone_subset": "{}"}},
                            { "recovery_progress": { "first?": 1,
                                  "data_complete?": 0,
                                  "data_recovered_to": 0,
                                  "omap_complete?": 0,
                                  "omap_recovered_to": ""},
                              "recovery_info": { "object": "6f257b4c\/4465.62__shadow__Hy8nYXY_pRRUl-det5IWs2Lpss-iRss_2\/head\/\/3",
                                  "at_version": "195557'21725",
                                  "size": "18446744073709551615",
                                  "object_info": { "oid": { "oid": "",
                                          "key": "",
                                          "snapid": 0,
                                          "hash": 0,
                                          "max": 0,
                                          "pool": -1,
                                          "namespace": ""},
                                      "category": "",
                                      "version": "0'0",
                                      "prior_version": "0'0",
                                      "last_reqid": "unknown.0.0:0",
                                      "size": 0,
                                      "mtime": "0.000000",
                                      "lost": 0,
                                      "wrlock_by": "unknown.0.0:0",
                                      "snaps": [],
                                      "truncate_seq": 0,
                                      "truncate_size": 0,
                                      "watchers": {}},
                                  "snapset": { "snap_context": { "seq": 0,
                                          "snaps": []},
                                      "head_exists": 0,
                                      "clones": []},
                                  "copy_subset": "[0~18446744073709551615]",
                                  "clone_subset": "{}"}},
                            { "recovery_progress": { "first?": 1,
                                  "data_complete?": 0,
                                  "data_recovered_to": 0,
                                  "omap_complete?": 0,
                                  "omap_recovered_to": ""},
                              "recovery_info": { "object": "d78fb4c\/4465.43__shadow__7b3xJ4gpIqIEOBTEv-lTyfBnwcYjse4_1\/head\/\/3",
                                  "at_version": "195541'21721",
                                  "size": "18446744073709551615",
                                  "object_info": { "oid": { "oid": "",
                                          "key": "",
                                          "snapid": 0,
                                          "hash": 0,
                                          "max": 0,
                                          "pool": -1,
                                          "namespace": ""},
                                      "category": "",
                                      "version": "0'0",
                                      "prior_version": "0'0",
                                      "last_reqid": "unknown.0.0:0",
                                      "size": 0,
                                      "mtime": "0.000000",
                                      "lost": 0,
                                      "wrlock_by": "unknown.0.0:0",
                                      "snaps": [],
                                      "truncate_seq": 0,
                                      "truncate_size": 0,
                                      "watchers": {}},
                                  "snapset": { "snap_context": { "seq": 0,
                                          "snaps": []},
                                      "head_exists": 0,
                                      "clones": []},
                                  "copy_subset": "[0~18446744073709551615]",
                                  "clone_subset": "{}"}}]}],
              "pushing": []},
          "scrub": { "scrubber.epoch_start": "0",
              "scrubber.active": 0,
              "scrubber.block_writes": 0,
              "scrubber.finalizing": 0,
              "scrubber.waiting_on": 0,
              "scrubber.waiting_on_whom": []}},
        { "name": "Started",
          "enter_time": "2013-07-19 00:46:18.133301"}]}

Actions #3

Updated by Samuel Just over 10 years ago

  • Status changed from New to Duplicate

The osd 1 log indicates that it is running 9fe7806d4f1fe67fa10906df20cbed017321effe which is actually 2 commits behind wip-5655 (which is apparently c0e77c91b6c39998ef4e19a726db87b66850cf2c at the moment rather than 3612802bdc53fe6ecdd63c77cbba4077bf0c0402, maybe the branch got switched). Thus, osd 1 at least should still be suffering from the same bug as 5655, which this probably is.

Actions #4

Updated by Faidon Liambotis over 10 years ago

Ugh. This (9fe7806d4f1fe67fa10906df20cbed017321effe) isn't actually 2 commits behind wip-5655, it's just an earlier incarnation of wip-5655 (your very first attempt). I didn't see the "unknown message type" in the other OSD logs so I assumed it had worked. I installed wip-5655 just as it appeared on gitbuilder, not accounting for its backlog, and foolishly didn't check the sha1. Sorry about that...

I installed c0e77c91b6c39998ef4e19a726db87b66850cf2c and confirmed that it did the trick now. Thanks!

Actions

Also available in: Atom PDF