Project

General

Profile

Actions

Bug #5655

closed

Slow requests for 1h30 "currently waiting for missing objects"

Added by Faidon Liambotis almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This happened today after I upgraded & restarted osd.0 to next (d1501938f5d07c067d908501fc5cfe3c857d7281). I remember seeing something similar from back when I upgraded from cuttlefish to 0.65/0.66, so I don't think it's a post-0.66 bug.

2013-07-17 20:59:58.347025 mon.0 [INF] pgmap v10102734: 16760 pgs: 16737 active+clean, 11 active+recovery_wait, 11 active+clean+scrubbing+deep, 1 active+recovering; 46788 GB data, 144 TB used, 117 TB / 261 TB avail; 3428B/s rd, 5op/s; 13/860603483 degraded (0.000%)
2013-07-17 20:59:54.004236 osd.0 [WRN] 6 slow requests, 6 included below; oldest blocked for > 4969.895277 secs
2013-07-17 20:59:54.004254 osd.0 [WRN] slow request 4969.895277 seconds old, received at 2013-07-17 19:37:04.108843: osd_op(client.30972.0:520820 .dir.4465.50 [call rgw.bucket_prepare_op] 3.caf23b97 e195366) v4 currently waiting for missing object
2013-07-17 20:59:54.004262 osd.0 [WRN] slow request 4925.822950 seconds old, received at 2013-07-17 19:37:48.181170: osd_op(client.30969.0:525359 .dir.4465.50 [call rgw.bucket_prepare_op] 3.caf23b97 e195366) v4 currently waiting for missing object
2013-07-17 20:59:54.004268 osd.0 [WRN] slow request 4630.721023 seconds old, received at 2013-07-17 19:42:43.283097: osd_op(client.30978.0:522188 .dir.4465.50 [call rgw.bucket_prepare_op] 3.caf23b97 e195372) v4 currently waiting for missing object
2013-07-17 20:59:54.004274 osd.0 [WRN] slow request 4630.146189 seconds old, received at 2013-07-17 19:42:43.857931: osd_op(client.30982.0:2825112 .dir.4465.50 [call rgw.bucket_prepare_op] 3.caf23b97 e195372) v4 currently waiting for missing object
2013-07-17 20:59:54.004291 osd.0 [WRN] slow request 4586.352550 seconds old, received at 2013-07-17 19:43:27.651570: osd_op(client.30978.0:522369 .dir.4465.50 [call rgw.bucket_prepare_op] 3.caf23b97 e195372) v4 currently waiting for missing object
2013-07-17 20:59:55.004567 osd.0 [WRN] 6 slow requests, 6 included below; oldest blocked for > 4970.895611 secs

Logs are flooding with

2013-07-17 21:00:48.724203 7ff8ad8ff700  0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46340 s=2 pgs=3324487 cs=6523629 l=0 c=0xe913c60).fault, initiating reconnect
2013-07-17 21:00:48.725846 7ff8e7279700  0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46346 s=2 pgs=3324488 cs=6523631 l=0 c=0xe913c60).fault, initiating reconnect
2013-07-17 21:00:48.728097 7ff8ad8ff700  0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46353 s=2 pgs=3324489 cs=6523633 l=0 c=0xe913c60).fault, initiating reconnect
2013-07-17 21:00:48.729876 7ff8ad8ff700  0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46360 s=2 pgs=3324490 cs=6523635 l=0 c=0xe913c60).fault, initiating reconnect
2013-07-17 21:00:48.731618 7ff8ad8ff700  0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46367 s=2 pgs=3324491 cs=6523637 l=0 c=0xe913c60).fault, initiating reconnect
2013-07-17 21:00:48.731662 7ff8e7279700  0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46367 s=1 pgs=3324491 cs=6523638 l=0 c=0xe913c60).fault
2013-07-17 21:00:48.733189 7ff8ad8ff700  0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46373 s=2 pgs=3324492 cs=6523639 l=0 c=0xe913c60).fault, initiating reconnect
2013-07-17 21:00:48.734658 7ff8e7279700  0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46380 s=2 pgs=3324493 cs=6523641 l=0 c=0xe913c60).fault, initiating reconnect

With 10.64.0.175:6861/6617 being osd.25, an osd in the set of pg 3.3b97, the only one in a recovering state:

3.3b97    17545    1    1    0    3145618437    3005    3005    active+recovering    2013-07-17 18:58:19.629601    195334'200991    195519'126485    [0,25,120]    [0,25,120]    195324'200265    2013-07-13 15:49:31.189074    195324'200265    2013-07-13 15:49:31.189074

{ "state": "active+recovering",
  "epoch": 195519,
  "up": [
        0,
        25,
        120],
  "acting": [
        0,
        25,
        120],
  "info": { "pgid": "3.3b97",
      "last_update": "195334'200991",
      "last_complete": "195334'200990",
      "log_tail": "190854'197986",
      "last_backfill": "MAX",
      "purged_snaps": "[]",
      "history": { "epoch_created": 14,
          "last_epoch_started": 195341,
          "last_epoch_clean": 195334,
          "last_epoch_split": 0,
          "same_up_since": 195338,
          "same_interval_since": 195338,
          "same_primary_since": 195338,
          "last_scrub": "195324'200265",
          "last_scrub_stamp": "2013-07-13 15:49:31.189074",
          "last_deep_scrub": "195324'200265",
          "last_deep_scrub_stamp": "2013-07-13 15:49:31.189074",
          "last_clean_scrub_stamp": "2013-07-13 15:49:31.189074"},
      "stats": { "version": "195334'200991",
          "reported_seq": "126485",
          "reported_epoch": "195519",
          "state": "active+recovering",
          "last_fresh": "2013-07-17 20:38:08.334414",
          "last_change": "2013-07-17 18:58:19.629601",
          "last_active": "2013-07-17 20:38:08.334414",
          "last_clean": "2013-07-17 17:42:45.918108",
          "last_became_active": "2013-06-06 01:28:36.818839",
          "last_unstale": "2013-07-17 20:38:08.334414",
          "mapping_epoch": 195333,
          "log_start": "190854'197986",
          "ondisk_log_start": "190854'197986",
          "created": 14,
          "last_epoch_clean": 195334,
          "parent": "0.0",
          "parent_split_bits": 0,
          "last_scrub": "195324'200265",
          "last_scrub_stamp": "2013-07-13 15:49:31.189074",
          "last_deep_scrub": "195324'200265",
          "last_deep_scrub_stamp": "2013-07-13 15:49:31.189074",
          "last_clean_scrub_stamp": "2013-07-13 15:49:31.189074",
          "log_size": 3005,
          "ondisk_log_size": 3005,
          "stats_invalid": "0",
          "stat_sum": { "num_bytes": 3145618437,
              "num_objects": 17545,
              "num_object_clones": 0,
              "num_object_copies": 0,
              "num_objects_missing_on_primary": 0,
              "num_objects_degraded": 0,
              "num_objects_unfound": 0,
              "num_read": 10592,
              "num_read_kb": 115768,
              "num_write": 115122,
              "num_write_kb": 3244088,
              "num_scrub_errors": 0,
              "num_shallow_scrub_errors": 0,
              "num_deep_scrub_errors": 0,
              "num_objects_recovered": 16,
              "num_bytes_recovered": 23155835,
              "num_keys_recovered": 75586},
          "stat_cat_sum": {},
          "up": [
                0,
                25,
                120],
          "acting": [
                0,
                25,
                120]},
      "empty": 0,
      "dne": 0,
      "incomplete": 0,
      "last_epoch_started": 195341},
  "recovery_state": [
        { "name": "Started\/Primary\/Active",
          "enter_time": "2013-07-17 18:55:57.345926",
          "might_have_unfound": [
                { "osd": 25,
                  "status": "already probed"},
                { "osd": 120,
                  "status": "already probed"}],
          "recovery_progress": { "backfill_target": -1,
              "waiting_on_backfill": 0,
              "backfill_pos": "0\/\/0\/\/-1",
              "backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "peer_backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "backfills_in_flight": [],
              "pull_from_peer": [
                    { "pull_from": 25,
                      "pulls": [
                            { "recovery_progress": { "first?": 1,
                                  "data_complete?": 0,
                                  "data_recovered_to": 0,
                                  "omap_complete?": 0,
                                  "omap_recovered_to": ""},
                              "recovery_info": { "object": "caf23b97\/.dir.4465.50\/head\/\/3",
                                  "at_version": "195334'200991",
                                  "size": "18446744073709551615",
                                  "object_info": { "oid": { "oid": "",
                                          "key": "",
                                          "snapid": 0,
                                          "hash": 0,
                                          "max": 0,
                                          "pool": -1,
                                          "namespace": ""},
                                      "category": "",
                                      "version": "0'0",
                                      "prior_version": "0'0",
                                      "last_reqid": "unknown.0.0:0",
                                      "size": 0,
                                      "mtime": "0.000000",
                                      "lost": 0,
                                      "wrlock_by": "unknown.0.0:0",
                                      "snaps": [],
                                      "truncate_seq": 0,
                                      "truncate_size": 0,
                                      "watchers": {}},
                                  "snapset": { "snap_context": { "seq": 0,
                                          "snaps": []},
                                      "head_exists": 0,
                                      "clones": []},
                                  "copy_subset": "[0~18446744073709551615]",
                                  "clone_subset": "{}"}}]}],
              "pushing": []},
          "scrub": { "scrubber.epoch_start": "0",
              "scrubber.active": 0,
              "scrubber.block_writes": 0,
              "scrubber.finalizing": 0,
              "scrubber.waiting_on": 0,
              "scrubber.waiting_on_whom": []}},
        { "name": "Started",
          "enter_time": "2013-07-17 18:53:41.140936"}]}

I got debug-ms/osd/filestore 20 logs by telling the OSD but I'm not sure if they'd be of help. Let me know what else debugging information I can get now or next time this happens.


Related issues 2 (0 open2 closed)

Related to Ceph - Bug #5517: osd: stuck peering on cuttlefishResolvedSage Weil07/08/2013

Actions
Related to Ceph - Bug #5671: pgs stuck in recovering, not progressing at allDuplicate07/18/2013

Actions
Actions

Also available in: Atom PDF