Project

General

Profile

Actions

Bug #5655

closed

Slow requests for 1h30 "currently waiting for missing objects"

Added by Faidon Liambotis almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This happened today after I upgraded & restarted osd.0 to next (d1501938f5d07c067d908501fc5cfe3c857d7281). I remember seeing something similar from back when I upgraded from cuttlefish to 0.65/0.66, so I don't think it's a post-0.66 bug.

2013-07-17 20:59:58.347025 mon.0 [INF] pgmap v10102734: 16760 pgs: 16737 active+clean, 11 active+recovery_wait, 11 active+clean+scrubbing+deep, 1 active+recovering; 46788 GB data, 144 TB used, 117 TB / 261 TB avail; 3428B/s rd, 5op/s; 13/860603483 degraded (0.000%)
2013-07-17 20:59:54.004236 osd.0 [WRN] 6 slow requests, 6 included below; oldest blocked for > 4969.895277 secs
2013-07-17 20:59:54.004254 osd.0 [WRN] slow request 4969.895277 seconds old, received at 2013-07-17 19:37:04.108843: osd_op(client.30972.0:520820 .dir.4465.50 [call rgw.bucket_prepare_op] 3.caf23b97 e195366) v4 currently waiting for missing object
2013-07-17 20:59:54.004262 osd.0 [WRN] slow request 4925.822950 seconds old, received at 2013-07-17 19:37:48.181170: osd_op(client.30969.0:525359 .dir.4465.50 [call rgw.bucket_prepare_op] 3.caf23b97 e195366) v4 currently waiting for missing object
2013-07-17 20:59:54.004268 osd.0 [WRN] slow request 4630.721023 seconds old, received at 2013-07-17 19:42:43.283097: osd_op(client.30978.0:522188 .dir.4465.50 [call rgw.bucket_prepare_op] 3.caf23b97 e195372) v4 currently waiting for missing object
2013-07-17 20:59:54.004274 osd.0 [WRN] slow request 4630.146189 seconds old, received at 2013-07-17 19:42:43.857931: osd_op(client.30982.0:2825112 .dir.4465.50 [call rgw.bucket_prepare_op] 3.caf23b97 e195372) v4 currently waiting for missing object
2013-07-17 20:59:54.004291 osd.0 [WRN] slow request 4586.352550 seconds old, received at 2013-07-17 19:43:27.651570: osd_op(client.30978.0:522369 .dir.4465.50 [call rgw.bucket_prepare_op] 3.caf23b97 e195372) v4 currently waiting for missing object
2013-07-17 20:59:55.004567 osd.0 [WRN] 6 slow requests, 6 included below; oldest blocked for > 4970.895611 secs

Logs are flooding with

2013-07-17 21:00:48.724203 7ff8ad8ff700  0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46340 s=2 pgs=3324487 cs=6523629 l=0 c=0xe913c60).fault, initiating reconnect
2013-07-17 21:00:48.725846 7ff8e7279700  0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46346 s=2 pgs=3324488 cs=6523631 l=0 c=0xe913c60).fault, initiating reconnect
2013-07-17 21:00:48.728097 7ff8ad8ff700  0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46353 s=2 pgs=3324489 cs=6523633 l=0 c=0xe913c60).fault, initiating reconnect
2013-07-17 21:00:48.729876 7ff8ad8ff700  0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46360 s=2 pgs=3324490 cs=6523635 l=0 c=0xe913c60).fault, initiating reconnect
2013-07-17 21:00:48.731618 7ff8ad8ff700  0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46367 s=2 pgs=3324491 cs=6523637 l=0 c=0xe913c60).fault, initiating reconnect
2013-07-17 21:00:48.731662 7ff8e7279700  0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46367 s=1 pgs=3324491 cs=6523638 l=0 c=0xe913c60).fault
2013-07-17 21:00:48.733189 7ff8ad8ff700  0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46373 s=2 pgs=3324492 cs=6523639 l=0 c=0xe913c60).fault, initiating reconnect
2013-07-17 21:00:48.734658 7ff8e7279700  0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46380 s=2 pgs=3324493 cs=6523641 l=0 c=0xe913c60).fault, initiating reconnect

With 10.64.0.175:6861/6617 being osd.25, an osd in the set of pg 3.3b97, the only one in a recovering state:

3.3b97    17545    1    1    0    3145618437    3005    3005    active+recovering    2013-07-17 18:58:19.629601    195334'200991    195519'126485    [0,25,120]    [0,25,120]    195324'200265    2013-07-13 15:49:31.189074    195324'200265    2013-07-13 15:49:31.189074

{ "state": "active+recovering",
  "epoch": 195519,
  "up": [
        0,
        25,
        120],
  "acting": [
        0,
        25,
        120],
  "info": { "pgid": "3.3b97",
      "last_update": "195334'200991",
      "last_complete": "195334'200990",
      "log_tail": "190854'197986",
      "last_backfill": "MAX",
      "purged_snaps": "[]",
      "history": { "epoch_created": 14,
          "last_epoch_started": 195341,
          "last_epoch_clean": 195334,
          "last_epoch_split": 0,
          "same_up_since": 195338,
          "same_interval_since": 195338,
          "same_primary_since": 195338,
          "last_scrub": "195324'200265",
          "last_scrub_stamp": "2013-07-13 15:49:31.189074",
          "last_deep_scrub": "195324'200265",
          "last_deep_scrub_stamp": "2013-07-13 15:49:31.189074",
          "last_clean_scrub_stamp": "2013-07-13 15:49:31.189074"},
      "stats": { "version": "195334'200991",
          "reported_seq": "126485",
          "reported_epoch": "195519",
          "state": "active+recovering",
          "last_fresh": "2013-07-17 20:38:08.334414",
          "last_change": "2013-07-17 18:58:19.629601",
          "last_active": "2013-07-17 20:38:08.334414",
          "last_clean": "2013-07-17 17:42:45.918108",
          "last_became_active": "2013-06-06 01:28:36.818839",
          "last_unstale": "2013-07-17 20:38:08.334414",
          "mapping_epoch": 195333,
          "log_start": "190854'197986",
          "ondisk_log_start": "190854'197986",
          "created": 14,
          "last_epoch_clean": 195334,
          "parent": "0.0",
          "parent_split_bits": 0,
          "last_scrub": "195324'200265",
          "last_scrub_stamp": "2013-07-13 15:49:31.189074",
          "last_deep_scrub": "195324'200265",
          "last_deep_scrub_stamp": "2013-07-13 15:49:31.189074",
          "last_clean_scrub_stamp": "2013-07-13 15:49:31.189074",
          "log_size": 3005,
          "ondisk_log_size": 3005,
          "stats_invalid": "0",
          "stat_sum": { "num_bytes": 3145618437,
              "num_objects": 17545,
              "num_object_clones": 0,
              "num_object_copies": 0,
              "num_objects_missing_on_primary": 0,
              "num_objects_degraded": 0,
              "num_objects_unfound": 0,
              "num_read": 10592,
              "num_read_kb": 115768,
              "num_write": 115122,
              "num_write_kb": 3244088,
              "num_scrub_errors": 0,
              "num_shallow_scrub_errors": 0,
              "num_deep_scrub_errors": 0,
              "num_objects_recovered": 16,
              "num_bytes_recovered": 23155835,
              "num_keys_recovered": 75586},
          "stat_cat_sum": {},
          "up": [
                0,
                25,
                120],
          "acting": [
                0,
                25,
                120]},
      "empty": 0,
      "dne": 0,
      "incomplete": 0,
      "last_epoch_started": 195341},
  "recovery_state": [
        { "name": "Started\/Primary\/Active",
          "enter_time": "2013-07-17 18:55:57.345926",
          "might_have_unfound": [
                { "osd": 25,
                  "status": "already probed"},
                { "osd": 120,
                  "status": "already probed"}],
          "recovery_progress": { "backfill_target": -1,
              "waiting_on_backfill": 0,
              "backfill_pos": "0\/\/0\/\/-1",
              "backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "peer_backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "backfills_in_flight": [],
              "pull_from_peer": [
                    { "pull_from": 25,
                      "pulls": [
                            { "recovery_progress": { "first?": 1,
                                  "data_complete?": 0,
                                  "data_recovered_to": 0,
                                  "omap_complete?": 0,
                                  "omap_recovered_to": ""},
                              "recovery_info": { "object": "caf23b97\/.dir.4465.50\/head\/\/3",
                                  "at_version": "195334'200991",
                                  "size": "18446744073709551615",
                                  "object_info": { "oid": { "oid": "",
                                          "key": "",
                                          "snapid": 0,
                                          "hash": 0,
                                          "max": 0,
                                          "pool": -1,
                                          "namespace": ""},
                                      "category": "",
                                      "version": "0'0",
                                      "prior_version": "0'0",
                                      "last_reqid": "unknown.0.0:0",
                                      "size": 0,
                                      "mtime": "0.000000",
                                      "lost": 0,
                                      "wrlock_by": "unknown.0.0:0",
                                      "snaps": [],
                                      "truncate_seq": 0,
                                      "truncate_size": 0,
                                      "watchers": {}},
                                  "snapset": { "snap_context": { "seq": 0,
                                          "snaps": []},
                                      "head_exists": 0,
                                      "clones": []},
                                  "copy_subset": "[0~18446744073709551615]",
                                  "clone_subset": "{}"}}]}],
              "pushing": []},
          "scrub": { "scrubber.epoch_start": "0",
              "scrubber.active": 0,
              "scrubber.block_writes": 0,
              "scrubber.finalizing": 0,
              "scrubber.waiting_on": 0,
              "scrubber.waiting_on_whom": []}},
        { "name": "Started",
          "enter_time": "2013-07-17 18:53:41.140936"}]}

I got debug-ms/osd/filestore 20 logs by telling the OSD but I'm not sure if they'd be of help. Let me know what else debugging information I can get now or next time this happens.


Related issues 2 (0 open2 closed)

Related to Ceph - Bug #5517: osd: stuck peering on cuttlefishResolvedSage Weil07/08/2013

Actions
Related to Ceph - Bug #5671: pgs stuck in recovering, not progressing at allDuplicate07/18/2013

Actions
Actions #1

Updated by Faidon Liambotis almost 11 years ago

Cephdrop now has under /5655/
1) osd.0 tell debug-ms/osd/filestore 20 from earlier on (
2) ceph --asok $asok log dump for osd.0 and osd.25

Also worthy of note is that osd.25's log is full of such messages:

2013-07-17 21:18:22.292996 7f6f902ef700  0 can't decode unknown message type 106 MSG_AUTH=17

Actions #2

Updated by Samuel Just almost 11 years ago

  • Priority changed from Normal to Urgent
Actions #3

Updated by Samuel Just almost 11 years ago

I suspect a dropped message, this may be related to the stuck peering bug. Unfortunately, the relevant messages don't show up in the log dumps.

Actions #4

Updated by Faidon Liambotis almost 11 years ago

This is now:

2013-07-18 05:03:35.299938 osd.0 [WRN] slow request 33607.648206 seconds old, received at 2013-07-17 19:43:27.651570: osd_op(client.30978.0:522369 .dir.4465.50 [call rgw.bucket_prepare_op] 3.caf23b97 e195372) v4 currently waiting for missing object

So I'm going to kill osd.0 and try to recover.

Actions #5

Updated by Faidon Liambotis almost 11 years ago

I've had multiple /var/log across the cluster fill up from random OSDs that peer with osd.0 & osd.1 spewing "can't decode unknown message type" (tens of gigabytes of logs). All of the OSDs but osd.0 & osd.1 run 0.66.

I've now upgraded to 07dfb6f4afac6b2156ad9d9e3489e3c601661c35 and restarted osd.0/osd.1 (as part of the peering bug troubleshooting) and I still get those messages. I'm guessing from the 10h that it's not really recovering either so I now I completely stopped those two.

Actions #6

Updated by Sage Weil almost 11 years ago

sounds like they are sending the new message types due to incorrect features... maybe because of the msgr stuff we just fixed? that series just went into next in ea1c623406ac0761c6d63041dcc5ca849d1fa932

Actions #7

Updated by Faidon Liambotis almost 11 years ago

The revision I upgraded to the last hour and mentioned in my previous update (07dfb6f4afac6b2156ad9d9e3489e3c601661c35) contains ea1c623406ac0761c6d63041dcc5ca849d1fa932 and I'm still getting the effects. Is this too late? How can I recover those two OSDs now?

Actions #8

Updated by Faidon Liambotis almost 11 years ago

I also restarted osd.3 & osd.4, which were upgraded directly from 0.66 to 07dfb6f4afac6b2156ad9d9e3489e3c601661c35 and I got unknown message types in their peers so I guess I can confirm ea1c623406ac0761c6d63041dcc5ca849d1fa932 is not fixing this issue.

Actions #9

Updated by Samuel Just almost 11 years ago

  • Status changed from New to Resolved

This was a bug in 0.65 and cuttlefish before 0.61.5, the feature set was getting computed as ~0 causing the new osds to send the wrong message. You can avoid it by upgrading to 0.61.5 for cuttlefish or current next for 0.65.

Actions #10

Updated by Faidon Liambotis almost 11 years ago

All the other OSDs run 0.66 but you seem to imply that it should be fixed there?

Actions #11

Updated by Greg Farnum almost 11 years ago

The bug is in v0.66 as well. You want 4255b5c2fb54ae40c53284b3ab700fdfc7e61748 on all of them (the problem here is that the old OSDs are claiming to have more capabilities than they actually do).

Actions

Also available in: Atom PDF