Actions
Bug #5655
closedSlow requests for 1h30 "currently waiting for missing objects"
Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
This happened today after I upgraded & restarted osd.0 to next (d1501938f5d07c067d908501fc5cfe3c857d7281). I remember seeing something similar from back when I upgraded from cuttlefish to 0.65/0.66, so I don't think it's a post-0.66 bug.
2013-07-17 20:59:58.347025 mon.0 [INF] pgmap v10102734: 16760 pgs: 16737 active+clean, 11 active+recovery_wait, 11 active+clean+scrubbing+deep, 1 active+recovering; 46788 GB data, 144 TB used, 117 TB / 261 TB avail; 3428B/s rd, 5op/s; 13/860603483 degraded (0.000%) 2013-07-17 20:59:54.004236 osd.0 [WRN] 6 slow requests, 6 included below; oldest blocked for > 4969.895277 secs 2013-07-17 20:59:54.004254 osd.0 [WRN] slow request 4969.895277 seconds old, received at 2013-07-17 19:37:04.108843: osd_op(client.30972.0:520820 .dir.4465.50 [call rgw.bucket_prepare_op] 3.caf23b97 e195366) v4 currently waiting for missing object 2013-07-17 20:59:54.004262 osd.0 [WRN] slow request 4925.822950 seconds old, received at 2013-07-17 19:37:48.181170: osd_op(client.30969.0:525359 .dir.4465.50 [call rgw.bucket_prepare_op] 3.caf23b97 e195366) v4 currently waiting for missing object 2013-07-17 20:59:54.004268 osd.0 [WRN] slow request 4630.721023 seconds old, received at 2013-07-17 19:42:43.283097: osd_op(client.30978.0:522188 .dir.4465.50 [call rgw.bucket_prepare_op] 3.caf23b97 e195372) v4 currently waiting for missing object 2013-07-17 20:59:54.004274 osd.0 [WRN] slow request 4630.146189 seconds old, received at 2013-07-17 19:42:43.857931: osd_op(client.30982.0:2825112 .dir.4465.50 [call rgw.bucket_prepare_op] 3.caf23b97 e195372) v4 currently waiting for missing object 2013-07-17 20:59:54.004291 osd.0 [WRN] slow request 4586.352550 seconds old, received at 2013-07-17 19:43:27.651570: osd_op(client.30978.0:522369 .dir.4465.50 [call rgw.bucket_prepare_op] 3.caf23b97 e195372) v4 currently waiting for missing object 2013-07-17 20:59:55.004567 osd.0 [WRN] 6 slow requests, 6 included below; oldest blocked for > 4970.895611 secs
Logs are flooding with
2013-07-17 21:00:48.724203 7ff8ad8ff700 0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46340 s=2 pgs=3324487 cs=6523629 l=0 c=0xe913c60).fault, initiating reconnect 2013-07-17 21:00:48.725846 7ff8e7279700 0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46346 s=2 pgs=3324488 cs=6523631 l=0 c=0xe913c60).fault, initiating reconnect 2013-07-17 21:00:48.728097 7ff8ad8ff700 0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46353 s=2 pgs=3324489 cs=6523633 l=0 c=0xe913c60).fault, initiating reconnect 2013-07-17 21:00:48.729876 7ff8ad8ff700 0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46360 s=2 pgs=3324490 cs=6523635 l=0 c=0xe913c60).fault, initiating reconnect 2013-07-17 21:00:48.731618 7ff8ad8ff700 0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46367 s=2 pgs=3324491 cs=6523637 l=0 c=0xe913c60).fault, initiating reconnect 2013-07-17 21:00:48.731662 7ff8e7279700 0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46367 s=1 pgs=3324491 cs=6523638 l=0 c=0xe913c60).fault 2013-07-17 21:00:48.733189 7ff8ad8ff700 0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46373 s=2 pgs=3324492 cs=6523639 l=0 c=0xe913c60).fault, initiating reconnect 2013-07-17 21:00:48.734658 7ff8e7279700 0 -- 10.64.0.173:6821/4292 >> 10.64.0.175:6861/6617 pipe(0xf386780 sd=30 :46380 s=2 pgs=3324493 cs=6523641 l=0 c=0xe913c60).fault, initiating reconnect
With 10.64.0.175:6861/6617 being osd.25, an osd in the set of pg 3.3b97, the only one in a recovering state:
3.3b97 17545 1 1 0 3145618437 3005 3005 active+recovering 2013-07-17 18:58:19.629601 195334'200991 195519'126485 [0,25,120] [0,25,120] 195324'200265 2013-07-13 15:49:31.189074 195324'200265 2013-07-13 15:49:31.189074
{ "state": "active+recovering", "epoch": 195519, "up": [ 0, 25, 120], "acting": [ 0, 25, 120], "info": { "pgid": "3.3b97", "last_update": "195334'200991", "last_complete": "195334'200990", "log_tail": "190854'197986", "last_backfill": "MAX", "purged_snaps": "[]", "history": { "epoch_created": 14, "last_epoch_started": 195341, "last_epoch_clean": 195334, "last_epoch_split": 0, "same_up_since": 195338, "same_interval_since": 195338, "same_primary_since": 195338, "last_scrub": "195324'200265", "last_scrub_stamp": "2013-07-13 15:49:31.189074", "last_deep_scrub": "195324'200265", "last_deep_scrub_stamp": "2013-07-13 15:49:31.189074", "last_clean_scrub_stamp": "2013-07-13 15:49:31.189074"}, "stats": { "version": "195334'200991", "reported_seq": "126485", "reported_epoch": "195519", "state": "active+recovering", "last_fresh": "2013-07-17 20:38:08.334414", "last_change": "2013-07-17 18:58:19.629601", "last_active": "2013-07-17 20:38:08.334414", "last_clean": "2013-07-17 17:42:45.918108", "last_became_active": "2013-06-06 01:28:36.818839", "last_unstale": "2013-07-17 20:38:08.334414", "mapping_epoch": 195333, "log_start": "190854'197986", "ondisk_log_start": "190854'197986", "created": 14, "last_epoch_clean": 195334, "parent": "0.0", "parent_split_bits": 0, "last_scrub": "195324'200265", "last_scrub_stamp": "2013-07-13 15:49:31.189074", "last_deep_scrub": "195324'200265", "last_deep_scrub_stamp": "2013-07-13 15:49:31.189074", "last_clean_scrub_stamp": "2013-07-13 15:49:31.189074", "log_size": 3005, "ondisk_log_size": 3005, "stats_invalid": "0", "stat_sum": { "num_bytes": 3145618437, "num_objects": 17545, "num_object_clones": 0, "num_object_copies": 0, "num_objects_missing_on_primary": 0, "num_objects_degraded": 0, "num_objects_unfound": 0, "num_read": 10592, "num_read_kb": 115768, "num_write": 115122, "num_write_kb": 3244088, "num_scrub_errors": 0, "num_shallow_scrub_errors": 0, "num_deep_scrub_errors": 0, "num_objects_recovered": 16, "num_bytes_recovered": 23155835, "num_keys_recovered": 75586}, "stat_cat_sum": {}, "up": [ 0, 25, 120], "acting": [ 0, 25, 120]}, "empty": 0, "dne": 0, "incomplete": 0, "last_epoch_started": 195341}, "recovery_state": [ { "name": "Started\/Primary\/Active", "enter_time": "2013-07-17 18:55:57.345926", "might_have_unfound": [ { "osd": 25, "status": "already probed"}, { "osd": 120, "status": "already probed"}], "recovery_progress": { "backfill_target": -1, "waiting_on_backfill": 0, "backfill_pos": "0\/\/0\/\/-1", "backfill_info": { "begin": "0\/\/0\/\/-1", "end": "0\/\/0\/\/-1", "objects": []}, "peer_backfill_info": { "begin": "0\/\/0\/\/-1", "end": "0\/\/0\/\/-1", "objects": []}, "backfills_in_flight": [], "pull_from_peer": [ { "pull_from": 25, "pulls": [ { "recovery_progress": { "first?": 1, "data_complete?": 0, "data_recovered_to": 0, "omap_complete?": 0, "omap_recovered_to": ""}, "recovery_info": { "object": "caf23b97\/.dir.4465.50\/head\/\/3", "at_version": "195334'200991", "size": "18446744073709551615", "object_info": { "oid": { "oid": "", "key": "", "snapid": 0, "hash": 0, "max": 0, "pool": -1, "namespace": ""}, "category": "", "version": "0'0", "prior_version": "0'0", "last_reqid": "unknown.0.0:0", "size": 0, "mtime": "0.000000", "lost": 0, "wrlock_by": "unknown.0.0:0", "snaps": [], "truncate_seq": 0, "truncate_size": 0, "watchers": {}}, "snapset": { "snap_context": { "seq": 0, "snaps": []}, "head_exists": 0, "clones": []}, "copy_subset": "[0~18446744073709551615]", "clone_subset": "{}"}}]}], "pushing": []}, "scrub": { "scrubber.epoch_start": "0", "scrubber.active": 0, "scrubber.block_writes": 0, "scrubber.finalizing": 0, "scrubber.waiting_on": 0, "scrubber.waiting_on_whom": []}}, { "name": "Started", "enter_time": "2013-07-17 18:53:41.140936"}]}
I got debug-ms/osd/filestore 20 logs by telling the OSD but I'm not sure if they'd be of help. Let me know what else debugging information I can get now or next time this happens.
Actions