Project

General

Profile

Actions

Bug #10847

closed

stuck recovering, MOSDPGPush took 25 minutes from send to recieve

Added by Samuel Just about 9 years ago. Updated over 7 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2015-02-11 09:49:17.972106 7f050f0c8700 20 osd.1 pg_epoch: 41 pg[1.12( v 29'12 (0'0,29'12] local-les=34 n=9 ec=8 les/c 34/22 32/32/32) [4,1,3] r=1 lpr=32 pi=21-31/4 luod=0'0 crt=22'8 lcod 29'11 active] send_pushes: sending push PushOp(921295f2/benchmark_data_burnupi27_14622_object153/head//1, version: 22'8, data_included: [0~1048576], data_size: 1048576, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery
_info: ObjectRecoveryInfo(921295f2/benchmark_data_burnupi27_14622_object153/head//1@22'8, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:1048576, data_complete:false, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false)) to osd.4
2015-02-11 09:49:17.972148 7f050f0c8700 1 -- 10.214.135.36:6805/13635 --> 10.214.137.128:6809/25506 -- MOSDPGPush(1.12 41 [PushOp(921295f2/benchmark_data_burnupi27_14622_object153/head//1, version: 22'8, data_included: [0~1048576], data_size: 1048576, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(921295f2/benchmark_data_burnupi27_14622_object153/head//1@22'8, copy_subset: [
0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:1048576, data_complete:false, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2 -- ?+0 0x4356000 con 0x441cdc0
2015-02-11 09:49:17.972174 7f050f0c8700 10 osd.1 41 dequeue_op 0x4779000 finish
...
2015-02-11 10:16:55.501102 7f69bed51700 1 -- 10.214.137.128:6809/25506 <== osd.1 10.214.135.36:6805/13635 1861 ==== MOSDPGPush(1.12 41 [PushOp(921295f2/benchmark_data_burnupi27_14622_object153/head//1, version: 22'8, data_included: [0~1048576], data_size: 1048576, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(921295f2/benchmark_data_burnupi27_14622_object153/head//1@22'8, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:1048576, data_complete:false, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2 ==== 1049498+0+0 (4092764375 0 0) 0x6389a00 con 0x5a018c0
2015-02-11 10:16:55.501210 7f69bed51700 10 osd.4 41 handle_replica_op MOSDPGPush(1.12 41 [PushOp(921295f2/benchmark_data_burnupi27_14622_object153/head//1, version: 22'8, data_included: [0~1048576], data_size: 1048576, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectRecoveryInfo(921295f2/benchmark_data_burnupi27_14622_object153/head//1@22'8, copy_subset: [0~4194304], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:1048576, data_complete:false, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2 epoch 41

ubuntu@teuthology:/a/samuelj-2015-02-10_21:50:39-rados-wip-sam-testing-wip-testing-vanilla-fixes-basic-multi/749893/remote

wip-sam-testing, but I don't think it's related

Actions #1

Updated by Samuel Just about 9 years ago

ea5d1b370e534520ad686d3764bbe269c08cec8a

Saved as wip-sam-testing-10847

Actions #2

Updated by Sage Weil about 9 years ago

  • Status changed from New to In Progress
  • Assignee set to Sage Weil
Actions #3

Updated by Sage Weil about 9 years ago

the recovery message is queued with a lower priority. it looks like it starved.

Actions #4

Updated by Sage Weil about 9 years ago

  • Status changed from In Progress to 12
  • Assignee deleted (Sage Weil)
  • Source changed from other to Q/A
Actions #5

Updated by Samuel Just almost 9 years ago

  • Priority changed from Urgent to High
  • Regression set to No
Actions #6

Updated by Samuel Just over 7 years ago

  • Status changed from 12 to Can't reproduce
Actions

Also available in: Atom PDF