Project

General

Profile

Actions

Bug #7999

closed

osd: pgs share info that hasn't been persisted

Added by Sage Weil about 10 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Firefly
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ubuntu@teuthology:/a/gregf-2014-04-04_22:05:49-rados-wip-7994-testing-basic-plana/170880


Related issues 1 (0 open1 closed)

Has duplicate Ceph - Bug #8397: radosbench test reported error in cluster logDuplicate05/19/2014

Actions
Actions #1

Updated by Sage Weil about 10 years ago

2014-04-05 10:38:25.094534 7fadbfb76700  1 -- 10.214.131.23:6800/6068 <== client.4134 10.214.132.36:0/1006261 370 ==== osd_op(client.4134.0:3714 benchmark_data_plana42_6261_object3713 [write 0~4194304] 3.98cdc907 RETRY=1 ack+ondisk+retry+write e52) v4 ==== 189+0+4194304 (2940031824 0 2584664079) 0x4ff5000 con 0x2e74580
2014-04-05 10:38:25.094573 7fadbfb76700 20 osd.3 52 _dispatch 0x4ff5000 osd_op(client.4134.0:3714 benchmark_data_plana42_6261_object3713 [write 0~4194304] 3.98cdc907 RETRY=1 ack+ondisk+retry+write e52) v4
2014-04-05 10:38:25.094667 7fadbfb76700 15 osd.3 52 enqueue_op 0x5037d20 prio 63 cost 4194304 latency 0.035841 osd_op(client.4134.0:3714 benchmark_data_plana42_6261_object3713 [write 0~4194304] 3.98cdc907 RETRY=1 ack+ondisk+retry+write e52) v4
2014-04-05 10:38:25.094708 7fadb8b68700 10 osd.3 52 dequeue_op 0x5037d20 prio 63 cost 4194304 latency 0.035882 osd_op(client.4134.0:3714 benchmark_data_plana42_6261_object3713 [write 0~4194304] 3.98cdc907 RETRY=1 ack+ondisk+retry+write e52) v4 pg pg[3.7( empty local-les=0 n=0 ec=6 les/c 44/44 52/52/52) [3,4] r=0 lpr=52 pi=32-51/2 crt=0'0 mlcod 0'0 peering]
2014-04-05 10:38:27.957984 7fadb8b68700 20   ? 51'157 (0'0) modify   98cdc907/benchmark_data_plana42_6261_object3713/head//3 by client.4134.0:3714 2014-04-05 10:38:24.638724
2014-04-05 10:38:27.959474 7fadb8b68700 20 merge_log 51'157 (0'0) modify   98cdc907/benchmark_data_plana42_6261_object3713/head//3 by client.4134.0:3714 2014-04-05 10:38:24.638724
2014-04-05 10:38:29.404929 7fadb9369700 10 osd.3 56 dequeue_op 0x5037d20 prio 63 cost 4194304 latency 4.346104 osd_op(client.4134.0:3714 benchmark_data_plana42_6261_object3713 [write 0~4194304] 3.98cdc907 RETRY=1 ack+ondisk+retry+write e52) v4 pg pg[3.7( v 51'157 lc 0'0 (0'0,51'157] local-les=54 n=157 ec=6 les/c 54/44 52/52/52) [3,4] r=0 lpr=52 pi=32-51/2 crt=51'157 mlcod 0'0 active+recovery_wait m=119]
2014-04-05 10:38:29.405500 7fadb9369700 10 osd.3 pg_epoch: 56 pg[3.7( v 51'157 lc 0'0 (0'0,51'157] local-les=54 n=157 ec=6 les/c 54/44 52/52/52) [3,4] r=0 lpr=52 pi=32-51/2 crt=51'157 mlcod 0'0 active+recovery_wait m=119] do_op osd_op(client.4134.0:3714 benchmark_data_plana42_6261_object3713 [write 0~4194304] 3.98cdc907 RETRY=1 ack+ondisk+retry+write e52) v4 may_write -> write-ordered flags ack+ondisk+retry+write
2014-04-05 10:38:59.054096 7fadc94b7700  0 log [WRN] : slow request 33.995173 seconds old, received at 2014-04-05 10:38:25.058825: osd_op(client.4134.0:3714 benchmark_data_plana42_6261_object3713 [write 0~4194304] 3.98cdc907 RETRY=1 ack+ondisk+retry+write e52) v4 currently waiting for missing object
2014-04-05 10:39:25.376618 7fadc94b7700  0 log [WRN] : slow request 60.317758 seconds old, received at 2014-04-05 10:38:25.058825: osd_op(client.4134.0:3714 benchmark_data_plana42_6261_object3713 [write 0~4194304] 3.98cdc907 RETRY=1 ack+ondisk+retry+write e52) v4 currently waiting for missing object
2014-04-05 10:40:25.388955 7fadc94b7700  0 log [WRN] : slow request 120.330093 seconds old, received at 2014-04-05 10:38:25.058825: osd_op(client.4134.0:3714 benchmark_data_plana42_6261_object3713 [write 0~4194304] 3.98cdc907 RETRY=1 ack+ondisk+retry+write e52) v4 currently waiting for missing object
2014-04-05 10:42:25.411515 7fadc94b7700  0 log [WRN] : slow request 240.352655 seconds old, received at 2014-04-05 10:38:25.058825: osd_op(client.4134.0:3714 benchmark_data_plana42_6261_object3713 [write 0~4194304] 3.98cdc907 RETRY=1 ack+ondisk+retry+write e52) v4 currently waiting for missing object
2014-04-05 10:46:25.457632 7fadc94b7700  0 log [WRN] : slow request 480.398772 seconds old, received at 2014-04-05 10:38:25.058825: osd_op(client.4134.0:3714 benchmark_data_plana42_6261_object3713 [write 0~4194304] 3.98cdc907 RETRY=1 ack+ondisk+retry+write e52) v4 currently waiting for missing object

Actions #2

Updated by Sage Weil about 10 years ago

  • Description updated (diff)
Actions #3

Updated by Sage Weil about 10 years ago

  • Subject changed from timeout during recovery to osd: pgs share info that hasn't been persisted

osd.0 starts a repop:

2014-04-05 10:38:26.529279 7f8256d35700 10 osd.0 pg_epoch: 51 pg[3.7( v 49'156 (0'0,49'156] local-les=44 n=157 ec=6 les/c 44/44 43/43/43) [0,4] r=0 lpr=43 crt=31'72 lcod 49'155 mlcod 49'155 active+clean] new_repop rep_tid 276 on osd_op(client.4134.0:3714 benchmark_data_plana42_6261_object3713 [write 0~4194304] 3.98
cdc907 ack+ondisk+write e51) v4

then loses primary and shares info with the new primary

2014-04-05 10:38:26.533747 7f8256534700  1 osd.0 pg_epoch: 52 pg[3.7( v 51'157 (0'0,51'157] local-les=44 n=157 ec=6 les/c 44/44 52/52/52) [3,4] r=-1 lpr=52 pi=43-51/1 crt=31'72 lcod 49'155 inactive NOTIFY] state<Start>: transitioning to Stray
...
2014-04-05 10:38:26.533995 7f8256534700  1 -- 10.214.132.36:6804/5346 --> 10.214.131.23:6801/6068 -- pg_notify(0.1c(2),0.20(1),3.2(2),3.1(6),3.7(1) epoch 52) v5 -- ?+0 0x3d781c0 con 0x33ce160

and then shortly after that restarts, without committing the write, and comes back at an older last_update:

2014-04-05 10:38:38.690321 7f29adb85780 10 osd.0 pg_epoch: 49 pg[3.7( v 49'156 (0'0,49'156] local-les=44 n=156 ec=6 les/c 44/44 43/43/43) [0,4] r=0 lpr=0 crt=31'72 lcod 0'0 mlcod 0'0 inactive] handle_loaded

meahwile, on the new primary, we get the notify,

2014-04-05 10:38:26.535685 7fadb8b68700 10 osd.3 pg_epoch: 53 pg[3.7( empty local-les=0 n=0 ec=6 les/c 44/44 52/52/52) [3,4] r=0 lpr=52 pi=32-51/2 crt=0'0 mlcod 0'0 peering] handle_peering_event: epoch_sent: 52 epoch_requested: 52 MNotifyRec from (0,255) notify: (query_epoch:52, epoch_sent:52, info:3.7( v 51'157 (0'0,51'157] local-les=44 n=157 ec=6 les/c 44/44 52/52/52))
2014-04-05 10:38:26.535713 7fadb8b68700 10 osd.3 pg_epoch: 53 pg[3.7( empty local-les=0 n=0 ec=6 les/c 44/44 52/52/52) [3,4] r=0 lpr=52 pi=32-51/2 crt=0'0 mlcod 0'0 peering]  got osd.(0,255) 3.7( v 51'157 (0'0,51'157] local-les=44 n=157 ec=6 les/c 44/44 52/52/52)
2014-04-05 10:38:26.535731 7fadb8b68700 10 osd.3 pg_epoch: 53 pg[3.7( empty local-les=0 n=0 ec=6 les/c 44/44 52/52/52) [3,4] r=0 lpr=52 pi=32-51/2 crt=0'0 mlcod 0'0 peering]  osd.(0,255) has stray content: 3.7( v 51'157 (0'0,51'157] local-les=44 n=157 ec=6 les/c 44/44 52/52/52)

decide that the object is on osd.0

2014-04-05 10:38:28.016958 7fadb9369700 10 osd.3 pg_epoch: 54 pg[3.7( v 51'157 lc 0'0 (0'0,51'157] local-les=54 n=157 ec=6 les/c 44/44 52/52/52) [3,4] r=0 lpr=52 pi=32-51/2 crt=51'157 mlcod 0'0 inactive m=119 u=1] search_for_missing 98cdc907/benchmark_data_plana42_6261_object3713/head//3 51'157 is on osd.(0,255)

but then osd.0 restarts and we have no source

2014-04-05 10:38:39.968384 7fadb8b68700 10 osd.3 pg_epoch: 58 pg[3.7( v 51'157 lc 0'0 (0'0,51'157] local-les=54 n=157 ec=6 les/c 54/44 52/52/52) [3,4] r=0 lpr=52 pi=32-51/2 rops=2 crt=51'157 mlcod 0'0 active+recovering m=119] check_recovery_sources sources osds (0,255) now down, remaining sources are (4,255)
2014-04-05 10:38:39.968706 7fadb8b68700 10 osd.3 pg_epoch: 58 pg[3.7( v 51'157 lc 0'0 (0'0,51'157] local-les=54 n=157 ec=6 les/c 54/44 52/52/52) [3,4] r=0 lpr=52 pi=32-51/2 rops=2 crt=51'157 mlcod 0'0 active+recovering m=119 u=1] check_recovery_sources resetting pulls from osd.(0,255), osdmap has it marked down
2014-04-05 10:38:39.969015 7fadb8b68700 10 osd.3 pg_epoch: 58 pg[3.7( v 51'157 lc 0'0 (0'0,51'157] local-les=54 n=157 ec=6 les/c 54/44 52/52/52) [3,4] r=0 lpr=52 pi=32-51/2 rops=2 crt=51'157 mlcod 0'0 active+recovering m=119 u=1] finish_recovery_op 98cdc907/benchmark_data_plana42_6261_object3713/head//3

Actions #4

Updated by Sage Weil about 10 years ago

  • Status changed from In Progress to 12
Actions #5

Updated by Sage Weil about 10 years ago

  • Assignee deleted (Sage Weil)
Actions #6

Updated by Samuel Just almost 10 years ago

ubuntu@teuthology:/a/samuelj-2014-05-06_14:42:22-rados-wip-sam-testing-testing-basic-plana/240025/remote

Actions #7

Updated by Samuel Just almost 10 years ago

  • Assignee set to Samuel Just
Actions #8

Updated by Samuel Just almost 10 years ago

  • Status changed from 12 to In Progress
Actions #9

Updated by Samuel Just almost 10 years ago

  • Status changed from In Progress to 7
Actions #10

Updated by Samuel Just almost 10 years ago

  • Status changed from 7 to Fix Under Review
Actions #11

Updated by Samuel Just almost 10 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #12

Updated by Samuel Just almost 10 years ago

  • Backport set to Firefly
Actions #13

Updated by Samuel Just almost 10 years ago

ubuntu@teuthology:/a/teuthology-2014-06-17_02:30:05-rados-firefly-distro-basic-plana/313395

Actions #14

Updated by Sage Weil over 9 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF