Project

General

Profile

Actions

Bug #9220

closed

objecter doesn't reconnect watch on interval change w/ same primary

Added by Sage Weil over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
firefly
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ubuntu@teuthology:/a/teuthology-2014-08-24_02:30:02-rados-next-testing-basic-multi/446803

2014-08-24 14:13:30.039746 7faf01235700 20 osd.3 pg_epoch: 948 pg[113.2s0( v 944'2 (0'0,944'2] local-les=946 n=1 ec=941 les/c 946/944 945/945/941) [3,4,1] r=0 lpr=945 pi=941-944/1 crt=0'0 lcod 944'1 mlcod 0'0 active+degraded] op_has_sufficient_caps pool=113 (test-rados-api-plana40-12123-5 test-rados-api-plana40-121
23-6) owner=0 need_read_cap=0 need_write_cap=1 need_class_read_cap=0 need_class_write_cap=0 -> yes
2014-08-24 14:13:30.039776 7faf01235700 10 osd.3 pg_epoch: 948 pg[113.2s0( v 944'2 (0'0,944'2] local-les=946 n=1 ec=941 les/c 946/944 945/945/941) [3,4,1] r=0 lpr=945 pi=941-944/1 crt=0'0 lcod 944'1 mlcod 0'0 active+degraded] handle_message: osd_op(client.4886.0:2 test-rados-api-plana40-12123-6/foo [watch add cooki
e 1 ver 0] 113.e3ef388a ondisk+read+write+known_if_redirected e941) v4
2014-08-24 14:13:30.039808 7faf01235700 10 osd.3 pg_epoch: 948 pg[113.2s0( v 944'2 (0'0,944'2] local-les=946 n=1 ec=941 les/c 946/944 945/945/941) [3,4,1] r=0 lpr=945 pi=941-944/1 crt=0'0 lcod 944'1 mlcod 0'0 active+degraded] do_op osd_op(client.4886.0:2 test-rados-api-plana40-12123-6/foo [watch add cookie 1 ver 0]
 113.e3ef388a ondisk+read+write+known_if_redirected e941) v4 may_write -> write-ordered flags ondisk+read+write+known_if_redirected
2014-08-24 14:13:30.039844 7faf01235700 10 osd.3 pg_epoch: 948 pg[113.2s0( v 944'2 (0'0,944'2] local-les=946 n=1 ec=941 les/c 946/944 945/945/941) [3,4,1] r=0 lpr=945 pi=941-944/1 crt=0'0 lcod 944'1 mlcod 0'0 active+degraded] readable_with_acting: locs:1(2),3(0)
2014-08-24 14:13:30.039879 7faf01235700  7 osd.3 pg_epoch: 948 pg[113.2s0( v 944'2 (0'0,944'2] local-les=946 n=1 ec=941 les/c 946/944 945/945/941) [3,4,1] r=0 lpr=945 pi=941-944/1 crt=0'0 lcod 944'1 mlcod 0'0 active+degraded] degraded e3ef388a/foo/head/test-rados-api-plana40-12123-6/113, recovering
2014-08-24 14:13:30.039906 7faf01235700 10 osd.3 pg_epoch: 948 pg[113.2s0( v 944'2 (0'0,944'2] local-les=946 n=1 ec=941 les/c 946/944 945/945/941) [3,4,1] r=0 lpr=945 pi=941-944/1 crt=0'0 lcod 944'1 mlcod 0'0 active+degraded] prep_object_replica_pushes: on e3ef388a/foo/head/test-rados-api-plana40-12123-6/113
2014-08-24 14:13:30.039933 7faf01235700 15 filestore(/var/lib/ceph/osd/ceph-3) getattr 113.2s0_head/e3ef388a/foo/head/test-rados-api-plana40-12123-6/113/18446744073709551615/0 '_'
2014-08-24 14:13:30.040033 7faf01235700 10 filestore(/var/lib/ceph/osd/ceph-3) getattr 113.2s0_head/e3ef388a/foo/head/test-rados-api-plana40-12123-6/113/18446744073709551615/0 '_' = 624
2014-08-24 14:13:30.040064 7faf01235700 15 filestore(/var/lib/ceph/osd/ceph-3) getattr 113.2s0_head/e3ef388a/foo/head/test-rados-api-plana40-12123-6/113/18446744073709551615/0 'snapset'
2014-08-24 14:13:30.040084 7faf01235700 10 filestore(/var/lib/ceph/osd/ceph-3) getattr 113.2s0_head/e3ef388a/foo/head/test-rados-api-plana40-12123-6/113/18446744073709551615/0 'snapset' = 31
2014-08-24 14:13:30.040097 7faf01235700 10 osd.3 pg_epoch: 948 pg[113.2s0( v 944'2 (0'0,944'2] local-les=946 n=1 ec=941 les/c 946/944 945/945/941) [3,4,1] r=0 lpr=945 pi=941-944/1 crt=0'0 lcod 944'1 mlcod 0'0 active+degraded] populate_obc_watchers e3ef388a/foo/head/test-rados-api-plana40-12123-6/113
2014-08-24 14:13:30.040127 7faf01235700 10 osd.3 pg_epoch: 948 pg[113.2s0( v 944'2 (0'0,944'2] local-les=946 n=1 ec=941 les/c 946/944 945/945/941) [3,4,1] r=0 lpr=945 pi=941-944/1 crt=0'0 lcod 944'1 mlcod 0'0 active+degraded]   unconnected watcher 1,client.4886 will expire 30.000000
2014-08-24 14:13:30.040157 7faf01235700 10 osd.3 pg_epoch: 948 pg[113.2s0( v 944'2 (0'0,944'2] local-les=946 n=1 ec=941 les/c 946/944 945/945/941) [3,4,1] r=0 lpr=945 pi=941-944/1 crt=0'0 lcod 944'1 mlcod 0'0 active+degraded]  -- Watch(1,client.4886) Watch()
2014-08-24 14:13:30.040174 7faf01235700 10 osd.3 pg_epoch: 948 pg[113.2s0( v 944'2 (0'0,944'2] local-les=946 n=1 ec=941 les/c 946/944 945/945/941) [3,4,1] r=0 lpr=945 pi=941-944/1 crt=0'0 lcod 944'1 mlcod 0'0 active+degraded]  -- Watch(1,client.4886) disconnect
2014-08-24 14:13:30.040188 7faf01235700 15 osd.3 pg_epoch: 948 pg[113.2s0( v 944'2 (0'0,944'2] local-les=946 n=1 ec=941 les/c 946/944 945/945/941) [3,4,1] r=0 lpr=945 pi=941-944/1 crt=0'0 lcod 944'1 mlcod 0'0 active+degraded]  -- Watch(1,client.4886) registering callback, timeout: 30
2014-08-24 14:13:30.040212 7faf01235700 20 osd.3 pg_epoch: 948 pg[113.2s0( v 944'2 (0'0,944'2] local-les=946 n=1 ec=941 les/c 946/944 945/945/941) [3,4,1] r=0 lpr=945 pi=941-944/1 crt=0'0 lcod 944'1 mlcod 0'0 active+degraded] ReplicatedPG::check_blacklisted_obc_watchers for obc e3ef388a/foo/head/test-rados-api-plan
...
and when the op comes around again, it's seen as a dup ...

Actions #1

Updated by Sage Weil over 9 years ago

ubuntu@teuthology:/a/teuthology-2014-08-24_02:30:02-rados-next-testing-basic-multi/446755 too ?

Actions #2

Updated by Sage Weil over 9 years ago

  • Status changed from New to 12
Actions #3

Updated by Sage Weil over 9 years ago

I think this is what's happening:

- op arrives, we register the watcher, fire off repops
- pg repeers, but keeps same primary; requeues op
- op is dequeued, sees the object is degraded, waits for that
- object recovers, op is requeued
- op is dequeued, sees the tid has already happened, returns 0

I think the degraded part (steps 3-4) aren't necessary for this to be buggy. I suspect the larger problem is that we are doing the tid dup checks on the watch ops and they are really sort of special. Normally the client resends with a unique tid so it isn't usually a problem, but given the osd can requeue for a range of reasons, that isn't sufficient...

Actions #4

Updated by Sage Weil over 9 years ago

  • Subject changed from LibRadosWatchNotifyECPP.WatchNotifyTestPP hang to osd: watch request becomes no-op if requeued and dup
Actions #5

Updated by Sage Weil over 9 years ago

  • Tracker changed from Bug to Fix
  • Target version set to 0.86
Actions #6

Updated by Sage Weil over 9 years ago

  • Tracker changed from Fix to Bug
  • Subject changed from osd: watch request becomes no-op if requeued and dup to objecter doesn't reconnect watch on interval change w/ same primary
  • Assignee set to Sage Weil
  • Backport set to firefly,dumpling
Actions #7

Updated by Sage Weil over 9 years ago

  • Target version deleted (0.86)

the problem is that the linger op isn't resent on an interval change where the primary is the same.

Actions #8

Updated by Sage Weil over 9 years ago

  • Priority changed from Urgent to Immediate
Actions #9

Updated by Samuel Just over 9 years ago

have test

Actions #10

Updated by Samuel Just over 9 years ago

ceph-qa-suite:wip-notify

Actions #11

Updated by Sage Weil over 9 years ago

  • Assignee changed from Sage Weil to Samuel Just
Actions #12

Updated by Samuel Just over 9 years ago

  • Status changed from 12 to 7
Actions #13

Updated by Samuel Just over 9 years ago

  • Status changed from 7 to Fix Under Review
Actions #14

Updated by Samuel Just over 9 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #15

Updated by Sage Weil over 9 years ago

  • Status changed from Pending Backport to Resolved
Actions #16

Updated by Josh Durgin over 9 years ago

  • Status changed from Resolved to Pending Backport
Actions #17

Updated by Josh Durgin over 9 years ago

  • Status changed from Pending Backport to Resolved
  • Backport changed from firefly,dumpling to firefly

This did not need backporting to dumpling after all, since it was broken after dumpling by 860d72770cdf092c027d50f4ee03bed76c975599. Dumpling does this correctly, but is still affected by #9806.

Actions

Also available in: Atom PDF