Project

General

Profile

Actions

Bug #9419

closed

dumpling->firefly upgrade, sending setallochint?

Added by Samuel Just over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

100%

Source:
Support
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Crash on dumpling osds with bad op 39 when the first osd is upgraded to firefly, setallochint.

https://github.com/ceph/ceph/pull/2543


Subtasks 1 (0 open1 closed)

Feature #9568: Add test case to test #9419 (ceph wip-9419)ResolvedYuri Weinstein09/22/2014

Actions
Actions #1

Updated by Samuel Just over 9 years ago

client rbd (firefly) --with setallochint--> primary (firefly) --with setallochint--> replica (dumpling) crash

Actions #2

Updated by Samuel Just over 9 years ago

The problem here appears to be that the user upgraded the clients before the osds were fully upgraded. librbd sends the setallochint unconditionally, old osds will respond with ENOTSUPP. The bug here would be that the primary supported the op and the replicas didn't. It probably should have returned ENOTSUPP.

Actions #3

Updated by Samuel Just over 9 years ago

  • Assignee deleted (Samuel Just)
Actions #4

Updated by Samuel Just over 9 years ago

  • Assignee set to David Zafman

Two steps:
1) During GetInfo, for actingbackfill peers, build up a feature set which is the intersection of the feature sets of all of the peers as we receive messages from them.
2) Use this feature set in the setallochint handler in do_osd_ops to return ENOTSUPP if any peer does not understand it.

Actions #5

Updated by David Zafman over 9 years ago

  • Status changed from New to 7
Actions #6

Updated by Loïc Dachary over 9 years ago

What happens if

  • all OSDs in a PG support setallochint
  • one secondary OSD goes down
  • the secondary is replaced by an OSD that does not support setallochint
Actions #7

Updated by David Zafman over 9 years ago

  • Status changed from 7 to Fix Under Review
Actions #8

Updated by Loïc Dachary over 9 years ago

  • Description updated (diff)
Actions #9

Updated by David Zafman over 9 years ago

On any change of pg configuration peering happens, so a new collection of feature bits from the peers is collected. If not all peers support the feature, EOPNOTSUPP is returned to client and no messages are sent to any secondaries.

Actions #10

Updated by Loïc Dachary over 9 years ago

Thanks for explaining. Since alloc hint is optional it does not matter if it is activated and deactivate later.

Actions #11

Updated by David Zafman over 9 years ago

Notes on using feature bits already present. The problem is that CEPH_FEATURE_MSGR_KEEPALIVE2 was back ported, so we'd have to check CEPH_FEATURE_OSD_POOLRESEND but that is over 2 months worth of changes later. For maintainability I'd rather have a feature bit dedicated to the feature being checked for.

f825624f (Sage Weil 2014-01-29 19:47:21 -0800 125) CEPH_FEATURE_OSD_PRIMARY_AFFINITY
64568023 (Ilya Dryomov 2014-02-21 16:34:13 +0200 Added HINT CODE (v0.78)
d747d79f (Sage Weil 2014-03-27 21:09:13 -0700 126) CEPH_FEATURE_MSGR_KEEPALIVE2 (v0.79)
45e79a17 (Sage Weil 2014-05-08 10:50:51 -0700 54) CEPH_FEATURE_OSD_POOLRESEND (v0.81)

Actions #12

Updated by David Zafman over 9 years ago

  • Source changed from Support to Development
Actions #13

Updated by David Zafman over 9 years ago

  • Source changed from Development to Support
Actions #14

Updated by Samuel Just over 9 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #15

Updated by Samuel Just over 9 years ago

next step is to add a tests for this to the upgrade suties.

Actions #16

Updated by Samuel Just over 9 years ago

  • Assignee changed from David Zafman to Yuri Weinstein
Actions #17

Updated by Samuel Just over 9 years ago

  • Status changed from Pending Backport to 12
Actions #18

Updated by Yuri Weinstein over 9 years ago

  • Status changed from 12 to 7

This is done an a new case was added - PR https://github.com/ceph/ceph-qa-suite/pull/198

Actions #19

Updated by Samuel Just over 9 years ago

  • Status changed from 7 to Resolved
Actions

Also available in: Atom PDF