Bug #9806: Objecter: resend linger ops on split - Ceph - Ceph

Actions

Copy link

Bug #9806

closed

Objecter: resend linger ops on split

Added by Samuel Just over 9 years ago. Updated over 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Josh Durgin

Category:

Target version:

% Done:

Source:

other

Tags:

conflict

Backport:

firefly

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Otherwise, we can lose notifies.

cb9262abd7fd5f0a9f583bd34e4c425a049e56ce

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Samuel Just over 9 years ago

Description updated (diff)

Actions

Copy link

Updated by Josh Durgin over 9 years ago

Backport set to giant, firefly, dumpling

Actions

Copy link

Updated by Josh Durgin over 9 years ago

Status changed from New to 7
Assignee set to Josh Durgin

Actions

Copy link

Updated by Sage Weil over 9 years ago

Status changed from 7 to Resolved

Actions

Copy link

Updated by Sage Weil over 9 years ago

Status changed from Resolved to Pending Backport

Actions

Copy link

Updated by Loïc Dachary about 9 years ago

Description updated (diff)

The cb9262abd7fd5f0a9f583bd34e4c425a049e56ce does not apply cleanly on dumpling which suggests more should be backported for it to make sense. Should this be backported for v0.67.12 or can it wait ?

Actions

Copy link

Updated by Loïc Dachary about 9 years ago

It won't be in dumpling v0.67.12 but ... it could be in v0.80.10 ;-) It looks like an important fix.

Actions

Copy link

Updated by Loïc Dachary about 9 years ago

Backport changed from giant, firefly, dumpling to firefly, dumpling

already in giant

Actions

Copy link

Updated by Loïc Dachary about 9 years ago

Backport changed from firefly, dumpling to firefly

dumpling is end of life

Actions

Copy link

#10

Updated by Loïc Dachary almost 9 years ago

Tags set to conflict
Regression set to No

Actions

Copy link

#11

Updated by Nathan Cutler almost 9 years ago

Status changed from Pending Backport to Resolved

Actions

Copy link

#12

Updated by Christian Theune over 8 years ago

As far as I understand this hurts snapshots. I'm on Firefly and getting bitten by this. Is there a work-around to get back to a usable snapshot state once this has kicked in?

Actions

Copy link

#13

Updated by Josh Durgin over 8 years ago

A workaround is to detach and reattach your images. This reopens them and reestablishes the watch.

Actions

Copy link

#14

Updated by Christian Theune over 8 years ago

Ah. So in that case restarting Qemu would be the specific action for that, right?

I'm currently trying this out. I'm a bit unclear on the specifics of the trigger. We have some automation code that causes pg_num and pgp_num to be automatically (slowly) adjusted for growing pools.

However, this was running for a while without the cluster exhibiting the issue in a way that we would notice. The specific point when we noticed was when we updated our tunables to the recommended settings for Firefly (and caused a large CRUSH rearrangement).

Would you think that in running operations stopping our automatic pg_num, pgp_num adaption would be sufficient to avoid this bug?

For further clarification: does this bug apply on a per-pool basis, per-image basis or cluster-wide? My guess would be this applies on a per-pool basis.

Thanks for the hint!

Actions

Copy link

#15

Updated by Christian Theune over 8 years ago

Ok, so I restarted one of the VMs exiting Qemu and starting afresh. Took a snapshot immediately after that and it's been giving a consistent hash of the mapped rbd device multiple times after that.

Actions

Copy link

#16

Updated by Josh Durgin over 8 years ago

Yes, restarting qemu will fix it. The trigger for the issue is pg split, so it would only affect pools where you had increased pg_num and pgp_num. If you avoid splitting, you avoid this bug. Other crush changes like straw2 or new tunables should not cause this issue.

Actions

Copy link

#17

Updated by Christian Theune over 8 years ago

That's a relief! Thanks for the explanation, I hope other people stumbling over this bug will find this helpful, too. :)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #9806

Objecter: resend linger ops on split

Updated by Samuel Just over 9 years ago

Updated by Josh Durgin over 9 years ago

Updated by Josh Durgin over 9 years ago

Updated by Sage Weil over 9 years ago

Updated by Sage Weil over 9 years ago

Updated by Loïc Dachary about 9 years ago

Updated by Loïc Dachary about 9 years ago

Updated by Loïc Dachary about 9 years ago

Updated by Loïc Dachary about 9 years ago

Updated by Loïc Dachary almost 9 years ago

Updated by Nathan Cutler almost 9 years ago

Updated by Christian Theune over 8 years ago

Updated by Josh Durgin over 8 years ago

Updated by Christian Theune over 8 years ago

Updated by Christian Theune over 8 years ago

Updated by Josh Durgin over 8 years ago

Updated by Christian Theune over 8 years ago