Project

General

Profile

Bug #5615

lock ops are not re-sent when cluster gets marked un-full

Added by Greg Farnum about 6 years ago. Updated about 6 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
07/12/2013
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

I'm not certain what the correct behavior should be in this case, so
maybe it is not a bug, but here is what is happening:

When an OSD becomes full, a process fails and we unmount the rbd
attempt to remove the lock associated with the rbd for the process.
The unmount works fine, but removing the lock is failing right now
because the list_lockers() function call never returns.

Here is a code snippet I tried with a fake rbd lock on a test cluster:

import rbd
import rados
with rados.Rados(conffile='/etc/ceph/ceph.conf') as cluster:
  with cluster.open_ioctx('rbd') as ioctx:
    with rbd.Image(ioctx, 'msd1') as image:
      image.list_lockers()

The process never returns, even after the ceph cluster is returned to
healthy.  The only indication of the error is an error in the
/var/log/messages file:

Jul 11 23:25:05 node-172-16-0-13 python: 2013-07-11 23:25:05.826793
7ffc66d72700  0 client.6911.objecter  FULL, paused modify
0x7ffc687c6050 tid 2

Any help would be greatly appreciated.

ceph version:

ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404)

This may turn out to be a librados issue, but it showed up via rbd locking.

History

#1 Updated by Sage Weil about 6 years ago

i bet this affects all class calls/execs, because objecter doesn't know if it is a read or a write. we may need to resend all of them (and/or pessimistically flag them all as write) so that they get resent.

#2 Updated by Greg Farnum about 6 years ago

If it's stopping the op because the cluster is marked FULL then it ought to be able to know the op needs to be sent out again later. ;)

#3 Updated by Sage Weil about 6 years ago

oh right. in that case the op was probably sent and then dropped by the osd. objecter only pauses ops marked as write, and iirc class ops aren't.

#4 Updated by Sage Weil about 6 years ago

  • Status changed from New to Duplicate

this is the linger resend on unfull bug #6070

Also available in: Atom PDF