Project

General

Profile

Actions

Bug #39710

open

[exclusive lock] An image got more than one lock owners at the same time

Added by yu feng almost 5 years ago. Updated almost 5 years ago.

Status:
New
Priority:
Low
Assignee:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

With ceph mimic 13.2.5, I wrote a simple program to test the librbd exclusive lock feature relating APIs(i.e. lock_acquire, lock_break...) and found that in a specific case, there can be more than one clients thought themselves the lock owner at the same time.

The "rbd_blacklist_on_break_lock" is setted as false before the test, and the information of the tested image is:

[root@node0012 ~]# rbd info p1/lun7
rbd image 'lun7':
size 1 GiB in 256 objects
order 22 (4 MiB objects)
id: f6736b8b4567
block_name_prefix: rbd_data.f6736b8b4567
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
op_features:
flags:
create_timestamp: Sat May 11 14:29:38 2019

The test procedure is:
1. clients on host node0010 and host node0011(the ip of the two hosts are separately 10.0.10.10 and 10.0.10.11) opened the same image p1/lun7 as above.
2. client on host node0010 call lock_acquire.
3. client on host node0011 call lock_break & lock_acquire.

After step 3, the command shows the locker owner is on node0011:

[root@node0012 ~]# rados -p p1 lock info rbd_header.f6736b8b4567 rbd_lock {"name":"rbd_lock","type":"exclusive","tag":"internal","lockers":[{"name":"client.63133","cookie":"auto 140200885894320","description":"","expiration":"0.000000","addr":"10.0.10.11:0/2140833570"}]}

[root@node0012 ~]# rbd lock ls p1/lun7
There is 1 exclusive lock on this image.
Locker ID Address
client.63133 auto 140200885894320 10.0.10.11:0/2140833570

but the client on node0010 still think itself the lock owner:

2019-05-11 14:34:24.585 7f24d9ffb700 10 librbd::Watcher::C_NotifyAck 0x7f24d400b880 C_NotifyAck: id=4621384811568, handle=139796018117808
2019-05-11 14:34:24.585 7f24d9ffb700 10 librbd::ImageWatcher: 0x7f24d4004dd0 image exclusively locked announcement
2019-05-11 14:34:24.585 7f24d9ffb700 10 librbd::ImageWatcher: 0x7f24d4004dd0 current lock owner: [*63133,140200885894320*]
2019-05-11 14:34:24.585 7f24d9ffb700 20 librbd::ManagedLock: 0x7f24d400e860 is_lock_owner: =1
2019-05-11 14:34:24.585 7f24d9ffb700 10 librbd::Watcher::C_NotifyAck 0x7f24d400b880 finish: r=0

Then the client on node0010 continue to write data, got the following log:

2019-05-11 14:34:46.465 7f25042cb480 20 librbd::io::ImageRequestWQ: 0x1e042d0 aio_write: ictx=0x1e02f40, completion=0x1e05550, off=0, len=4096, flags=0
2019-05-11 14:34:46.465 7f25042cb480 20 librbd::io::ImageRequestWQ: 0x1e042d0 queue: ictx=0x1e02f40, req=0x1dec9d0
2019-05-11 14:34:46.465 7f24d97fa700 20 librbd::io::AsyncOperation: 0x1e05688 start_op
2019-05-11 14:34:46.465 7f24d97fa700 20 librbd::io::ImageRequestWQ: 0x1e042d0 process: ictx=0x1e02f40, req=0x1dec9d0
2019-05-11 14:34:46.465 7f24d97fa700 20 librbd::io::ImageRequest: 0x7f24d97f6e90 send: aio_write: ictx=0x1e02f40, completion=0x1e05550

2019-05-11 14:34:46.465 7f24d97fa700 20 librbd::ObjectMap: 0x7f24cc008db0 aio_update: start=0, end=1, ->1
2019-05-11 14:34:46.465 7f24d97fa700 20 librbd::object_map::UpdateRequest: 0x7f24cc0068e0 update_object_map: ictx=0x1e02f40, oid=rbd_object_map.f6736b8b4567, [0,
1) = ->1
2019-05-11 14:34:46.465 7f24d97fa700 10 client.63127.objecter _op_submit oid rbd_object_map.f6736b8b4567 '@1' '@1' [call lock.assert_locked,call rbd.object_map_u
pdate] tid 29 osd.4
2019-05-11 14:34:46.472 7f24f2ad0700 10 client.63127.objecter op 0 rval -16 len 0
2019-05-11 14:34:46.472 7f24f2ad0700 10 client.63127.objecter op 1 rval 0 len 0
2019-05-11 14:34:46.472 7f24d9ffb700 20 librbd::object_map::UpdateRequest: 0x7f24cc0068e0 handle_update_object_map: r=-16
2019-05-11 14:34:46.472 7f24d9ffb700 20 librbd::object_map::UpdateRequest: 0x7f24cc0068e0 update_in_memory_object_map:
2019-05-11 14:34:46.472 7f24d9ffb700 20 librbd::object_map::Request: 0x7f24cc0068e0 should_complete: r=-16
2019-05-11 14:34:46.472 7f24d9ffb700 -1 librbd::object_map::Request: failed to update object map: (16) Device or resource busy
2019-05-11 14:34:46.472 7f24d9ffb700 -1 librbd::object_map::InvalidateRequest: 0x7f24d4011d30 invalidating object map in-memory
2019-05-11 14:34:46.472 7f24d9ffb700 -1 librbd::object_map::InvalidateRequest: 0x7f24d4011d30 invalidating object map on-disk

The object_map::UpdateRequest got error -16 is because the op "call lock.assert_locked" failed. Also, we can see the object rbd_object_map.f6736b8b4567 is locked by client on node0011:

[root@node0012 ~]# rados -p p1 lock info rbd_object_map.f6736b8b4567 rbd_lock {"name":"rbd_lock","type":"exclusive","tag":"","lockers":[{"name":"client.63133","cookie":"","description":"","expiration":"0.000000","addr":"10.0.10.11:0/2140833570"}]}

At the end, the image got object-map and fast-diff invalid flags:

[root@node0012 ~]# rbd info p1/lun7
rbd image 'lun7':
size 1 GiB in 256 objects
order 22 (4 MiB objects)
id: f6736b8b4567
block_name_prefix: rbd_data.f6736b8b4567
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
op_features:
flags: object map invalid, fast diff invalid
create_timestamp: Sat May 11 14:29:38 2019

Here are my question:
1. Why client on node0010 received the notify message "AcquiredLockPayload" from the true lock owner, but didn't change the lock state? It seems the following code does not work in this case.

RWLock::RLocker owner_locker(m_image_ctx.owner_lock);
if (m_image_ctx.exclusive_lock != nullptr) {
// potentially wake up the exclusive lock state machine now that
// a lock owner has advertised itself
m_image_ctx.exclusive_lock->handle_peer_notification(0);
}

2. Does the exclusive lock feature should always be used alone with "rbd_blacklist_on_break_lock = true"?


Files

node0010.log.tar.gz (54.8 KB) node0010.log.tar.gz yu feng, 05/11/2019 07:45 AM
node0011.log.tar.gz (55 KB) node0011.log.tar.gz yu feng, 05/11/2019 07:47 AM
Actions

Also available in: Atom PDF