Project

General

Profile

Actions

Bug #18667

closed

[cache tiering] omap data time-traveled to stale version

Added by Jason Dillaman over 7 years ago. Updated over 4 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
Tiering
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Noticed an oddity while examining the logs of an upgrade test failure [1] against a Jewel (v10.2.5+) cluster. An image (test-librbd-smithi163-16572-1/image24 -- header oid 'rbd_header.101426e4642c') was opened and locked:

librbd log:

2017-01-24 23:10:01.426278 7f1311b091c0 10 librbd::ExclusiveLock: 0x7f12ac03f640 send_acquire_lock
...
2017-01-24 23:10:01.427324 7f12be7fc700 10 librbd::exclusive_lock::AcquireRequest: 0x7f131c945b00 send_lock: cookie=auto 139718172082336
...
2017-01-24 23:10:01.431555 7f12827fc700 10 librbd::ExclusiveLock: 0x7f12ac03f640 handle_acquire_lock: r=0

osd log:

2017-01-24 23:10:01.400444 7fc007afe700 10 osd.1 pg_epoch: 36 pg[1.7( v 36'5 (0'0,36'5] local-les=6 n=4 ec=5 les/c/f 6/6/0 5/5/5) [1,2] r=0 lpr=5 crt=8'3 lcod 36'4 mlcod 36'4 active+clean] do_op osd_op(client.4116.0:338 1.4155e1bf rbd_header.101426e4642c [call lock.lock] snapc 0=[] ondisk+write+known_if_redirected e36) v7 may_write may_read -> write-ordered flags ondisk+write+known_if_redirected
...
2017-01-24 23:10:01.400703 7fc007afe700  0 <cls> cls/lock/cls_lock.cc:203: locker id: locker=client.4116, cookie=auto 139718172082336
2017-01-24 23:10:01.400706 7fc007afe700 20 <cls> cls/lock/cls_lock.cc:206: existing_lock_type=none

Initially the ops were being forwarded to the base pool (pg 1.7) but at some point, the object was promoted to the cache tier pool test-librbd-smithi163-16572-9 / pg 9.7). This promotion resulted in the loss of the watch handle, which is OK and librbd responded by re-establishing the watch and attempting to unlock / relock the image header:

librbd log:

2017-01-24 23:10:25.362145 7f12be7fc700 10 librbd::ExclusiveLock: 0x7f12ac03f640 send_release_lock
...
2017-01-24 23:10:25.370027 7f12be7fc700 10 librbd::exclusive_lock::ReleaseRequest: 0x7f12ac00b090 send_unlock: cookie=auto 139718172082336
...
2017-01-24 23:10:25.372626 7f12827fc700 10 librbd::ExclusiveLock: 0x7f12ac03f640 handle_release_lock: r=0
...
2017-01-24 23:10:25.372741 7f12827fc700 10 librbd::ExclusiveLock: 0x7f12ac03f640 send_acquire_lock
...
2017-01-24 23:10:25.376711 7f12be7fc700 10 librbd::exclusive_lock::AcquireRequest: 0x7f126000bc90 send_lock: cookie=auto 139718172150048
...
2017-01-24 23:10:25.383956 7f12827fc700 10 librbd::ExclusiveLock: 0x7f12ac03f640 handle_acquire_lock: r=0

osd log:

2017-01-24 23:10:25.344323 7fb63abe3700 10 osd.2 pg_epoch: 56 pg[9.7( v 56'3 (0'0,56'3] local-les=54 n=2 ec=53 les/c/f 54/54/0 53/53/53) [2,1] r=0 lpr=53 crt=56'1 lcod 56'2 mlcod 56'2 active+clean] do_op osd_op(client.4116.0:1341 9.4155e1bf rbd_header.101426e4642c [call lock.unlock] snapc 0=[] ondisk+write+known_if_redirected e56) v7 may_write may_read -> write-ordered flags ondisk+write+known_if_redirected
...
2017-01-24 23:10:25.344550 7fb63abe3700  0 <cls> cls/lock/cls_lock.cc:299: src locker id: locker=client.4116, cookie=auto 139718172082336
2017-01-24 23:10:25.344555 7fb63abe3700  0 <cls> cls/lock/cls_lock.cc:301: dst locker id: locker=client.4116, cookie=auto 139718172082336
...
2017-01-24 23:10:25.350955 7fb63d3e8700 10 osd.2 pg_epoch: 56 pg[9.7( v 56'6 (0'0,56'6] local-les=54 n=3 ec=53 les/c/f 54/54/0 53/53/53) [2,1] r=0 lpr=53 crt=56'4 lcod 56'5 mlcod 56'5 active+clean] do_op osd_op(client.4116.0:1347 9.4155e1bf rbd_header.101426e4642c [call lock.lock] snapc 0=[] ondisk+write+known_if_redirected e56) v7 may_write may_read -> write-ordered flags ondisk+write+known_if_redirected
...
2017-01-24 23:10:25.351067 7fb63d3e8700  0 <cls> cls/lock/cls_lock.cc:203: locker id: locker=client.4116, cookie=auto 139718172150048
2017-01-24 23:10:25.351068 7fb63d3e8700 20 <cls> cls/lock/cls_lock.cc:206: existing_lock_type=exclusive
...
2017-01-24 23:10:25.351990 7fb63abe3700  1 -- 172.21.15.59:6800/16837 --> 172.21.15.163:0/3774197113 -- osd_op_reply(1347 rbd_header.101426e4642c [call lock.lock] v56'7 uv9 ondisk = 0) v7 -- ?+0 0x7fb6692c58c0 con 0x7fb668d5a600

In the logs above, the lock was successfully released and re-acquired against the header object within the cache tier pool. Note the lock cookies for the release request show that the incoming cookie (src) and the on-disk cookie (dst) match so the lock is successfully released. After the lock is re-acquired, the new cookie is 139718172150048 (instead of 139718172082336).

Shortly thereafter, the watch disconnects again (demotion?) and librbd once again re-acquires the lock (now against the base pool) and attempts to reestablish the lock. However, the attempt to release the lock fails w/ -ENOENT because the cookies didn't align:

librbd logs:

2017-01-24 23:10:32.276402 7f12be7fc700 10 librbd::ExclusiveLock: 0x7f12ac03f640 send_release_lock
...
2017-01-24 23:10:32.278443 7f12be7fc700 10 librbd::exclusive_lock::ReleaseRequest: 0x7f12ac03bcc0 send_unlock: cookie=auto 139718172150048
...
2017-01-24 23:10:32.279406 7f12be7fc700 10 librbd::exclusive_lock::ReleaseRequest: 0x7f12ac03bcc0 handle_unlock: r=-2
...
2017-01-24 23:10:32.279442 7f12827fc700 10 librbd::ExclusiveLock: 0x7f12ac03f640 handle_release_lock: r=0

osd logs:

2017-01-24 23:10:32.253136 7fc007afe700 10 osd.1 pg_epoch: 62 pg[1.7( v 62'17 (0'0,62'17] local-les=6 n=6 ec=5 les/c/f 6/6/0 5/5/5) [1,2] r=0 lpr=5 crt=62'15 lcod 62'16 mlcod 62'16 active+clean] do_op osd_op(client.4116.0:1855 1.4155e1bf rbd_header.101426e4642c [call lock.unlock] snapc 0=[] ondisk+write+known_if_redirected e62) v7 may_write may_read -> write-ordered flags ondisk+write+known_if_redirected
...
2017-01-24 23:10:32.253347 7fc007afe700  0 <cls> cls/lock/cls_lock.cc:299: src locker id: locker=client.4116, cookie=auto 139718172150048
2017-01-24 23:10:32.253350 7fc007afe700  0 <cls> cls/lock/cls_lock.cc:301: dst locker id: locker=client.4116, cookie=auto 139718172082336
...
2017-01-24 23:10:32.253374 7fc007afe700  1 -- 172.21.15.59:6804/16838 --> 172.21.15.163:0/3774197113 -- osd_op_reply(1855 rbd_header.101426e4642c [call lock.unlock] v0'0 uv15 ondisk = -2 ((2) No such file or directory)) v7 -- ?+0 0x7fc0346fe3c0 con 0x7fc03404d180

Note how the base tier has the stale version of the omap lock cookie data (139718172082336) instead of the correct cookie (139718172150048).

[1] http://qa-proxy.ceph.com/teuthology/jdillaman-2017-01-24_18:00:50-upgrade-wip-18617-kraken-distro-basic-smithi/744304

Actions #1

Updated by Jason Dillaman over 7 years ago

  • Subject changed from [cache tiering] omap data time-traveled to older version to [cache tiering] omap data time-traveled to stale version
Actions #2

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category set to Tiering
  • Component(RADOS) OSD added
Actions #3

Updated by Greg Farnum almost 7 years ago

Jason says this "seems to pop up randomly every few weeks or so", so it's definitely a live, going concern. :(

Actions #4

Updated by Josh Durgin over 4 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF