Bug #13372: librbd: object map invalid but not flagged as such - rbd - Ceph

Actions

Copy link

Bug #13372

closed

librbd: object map invalid but not flagged as such

Added by Josh Durgin over 8 years ago. Updated over 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Jason Dillaman

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

hammer

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

http://pulpito.ceph.com/teuthology-2015-10-03_23:00:08-rbd-infernalis-distro-basic-multi/1087438/

Reproduced by running the same job, and paused to be able to examine the images.

During the rbd cp from testimg1@snap to testimg2, there are some warnings about object map:

2015-10-05T16:53:12.606 INFO:tasks.workunit.client.0.burnupi10.stderr:+ rbd copy testimg1 --snap=snap1 testimg2
2015-10-05T16:53:12.898 INFO:tasks.workunit.client.0.burnupi10.stderr:^MImage copy: 1% complete...^MImage copy: 3% complete...^MImag
e copy: 4% complete...^MImage copy: 6% complete...^MImage copy: 7% complete...^MImage copy: 9% complete...^MImage copy: 10% complete
...^MImage copy: 12% complete...^MImage copy: 14% complete...^MImage copy: 15% complete...^MImage copy: 17% complete...^MImage copy:
 18% complete...^MImage copy: 20% complete...^MImage copy: 21% complete...^MImage copy: 23% complete...^MImage copy: 25% complete...
^MImage copy: 26% complete...^MImage copy: 28% complete...^MImage copy: 29% complete...^MImage copy: 31% complete...^MImage copy: 32
% complete...^MImage copy: 34% complete...^MImage copy: 35% complete...^MImage copy: 37% complete...^MImage copy: 39% complete...^MI
mage copy: 40% complete...^MImage copy: 42% complete...^MImage copy: 43% complete...^MImage copy: 45% complete...^MImage copy: 46% c
omplete...^MImage copy: 48% complete...^MImage copy: 50% complete...^MImage copy: 51% complete...^MImage copy: 53% complete...^MImag
e copy: 54% complete...^MImage copy: 56% complete...^MImage copy: 57% complete...^MImage copy: 59% complete...^MImage copy: 60% comp
lete...^MImage copy: 62% complete...^MImage copy: 64% complete...^MImage copy: 65% complete...^MImage copy: 67% complete...^MImage c
opy: 68% complete...^MImage copy: 70% complete...^MImage copy: 71% complete...^MImage copy: 73% complete...^MImage copy: 75% complet
e...^MImage copy: 76% complete...^MImage copy: 78% complete...^MImage copy: 79% complete...^MImage copy: 81% complete...^MImage copy
: 82% complete...^MImage copy: 84% complete...^MImage copy: 85% complete...^MImage copy: 87% complete...^MImage copy: 89% complete..
.^MImage copy: 90% complete...^MImage copy: 92% complete...^MImage copy: 93% complete...^MImage copy: 95% complete...^MImage copy: 9
6% complete...^MImage copy: 98% complete...^MImage copy: 100% complete...2015-10-05 16:53:12.890085 7fcd0f7fe700 -1 librbd::ObjectMa
p: object map lock not owned by client
2015-10-05T16:53:12.901 INFO:tasks.workunit.client.0.burnupi10.stderr:2015-10-05 16:53:12.893055 7fcd0f7fe700 -1 librbd::ObjectMap: 
object map lock not owned by client
2015-10-05T16:53:12.902 INFO:tasks.workunit.client.0.burnupi10.stderr:2015-10-05 16:53:12.894065 7fcd0f7fe700 -1 librbd::ObjectMap: 
object map lock not owned by client
2015-10-05T16:53:13.037 INFO:tasks.workunit.client.0.burnupi10.stderr:^MImage copy: 100% complete...done.

The rados objects 0 and 2 for both testimg1 and testimg2 match, but the exported versions differ - testimg2 is all zeroes, suggesting an invalid object map that was not marked invalid. The cluster is still up for further investigation on burnupi10 currently.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Josh Durgin over 8 years ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by Jason Dillaman over 8 years ago

Status changed from New to In Progress
Assignee set to Jason Dillaman

Actions

Copy link

Updated by Jason Dillaman over 8 years ago

The lock is held by the OSD!

# rados -p rbd lock info rbd_object_map.12d323e2e3c0 rbd_lock
{"name":"rbd_lock","type":"exclusive","tag":"","lockers":[{"name":"osd.0","cookie":"","description":"","expiration":"0.000000","addr":"10.214.134.32:0\/12881"}]}

Client requests lock, the assert_lock fails because the locker is incorrectly listed as the OSD, and the unlock fails for the same reason.

ceph-osd.0.log:2015-10-05 16:53:12.764864 7fcfeaeff700  1 -- 10.214.134.32:6800/12881 --> 10.214.134.32:0/3469838287 -- osd_op_reply(35 rbd_object_map.12d323e2e3c0 [call lock.lock] v0'0 uv1022 ondisk = 0) v6 -- ?+0 0x7fd015207b80 con 0x7fd014cca840

ceph-osd.0.log:2015-10-05 16:53:12.890261 7fcfeef07700  1 -- 10.214.134.32:6800/12881 --> 10.214.134.32:0/3469838287 -- osd_op_reply(41 rbd_object_map.12d323e2e3c0 [call lock.assert_locked,call rbd.object_map_update] v0'0 uv0 ondisk = -16 ((16) Device or resource busy)) v6 -- ?+0 0x7fd013c76680 con 0x7fd014cca840

ceph-osd.0.log:2015-10-05 16:53:13.005884 7fcfeef07700  1 -- 10.214.134.32:6800/12881 --> 10.214.134.32:0/3469838287 -- osd_op_reply(47 rbd_object_map.12d323e2e3c0 [call lock.unlock] v0'0 uv1743 ondisk = -2 ((2) No such file or directory)) v6 -- ?+0 0x7fd014cdcc00 con 0x7fd014cca840

This was a cache tier test case, so this is most likely another proxied write issue being discovered.

Actions

Copy link