Bug #13372
closedlibrbd: object map invalid but not flagged as such
0%
Description
http://pulpito.ceph.com/teuthology-2015-10-03_23:00:08-rbd-infernalis-distro-basic-multi/1087438/
Reproduced by running the same job, and paused to be able to examine the images.
During the rbd cp from testimg1@snap to testimg2, there are some warnings about object map:
2015-10-05T16:53:12.606 INFO:tasks.workunit.client.0.burnupi10.stderr:+ rbd copy testimg1 --snap=snap1 testimg2 2015-10-05T16:53:12.898 INFO:tasks.workunit.client.0.burnupi10.stderr:^MImage copy: 1% complete...^MImage copy: 3% complete...^MImag e copy: 4% complete...^MImage copy: 6% complete...^MImage copy: 7% complete...^MImage copy: 9% complete...^MImage copy: 10% complete ...^MImage copy: 12% complete...^MImage copy: 14% complete...^MImage copy: 15% complete...^MImage copy: 17% complete...^MImage copy: 18% complete...^MImage copy: 20% complete...^MImage copy: 21% complete...^MImage copy: 23% complete...^MImage copy: 25% complete... ^MImage copy: 26% complete...^MImage copy: 28% complete...^MImage copy: 29% complete...^MImage copy: 31% complete...^MImage copy: 32 % complete...^MImage copy: 34% complete...^MImage copy: 35% complete...^MImage copy: 37% complete...^MImage copy: 39% complete...^MI mage copy: 40% complete...^MImage copy: 42% complete...^MImage copy: 43% complete...^MImage copy: 45% complete...^MImage copy: 46% c omplete...^MImage copy: 48% complete...^MImage copy: 50% complete...^MImage copy: 51% complete...^MImage copy: 53% complete...^MImag e copy: 54% complete...^MImage copy: 56% complete...^MImage copy: 57% complete...^MImage copy: 59% complete...^MImage copy: 60% comp lete...^MImage copy: 62% complete...^MImage copy: 64% complete...^MImage copy: 65% complete...^MImage copy: 67% complete...^MImage c opy: 68% complete...^MImage copy: 70% complete...^MImage copy: 71% complete...^MImage copy: 73% complete...^MImage copy: 75% complet e...^MImage copy: 76% complete...^MImage copy: 78% complete...^MImage copy: 79% complete...^MImage copy: 81% complete...^MImage copy : 82% complete...^MImage copy: 84% complete...^MImage copy: 85% complete...^MImage copy: 87% complete...^MImage copy: 89% complete.. .^MImage copy: 90% complete...^MImage copy: 92% complete...^MImage copy: 93% complete...^MImage copy: 95% complete...^MImage copy: 9 6% complete...^MImage copy: 98% complete...^MImage copy: 100% complete...2015-10-05 16:53:12.890085 7fcd0f7fe700 -1 librbd::ObjectMa p: object map lock not owned by client 2015-10-05T16:53:12.901 INFO:tasks.workunit.client.0.burnupi10.stderr:2015-10-05 16:53:12.893055 7fcd0f7fe700 -1 librbd::ObjectMap: object map lock not owned by client 2015-10-05T16:53:12.902 INFO:tasks.workunit.client.0.burnupi10.stderr:2015-10-05 16:53:12.894065 7fcd0f7fe700 -1 librbd::ObjectMap: object map lock not owned by client 2015-10-05T16:53:13.037 INFO:tasks.workunit.client.0.burnupi10.stderr:^MImage copy: 100% complete...done.
The rados objects 0 and 2 for both testimg1 and testimg2 match, but the exported versions differ - testimg2 is all zeroes, suggesting an invalid object map that was not marked invalid. The cluster is still up for further investigation on burnupi10 currently.
Updated by Jason Dillaman over 8 years ago
- Status changed from New to In Progress
- Assignee set to Jason Dillaman
Updated by Jason Dillaman over 8 years ago
The lock is held by the OSD!
# rados -p rbd lock info rbd_object_map.12d323e2e3c0 rbd_lock {"name":"rbd_lock","type":"exclusive","tag":"","lockers":[{"name":"osd.0","cookie":"","description":"","expiration":"0.000000","addr":"10.214.134.32:0\/12881"}]}
Client requests lock, the assert_lock fails because the locker is incorrectly listed as the OSD, and the unlock fails for the same reason.
ceph-osd.0.log:2015-10-05 16:53:12.764864 7fcfeaeff700 1 -- 10.214.134.32:6800/12881 --> 10.214.134.32:0/3469838287 -- osd_op_reply(35 rbd_object_map.12d323e2e3c0 [call lock.lock] v0'0 uv1022 ondisk = 0) v6 -- ?+0 0x7fd015207b80 con 0x7fd014cca840 ceph-osd.0.log:2015-10-05 16:53:12.890261 7fcfeef07700 1 -- 10.214.134.32:6800/12881 --> 10.214.134.32:0/3469838287 -- osd_op_reply(41 rbd_object_map.12d323e2e3c0 [call lock.assert_locked,call rbd.object_map_update] v0'0 uv0 ondisk = -16 ((16) Device or resource busy)) v6 -- ?+0 0x7fd013c76680 con 0x7fd014cca840 ceph-osd.0.log:2015-10-05 16:53:13.005884 7fcfeef07700 1 -- 10.214.134.32:6800/12881 --> 10.214.134.32:0/3469838287 -- osd_op_reply(47 rbd_object_map.12d323e2e3c0 [call lock.unlock] v0'0 uv1743 ondisk = -2 ((2) No such file or directory)) v6 -- ?+0 0x7fd014cdcc00 con 0x7fd014cca840
This was a cache tier test case, so this is most likely another proxied write issue being discovered.
Updated by Jason Dillaman over 8 years ago
- Status changed from In Progress to Fix Under Review
- Backport set to hammer
infernalis PR: https://github.com/ceph/ceph/pull/6184
Updated by Jason Dillaman over 8 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Jason Dillaman over 8 years ago
- Status changed from Pending Backport to Resolved