Project

General

Profile

Bug #11502

"[ FAILED ] TestLibRBD.LockingPP" in rbd-next-distro-basic-mult run

Added by Yuri Weinstein over 3 years ago. Updated 10 months ago.

Status:
Resolved
Priority:
Normal
Target version:
-
Start date:
04/29/2015
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
luminous,jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Run: http://pulpito.ceph.com/teuthology-2015-04-27_23:00:01-rbd-next-distro-basic-multi/
Jobs: ['866113', '866154']
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-04-27_23:00:01-rbd-next-distro-basic-multi/866113/

2015-04-28T11:51:14.717 INFO:tasks.workunit.client.0.plana46.stdout:[ RUN      ] TestLibRBD.LockingPP
2015-04-28T11:51:14.718 INFO:tasks.workunit.client.0.plana46.stdout:using new format!
2015-04-28T11:51:14.833 INFO:tasks.workunit.client.0.plana46.stdout:test/librbd/test_librbd.cc:1831: Failure
2015-04-28T11:51:14.834 INFO:tasks.workunit.client.0.plana46.stdout:Value of: rbd.open(ioctx, image, name.c_str(), __null)
2015-04-28T11:51:14.834 INFO:tasks.workunit.client.0.plana46.stdout:  Actual: -2
2015-04-28T11:51:14.834 INFO:tasks.workunit.client.0.plana46.stdout:Expected: 0
2015-04-28T11:51:14.834 INFO:tasks.workunit.client.0.plana46.stdout:[  FAILED  ] TestLibRBD.LockingPP (130 ms)

Related issues

Copied to rbd - Backport #23064: luminous: "[ FAILED ] TestLibRBD.LockingPP" in rbd-next-distro-basic-mult run Resolved
Copied to rbd - Backport #23065: jewel: "[ FAILED ] TestLibRBD.LockingPP" in rbd-next-distro-basic-mult run Resolved

History

#1 Updated by Sage Weil over 3 years ago

  • Project changed from Ceph to rbd
  • Regression set to No

#2 Updated by Josh Durgin over 3 years ago

  • Status changed from New to Can't reproduce

If this happens again we can investigate. No trace of what went wrong from these logs.

#3 Updated by Jason Dillaman about 2 years ago

  • Status changed from Can't reproduce to In Progress
  • Assignee set to Jason Dillaman

#4 Updated by Jason Dillaman about 2 years ago

  • Status changed from In Progress to Verified
  • Assignee deleted (Jason Dillaman)
  • Priority changed from Normal to High

OSD logs of interest: http://qa-proxy.ceph.com/teuthology/teuthology-2016-12-11_02:01:21-rbd-master-distro-basic-smithi/624047/remote/smithi089/log/ceph-osd.1.log.gz

The error is that 'image50' is created and the test fails when the image open fails with -ENOENT (even though the open request is right after the successful create request).

Assumptions: pg11.5 is the cache tier pg for object rbd_id.image50 and pg2.5 is the backing pg for object rbd_id.image50. This file is the mapping between the v2 image name and its unique id.

Step 1) librbd attempts to create image50. first step is to stat the image id object (rbd_id.image50) to ensure the image doesn't already exist:

...
2016-12-11 19:10:19.703177 7ff5bb2b0700 20 osd.1 pg_epoch: 70 pg[11.5( empty local-les=64 n=0 ec=63 les/c/f 64/64/0 63/63/63) [1,2] r=0 lpr=63 crt=0'0 mlcod 0'0 active+clean] do_op: op osd_op(client.4119.0:545 11.2f844175 rbd_id.image50 [stat] snapc 0=[] ack+read+known_if_redirected e70) v7
...
2016-12-11 19:10:19.703417 7ff5bb2b0700 10 osd.1 pg_epoch: 70 pg[11.5( empty local-les=64 n=0 ec=63 les/c/f 64/64/0 63/63/63) [1,2] r=0 lpr=63 crt=0'0 mlcod 0'0 active+clean] get_object_context: no obc for soid 11:ae8221f4:::rbd_id.image50:head but can_create
...
2016-12-11 19:10:19.703520 7ff5bb2b0700 10 osd.1 pg_epoch: 70 pg[11.5( empty local-les=64 n=0 ec=63 les/c/f 64/64/0 63/63/63) [1,2] r=0 lpr=63 crt=0'0 mlcod 0'0 active+clean] start_copy 11:ae8221f4:::rbd_id.image50:head from 11:ae8221f4:::rbd_id.image50:head @2 v0 flags 30 mirror_snapset
2016-12-11 19:10:19.703533 7ff5bb2b0700 10 osd.1 pg_epoch: 70 pg[11.5( empty local-les=64 n=0 ec=63 les/c/f 64/64/0 63/63/63) [1,2] r=0 lpr=63 crt=0'0 mlcod 0'0 active+clean] _copy_some 0x560246ec3c00 0x560246600410
...
2016-12-11 19:10:19.703961 7ff5bdab5700 20 osd.1 pg_epoch: 70 pg[2.5( v 52'29 (0'0,52'29] local-les=13 n=11 ec=12 les/c/f 13/13/0 12/12/12) [1,4] r=0 lpr=12 crt=52'29 lcod 45'28 mlcod 45'28 active+clean] do_op: op osd_op(osd.1.5:3 2.2f844175 rbd_id.image50 [list-snaps] snapc 0=[] ack+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e70) v7
...
2016-12-11 19:10:19.704234 7ff5bdab5700  1 -- 172.21.15.89:6804/6151 --> 172.21.15.89:0/6151 -- osd_op_reply(3 rbd_id.image50 [list-snaps] v0'0 uv0 ack = -2 ((2) No such file or directory)) v7 -- 0x560246e35a00 con 0
...
2016-12-11 19:10:19.704361 7ff5bb2b0700 20 osd.1 pg_epoch: 70 pg[2.5( v 52'29 (0'0,52'29] local-les=13 n=11 ec=12 les/c/f 13/13/0 12/12/12) [1,4] r=0 lpr=12 crt=52'29 lcod 45'28 mlcod 45'28 active+clean] do_op: op osd_op(osd.1.5:4 2.2f844175 rbd_id.image50 [copy-get max 8388608] snapc 0=[] ack+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e70) v7
...
2016-12-11 19:10:19.704563 7ff5bb2b0700  1 -- 172.21.15.89:6804/6151 --> 172.21.15.89:0/6151 -- osd_op_reply(4 rbd_id.image50 [copy-get max 8388608] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v7 -- 0x5602466fed80 con 0
...
2016-12-11 19:10:19.704771 7ff5b6aa7700 10 osd.1 pg_epoch: 70 pg[11.5( empty local-les=64 n=0 ec=63 les/c/f 64/64/0 63/63/63) [1,2] r=0 lpr=63 crt=0'0 mlcod 0'0 active+clean] process_copy_chunk 11:ae8221f4:::rbd_id.image50:head tid 4 (2) No such file or directory
...
2016-12-11 19:10:19.706745 7ff5bdab5700 25 osd.1 pg_epoch: 70 pg[11.5( v 70'1 (0'0,70'1] local-les=64 n=1 ec=63 les/c/f 64/64/0 63/63/63) [1,2] r=0 lpr=63 crt=70'1 lcod 0'0 mlcod 0'0 active+clean] do_op oi 11:ae8221f4:::rbd_id.image50:head(70'1 osd.1.0:213 whiteout s 0 uv 0 alloc_hint [0 0 0])
...
2016-12-11 19:10:19.706778 7ff5bdab5700  1 -- 172.21.15.89:6804/6151 --> 172.21.15.85:0/2199069473 -- osd_op_reply(545 rbd_id.image50 [stat] v0'0 uv0 ack = -2 ((2) No such file or directory)) v7 -- 0x560246afa080 con 0
...

Step 1 result) librbd receives correct -2 error for non-existent object stat call.
Step 2) librbd proceeds to create rbd_id.image50 object:

...
2016-12-11 19:10:19.708777 7ff5bb2b0700 20 osd.1 pg_epoch: 70 pg[11.5( v 70'1 (0'0,70'1] local-les=64 n=1 ec=63 les/c/f 64/64/0 63/63/63) [1,2] r=0 lpr=63 crt=70'1 lcod 0'0 mlcod 0'0 active+clean] do_op: op osd_op(client.4119.0:547 11.2f844175 rbd_id.image50 [create 0~0 [excl],call rbd.set_id] snapc 0=[] ack+ondisk+write+known_if_redirected e70) v7
...
2016-12-11 19:10:19.709070 7ff5bb2b0700 10 osd.1 pg_epoch: 70 pg[11.5( v 70'1 (0'0,70'1] local-les=64 n=1 ec=63 les/c/f 64/64/0 63/63/63) [1,2] r=0 lpr=63 crt=70'1 lcod 0'0 mlcod 0'0 active+clean] maybe_create_new_object clearing whiteout on 11:ae8221f4:::rbd_id.image50:head
...
2016-12-11 19:10:19.710626 7ff5bdab5700  1 -- 172.21.15.89:6804/6151 --> 172.21.15.85:0/2199069473 -- osd_op_reply(547 rbd_id.image50 [create 0~0 [excl],call rbd.set_id] v70'2 uv2 ack = 0) v7 -- 0x560246af8d00 con 0
...

Step 2 result) librbd receives ACK that the object create + rbd.set_id cls method succeeded.

Note: rbd_id.image50 object appears to be in the cache tier right before the next step

...
2016-12-11 19:10:19.730055 7ff5beab7700 10 osd.1 pg_epoch: 71 pg[11.5( v 70'2 (0'0,70'2] local-les=64 n=1 ec=63 les/c/f 64/64/0 63/63/63) [1,2] r=0 lpr=63 crt=70'2 lcod 70'1 mlcod 70'1 active+clean] agent_choose_mode flush_mode: idle evict_mode: idle num_objects: 1 num_bytes: 16 num_objects_dirty: 1 num_objects_omap: 0 num_dirty: 1 num_user_objects: 1 num_user_bytes: 2064 num_overhead_bytes: 2048 pool.info.target_max_bytes: 0 pool.info.target_max_objects: 0
...

Step 3) librbd next tries to open the newly created image by first stat-ing the rbd_id.image50 object to see if this is a v2 rbd image. However, it appears that the request is (incorrectly) send to the base tier (pg2.5) instead of the cache tier (pg 11.5).

...
2016-12-11 19:10:19.768580 7ff5bb2b0700 20 osd.1 pg_epoch: 72 pg[2.5( v 52'29 (0'0,52'29] local-les=13 n=11 ec=12 les/c/f 13/13/0 12/12/12) [1,4] r=0 lpr=12 crt=52'29 lcod 45'28 mlcod 45'28 active+clean] do_op: op osd_op(client.4119.0:551 2.2f844175 rbd_id.image50 [stat] snapc 0=[] ack+read+known_if_redirected e72) v7
...
2016-12-11 19:10:19.768633 7ff5bb2b0700 10 osd.1 pg_epoch: 72 pg[2.5( v 52'29 (0'0,52'29] local-les=13 n=11 ec=12 les/c/f 13/13/0 12/12/12) [1,4] r=0 lpr=12 crt=52'29 lcod 45'28 mlcod 45'28 active+clean] get_object_context: obc NOT found in cache: 2:ae8221f4:::rbd_id.image50:head
2016-12-11 19:10:19.768646 7ff5bb2b0700 15 filestore(/var/lib/ceph/osd/ceph-1) getattr 2.5_head/#2:ae8221f4:::rbd_id.image50:head# '_'
2016-12-11 19:10:19.768679 7ff5bb2b0700 10 filestore(/var/lib/ceph/osd/ceph-1) error opening file /var/lib/ceph/osd/ceph-1/current/2.5_head/rbd\uid.image50__head_2F844175__2 with flags=2: (2) No such file or directory
2016-12-11 19:10:19.768689 7ff5bb2b0700 10 filestore(/var/lib/ceph/osd/ceph-1) getattr 2.5_head/#2:ae8221f4:::rbd_id.image50:head# '_' = -2
2016-12-11 19:10:19.768711 7ff5bb2b0700  1 -- 172.21.15.89:6804/6151 --> 172.21.15.85:0/2199069473 -- osd_op_reply(551 rbd_id.image50 [stat] v0'0 uv0 ack = -2 ((2) No such file or directory)) v7 -- 0x5602466ff0c0 con 0
...

Step 3 result) The stat call fails w/ -2 and librbd will fail to open the image since it cannot find the image name -> image id mapping file.

#5 Updated by Jason Dillaman about 2 years ago

  • Project changed from rbd to Ceph

#6 Updated by Sage Weil almost 2 years ago

  • Status changed from Verified to Can't reproduce

#7 Updated by Sage Weil almost 2 years ago

please reopen if this happens again

#8 Updated by Jason Dillaman about 1 year ago

  • Status changed from Can't reproduce to Verified

Recent repeat failure: http://qa-proxy.ceph.com/teuthology/abhi-2017-10-21_15:51:20-rbd-wip-luminous-backports-distro-basic-smithi/1758354/teuthology.log

2017-10-21 19:07:29.832119 7f8759ced700  1 -- 172.21.15.117:6800/9349 --> 172.21.15.13:0/412289887 -- osd_op_reply(1184 rbd_id.image54 [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 -- 0x55780e346380 con 0
2017-10-21 19:07:29.837748 7f875dcf5700  1 -- 172.21.15.117:6800/9349 --> 172.21.15.13:0/412289887 -- osd_op_reply(1187 rbd_id.image54 [create,call rbd.set_id] v89'8 uv36 ondisk = 0) v8 -- 0x55780e346700 con 0
2017-10-21 19:07:42.271610 7fbdb7fc2700  1 -- 172.21.15.18:6809/5996 --> 172.21.15.13:0/412289887 -- osd_op_reply(1190 rbd_id.image54 [stat] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 -- 0x5600d018d800 con 0

#9 Updated by Jason Dillaman 11 months ago

  • Status changed from Verified to In Progress
  • Assignee set to Jason Dillaman
  • Priority changed from High to Normal
  • Backport set to luminous,jewel

This appears to be a race condition between the tear down of the previous test case and this test case. The previous test case creates a cache tier and destroys it. The race occurs when the deletion of the cache tier slowly propagates through the OSDs. Going to use a unique pool for the one cache tier test case to avoid the ripple race condition effect.

#10 Updated by Jason Dillaman 11 months ago

  • Project changed from Ceph to rbd
  • ceph-qa-suite deleted (rbd)

#11 Updated by Jason Dillaman 11 months ago

  • Status changed from In Progress to Need Review

#12 Updated by Mykola Golub 11 months ago

  • Status changed from Need Review to Pending Backport

#13 Updated by Nathan Cutler 11 months ago

  • Copied to Backport #23064: luminous: "[ FAILED ] TestLibRBD.LockingPP" in rbd-next-distro-basic-mult run added

#14 Updated by Nathan Cutler 11 months ago

  • Copied to Backport #23065: jewel: "[ FAILED ] TestLibRBD.LockingPP" in rbd-next-distro-basic-mult run added

#15 Updated by Nathan Cutler 10 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF