Project

General

Profile

Actions

Bug #13903

closed

Failure in TestStrays.test_ops_throttle

Added by John Spray over 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Actions #1

Updated by John Spray over 8 years ago

Passes when run locally :-/

Actions #3

Updated by John Spray over 8 years ago

Greg: that linked result is from TestDamage, was there something in the logs that indicated it had a common cause with this issue?

Actions #5

Updated by John Spray over 8 years ago

  • Status changed from New to In Progress
  • Assignee set to John Spray
Actions #6

Updated by John Spray over 8 years ago

So in all three cases we're seeing just a single inode that's failing to get purged, probably the dir.

http://pulpito.ceph.com/teuthology-2015-11-23_23:04:04-fs-master---basic-multi/1157802/

2015-11-25T19:41:45.235 INFO:tasks.cephfs.test_strays:Waiting for purge to complete 0/1, 1600/1601

http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2015-12-21_23:04:02-fs-master---basic-openstack/48408/

2015-12-22T15:30:41.215 INFO:tasks.cephfs.test_strays:Waiting for purge to complete 0/1, 3200/3201

http://pulpito.ovh.sepia.ceph.com:8081/gregf-2015-12-23_05:34:31-fs-master---basic-openstack/50203/

2015-12-23T07:38:05.973 INFO:tasks.cephfs.test_strays:Waiting for purge to complete 0/1, 3200/3201

Actions #7

Updated by John Spray over 8 years ago

http://pulpito.ceph.com/teuthology-2015-11-23_23:04:04-fs-master---basic-multi/1157802/

In this case I can see the stray #100/stray2/10000000000 finally getting purged just after the client session ends, so this is a client->server protocol behaviour that's keeping it stuck.

Actions #8

Updated by John Spray over 8 years ago

The client is receiving a client_caps message for the dir just after it's done the unlink. I think that's preventing it from sending the client_cap_release that it would usually send.

2015-11-25 19:27:58.059696 7f50137fe700 20 client.4467 encode_inode_release enter(in:10000000000.head(faked_ino=0 ref=7 ll_ref=1605 cap_refs={} open={} mode=40755 size=0/0 mtime=2015-11-25 19:27:58.051061 caps=pAsLsXsFsx(0=pAsLsXsFsx) parents=0x7f5004005e90 0x7f501c01a5a0), req:0x7f5004077fd0 mds:0, drop:256, unless:512, have:, force:1)
2015-11-25 19:27:58.071510 7f502bfff700  5 client.4467 handle_cap_grant on in 10000000000 mds.0 seq 7846 caps now pAsLsXsFs was pAsLsXsFsx
2015-11-25 19:27:58.085591 7f502bfff700  5 client.4467 handle_cap_grant on in 10000000000 mds.0 seq 7847 caps now pAsXsFs was pAsLsXsFs
2015-11-25 19:27:58.222784 7f5031ffb700  1 -- 10.214.134.136:0/2601946693 --> 10.214.132.10:6806/20760 -- client_cap_release(73) v2 -- ?+0 0x7f501c10f230 con 0x7f501c014a60
2015-11-25 19:27:58.451742 7f502bfff700  5 client.4467 handle_cap_grant on in 10000000000 mds.0 seq 7848 caps now pAsXs was pAsXsFs
2015-11-25 19:27:58.457949 7f502bfff700  5 client.4467 handle_cap_grant on in 10000000000 mds.0 seq 7849 caps now pAsLsXs was pAsXs
2015-11-25 19:41:46.808582 7f502bfff700  1 -- 10.214.134.136:0/2601946693 --> 10.214.132.10:6806/20760 -- client_cap_release(2) v2 -- ?+0 0x7f503ee09570 con 0x7f501c014a60

(that big time gap between the last two is between the place we wanted it to happen, and the place where it's eventually happening after umount)

Actions #9

Updated by John Spray over 8 years ago

This is reproducible with a simpler "delete lots of files and then their directory" test https://github.com/ceph/ceph-qa-suite/pull/787

Actions #10

Updated by Greg Farnum over 8 years ago

I think you talked about this in standup but I'm forgetting — do you need somebody else to look over the caps stuff here?

Actions #11

Updated by John Spray over 8 years ago

If anyone has time, yes -- given enough time I can figure it out but it might be more obvious to someone more familiar. It's not obvious to me whether the sequence of cap ops is fine and we just need another special case for unlinking where we bounce grants that occur after the unlink, or if the way we're getting granted caps after we no longer even want/need them is wrong.

Actions #13

Updated by Greg Farnum over 8 years ago

  • Assignee changed from John Spray to Zheng Yan

Zheng, please take a look.

Actions #15

Updated by Zheng Yan over 8 years ago

  • Status changed from In Progress to Fix Under Review
Actions #16

Updated by Greg Farnum about 8 years ago

  • Status changed from Fix Under Review to Resolved

Whoops, merged this last week.

Actions

Also available in: Atom PDF