Project

General

Profile

Actions

Bug #9800

closed

client-limits test is not passing

Added by Greg Farnum over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/teuthology-2014-10-13_23:04:01-fs-giant-distro-basic-multi/547170

The client isn't dropping its caps:

ubuntu@plana03:~$ sudo ceph -s
    cluster 6de65115-1809-449f-bb3e-29f304d73025
     health HEALTH_WARN mds0: Client plana03:1 failing to respond to capability release; mds0: Client plana03:0 failing to respond to cache pressure

I'm a little confused because while the MDS shows plana03:0 has 181 caps,

ubuntu@plana39:~$ sudo ceph daemon /var/run/ceph/ceph-mds.a.asok session ls
[
    { "id": 4176,
      "num_leases": 0,
      "num_caps": 0,
      "state": "closed",
      "replay_requests": 0,
      "reconnecting": false,
      "inst": "client.4176 10.214.131.37:0\/7083",
      "client_metadata": { "entity_id": "1",
          "hostname": "plana03",
          "mount_point": "\/home\/ubuntu\/cephtest\/mnt.1"}},
    { "id": 4177,
      "num_leases": 0,
      "num_caps": 181,
      "state": "open",
      "replay_requests": 0,
      "reconnecting": false,
      "inst": "client.4177 10.214.131.37:0\/7123",
      "client_metadata": { "entity_id": "0",
          "hostname": "plana03",
          "mount_point": "\/home\/ubuntu\/cephtest\/mnt.0"}},
    { "id": 4178,
      "num_leases": 0,
      "num_caps": 2,
      "state": "open",
      "replay_requests": 0,
      "reconnecting": false,
      "inst": "client.4178 10.214.131.37:0\/7166",
      "client_metadata": { "entity_id": "1",
          "hostname": "plana03",
          "mount_point": "\/home\/ubuntu\/cephtest\/mnt.1"}}]

...the client itself claims to have 503 caps!

ubuntu@plana03:~$ sudo ceph daemon /var/run/ceph/ceph-client.0.7123.asok mds_sessions
{ "id": 4177,
  "sessions": [
        { "mds": 0,
          "addr": "10.214.132.39:6805\/8408",
          "seq": 1,
          "cap_gen": 0,
          "cap_ttl": "2014-10-16 13:42:56.919437",
          "last_cap_renew_request": "2014-10-16 13:41:56.919437",
          "cap_renew_seq": 6448,
          "num_caps": 503,
          "state": "open"}],
  "mdsmap_epoch": 38}

The MDS log didn't have much of use, but it does have a stuck request from client.1 in addition to client.0 not dropping caps:

2014-10-15 01:54:12.079232 7fb64891a700  0 log_channel(default) log [WRN] : client.4178 isn't responding to mclientcaps(revoke), ino 100000001f5 pending pAsxLsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.165651 seconds ago
2014-10-15 01:54:12.079268 7fb64891a700  0 log_channel(default) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 63.165923 secs
2014-10-15 01:54:12.079272 7fb64891a700  0 log_channel(default) log [WRN] : slow request 63.165923 seconds old, received at 2014-10-15 01:53:08.913324: client_request(client.4178:12 rmxattr #100000001f5 security.ima 2014-10-15 01:53:08.913046) currently failed to xlock, waiting
2

ubuntu@plana39:~$ sudo ceph daemon /var/run/ceph/ceph-mds.a.asok dump_ops_in_flight
{ "ops": [
        { "description": "client_request(client.4178:12 rmxattr #100000001f5 security.ima 2014-10-15 01:53:08.913046)",
          "initiated_at": "2014-10-15 01:53:08.913324",
          "age": "129435.891259",
          "duration": "0.000279",
          "type_data": [
                "failed to xlock, waiting",
                "client.4178:12",
                "client_request",
                { "client": "client.4178",
                  "tid": 12},
                [   
                    { "time": "2014-10-15 01:53:08.913324",
                      "event": "initiated"},
                    { "time": "2014-10-15 01:53:08.913603",
                      "event": "failed to xlock, waiting"}]]}],
  "num_ops": 1}

ubuntu@plana03:~$ sudo ceph daemon /var/run/ceph/ceph-client.0.7123.asok mds_requests
{}
ubuntu@plana03:~$ sudo ceph daemon /var/run/ceph/ceph-client.1.7166.asok mds_requests
{ "request": { "tid": 12,
      "op": "rmxattr",
      "path": "#100000001f5",
      "path2": "security.ima",
      "ino": "100000001f5",
      "hint_ino": "0",
      "sent_stamp": "2014-10-15 01:53:08.913049",
      "mds": 0,
      "resend_mds": -1,
      "send_to_auth": 0,
      "sent_on_mseq": 0,
      "retry_attempt": 0,
      "got_unsafe": 0,
      "uid": 1000,
      "gid": 1000,
      "oldest_client_tid": 12,
      "mdsmap_epoch": 0,
      "flags": 0,
      "num_retry": 0,
      "num_fwd": 0,
      "num_releases": 0}}

It was blocking machines so I killed it, and the logs are turned way down so there was nothing useful at all in the client ones. I'm hoping this is apparent upon inspection; otherwise it'll pop up again and we should perhaps add debugging output to this test.

Actions #1

Updated by John Spray over 9 years ago

Same failure:

http://pulpito.front.sepia.ceph.com/teuthology-2014-10-17_23:04:02-fs-giant-distro-basic-multi/555242/

Nothing much useful here either (note to self, attach gdb to mds before client, lost the client session on the mds side when attached gdb to client and it timed out).

{ "request": { "tid": 12,
      "op": "rmxattr",
      "path": "#100000001f5",
      "path2": "security.ima",
      "ino": "100000001f5",
      "hint_ino": "0",
      "sent_stamp": "2014-10-18 18:26:32.740100",
      "mds": 0,
      "resend_mds": -1,
      "send_to_auth": 0,
      "sent_on_mseq": 0,
      "retry_attempt": 0,
      "got_unsafe": 0,
      "uid": 1000,
      "gid": 1000,
      "oldest_client_tid": 12,
      "mdsmap_epoch": 0,
      "flags": 0,
      "num_retry": 0,
      "num_fwd": 0,
      "num_releases": 0}}

Thread 1 (Thread 0x7fc44d471780 (LWP 6401)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x000000000055d8c8 in Wait (mutex=..., this=0x7ffff73804b0) at ./common/Cond.h:55
#2  Client::make_request (this=0x2fb5e20, request=0x2fc2160, uid=1000, gid=1000, ptarget=0x0, pcreated=0x0, use_mds=-1, pdirbl=0x0)
    at client/Client.cc:1445
#3  0x0000000000574842 in Client::_removexattr (this=0x2fb5e20, in=0x7fc430007e60, name=<optimized out>, uid=1000, gid=1000)
    at client/Client.cc:7685
#4  0x0000000000574afa in Client::ll_removexattr (this=0x2fb5e20, in=0x7fc430007e60, name=0x7fc44d347038 "security.ima", uid=1000, gid=1000)
    at client/Client.cc:7704
#5  0x0000000000531a9d in fuse_ll_removexattr (req=0x2fb6a60, ino=<optimized out>, name=0x7fc44d347038 "security.ima") at client/fuse_ll.cc:233
#6  0x00007fc44d040d65 in fuse_session_loop () from /lib/libfuse.so.2
#7  0x000000000052d91a in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ceph_fuse.cc:173

The order of tests was test_client_pin, test_client_pin_root, then the failure in test_client_release_bug. It's correct that the client is failing to release caps (the injected failure is for it to do that).

The client's rmxattr is seemingly blocking on a cap release from the same client. This rmxattr doesn't happen on a fedora 20 or precise host (quick manual check), so presumably we're seeing this in plana/burnupis because the security.ima rmxattr is a trusty-ism.

Probably the solution is for me to rework the way we simulate this failure in such a way that it isn't scuppered by a rogue rmxattr (whether we want the security extensions enabled on test nodes or not, having the tests be tolerant would be desirable). Good news is that this seems to be a limitation of the failure injection, rather a bug in ceph or the test itself.

Actions #3

Updated by John Spray over 9 years ago

  • Status changed from 7 to Resolved
commit 54abbc61fd0d2c28e641aa6fa0f9ef7986cfd207
Merge: aa64084 5691c68
Author: John Spray <jcspray@gmail.com>
Date:   Tue Oct 28 12:29:29 2014 +0000

    Merge pull request #2809 from ceph/wip-9800

    client: allow xattr caps in inject_release_failure

    Reviewed-by: Greg Farnum <greg@inktank.com>
Actions #4

Updated by Greg Farnum over 9 years ago

  • Status changed from Resolved to Pending Backport

I don't know that we need/want to try and push this in before release (although since it's all guarded inside of a branch only used for debugging, it's safer), but we should backport it or our giant runs into eternity are going to be noisy.

Actions #5

Updated by Greg Farnum over 9 years ago

  • Status changed from Pending Backport to Resolved

Backported in commit:387efc5fe1fb148ec135a6d8585a3b8f8d97dbf8

Actions

Also available in: Atom PDF