Project

General

Profile

Actions

Bug #10912

closed

Stuck request during clientreplay

Added by John Spray about 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Looks like a client bug, MDS has a stuck unlink which can't get the wrlock it wants -- it has sent a revoking MClientCaps to the client, but the client never saw the caps that the MDS is trying to revoke, so it thinks it's a no-op and fails to send an acknowledgement to the MDS.

Seen in teuthology-2015-02-16_23:04:02-fs-master-testing-basic-multi/762834/, have taken a copy of logs as teuthology might eat them when it kills the job

Actions #1

Updated by John Spray about 9 years ago

Logs at teuthology:~/jcsp/10912

Actions #2

Updated by John Spray about 9 years ago

here's the stuck guy:

{
    "ops": [
        {
            "description": "client_request(client.4127:4258 unlink #10000000374\/fstest_9519155c87a6e37015dde5364313a795 2015-02-17 09:41:27.816970 RETRY=1 REPLAY)",
            "initiated_at": "2015-02-17 09:41:45.331510",
            "age": 87983.658023,
            "duration": 2.311106,
            "type_data": [
                "failed to wrlock, waiting",
                "client.4127:4258",
                "client_request",
                {
                    "client": "client.4127",
                    "tid": 4258
                },
                [
                    {
                        "time": "2015-02-17 09:41:45.331510",
                        "event": "initiated" 
                    },
                    {
                        "time": "2015-02-17 09:41:47.641223",
                        "event": "failed to wrlock, waiting" 
                    },
                    {
                        "time": "2015-02-17 09:41:47.642616",
                        "event": "failed to wrlock, waiting" 
                    }
                ]
            ]
        }
    ],
    "num_ops": 1
}

and here's his last attempt to run:

2015-02-17 09:41:47.641985 7ffcb406d700  7 mds.0.server handle_client_unlink on [dentry #1/client.0/tmp/tmp/fstest_68426ebeac309bce2c6c13590af6800e/fstest_3a325443b9d904f0b4f3c258d1824c26/fstest_9519155c87a6e37015dde5364313a795 [2,head] auth (dn xlock x=1 by 0x3e35700) (dversion lock w=1 last_client=4127) v=6 ap=2+1 inode=0x49b4278 state=new | request=1 lock=2 inodepin=1 dirty=1 waiter=0 authpin=1 0x3f0f0e0]
2015-02-17 09:41:47.642026 7ffcb406d700 20 mds.0.locker  must wrlock (dversion lock w=1 last_client=4127) [dentry #100/stray0/10000000376 [2,head] auth NULL (dn xlock x=1 by 0x3e35700) (dversion lock w=1 last_client=4127) pv=0 v=2000 ap=2+0 inode=0 state=new | request=1 lock=2 authpin=1 0x3ef2580]
2015-02-17 09:41:47.642034 7ffcb406d700 20 mds.0.locker  must wrlock (dversion lock w=1 last_client=4127) [dentry #1/client.0/tmp/tmp/fstest_68426ebeac309bce2c6c13590af6800e/fstest_3a325443b9d904f0b4f3c258d1824c26/fstest_9519155c87a6e37015dde5364313a795 [2,head] auth (dn xlock x=1 by 0x3e35700) (dversion lock w=1 last_client=4127) v=6 ap=2+1 inode=0x49b4278 state=new | request=1 lock=2 inodepin=1 dirty=1 waiter=0 authpin=1 0x3f0f0e0]
2015-02-17 09:41:47.642040 7ffcb406d700 20 mds.0.locker  must wrlock (ifile sync) [inode 600 [...2,head] ~mds0/stray0/ auth v1896 ap=1+1 dirtyparent f(v4 m2015-02-17 09:41:27.788733 38=1+37) n(v6 rc2015-02-17 09:41:27.788733 39=1+38) (inest lock dirty) (iversion lock) | dirtyscattered=1 lock=0 dirfrag=1 stickydirs=1 stray=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x46797c8]
2015-02-17 09:41:47.642049 7ffcb406d700 20 mds.0.locker  must wrlock (inest lock dirty) [inode 600 [...2,head] ~mds0/stray0/ auth v1896 ap=1+1 dirtyparent f(v4 m2015-02-17 09:41:27.788733 38=1+37) n(v6 rc2015-02-17 09:41:27.788733 39=1+38) (inest lock dirty) (iversion lock) | dirtyscattered=1 lock=0 dirfrag=1 stickydirs=1 stray=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x46797c8]
2015-02-17 09:41:47.642058 7ffcb406d700 20 mds.0.locker  must wrlock (iversion lock) [inode 10000000376 [2,head] /client.0/tmp/tmp/fstest_68426ebeac309bce2c6c13590af6800e/fstest_3a325443b9d904f0b4f3c258d1824c26/fstest_9519155c87a6e37015dde5364313a795 auth v6 ap=1+0 dirtyparent s=0 n(v0 1=1+0) (iauth excl) (ixattr excl) (iversion lock) caps={4127=pAsXsFs/-@0} | request=0 caps=1 dirtyparent=1 dirty=1 authpin=1 0x49b4278]
2015-02-17 09:41:47.642065 7ffcb406d700 20 mds.0.locker  must wrlock (ifile sync) [inode 10000000374 [...2,head] /client.0/tmp/tmp/fstest_68426ebeac309bce2c6c13590af6800e/fstest_3a325443b9d904f0b4f3c258d1824c26/ auth v10 ap=1+1 dirtyparent f(v0 m2015-02-17 09:41:27.803248 1=1+0) n(v0 rc2015-02-17 09:41:27.803248 2=1+1) (inest lock) (iversion lock) caps={4127=pAsLsXsFs/-@0},l=4127 | lock=0 dirfrag=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x49b5eb0]
2015-02-17 09:41:47.642075 7ffcb406d700 20 mds.0.locker  must wrlock (inest lock) [inode 10000000374 [...2,head] /client.0/tmp/tmp/fstest_68426ebeac309bce2c6c13590af6800e/fstest_3a325443b9d904f0b4f3c258d1824c26/ auth v10 ap=1+1 dirtyparent f(v0 m2015-02-17 09:41:27.803248 1=1+0) n(v0 rc2015-02-17 09:41:27.803248 2=1+1) (inest lock) (iversion lock) caps={4127=pAsLsXsFs/-@0},l=4127 | lock=0 dirfrag=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x49b5eb0]

in final eval of locks, after wrlock_start we send:
11581 2015-02-17 09:41:47.642515 7ffcb406d700  7 mds.0.locker issue_caps loner client.4127 allowed=pAsLsXsFc, xlocker allowed=pAsLsXsFc, others allowed=pAsLsXsFc on [inode 10000000374 [...2,head] /client.0/tmp/tmp/fstest_68426ebeac      309bce2c6c13590af6800e/fstest_3a325443b9d904f0b4f3c258d1824c26/ auth v10 ap=1+1 dirtyparent f(v0 m2015-02-17 09:41:27.803248 1=1+0) n(v0 rc2015-02-17 09:41:27.803248 2=1+1) (isnap sync r=1) (inest lock) (ifile sync->lock) (iv      ersion lock) caps={4127=pAsLsXsFs/-@0},l=4127 | lock=1 dirfrag=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x49b5eb0]
11582 2015-02-17 09:41:47.642528 7ffcb406d700 20 mds.0.locker  client.4127 pending pAsLsXsFs allowed pAsLsXsFc wanted -
11583 2015-02-17 09:41:47.642531 7ffcb406d700  7 mds.0.locker    sending MClientCaps to client.4127 seq 1 new pending pAsLsXs was pAsLsXsFs
Actions #3

Updated by Greg Farnum about 9 years ago

Could this be a result of the client dumping all caps it doesn't care about, and the MDS somehow thinking the client still has this one?

Actions #4

Updated by John Spray about 9 years ago

Hmm, so it's not the trim_cache_for_reconnect that's doing it, as that happens before we compose the reconnect message, and we can see the Fs cap is still there at that stage:

2015-02-17 09:41:45.329854 7ff8df7fe700 10 client.4127  caps on 10000000374.head pAsLsXsFs wants -
2015-02-17 09:41:45.329859 7ff8df7fe700 10 client.4127     path #1/client.0/tmp/tmp/fstest_68426ebeac309bce2c6c13590af6800e/fstest_3a325443b9d904f0b4f3c258d1824c26

But it's gone by the time we see the response to the replayed creation (4255) of a dentry in #10000000374:

2015-02-17 09:41:47.551115 7ff8df7fe700  1 -- 10.214.136.20:0/22898 <== mds.0 10.214.136.20:6800/37047 2 ==== client_reply(???:4255 = 0 (0) Success safe) v1 ==== 35+0+0 (565419744 0 0) 0x7ff86c0027e0 con 0x7ff8cc03df00
2015-02-17 09:41:47.551148 7ff8df7fe700 10 client.4127 insert_trace from 2015-02-17 09:41:27.712184 mds.0 is_target=0 is_dentry=0
2015-02-17 09:41:47.551164 7ff8df7fe700 10 client.4127 insert_trace -- already got unsafe; ignoring
2015-02-17 09:41:47.551177 7ff8df7fe700 10 client.4127 put_inode on 10000000374.head(ref=19 ll_ref=22 cap_refs={} open={} mode=41777 size=0/0 mtime=2015-02-17 09:41:27.999787 caps=pAsLsXs(0=pAsLsXs) COMPLETE parents=0x7ff864007780 0x7ff8cc0cbb80)

Based on these logs, I have no idea what's wiping out the Fs part of the cap in the background

Actions #5

Updated by Greg Farnum about 9 years ago

Can you upload these logs to the appropriate teuthology directory or something?

Also, we drop some caps automatically when sending requests; look at the family of stuff around MetaRequest::inode_drop and Client::encode_inode_release(). Maybe some of that is getting invoked on replayed requests in an improper fashion?

Actions #6

Updated by Zheng Yan about 9 years ago

  • Status changed from New to Fix Under Review
Actions #7

Updated by Greg Farnum about 9 years ago

  • Status changed from Fix Under Review to Resolved

Merged to master by commit:363be2df46f63ddbe0e355af6835aab06b9866c2 and to hammer by commit:a54864a38842def1b4f139e13c71dfb3e6555ea0.

Actions

Also available in: Atom PDF