Project

General

Profile

Actions

Bug #19635

closed

Deadlock on two ceph-fuse clients accessing the same file

Added by John Spray about 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
Backport:
jewel, kraken
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

See Dan's reproducer script, and thread "[ceph-users] fsping, why you no work no mo?"
https://raw.githubusercontent.com/dvanders/fsping/

When I started a vstart cluster and mounted two fuse clients, then ran the script, I got two blocked requests like this

(virtualenv) jspray@senta04:~/ceph/build$ bin/ceph daemon mds.a ops
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
{
    "ops": [
        {
            "description": "client_request(client.4110:27 lookup #1/senta04.ack 2017-04-16 17:39:09.476736 caller_uid=1121, caller_gid=1121{})",
            "initiated_at": "2017-04-16 17:39:09.476974",
            "age": 486.457417,
            "duration": 486.457469,
            "type_data": [
                "failed to rdlock, waiting",
                "client.4110:27",
                "client_request",
                {
                    "client": "client.4110",
                    "tid": 27
                },
                [
                    {
                        "time": "2017-04-16 17:39:09.476974",
                        "event": "initiated" 
                    },
                    {
                        "time": "2017-04-16 17:39:09.486978",
                        "event": "failed to rdlock, waiting" 
                    }
                ]
            ]
        },
        {
            "description": "client_request(client.4111:10 getattr pAsLsXsFs #100000003e9 2017-04-16 17:39:09.488176 caller_uid=1121, caller_gid=1121{})",
            "initiated_at": "2017-04-16 17:39:09.488318",
            "age": 486.446072,
            "duration": 486.446188,
            "type_data": [
                "failed to rdlock, waiting",
                "client.4111:10",
                "client_request",
                {
                    "client": "client.4111",
                    "tid": 10
                },
                [
                    {
                        "time": "2017-04-16 17:39:09.488318",
                        "event": "initiated" 
                    },
                    {
                        "time": "2017-04-16 17:39:09.489099",
                        "event": "failed to rdlock, waiting" 
                    }
                ]
            ]
        }
    ],
    "num_ops": 2
}

This is apparently something that worked in 10.2.5 and is now failing on more recent versions.


Related issues 2 (0 open2 closed)

Copied to CephFS - Backport #20027: jewel: Deadlock on two ceph-fuse clients accessing the same fileResolvedWei-Chung ChengActions
Copied to CephFS - Backport #20028: kraken: Deadlock on two ceph-fuse clients accessing the same fileResolvedNathan CutlerActions
Actions #1

Updated by John Spray about 7 years ago

I was wondering if d463107473 ("mds: finish lock waiters in the same order that they were added.") could have been the cause, but the issue still happens if I revert that.

Actions #2

Updated by John Spray about 7 years ago

Those requests are getting hung up on the iauth and ixattr locks on the inode for the ".syn" file the test script creates -- those locks are in the excl->sync transition at the time.

When we go into that transition we're not sending any revokes to the clients, but the server seems to be acting kind of like it's sat there waiting for a caps message maybe. Hmm.

Actions #3

Updated by Zheng Yan about 7 years ago

  • Assignee set to Zheng Yan
Actions #4

Updated by Zheng Yan about 7 years ago

  • Status changed from New to In Progress

This bug happens in following sequence of events

- Reuqest1 (from client1) create file1 (mds issues caps Asx to client1, early reply is no allowed)
- Request2 (from client2) lookup file1 (dentry lock is xlocked, waiting)
- Log event of request1 get journaled (Server::reply_client_request() calls MDCache::request_drop_non_rdlocks(). Request2 get dispatched when dropping xlock of the dentry lock)
- Request2 reovkes caps Ax from client1. (the caps haven't beeb sent to client. so Lock::issue_caps() just updates client1's caps. the caps get updated, but Locker::eval_gather() is not called. request2 waits infinitely)
- Send reply of request1 to client1 (with the update caps)

I think we should avoid finishing contexts directly when drop locks. (queue contexts to finisher instead)

Actions #5

Updated by Zheng Yan about 7 years ago

  • Status changed from In Progress to Fix Under Review
Actions #6

Updated by John Spray almost 7 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to jewel, kraken
Actions #7

Updated by Nathan Cutler almost 7 years ago

  • Copied to Backport #20027: jewel: Deadlock on two ceph-fuse clients accessing the same file added
Actions #8

Updated by Nathan Cutler almost 7 years ago

  • Copied to Backport #20028: kraken: Deadlock on two ceph-fuse clients accessing the same file added
Actions #9

Updated by Nathan Cutler almost 7 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF