Project

General

Profile

Bug #10944

Deadlock, MDS logs "slow request", getattr pAsLsXsFs failed to rdlock

Added by Ilja Slepnev over 4 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
02/24/2015
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:

Description

CephFS path got stuck in directory listing process. No OSD/network activity. Only MDS logs. Client reboot does not help.

  1. Created big files on path /ceph/a
  2. cat /ceph/a/* >/dev/null
  3. ls /ceph/a (deadlock here)
  4. /ceph/a is locked for all clients

Log messages on MDS (repeating):
2015-02-24 16:02:41.564071 7fdb0055c700 0 log_channel(default) log [WRN] : 9 slow requests, 1 included below; oldest blocked for > 14463.448519 secs
2015-02-24 16:02:41.564077 7fdb0055c700 0 log_channel(default) log [WRN] : slow request 1922.318256 seconds old, received at 2015-02-24 15:30:39.245786: client_request(client.66401597:2440 getattr pAsLsXsFs #10000002d68) currently failed to rdlock, waiting

Configuration of MDS and CephFS client:
CentOS 7.0.1406, Linux 3.10.0-123.20.1.el7.centos.plus.x86_64
ceph-0.87
dmesg: libceph: loaded (mon/osd proto 15/24)
dmesg: ceph: loaded (mds proto 32)

HEALTH_OK. OSD commit/apply latency is not changed, <100 ms.

How to find out what is going wrong and why getattr causes a deadlock?

Is there a workaround?

History

#1 Updated by Greg Farnum over 4 years ago

  • Status changed from New to Need More Info

As it says, the MDS is failing to get a read lock on probably that inode, or maybe on of its ancestors. If you had high-level logging enabled on the MDS you can examine those to find out exactly where it got stuck, and if you dump the cache you should be able to see what's preventing the read lock from being acquired.

If getting us this debugging info sounds bothersome, you should also be okay just restarting the MDS and things will probably resume and start working.

#2 Updated by Zheng Yan over 4 years ago

3.10 kernel is too old for using cephfs.

#3 Updated by Ilja Slepnev over 4 years ago

More experiment.
  1. create big testfile (30GB)
  2. pv testfile >testfile_2 (client host 1)
  3. pv testfile_2 >/dev/null (client host 2)

Attempted to read file that is being written and got the same lock. Both clients are in uninterruptible sleep.

No problem when running commands on same host.

ceph --admin-daemon /var/run/ceph/ceph-mds.<hostname>.asok dump_ops_in_flight

{ "ops": [
        { "description": "client_request(client.66801040:12 getattr pAsLsXsFs #100000008e6)",
          "initiated_at": "2015-02-25 14:19:19.750860",
          "age": "319.984677",
          "duration": "3.884925",
          "type_data": [
                "failed to rdlock, waiting",
                "client.66801040:12",
                "client_request",
                { "client": "client.66801040",
                  "tid": 12},
                [
                    { "time": "2015-02-25 14:19:19.750860",
                      "event": "initiated"},
                    { "time": "2015-02-25 14:19:19.751011",
                      "event": "failed to rdlock, waiting"},
                    { "time": "2015-02-25 14:19:23.635785",
                      "event": "failed to rdlock, waiting"}]]}],
  "num_ops": 1}
2015-02-25 14:21:22.414651 7f76c4f75700  0 log_channel(default) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 122.663766 secs
2015-02-25 14:21:22.414657 7f76c4f75700  0 log_channel(default) log [WRN] : slow request 122.663766 seconds old, received at 2015-02-25 14:19:19.750860: client_request(client.66801040:12 getattr pAsLsXsFs #100000008e6) currently failed to rdlock, waiting
2015-02-25 14:20:27.415284 7f76c4f75700  0 log_channel(default) log [WRN] : client.66802792 isn't responding to mclientcaps(revoke), ino 100000008e6 pending pAsLsXsFr issued pAsLsXsFrw, sent 63.779507 seconds ago                                                                            

Restarting MDS does not free the lock.
Restarting both locked clients helps.

#4 Updated by Greg Farnum over 4 years ago

  • Status changed from Need More Info to Resolved

Zheng Yan wrote:

3.10 kernel is too old for using cephfs.

Oh, I didn't notice the kernel version, d'oh.

The blocking you're seeing with a 30GB file is not surprising. Ceph does not have file range locking, so when your second node goes to read it has to acquire read access to the file. That requires that the node which created the file flush everything out to the OSDs, which for a 30GB file is going to take a while. Until that's done the reading node will block on the read.

#5 Updated by Zhi Zhang about 4 years ago

  • Regression set to No

Hi Greg and Yan,

We are using ceph-dokan on windows and hitting the same problem few times. And this deadlock happened at mds side, should not be related to 3.10 kernel, right? Although we are using centos with 3.10 kernel to setup ceph cluster.

Any suggestion on our case?

Thanks.

#6 Updated by Zheng Yan about 4 years ago

please dump mds when this hapeens (ceph mds tell \* dumpcache, cache dump file at /cachedump.* on the machine that runs MDS). Besides, please send the exact warning message to us.

#7 Updated by Ilja Slepnev about 4 years ago

Hi David,

After changing CentOS 7 client OS to Fedora 21 (kernel 3.18) the deadlock problem disappeared.

I did not change Ceph cluster OS.
OSDs, MONs, MDS are running on CentOS 7.1.1503 with stock kernel 3.10.0-229.4.2.el7.x86_64, ceph version 0.94.2.

#8 Updated by Zhi Zhang about 4 years ago

Thanks, guys. I will keep digging to see if we can find something. Will keep you updated and pls stay tuned.

#9 Updated by Zhi Zhang about 4 years ago

We met this issue because we enabled timeout mechanism in tick on cephfs client for all the mds requests. Once timeout happened, this request would be signaled and its tid would be removed from mds_request, but mds didn't know about it and would reply to client with new pending caps info in trace_bl.

Client ignored the new pending caps because of no such request's tid. This inode's old caps didn't change. If mds received next requests about getting attr of this inode from the same client, mds might send revoking caps request to this client for a new caps. Then client found this new caps equaling to old caps, so it wouldn't send caps back to mds. Hence this getattr request became a slow request.

#10 Updated by Greg Farnum about 4 years ago

Do I guess correctly that by the timeout mechanism you're referring to the patch in http://tracker.ceph.com/issues/12209, or something else?

#11 Updated by Zhi Zhang about 4 years ago

Hi Greg,

Not exactly. We have enabled timeout in 2 ways, one is the patch you saw, the other is in Client::tick() (see this email: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg21221.html). The patch we used here is the second one. But I think either one will result in this issue.

#12 Updated by Zheng Yan about 4 years ago

please try the attached patches. (the second one is modified version of your timeout patch)

#13 Updated by Zheng Yan about 4 years ago

please use the new timeout patch

#14 Updated by Zhi Zhang about 4 years ago

Hi Yan,

We have tested both patches for few days and no above issue happened again. We will perform some other tests as well.

Thanks.

#15 Updated by Zhi Zhang about 4 years ago

Hi Yan,

Those patches look good after performing test for a long time. Will you commit the patches to both ceph and ceph-dokan?

Thanks.

#16 Updated by Zheng Yan about 4 years ago

will do

#17 Updated by pengcheng lin about 4 years ago

Zheng Yan wrote:

will do

HI,i use your patch in ceph dokan client. and seems cause to client assert crash.

Description:
1?use ceph dokan client.
2, disable the network adapter.
3, wait 6 miniute
4, and then, client crash by assert.

three assert fail?

1?xlist.h ?line 31
~item() {
assert(!is_on_list());
//remove_myself();
}

2?xlish.h, line 73~75
~xlist() {
assert(_size 0);
assert(_front 0);
assert(_back == 0);
}

3,client.cc ,line 2456~2457
}
assert(session->requests.empty());
assert(session->unsafe_requests.empty());
}

#18 Updated by Zheng Yan about 4 years ago

please provide backtrace of the crash

#19 Updated by Richard Hesse over 3 years ago

We're also running into this with the latest versions of ceph and CentOS 7 kernel. Message on the mds is:

2016-02-28 18:05:02.524808 mds.0 [WRN] 2 slow requests, 1 included below; oldest blocked for > 399.752411 secs
2016-02-28 18:05:02.524816 mds.0 [WRN] slow request 242.451526 seconds old, received at 2016-02-28 18:01:00.073235: client_request(client.34639:6398 getattr pAsLsXsFs #10000003191 2016-02-28 18:01:00.070365) currently failed to rdlock, waiting

Our trigger is similar to the initial report. We're rsync'ing 400-500GB files through ceph FS. Stop the rsync, restart it, then the mds/client will hang. I've run "ceph mds tell \* dumpcache, cache dump file at /cachedump.*" on the mds, but I'm not sure where that data is stored. Let me know, and I can forward it along.

#20 Updated by Greg Farnum over 3 years ago

The file by default goes to a file called "cachedump.*", like it says. I think it tends to go in /, but I don't remember for certain — it's whatever the working directory for the ceph-mds process is.

#21 Updated by Oliver Dzombc over 3 years ago

Hi,

same here:

with clients:

3.10.0-327.10.1.el7.x86_64 = centos 7
3.19.0-25-generic #26~14.04.1-Ubuntu SMP = ubuntu 14.04 LTS

mdsmap e426: 1/1/1 up {0=ceph4=up:active}, 2 up:standby

@ceph4:
  1. ceph mds tell \* dumpcache, cache dump file at /cachedump.*
    ok

no cachedump file is existing/created.

  1. ceph mds tell \* dumpcache
    ok
  1. ceph --admin-daemon /var/run/ceph/ceph-mds.ceph4.asok dump_ops_in_flight
{
    "ops": [
        {
            "description": "client_request(client.28323163:42 getattr pAsLsXsFs #100000080b3 2016-04-01 19:01:12.000000)",
            "initiated_at": "2016-04-01 19:01:12.515412",
            "age": 207.979446,
            "duration": 0.000151,
            "type_data": [
                "failed to rdlock, waiting",
                "client.28323163:42",
                "client_request",
                {
                    "client": "client.28323163",
                    "tid": 42
                },
                [
                    {
                        "time": "2016-04-01 19:01:12.515412",
                        "event": "initiated" 
                    },
                    {
                        "time": "2016-04-01 19:01:12.515563",
                        "event": "failed to rdlock, waiting" 
                    }
                ]
            ]
        },
        {
            "description": "client_request(client.32425004:66762 getattr pAsLsXsFs #100000080ac 2016-04-01 18:33:16.943661)",
            "initiated_at": "2016-04-01 18:33:16.944950",
            "age": 1883.549907,
            "duration": 0.000211,
            "type_data": [
                "failed to rdlock, waiting",
                "client.32425004:66762",
                "client_request",
                {
                    "client": "client.32425004",
                    "tid": 66762
                },
                [
                    {
                        "time": "2016-04-01 18:33:16.944950",
                        "event": "initiated" 
                    },
                    {
                        "time": "2016-04-01 18:33:16.945162",
                        "event": "failed to rdlock, waiting" 
                    }
                ]
            ]
        },
        {
            "description": "client_request(client.32425004:66763 getattr pAsLsXsFs #100000080aa 2016-04-01 18:55:05.799300)",
            "initiated_at": "2016-04-01 18:55:05.800360",
            "age": 574.694497,
            "duration": 0.000172,
            "type_data": [
                "failed to rdlock, waiting",
                "client.32425004:66763",
                "client_request",
                {
                    "client": "client.32425004",
                    "tid": 66763
                },
                [
                    {
                        "time": "2016-04-01 18:55:05.800360",
                        "event": "initiated" 
                    },
                    {
                        "time": "2016-04-01 18:55:05.800532",
                        "event": "failed to rdlock, waiting" 
                    }
                ]
            ]
        },
        {
            "description": "client_request(client.28323163:4 getattr pAsLsXsFs #100000080af 2016-04-01 18:55:22.000000)",
            "initiated_at": "2016-04-01 18:55:22.189176",
            "age": 558.305681,
            "duration": 0.000124,
            "type_data": [
                "failed to rdlock, waiting",
                "client.28323163:4",
                "client_request",
                {
                    "client": "client.28323163",
                    "tid": 4
                },
                [
                    {
                        "time": "2016-04-01 18:55:22.189176",
                        "event": "initiated" 
                    },
                    {
                        "time": "2016-04-01 18:55:22.189300",
                        "event": "failed to rdlock, waiting" 
                    }
                ]
            ]
        },
        {
            "description": "client_request(client.28323163:37 getattr pAsLsXsFs #100000080aa 2016-04-01 18:58:22.000000)",
            "initiated_at": "2016-04-01 18:58:22.027982",
            "age": 378.466875,
            "duration": 0.000164,
            "type_data": [
                "failed to rdlock, waiting",
                "client.28323163:37",
                "client_request",
                {
                    "client": "client.28323163",
                    "tid": 37
                },
                [
                    {
                        "time": "2016-04-01 18:58:22.027982",
                        "event": "initiated" 
                    },
                    {
                        "time": "2016-04-01 18:58:22.028146",
                        "event": "failed to rdlock, waiting" 
                    }
                ]
            ]
        },
        {
            "description": "client_request(client.28323159:6 getattr pAsLsXsFs #100000080ac 2016-04-01 18:48:07.000000)",
            "initiated_at": "2016-04-01 18:48:07.541073",
            "age": 992.953784,
            "duration": 547.905453,
            "type_data": [
                "cleaned up request",
                "client.28323159:6",
                "client_request",
                {
                    "client": "client.28323159",
                    "tid": 6
                },
                [
                    {
                        "time": "2016-04-01 18:48:07.541073",
                        "event": "initiated" 
                    },
                    {
                        "time": "2016-04-01 18:48:07.541234",
                        "event": "failed to rdlock, waiting" 
                    },
                    {
                        "time": "2016-04-01 18:57:15.446500",
                        "event": "killing request" 
                    },
                    {
                        "time": "2016-04-01 18:57:15.446526",
                        "event": "cleaned up request" 
                    }
                ]
            ]
        },
        {
            "description": "client_request(client.28323163:38 getattr pAsLsXsFs #100000080ad 2016-04-01 19:01:11.000000)",
            "initiated_at": "2016-04-01 19:01:11.364297",
            "age": 209.130561,
            "duration": 0.000121,
            "type_data": [
                "failed to rdlock, waiting",
                "client.28323163:38",
                "client_request",
                {
                    "client": "client.28323163",
                    "tid": 38
                },
                [
                    {
                        "time": "2016-04-01 19:01:11.364297",
                        "event": "initiated" 
                    },
                    {
                        "time": "2016-04-01 19:01:11.364417",
                        "event": "failed to rdlock, waiting" 
                    }
                ]
            ]
        },
        {
            "description": "client_request(client.28323163:39 getattr pAsLsXsFs #100000080ac 2016-04-01 19:01:11.000000)",
            "initiated_at": "2016-04-01 19:01:11.618192",
            "age": 208.876665,
            "duration": 0.000157,
            "type_data": [
                "failed to rdlock, waiting",
                "client.28323163:39",
                "client_request",
                {
                    "client": "client.28323163",
                    "tid": 39
                },
                [
                    {
                        "time": "2016-04-01 19:01:11.618192",
                        "event": "initiated" 
                    },
                    {
                        "time": "2016-04-01 19:01:11.618349",
                        "event": "failed to rdlock, waiting" 
                    }
                ]
            ]
        },
        {
            "description": "client_request(client.28323163:40 getattr pAsLsXsFs #100000080aa 2016-04-01 19:01:11.000000)",
            "initiated_at": "2016-04-01 19:01:11.744488",
            "age": 208.750370,
            "duration": 0.000171,
            "type_data": [
                "failed to rdlock, waiting",
                "client.28323163:40",
                "client_request",
                {
                    "client": "client.28323163",
                    "tid": 40
                },
                [
                    {
                        "time": "2016-04-01 19:01:11.744488",
                        "event": "initiated" 
                    },
                    {
                        "time": "2016-04-01 19:01:11.744659",
                        "event": "failed to rdlock, waiting" 
                    }
                ]
            ]
        },
        {
            "description": "client_request(client.28323163:41 getattr pAsLsXsFs #100000080ab 2016-04-01 19:01:12.000000)",
            "initiated_at": "2016-04-01 19:01:12.392500",
            "age": 208.102357,
            "duration": 0.000217,
            "type_data": [
                "failed to rdlock, waiting",
                "client.28323163:41",
                "client_request",
                {
                    "client": "client.28323163",
                    "tid": 41
                },
                [
                    {
                        "time": "2016-04-01 19:01:12.392500",
                        "event": "initiated" 
                    },
                    {
                        "time": "2016-04-01 19:01:12.392718",
                        "event": "failed to rdlock, waiting" 
                    }
                ]
            ]
        }
    ],
    "num_ops": 10
}

On the node, if taking a look into the mount:

root@cn201:/ceph-storage# ls
kvm test test123 vz
root@cn201:/ceph-storage# rm test
root@cn201:/ceph-storage# ls

kvm test123 vz
root@cn201:/ceph-storage#
root@cn201:/ceph-storage# rm test123
root@cn201:/ceph-storage# cd vz
ls
root@cn201:/ceph-storage/vz# ls
root@cn201:/ceph-storage/vz# cd ..
root@cn201:/ceph-storage# ls
kvm vz
root@cn201:/ceph-storage# ls -la
total 4
drwxr-xr-x 1 root root 6699234394112 Apr 1 18:57 .
drwxr-xr-x 25 root root 4096 Mar 18 22:53 ..
drwxr-xr-x 1 root root 6658274394112 Apr 1 17:56 kvm
drwxr-xr-x 1 root root 0 Feb 5 00:57 vz


On Client Node A:

root@cn201:/ceph-storage/vz# dd if=/dev/zero bs=256M count=8 of=/ceph-storage/vz/test
8+0 records in
8+0 records out
2147483648 bytes (2.1 GB) copied, 2.50078 s, 859 MB/s
root@cn201:/ceph-storage/vz# ls
test
root@cn201:/ceph-storage/vz# ls la
total 2097152
drwxr-xr-x 1 root root 0 Apr 1 19:09 .
drwxr-xr-x 1 root root 6699234394112 Apr 1 18:57 ..
-rw-r--r-
1 root root 2147483648 Apr 1 19:09 test
root@cn201:/ceph-storage/vz# dd if=/dev/zero bs=256M count=16 of=/ceph-storage/vz/test
16+0 records in
16+0 records out
4294967296 bytes (4.3 GB) copied, 4.86012 s, 884 MB/s

while Client Node B can read it:

[root@cn202 ceph-storage]# ls
kvm vz
[root@cn202 ceph-storage]# cd vz
[root@cn202 vz]# ls
test
[root@cn202 vz]# dd if=test of=/dev/null
8388608+0 records in
8388608+0 records out
4294967296 bytes (4.3 GB) copied, 42.0679 s, 102 MB/s
[root@cn202 vz]#

but going again into the big one:

[root@cn202 vz]# cd ..
[root@cn202 ceph-storage]# cd kvm
[root@cn202 kvm]# ls

<dead end>


So is there simply a max directory / file size/count which exist/applies ?:

root@cn201:/ceph-storage# ls -la
total 4
drwxr-xr-x 1 root root 6699234394112 Apr 1 18:57 .
drwxr-xr-x 25 root root 4096 Mar 18 22:53 ..
drwxr-xr-x 1 root root 6658274394112 Apr 1 17:56 kvm
drwxr-xr-x 1 root root 0 Apr 1 19:09 vz

#22 Updated by Greg Farnum over 3 years ago

That's probably a result of using older kernel clients in your case, Oliver. There's no max file size/count which will do anything like that. You could try just restarting the MDS service; that might reset things enough to make progress.

And your cache dumps, if not in /, might also be in /var/log/ceph/, /var/run/ceph/ or one of your other Ceph folders. If you don't give it a path to write the file to, it uses the current working directory.

#23 Updated by Oliver Dzombc over 3 years ago

Hi Greg,

i can confirm, that restarting the mds will solve the issue. Thank you very much for that advice !

If it happens again, i will try to get the logs.

#24 Updated by Greg Farnum about 3 years ago

  • Component(FS) MDS added

#25 Updated by Oliver Dzombc about 3 years ago

Hi,

client kernel: 4.5.5
MDS Server kernel: 4.5.5

Only ONE client is accessing.
Only a specific directory within cephfs can not be accessed resulting in:

{
"description": "client_request(client.204159:14 getattr pAsLsXsFs #10000001432 2016-07-26 04:50:53.687551 RETRY=2)",
"initiated_at": "2016-07-26 04:07:41.280054",
"age": 37.230305,
"duration": 37.230574,
"type_data": [
"failed to rdlock, waiting",
"client.204159:14",
"client_request", {
"client": "client.204159",
"tid": 14
},
[ {
"time": "2016-07-26 04:07:41.280054",
"event": "initiated"
}, {
"time": "2016-07-26 04:07:43.337325",
"event": "failed to rdlock, waiting"
}
]
]
}
],

Restarting of the entire cluster did not help.
Restarting of the client did not help.

Any idea ?

#26 Updated by John Spray about 3 years ago

Oliver: on the mailing list it seemed like this was probably not a cephfs issue (there were very busy cache tiers).

#27 Updated by Oliver Dzombc almost 3 years ago

Hi,

i fully understand that this things might not be trivial.

But we have again that issue. Due to a network problem, ceph got in trouble. Thats ok of course.

Currently we see:

health HEALTH_WARN
'ssd_cache' at/near target max
monmap e1: 3 mons at {cephmon1=10.0.0.11:6789/0,cephmon2=10.0.0.12:6789/0,cephmon3=10.0.0.13:6789/0}
election epoch 146, quorum 0,1,2 cephmon1,cephmon2,cephmon3
fsmap e132: 1/1/1 up {0=cephmon1=up:active}, 2 up:standby
osdmap e2356: 24 osds: 24 up, 24 in
flags sortbitwise
pgmap v8783558: 2240 pgs, 4 pools, 17530 GB data, 4920 kobjects
35075 GB used, 18851 GB / 53926 GB avail
2240 active+clean
client io 486 kB/s rd, 5526 kB/s wr, 174 op/s rd, 793 op/s wr
cache io 414 MB/s evict, 125 op/s promote, 3 PG(s) evicting

And again, on specific directories within cephfs, we have no access.

I understand, that if things are busy, you can not expect speed.

But we are waiting here already multiple houres that the readlock is released.

I mean, ok, we are talking here about a directory with maybe 5-10 TB of Data in ~ 50-100 Files.

But is there really nothing we can do to speed up things, but to wait until all is cooled down again ?

The other directories ( they are not smaller by the way -- 8.7TB with 150 files ) are accessable and working perfectly.

So is there really no other way but to except that things are like this and have to wait until the readlock is released ( even if the client who is the only one accessing it, has been already restarted ) ?

Thank you !

Greetings
Oliver

#28 Updated by John Spray almost 3 years ago

Oliver: use "ceph daemon mds.<id> objecter_requests" to see if the MDS is stuck waiting for operations from the OSDs.

Also available in: Atom PDF