Project

General

Profile

Bug #23289

mds: xfstest generic/089 hangs in rename syscall in luminous

Added by Luis Henriques over 3 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
multimds
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Running this test on a multimds Luminous cluster results in a stalled (very slow?) client. I'm using a kernel client, but I was able to reproduce the issue with fuse as well. Thus, I don't think this is a client issue. Also, I wasn't able to reproduce it with a mimic (master branch) cluster.

In the client, here's the test stack:
cat /proc/1065/stack
[<0>] ceph_mdsc_do_request+0xef/0x2e0
[<0>] ceph_rename+0x125/0x1e0
[<0>] vfs_rename+0x335/0x810
[<0>] SyS_renameat2+0x47a/0x530
[<0>] do_syscall_64+0x60/0x110
[<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<0>] 0xffffffffffffffff

I'm able to reproduce this very easily with a vstart cluster:

MON=1 OSD=2 MDS=3 ../src/vstart.sh -x -n -i 192.168.155.1 --multimds 2 -b

If, while the test is stalled, I reduce the number of MDSs, the test will proceed and finish. I'm still looking at the MDS logs (attached), but I'm not very proficient reading those so it may take a while for me to see something that may be obvious to more trained eyes.

mds-logs.tgz - MDS logs (trimmed) (91.6 KB) Luis Henriques, 03/09/2018 03:37 PM

History

#1 Updated by Patrick Donnelly over 3 years ago

  • Subject changed from xfstest generic/089 hangs in rename syscall in luminous to mds: xfstest generic/089 hangs in rename syscall in luminous
  • Target version deleted (v12.2.4)
  • Source set to Community (user)
  • Backport set to luminous
  • Affected Versions v12.2.4 added

#2 Updated by Zheng Yan over 3 years ago

  • Assignee set to Zheng Yan

#3 Updated by Zheng Yan over 3 years ago

could you check if the test is hang or slow (check /sys/kernel/debug/ceph/xxx/mdsc). my local test show the test is slow because there are lots of cross-mds rename

#4 Updated by Luis Henriques over 3 years ago

I can confirm that the test still progresses (I see changes in the /sys/kernel/debug/ceph/xxx/mdsc file). But this makes the test take way too long compared to what happens in mimic (just a few minutes in a vstart cluster).

#5 Updated by Luis Henriques over 3 years ago

By the way, have you actually seen the test run to completion? I've waited a few hours and the test kept running.

#6 Updated by Patrick Donnelly over 2 years ago

  • Assignee deleted (Zheng Yan)

#7 Updated by Patrick Donnelly over 2 years ago

  • Category deleted (90)
  • Labels (FS) multimds added

Also available in: Atom PDF