Project

General

Profile

Actions

Bug #3663

closed

ceph kernel client is getting stuck on xstat* operations

Added by Roman Hlynovskiy over 11 years ago. Updated almost 8 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

there are 2 kernel clients happily working with ceph. as soon as I try mounting ceph from the third client, it's getting stuck on stat* operations (observed from strace).
2 working clients are also getting affected by this broken client until it's completely killed. sometimes those 2 working clients should be completely rebooted

log from mds is collected with the following debug config:

[mds]
debug mds = 20
debug ms = 1

client kernel is 3.2.0-0.bpo.3-686-pae
ceph release is 0.55 from debian-testing repo

I might be wrong but according to the log before hitting the bug, mds is complaining that there are some laggy sessions exist for the client which will stuck, so maybe it's trying to assign new session from the client to the old laggy ones from the same client which creates kind of race condition?

log in attach. ready to collect more logs and evidences.


Files

mds-a.debug.log.bz2 (525 KB) mds-a.debug.log.bz2 Roman Hlynovskiy, 12/20/2012 08:48 PM
ceph_20121221_01_logs.tar.gz (27.2 KB) ceph_20121221_01_logs.tar.gz Roman Hlynovskiy, 12/20/2012 10:19 PM
Actions #1

Updated by Sage Weil over 11 years ago

  • Status changed from New to Need More Info

Hmm. It's actually just saying its the oldest client; it's not actually too old (yet). The looping connect attempts are strange, though.. what is in the kernel log on that client? Is the kernel version different from the others? 3.2 is a bit hold, by the way. You are better off with 3.4 or 3.6 series kernels.

In any case, if you can reproduce the stat hang with a full mds log (debug ms = 1, debug mds = 20) and attach the full mds log for that period, that should have all of the information I need (especially if you provide the inode number for the file in question so it's easy to find in the log).

Given that there is the weirdness in teh log fragment you attached, actually, let's make that debug ms = 20 and debug mds = 20 (so we can see if the reconnect is due to the mds or something on the client).

Thanks!

Actions #2

Updated by Roman Hlynovskiy over 11 years ago

Hello Sage,

added 4 logs:

screen output from console of the laggy client. it ends up on 'jroger@pr02:~/data$ cp vo' this is where it's actually got stuck completely.
syslog output from the laggy client
mds log with 20/20 for mds/ms debugs started before the mount from stucking client and ended up after 1-2 mins of client being stuck.
mds log with 20/20 for mds/ms debug started after mds restart while working client is stuck

what are the observations:

1) as soon as I restarted mds (to disable debug level), stuck client got responsive, however working client got completely stuck (this is the second log from mds while it being stuck)
2) as soon as I rebooted laggy client system completely, working client transitioned from stuck state to the working

Actions #3

Updated by Sage Weil over 11 years ago

Hi Roman-

The logging levels are right, but in both mds logs neither mds was ever active; both were in the up:standby state the entire time. This is either because the active ceph-mds process wasn't restarted, or because we didn't wait long enough for the monitor to have one of the restarted daemons take over (by default it takes 15-20 seconds). Can you repeat the experiment, but also watch 'ceph -w' output and wait for the restarted ceph-mds process to go from up:standby all the way to up:active, and then once it's active reproduce the hang?

Ping me sagewk in #ceph on irc.oftc.net if you need help.

Thanks!

Actions #4

Updated by Roman Hlynovskiy over 11 years ago

Hi Sage,

i am very sorry for taking your time with this issue, I feel like an idiot :(
The buggy client is running in virtualized environment and using separate network address spacing for ceph communication.
I forgot to change ceph-related ip address for this node after system cloning which was the reason for this problem.
Now it's obvious why first client was getting stuck. I am pretty sure the problem with mon's getting down is related to this mis-configuration.

cheers!

Actions #5

Updated by Sage Weil over 11 years ago

  • Status changed from Need More Info to Rejected

No worries. Let us know if you do come across behavior that looks like a bug!

Actions #6

Updated by Greg Farnum almost 8 years ago

  • Component(FS) MDS added
Actions

Also available in: Atom PDF