Bug #3663: ceph kernel client is getting stuck on xstat* operations - CephFS - Ceph

Actions

Copy link

Bug #3663

closed

ceph kernel client is getting stuck on xstat* operations

Added by Roman Hlynovskiy over 11 years ago. Updated almost 8 years ago.

Status:

Rejected

Priority:

Normal

Assignee:

Category:

Target version:

v0.55d

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

there are 2 kernel clients happily working with ceph. as soon as I try mounting ceph from the third client, it's getting stuck on stat* operations (observed from strace).
2 working clients are also getting affected by this broken client until it's completely killed. sometimes those 2 working clients should be completely rebooted

log from mds is collected with the following debug config:

[mds]
debug mds = 20
debug ms = 1

client kernel is 3.2.0-0.bpo.3-686-pae
ceph release is 0.55 from debian-testing repo

I might be wrong but according to the log before hitting the bug, mds is complaining that there are some laggy sessions exist for the client which will stuck, so maybe it's trying to assign new session from the client to the old laggy ones from the same client which creates kind of race condition?

Files

Download all files

mds-a.debug.log.bz2 (525 KB) mds-a.debug.log.bz2		Roman Hlynovskiy, 12/20/2012 08:48 PM
ceph_20121221_01_logs.tar.gz (27.2 KB) ceph_20121221_01_logs.tar.gz		Roman Hlynovskiy, 12/20/2012 10:19 PM

Actions

Copy link

Updated by Sage Weil over 11 years ago

Status changed from New to Need More Info

Hmm. It's actually just saying its the oldest client; it's not actually too old (yet). The looping connect attempts are strange, though.. what is in the kernel log on that client? Is the kernel version different from the others? 3.2 is a bit hold, by the way. You are better off with 3.4 or 3.6 series kernels.

In any case, if you can reproduce the stat hang with a full mds log (debug ms = 1, debug mds = 20) and attach the full mds log for that period, that should have all of the information I need (especially if you provide the inode number for the file in question so it's easy to find in the log).

Given that there is the weirdness in teh log fragment you attached, actually, let's make that debug ms = 20 and debug mds = 20 (so we can see if the reconnect is due to the mds or something on the client).

Thanks!

Actions

Copy link

Updated by Roman Hlynovskiy over 11 years ago

File ceph_20121221_01_logs.tar.gz ceph_20121221_01_logs.tar.gz added

Hello Sage,

added 4 logs:

screen output from console of the laggy client. it ends up on 'jroger@pr02:~/data$ cp vo' this is where it's actually got stuck completely.
syslog output from the laggy client
mds log with 20/20 for mds/ms debugs started before the mount from stucking client and ended up after 1-2 mins of client being stuck.
mds log with 20/20 for mds/ms debug started after mds restart while working client is stuck

what are the observations:

1) as soon as I restarted mds (to disable debug level), stuck client got responsive, however working client got completely stuck (this is the second log from mds while it being stuck)
2) as soon as I rebooted laggy client system completely, working client transitioned from stuck state to the working

Actions

Copy link

Updated by Sage Weil over 11 years ago

Hi Roman-

The logging levels are right, but in both mds logs neither mds was ever active; both were in the up:standby state the entire time. This is either because the active ceph-mds process wasn't restarted, or because we didn't wait long enough for the monitor to have one of the restarted daemons take over (by default it takes 15-20 seconds). Can you repeat the experiment, but also watch 'ceph -w' output and wait for the restarted ceph-mds process to go from up:standby all the way to up:active, and then once it's active reproduce the hang?

Ping me sagewk in #ceph on irc.oftc.net if you need help.

Thanks!

Actions

Copy link

Updated by Roman Hlynovskiy over 11 years ago

Hi Sage,

i am very sorry for taking your time with this issue, I feel like an idiot :(
The buggy client is running in virtualized environment and using separate network address spacing for ceph communication.
I forgot to change ceph-related ip address for this node after system cloning which was the reason for this problem.
Now it's obvious why first client was getting stuck. I am pretty sure the problem with mon's getting down is related to this mis-configuration.

cheers!

Actions

Copy link