Project

General

Profile

Actions

Bug #777

closed

mount hung, tid timed out messages in log

Added by John Leach about 13 years ago. Updated about 13 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

100%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have a ceph cluster with 3 mons, 1 mds and 4 osds. I mounted the ceph filesystem on another machine using the default 2.6.37 kernel client and write a script to create a million files, between 1k and 4k in size. It ran for some unknown amount of time (at a guess, at least 30 to 60 mins) putting load on the cluster but at some point my script process got stuck in D state and the mount because inaccessible (could see the free space but any kind of read or write operation on the mount stuck the process in D state) and all load ceased.

In the kernel log on the client I see message like "tid 405965 timed out on osd1, will reset osd" (see client.log). You'll also see where I tried restarting the whole ceph cluster (though not the client) in an attempt to get the mount working again (with no luck).

Logs for 3 of the 4 osds attached (osd0 never seemed to have any timed out messages).

I started the test at around 12:30 on 4th Feb. I turned osd debugging on at about 23:36 on 4th Feb and shut the cluster down to go to bed at around 00:45 on 5th Feb.

I've also included a tarball of the /sys/kernel/debug/ceph directory on the client (though this was made today, many hours after the cluster was been shut down).


Files

cephlogs.tar.bz2 (1.78 MB) cephlogs.tar.bz2 John Leach, 02/05/2011 10:45 AM
mds-commit-fail.log (12.9 KB) mds-commit-fail.log John Leach, 02/09/2011 03:38 PM

Subtasks 2 (0 open2 closed)

Tasks #796: Let CDir::_commit_full write in piecesResolvedGreg Farnum02/10/2011

Actions
CephFS - Tasks #797: Don't _commit_full just because dir is_complete()ResolvedGreg Farnum02/10/2011

Actions
Actions #1

Updated by Greg Farnum about 13 years ago

I haven't fully determined everything going on here, but were you using anything besides the kernel client with this cluster?
Each OSD log you provided has at least one operation that it says is too large to fit into the journal. We've actually been working on better reporting errors like that back to the client, but it's the most problematic thing I've noticed so far that looks like it could be causing the problem.

Also the OSDs aren't managing to peer their PGs, but that's going to be a symptom of something else (possibly this journal-too-small problem, haven't traced it all the way out yet).

Actions #2

Updated by Greg Farnum about 13 years ago

  • Status changed from New to Resolved

In the absence of other evidence I'm going to go ahead and mark this as solved.

Actions #3

Updated by John Leach about 13 years ago

Only the kernel client was in use, so am assuming it was the mds that made a write bigger than the journal size (120meg). I've wiped out the cluster and am trying to reproduce.

Actions #4

Updated by Greg Farnum about 13 years ago

  • Status changed from Resolved to In Progress

I don't think the MDS can do that (and if it can, we need to put blocks in place to prevent that). I guess I'll need to dig back into the logs. Reopening for further investigation!

Actions #5

Updated by John Leach about 13 years ago

I can reproduce this:

  • format new cluster (in my case: 3 mons, 2 mds, 4 osds)
  • mount filesystem using kernel client (2.6.37 vanilla)
  • mkdir files
  • cd files
  • ruby -e '1000000.times { |i| s = i.to_s(24) ; open(s, "w") { |f| puts s ; f.write(s * 1024) } }'

that last line attempts to create a million little files. Along the way, the active mds seems to restart (I'm assuming a crash) and the other mds takes over, but at this time a large write is made to two of the osds which fills the journal (my journal size is 300meg).

the cmds quickly grows to 2gig ram size btw.

Will attach some debug logs if I can get them.

Actions #6

Updated by Greg Farnum about 13 years ago

Do you have any core dumps from the MDS? Or is this running in a memory-constrained system so it's getting OOM-killed?

Actions #7

Updated by John Leach about 13 years ago

got a debug log from the mds now. Looks like the mds decides to commit it's data to the osds and something goes wrong. I snipped the log at what looked like an appropriate place, hope I didn't miss anything important.

Actions #8

Updated by John Leach about 13 years ago

to be clear, this is not resulting in the "tid timed out" messages now, just the journal hangs, so it might be a separate issue. The client kernel blocks until the mds fails over and work continues (suppose it might get to the timed out messages once all the osds are stuck?)

Actions #9

Updated by John Leach about 13 years ago

no ooms (the boxes are 6gig ram) and now I'm relatively sure the mds is not crashing, it seems to be an orderly respawn (looks like it might be getting blacklisted for being laggy?)

Actions #10

Updated by Greg Farnum about 13 years ago

.....oh. So you're trying to create a million files in a single directory?

Ceph stores inodes in the directory they belong to. I don't know exactly how much on-disk space each inode takes, but in-memory they take ~2KB. Assuming the numbers are similar, a 300MB journal can handle a directory of ~153k inodes.
So, I think what's happening is this:
1) Client creates lots and lots of files in directory.
2) MDS keeps up for a while, but eventually tries to flush out a directory that won't fit into the OSD journal
3) OSD journal fails somehow. (In recent code it should return an error code, but in older code I think it fails silently)
4) MDS either is running slowly, or gets stuck waiting for ack from OSD and eventually gets killed by mon. (This could be due to a blocking behavior that shouldn't exist, actually! But a separate issue and Sage says not likely.)
5) MDS gets marked down and respawns itself.
6) Loop as kernel client replays with MDS, MDS tries to flush out to OSDs, etc etc etc

Something like this anyway! Sage says that this won't be an issue once we get directory fragmentation stable (large directories will fragment into multiple pieces that are committed independently, etc), and this will be done by 1.0, but at the moment there's not a lot to be done about it, except limit the size of your directories to something that fits within your journal. :(

Actions #11

Updated by Greg Farnum about 13 years ago

  • Status changed from In Progress to Duplicate

It's not exactly a duplicate, but #594 is the latest open bug on fragment split/merge. When that's done and fragmentation is enabled by default this test case should work!

(Or, if you're feeling brave, you can turn on fragmentation now -- it's in the code but you run the risk of things going horribly wrong if an MDS crashes while directories are fragmented.)

Actions #12

Updated by Sage Weil about 13 years ago

  • Status changed from Duplicate to In Progress

Greg Farnum wrote:

It's not exactly a duplicate, but #594 is the latest open bug on fragment split/merge. When that's done and fragmentation is enabled by default this test case should work!

(Or, if you're feeling brave, you can turn on fragmentation now -- it's in the code but you run the risk of things going horribly wrong if an MDS crashes while directories are fragmented.)

Just to clarify: the problem is when fragmentation is enabled AND you have multiple MDSs AND one of them fails/restarts. If you have a single MDS you can safely enable frags (aside from general concerns about this code being less well tested :).

And after thinking about it a bit, Greg, _commit_full should be pretty easily fixed up to behave even without fragmentation just by doing multiple writes with some max size in series. I'm pretty sure it doesn't have to be a single atomic commit because of the way the dentry/inode versioning works. Let's look at it tomorrow!

Actions #13

Updated by Greg Farnum about 13 years ago

Sage Weil wrote:

And after thinking about it a bit, Greg, _commit_full should be pretty easily fixed up to behave even without fragmentation just by doing multiple writes with some max size in series. I'm pretty sure it doesn't have to be a single atomic commit because of the way the dentry/inode versioning works. Let's look at it tomorrow!

All right, this should be handled in 3129581e89cb38f9ad2ff85673e50a033ec77d40 which is in the master branch. Let me know how it goes. :)

Actions #14

Updated by John Leach about 13 years ago

Tested with a build of master and it looks good. I'm writing files to a dir atm and am up to 430,000 entries and still going.

Actions #15

Updated by Greg Farnum about 13 years ago

  • Status changed from In Progress to Resolved

All right, looks like this is good!

Actions

Also available in: Atom PDF