Project

General

Profile

Actions

Bug #1702

closed

Ceph MDS crash + client mount problem

Added by Gokul Krishnan over 12 years ago. Updated over 7 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,
i have configured ceph using a configuration as shown here[[http://pastebin.com/sQb8WZbx]].

The Ceph server starts well, but after sometime, say about 30 mins, crashes; but only the MDS crashes not the OSD or MON daemons.
I have pasted the MDS log here[[http://pastebin.com/4bMmG5R6]].

Also, the problem occurs more frequently and earlier, when the clients are mounted.

Mounting clients is also an issue, a dmesg call from client says message like "[727228.500629] libceph: bad fsid, had 8d660ac8-a98a-6b0c-4b30-e09ed57e6d62 got 952ae1bf-8c99-8354-693b-16c5e0e8b945"
here 8d660ac8-a9... is the fsid of the server when started prior to the current instance and 952ae1bf-8c... is the new fsid.
I have tried to supply fsid to the clients using mount options (using the command "mount -t ceph <IP_CEPH_MON> <Mount_Point> -o fsid=<Latest_FSID>"...but this does not work always!

Actions #1

Updated by Sage Weil over 12 years ago

  • Category set to 1
  • Status changed from New to Need More Info
  • Assignee set to Sage Weil
  • Target version set to v0.39

Are you able to reproduce this with 'debug mds = 20' and 'debug ms = 20' in your ceph.conf [mds section]?

Not sure about the fsid part yet, but any bad behavior there shouldn't crash the MDS, so wanna fix that first.

Thanks!

Actions #2

Updated by Gokul Krishnan over 12 years ago

Hello,
thank you for the reply.

no, unfortunately i am not able to reproduce the error using debug ms = 20(for MDS)...here is the log for MDS[[http://pastebin.com/neuJ3kfB]]

Luckily the MDS is not crashing anymore...atleast so far :)

But clients are still not able to mount...

thanks for the info...

Actions #3

Updated by Sage Weil over 12 years ago

Ok, so generally speaking, the only time you shoudl see fsid mismatches like that is if you have daemons from multiple clusters (or versions of the cluster) running and the client is getting confused. Make sure you kill off all your daemons and restart. (The fsid is randomly generated during mkcephfs, so you probably did mkcephfs, started things up, ran it again without stopping all daemons, and started some new daemons.)

Actions #4

Updated by Gokul Krishnan over 12 years ago

Thank you for reverting back so quickly.

Well in my scenario, i just have one Ceph server running. And yes, every time we do a mkcephfs, a new fsid is getting generated, but the clients are some how storing the fsid values and are not able to update it for the next session. I just have a single cluster, so i avoid the problems of multiple fsid's being active at any instant of time. Is there a way to flush the client cache?

Actually what did u mean by "without stopping all daemons"?...can you please explain this a bit clearly, i failed to understand. Did you mean the mon and osd daemons?

More to it, i noticed that when the ceph server is started, it is quite stable on itself. But when a client tries to mount, either successfully or in vain, the probability of the MDS crashing, goes very high.

Actions #5

Updated by Gokul Krishnan over 12 years ago

by the way,
you have assigned a target version as v0.39...but in the site i can find only the source for v0.37...
even if v0.39 is under development, where is v0.38?
Thank you.

Actions #6

Updated by Sage Weil over 12 years ago

Gokul Krishnan wrote:

by the way,
you have assigned a target version as v0.39...but in the site i can find only the source for v0.37...
even if v0.39 is under development, where is v0.38?
Thank you.

Oops, with all the travel it slipped my mind. Building 0.38 now.

Actions #7

Updated by Sage Weil over 12 years ago

Gokul Krishnan wrote:

Thank you for reverting back so quickly.

Well in my scenario, i just have one Ceph server running. And yes, every time we do a mkcephfs, a new fsid is getting generated, but the clients are some how storing the fsid values and are not able to update it for the next session. I just have a single cluster, so i avoid the problems of multiple fsid's being active at any instant of time. Is there a way to flush the client cache?

Oh.. are you stopping/restarting the clients? If you have a client mounting an old fs, and run a mkcephfs to replace the fs, the client will be thoroughly confused. You need to umount before mkcephfs, or if you didn't, restart the clients.

Actually what did u mean by "without stopping all daemons"?...can you please explain this a bit clearly, i failed to understand. Did you mean the mon and osd daemons?

More to it, i noticed that when the ceph server is started, it is quite stable on itself. But when a client tries to mount, either successfully or in vain, the probability of the MDS crashing, goes very high.

Actions #8

Updated by Gokul Krishnan over 12 years ago

Yes I am stopping the clients and remounting...but if im doing a mkcephfs, i make sure to umount all the clients before stopping the Ceph daemons.
But yes I dont restart the clients...but I assume the Ceph server has been designed to handle this case (all clients un-mounted, server stopped, mkcephfs again (generating new fs), started Ceph daemons and trying to mount clients again with new fsid), aren't they?

Actions #9

Updated by Sage Weil over 12 years ago

  • Target version changed from v0.39 to v0.40
Actions #10

Updated by Sage Weil over 12 years ago

  • Status changed from Need More Info to Can't reproduce
Actions #11

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
  • Target version deleted (v0.40)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF