Project

General

Profile

Bug #1820

deprecate "ceph stop"

Added by Anonymous over 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

A good daemon supervision system would try to restart any daemons that just exited. For "ceph stop" to work in the world of upstart and systemd, it'd need to communicate with init to tell it that it wants to stay down. That's just not worth the effort; admins managing large clusters have tools to run commands on all machines anyway, let's not reimplement the wheel in Ceph, let's just remove this feature.

Also remove from doc/control.rst

Associated revisions

Revision 49588e94 (diff)
Added by Sage Weil about 8 years ago

osd: drop "stop" command

Send SIGINT.

Fixes: #1820
Signed-off-by: Sage Weil <>

History

#1 Updated by Sage Weil over 8 years ago

Iirc the real purpose is to make the daemon shut down cleanly. This is important for gprof, valgrind memcheck, etc. We could use a signal instead, or leave it undocumented ...

#2 Updated by Anonymous over 8 years ago

gcov is already using SIGTERM.

#3 Updated by Sage Weil over 8 years ago


void handle_signal(int signal)
{
  switch (signal) {
  case SIGTERM:
  case SIGINT:
#ifdef ENABLE_COVERAGE
    exit(0);
#else
    got_sigterm = true;
#endif
    break;
  }
}

Is exit(0) on SIGTERM acceptable behavior all of the time? or exit(1)?

It doesn't help for leak checking, where we should try to tear things down in an orderly fashion. Could do that, or use another signal.

In any case, the old shutdown stuff looks like its broken anyway.

#4 Updated by Anonymous over 8 years ago

exit(0) on SIGTERM is perfectly valid.

If we do need more than SIGUSR1 & SIGUSR2, the communication mechanism should be local, e.g. via a unix domain socket, not part of the "Ceph protocol".

Leak checking sounds like it could be done via valgrind, not needing anything in the Ceph daemons. http://valgrind.org/docs/manual/manual-core-adv.html#manual-core-adv.gdbserver-commandhandling

#5 Updated by Greg Farnum over 8 years ago

I don't see anything in teuthology sending stop commands to the OSDs; I believe the valgrind stuff just uses SIGTERM.

#6 Updated by Sage Weil over 8 years ago

none of this is tested anywhere.. it's for when you manually want to check for leaks, and need the osd to try to shut down, clean up, etc., and then see what is left on the heap. not that that's been done recently... :)

memcheck has two modes.. one just looks for referenced objects, and exit(0) is probably ok for that. the other dumps anything on the heap, but an orderly cleanup is needed for that.

i think for now we can just exit(0) (1?) for now, since the orderly shutdown isn't being triggered anyway (at least for the osd). that also means we can rip out the #ifdef

#7 Updated by Greg Farnum over 8 years ago

It's not being run because getting the parsing and isolating leaks is a pain, but there are teuthology tasks to run valgrind and look through the logs for leaks. They use SIGTERM, which ends up with CEPH_MSG_SHUTDOWN being passed through the messenger (that's what got_sigterm does; it's read in tick() and heartbeat()), and that triggers a clean shutdown via the shutdown() function...unless that's gotten broken in the last 3 months?

#8 Updated by Sage Weil about 8 years ago

  • Status changed from New to 4
  • Target version set to v0.43

wip-stop and wip-2090

#9 Updated by Anonymous about 8 years ago

Yeah. I can't speak for the threading & locking changes, but the command removal is trivial.

That still leaves

$ git grep -A 6 '"stop"' src/mon/MDSMonitor.cc
src/mon/MDSMonitor.cc:    if (m->cmd[1] == "stop" && m->cmd.size() > 2) {
src/mon/MDSMonitor.cc-      int who = atoi(m->cmd[2].c_str());
src/mon/MDSMonitor.cc-      if (!mdsmap.is_active(who)) {
src/mon/MDSMonitor.cc-  r = -EEXIST;
src/mon/MDSMonitor.cc-  ss << "mds." << who << " not active (" 
src/mon/MDSMonitor.cc-     << ceph_mds_state_name(mdsmap.get_state(who)) << ")";
src/mon/MDSMonitor.cc-      } else if ((mdsmap.get_root() == who ||

which seems to be a little bit more special.

#10 Updated by Sage Weil about 8 years ago

That one is a bit different.. it's instructing ceph-mds to export all of it's metadata to another node and leave the cluster. That's distinct from just shutting down so another ceph-mds can take over. We could rename it, I suppose, although now that "stop" is gone everywhere else it doesn't conflict with anything.

#11 Updated by Anonymous about 8 years ago

It sounds like that does two things: move the MDS from active to standby, and terminate it. And we're removing the "remote termination" feature all over the place. Perhaps that should be just "go_standby" or something like that?

I realize there might be a race condition there, you tell it to go standby, something sees you don't have enough active MDSes, and re-activating it. But that'll happen anyway if your daemon supervisor just restarts the ceph-mds process..

#12 Updated by Anonymous about 8 years ago

Which makes me think, is the concept of "go standby" of any value, if there's something that'll automatically say the opposite? If you really want your human decision to stick ("absolutely do not go active"), you might as well just terminate the process.

#13 Updated by Greg Farnum about 8 years ago

No, the important part is the hierarchy authority export. Then it shuts down; it's not a "go standby". I guess you could argue that it should go standby instead of shutting down, but to me that violates the principle of least surprise.
We don't have a separate concept anywhere of going from active to standby, and I don't think we should.

#14 Updated by Anonymous about 8 years ago

Greg, how is "ceph mds stop 0" different from that ceph-mds receiving a local request to terminate (e.g. SIGTERM)?

#15 Updated by Anonymous about 8 years ago

Tommi Virtanen wrote:

Greg, how is "ceph mds stop 0" different from that ceph-mds receiving a local request to terminate (e.g. SIGTERM)?

Make that "ceph mds stop $NUM". Not the root mds.

#16 Updated by Greg Farnum about 8 years ago

On termination the process exits. On receipt of a stop command it exports authority over the filesystem hierarchy to the rest of the MDS cluster, and then exits. Without that export of authority, another standby node needs to take over for the same logical MDS; with that export of authority it is in fact destroying the logical MDS.

#17 Updated by Sage Weil about 8 years ago

It can easily go back into standby (via the respawn() -> execve() path) instead of shutting down. Then it's really "leave" as in leave cluster, or "ostracize", or some similar verb... not "stop". Changing the actual command means updating the docs. For now we can support both commands?

#18 Updated by Anonymous about 8 years ago

Changing docs is easy, and the branches already rip out "documented" commands. Let's just make it make sense.

I would like to see some larger vision on how ceph-mds starting, kill -TERM, "ceph mds leave NUM" and the max_mds setting should interact with each other. You don't want to "ceph mds leave" a node just to have it hand off, exit, get restarted, and go back active. Perhaps set_max_mds and leave should co-operate better, somehow. Restating that a bit differently:

- "leave" is not needed if all you want is "move this active mds to some other node, this physical machine needs repair"
- "leave" is most valuable when you decreased max_mds by one just before it
- unless admin changes max_mds, "leave" followed by daemon restarting brings the mds back to active

So perhaps "leave" should manipulate max_mds?

I'm also slightly worried about "max mds" in ceph.conf vs "ceph mds set_max_mds". I think runtime editability for config options that are in fixed config files is a source of trouble. In general on a long-living system like a Ceph cluster, when an option is settable at runtime, I favor removing the config option completely and using the runtime controllability for setting it. Unless there is a huge performance reason against it. Duplication of information leads to confusion about authority.

#19 Updated by Anonymous about 8 years ago

Oh, I've talked of this before. It might be nice to have a "start ceph-mds only to process a leftover journal and hand it off" mode. That is:

1. start with 3 active mdses
2. stop ceph-mds on node2
3. take node2 out of the rack for repairs
4. realize you meant to make it leave before you turned it off

#20 Updated by Sage Weil about 8 years ago

yeah. i think the simplest is to make 'leave' refuse if it's is < max_mds.

and we could drop max mds from the ceph.conf, since it's only used during mkfs?

#21 Updated by Sage Weil about 8 years ago

ok, tested all this in wip-1820. 'deactivate' already moves the ceph-mds to standby (not exit), all good there.

no explicit 'activate' command yet, adding feature to tracker for that for later.

#22 Updated by Sage Weil about 8 years ago

  • Status changed from 4 to Resolved

Also available in: Atom PDF