Bug #14340
closed
client command stuck forever
Added by shun song over 8 years ago.
Updated over 8 years ago.
Description
1. sometimes when i use "ceph -s", it stuck up forever(about three hours until i kill it). but when i use the same command in other terminate, it serves normally.
2. before command, there are three mons and thirty osds, after stuck, there are only two mons and thirty osds,which run well.
3. check journal of the dead mon, it shows it was killed by -9.
in my opinion, it's odd that why this simple command can't complete while cluster runs well. if thinking further, apps using ceph may get confused whether cluster is response to react slowly or cluster has lost command due to ceph client stuck.
- Priority changed from Urgent to Normal
This sounds odd; the clients timeout on monitors which aren't communicating. Is there a backtrace in the monitor log, or where are you seeing that it was killed by -9? What version are you running on the client and the servers?
clients truely timeout, but just waits for mon to reply and never resend requests. for communication with osds, it timeout and will resend requests.
it's easy to reproduction if let a monitor down when client sent a request successfully to this mon after authentication. as client can't get reply any more, it stucks.
as for signal -9, my system records in system log. if lose pidfile for correspording mon, this mon daemon can't be stopped by "sevice ceph stop", so my test teammate have to kill it by -9.
i found ceph could handle it by letting mon suppoted feature CEPH_FEATURE_MSGR_KEEPALIVE2, but i don't wether it's proper.
The supported features are not needed here.. the Policy adds in all the CEPH_FEATURE_SUPPORTED_DEFAULT features, which is everything known at compile time. (We should just drop the supported argument for Policy...)
- Status changed from New to Need More Info
can you reproduce with ceph -s --debug-ms 20 and --debug-monc 20
- Status changed from Need More Info to Can't reproduce
Feel free to reopen if you have more information.
i think i have reproduced this situation at the moment of trying to give up.
my reproduction is as follows:
1. I start cluster with 3 mon and 1 osd using vstart.sh.
2. insert "return true;" into OSDMonitor::preprocess_query like this,and then compile it:
bool OSDMonitor::preprocess_query(PaxosServiceMessage *m)
{
dout(10) << "preprocess_query " << m << " from " << m->get_orig_source_inst() << dendl;
*return true;
switch (m->get_type()) {
// READs
case MSG_MON_COMMAND:
...
}
3. kill mon.c daemon and then use new comiple ceph-mon start mon.c. As a result, mon.a and mon.b daemons run in the case which can handle command about osds successfully, but mon.c(port:6791) can't .
4. when updating crush map command sent to mon.c, we shall see client stuck with massive keep_alive tick, but monclient never try to send command again or send command to other mons.
- detail log with attachment
Also available in: Atom
PDF