Project

General

Profile

Actions

Bug #14340

closed

client command stuck forever

Added by shun song over 8 years ago. Updated about 8 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

1. sometimes when i use "ceph -s", it stuck up forever(about three hours until i kill it). but when i use the same command in other terminate, it serves normally.
2. before command, there are three mons and thirty osds, after stuck, there are only two mons and thirty osds,which run well.
3. check journal of the dead mon, it shows it was killed by -9.

in my opinion, it's odd that why this simple command can't complete while cluster runs well. if thinking further, apps using ceph may get confused whether cluster is response to react slowly or cluster has lost command due to ceph client stuck.

Actions #1

Updated by Greg Farnum over 8 years ago

  • Priority changed from Urgent to Normal

This sounds odd; the clients timeout on monitors which aren't communicating. Is there a backtrace in the monitor log, or where are you seeing that it was killed by -9? What version are you running on the client and the servers?

Actions #2

Updated by shun song over 8 years ago

clients truely timeout, but just waits for mon to reply and never resend requests. for communication with osds, it timeout and will resend requests.
it's easy to reproduction if let a monitor down when client sent a request successfully to this mon after authentication. as client can't get reply any more, it stucks.
as for signal -9, my system records in system log. if lose pidfile for correspording mon, this mon daemon can't be stopped by "sevice ceph stop", so my test teammate have to kill it by -9.
i found ceph could handle it by letting mon suppoted feature CEPH_FEATURE_MSGR_KEEPALIVE2, but i don't wether it's proper.

Actions #3

Updated by shun song over 8 years ago

a pull request https://github.com/shun-s/ceph-1/pull/1 has made, please take a look

Actions #4

Updated by Sage Weil over 8 years ago

The supported features are not needed here.. the Policy adds in all the CEPH_FEATURE_SUPPORTED_DEFAULT features, which is everything known at compile time. (We should just drop the supported argument for Policy...)

Actions #5

Updated by Sage Weil over 8 years ago

  • Status changed from New to Need More Info

can you reproduce with ceph -s --debug-ms 20 and --debug-monc 20

Actions #6

Updated by Samuel Just over 8 years ago

  • Status changed from Need More Info to Can't reproduce

Feel free to reopen if you have more information.

Actions #7

Updated by shun song about 8 years ago

i think i have reproduced this situation at the moment of trying to give up.
my reproduction is as follows:
1. I start cluster with 3 mon and 1 osd using vstart.sh.
2. insert "return true;" into OSDMonitor::preprocess_query like this,and then compile it:
bool OSDMonitor::preprocess_query(PaxosServiceMessage *m) {
dout(10) << "preprocess_query " << m << " from " << m->get_orig_source_inst() << dendl;
*return true;

switch (m->get_type()) {
// READs
case MSG_MON_COMMAND:
...
}
3. kill mon.c daemon and then use new comiple ceph-mon start mon.c. As a result, mon.a and mon.b daemons run in the case which can handle command about osds successfully, but mon.c(port:6791) can't .
4. when updating crush map command sent to mon.c, we shall see client stuck with massive keep_alive tick, but monclient never try to send command again or send command to other mons.

  1. detail log with attachment
Actions

Also available in: Atom PDF