Project

General

Profile

Bug #14910

cephtool/test.sh: test_mon_tell test unreliable

Added by Sage Weil about 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph tell mon.foo may get redirected to another mon

/a/sage-2016-02-26_07:11:10-rados-wip-sage-testing---basic-smithi/28922

Associated revisions

Revision 6eff39af (diff)
Added by Kefu Chai about 8 years ago

qa/workunits/cephtool/test.sh: wait longer in ceph_watch_start()

"ceph --watch-debug" and "ceph tell mon.foo version" could connect
to different monitors, and there is chance that "ceph --watch-debug"
is not connected yet when "ceph tell" completes, and hence the former
fails to collect the cluster log including the "ceph tell" related
message. this renders test_mon_tell() unreliable. so, in
ceph_watch_start(), we should wait until the "ceph" cli connects to the
monitor and receives messages from it.

Fixes: #14910
Signed-off-by: Kefu Chai <>

History

#1 Updated by Kefu Chai about 8 years ago

in test_mon_tell(), we try to verify that ceph tell mon.foo version is always handled by mon.foo. and ceph cli always sends a "ceph status" to the cluster before checking the cluster debug log using "ceph --watch-debug". and in test_mon_tell(), we succeeded when checking mon.a, but failed on mon.b. the reason is not that the "ceph tell mon.foo" gets redirected to another mon, it because the "ceph --watch-debug" was too late when collecting the debug cluster log.

remote/smithi039/log/ceph-mon.a.log.gz:2016-02-26 21:44:34.495863 7fe854319700 10 mon.a@0(leader).log v535 logging 2016-02-26 21:44:34.495502 mon.0 172.21.15.39:6789/0 2201 : audit [DBG] from='client.? 172.21.15.39:0/2692516324' entity='client.admin' cmd=[{"prefix": "status"}]: dispatch

remote/smithi039/log/ceph-mon.a.log.gz:2016-02-26 21:44:34.548926 7fe854319700 10 mon.a@0(leader).log v535 logging 2016-02-26 21:44:34.548625 mon.0 172.21.15.39:6789/0 2202 : audit [DBG] from='client.? 172.21.15.39:0/331547002' entity='client.admin' cmd=[{"prefix": "version"}]: dispatch

remote/smithi039/log/ceph-mon.a.log.gz:2016-02-26 21:44:35.811061 7fe854319700 10 mon.a@0(leader).log v536 logging 2016-02-26 21:44:35.807152 mon.1 172.21.15.39:6790/0 438 : audit [DBG] from='client.? 172.21.15.39:0/3322545253' entity='client.admin' cmd=[{"prefix": "version"}]: dispatch

remote/smithi039/log/ceph-mon.a.log.gz:2016-02-26 21:44:38.673913 7fe854319700 10 mon.a@0(leader).log v538 logging 2016-02-26 21:44:38.673158 mon.2 172.21.15.39:6791/0 474 : audit [DBG] from='client.? 172.21.15.39:0/3876206733' entity='client.admin' cmd=[{"prefix": "status"}]: dispatch

see the log above, the second "ceph status" was handled by mon.c at 21:44:38.673913, the monclient was connected at 2016-02-26 21:44:38.666701.

while the cluster has populated the audit log about "ceph version" from mon.1 at "21:44:35.811061".

and on mon.c:

2016-02-26 21:44:36.162692 7fc6c81bb700 7 mon.c@2(peon).log v537 update_from_paxos applying incremental log 537 2016-02-26 21:44:35.844161 mon.1 172.21.15.39:6790/0 439 : audit [DBG] from='client.? 172.21.15.39:0/3322545253' entity='client.admin' cmd=[{"prefix": "version"}]: dispatch

2016-02-26 21:44:38.673022 7fc6c81bb700 1 -- 172.21.15.39:6791/0 <== client.4840 172.21.15.39:0/3876206733 6 ==== mon_command({"prefix": "status"} v 0) v1 ==== 62+0+0 (2889395799 0 0) 0x7fc6da707000 con 0x7fc6daa09500

also, the "ceph status" was way too late to collect the "ceph version" log.

#2 Updated by Kefu Chai about 8 years ago

as a fix, we might need to wait until the "ceph --watch-debug" prints something ("cluster" maybe) before calling "ceph tell mon.foo version", and expecting the audit log to show up in the cluster log's "debug" channel.

#3 Updated by Kefu Chai about 8 years ago

  • Status changed from New to Fix Under Review
  • Assignee set to Kefu Chai

#4 Updated by Sage Weil about 8 years ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF