Bug #18467
closed
Added by Sage Weil over 7 years ago.
Updated almost 7 years ago.
Category:
Correctness/Safety
Component(RADOS):
MonClient
Description
2017-01-01T05:21:39.480 INFO:tasks.workunit.client.0.smithi104.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/test.sh:1835: test_mon_ping: ceph ping 'mon.*'
2017-01-01T05:21:39.482 INFO:tasks.workunit.client.0.smithi104.stderr:2017-01-01 05:21:38.855920 7fbb055a7700 -1 WARNING: all dangerous and experimental features are enabled.
2017-01-01T05:21:39.484 INFO:tasks.workunit.client.0.smithi104.stderr:2017-01-01 05:21:38.863141 7fbb055a7700 -1 WARNING: all dangerous and experimental features are enabled.
2017-01-01T05:21:41.868 INFO:tasks.workunit.client.0.smithi104.stderr:2017-01-01 05:21:41.867129 7fbafcd76700 0 monclient: hunting for new mon
2017-01-01T05:21:41.874 INFO:tasks.workunit.client.0.smithi104.stderr:2017-01-01 05:21:41.872998 7fbafed7a700 0 -- 172.21.15.104:0/4251581445 >> 172.21.15.104:6789/0 conn(0x7fbb0013a3b0 :-1 s=STATE_OPEN pgs=1056 cs=1 l=1).read_until injecting socket failure
2017-01-01T05:21:41.877 INFO:tasks.workunit.client.0.smithi104.stderr:Error connecting to cluster: TypeError
2017-01-01T05:21:41.889 INFO:tasks.workunit.client.0.smithi104.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/test.sh:1: test_mon_ping: rm -fr /tmp/cephtool.92D
/a/sage-2017-01-01_04:15:28-rados-wip-sage-testing---basic-smithi/679829
presumably the socket failure injectino is what triggred the issue.. and why it usually works.
The offending code in qa/workunits/cephtool/test.sh
is:
function test_mon_ping()
{
ceph ping mon.a
ceph ping mon.b
expect_false ceph ping mon.foo
ceph ping mon.\*
}
I guess it should run the "ceph ping" commands, say, five times and sleep 5 seconds between each try?
Answer:
(11:28:12 PM) sage: smithfarm: hmm, i think the ping socket error should trigger a reconnect attempt and be masked at that level
(11:28:49 PM) sage: proably need to reproduce it with a client log debug objecter = 20 and debug ms = 20 log with the socket error injection turned up
That's with "ms inject socket failures: 500" which is unchanged. What's a reasonable higher value to try - 1000? 5000?
@Nathan Weinberg
Its value should be lower so that fault is easier to reproduce. Now, ping socket error will reconnect automatically as:
./bin/ceph ping mon.a
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2017-04-05 17:44:41.316100 7f1ee1f1b700 0 -- - >> 127.0.0.1:40000/0 conn(0x7f1edc0ca3b0 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=0)._try_send injecting socket failure
2017-04-05 17:44:41.517114 7f1ee1f1b700 0 -- - >> 127.0.0.1:40000/0 conn(0x7f1edc0ca3b0 :-1 s=STATE_CONNECTING_RE pgs=0 cs=0 l=0)._try_send injecting socket failure
2017-04-05 17:44:41.918417 7f1ee1f1b700 0 -- 127.0.0.1:0/2584501559 >> 127.0.0.1:40000/0 conn(0x7f1edc0ca3b0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0).read_until injecting socket failure
2017-04-05 17:44:42.720221 7f1ee1f1b700 0 -- 127.0.0.1:0/2584501559 >> 127.0.0.1:40000/0 conn(0x7f1edc0ca3b0 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=0).read_until injecting socket failure
2017-04-05 17:44:44.322370 7f1ee1f1b700 0 -- 127.0.0.1:0/2584501559 >> 127.0.0.1:40000/0 conn(0x7f1edc0ca3b0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0)._try_send injecting socket failure
2017-04-05 17:44:47.525169 7f1ee1f1b700 0 -- 127.0.0.1:0/2584501559 >> 127.0.0.1:40000/0 conn(0x7f1edc0ca3b0 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=0)._try_send injecting socket failure
2017-04-05 17:44:53.934825 7f1ee1f1b700 0 -- 127.0.0.1:0/2584501559 >> 127.0.0.1:40000/0 conn(0x7f1edc0ca3b0 :-1 s=STATE_OPEN pgs=43 cs=1 l=1).read_until injecting socket failure
{
"health": {
"health": {
"health_services": [
{
"mons": [
{
"name": "a",
"kb_total": 144366248,
"kb_used": 42897416,
"kb_avail": 94112416,
"avail_percent": 65,
"last_updated": "2017-04-05 17:44:15.424801",
"store_stats": {
"bytes_total": 5375478,
"bytes_sst": 0,
"bytes_log": 2458965,
"bytes_misc": 2916513,
"last_updated": "0.000000"
},
"health": "HEALTH_OK"
},
- Project changed from Ceph to RADOS
- Category set to Correctness/Safety
- Component(RADOS) MonClient added
- Status changed from New to Resolved
Also available in: Atom
PDF