Bug #18467
closedceph ping mon.* can fail
0%
Description
2017-01-01T05:21:39.480 INFO:tasks.workunit.client.0.smithi104.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/test.sh:1835: test_mon_ping: ceph ping 'mon.*' 2017-01-01T05:21:39.482 INFO:tasks.workunit.client.0.smithi104.stderr:2017-01-01 05:21:38.855920 7fbb055a7700 -1 WARNING: all dangerous and experimental features are enabled. 2017-01-01T05:21:39.484 INFO:tasks.workunit.client.0.smithi104.stderr:2017-01-01 05:21:38.863141 7fbb055a7700 -1 WARNING: all dangerous and experimental features are enabled. 2017-01-01T05:21:41.868 INFO:tasks.workunit.client.0.smithi104.stderr:2017-01-01 05:21:41.867129 7fbafcd76700 0 monclient: hunting for new mon 2017-01-01T05:21:41.874 INFO:tasks.workunit.client.0.smithi104.stderr:2017-01-01 05:21:41.872998 7fbafed7a700 0 -- 172.21.15.104:0/4251581445 >> 172.21.15.104:6789/0 conn(0x7fbb0013a3b0 :-1 s=STATE_OPEN pgs=1056 cs=1 l=1).read_until injecting socket failure 2017-01-01T05:21:41.877 INFO:tasks.workunit.client.0.smithi104.stderr:Error connecting to cluster: TypeError 2017-01-01T05:21:41.889 INFO:tasks.workunit.client.0.smithi104.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/test.sh:1: test_mon_ping: rm -fr /tmp/cephtool.92D
/a/sage-2017-01-01_04:15:28-rados-wip-sage-testing---basic-smithi/679829
presumably the socket failure injectino is what triggred the issue.. and why it usually works.
Updated by Nathan Cutler over 7 years ago
The offending code in qa/workunits/cephtool/test.sh
is:
function test_mon_ping() { ceph ping mon.a ceph ping mon.b expect_false ceph ping mon.foo ceph ping mon.\* }
I guess it should run the "ceph ping" commands, say, five times and sleep 5 seconds between each try?
Answer:
(11:28:12 PM) sage: smithfarm: hmm, i think the ping socket error should trigger a reconnect attempt and be masked at that level
(11:28:49 PM) sage: proably need to reproduce it with a client log debug objecter = 20 and debug ms = 20 log with the socket error injection turned up
Updated by Sage Weil over 7 years ago
Updated by Nathan Cutler over 7 years ago
That's with "ms inject socket failures: 500" which is unchanged. What's a reasonable higher value to try - 1000? 5000?
Updated by Chang Liu about 7 years ago
Its value should be lower so that fault is easier to reproduce. Now, ping socket error will reconnect automatically as:
./bin/ceph ping mon.a *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** 2017-04-05 17:44:41.316100 7f1ee1f1b700 0 -- - >> 127.0.0.1:40000/0 conn(0x7f1edc0ca3b0 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=0)._try_send injecting socket failure 2017-04-05 17:44:41.517114 7f1ee1f1b700 0 -- - >> 127.0.0.1:40000/0 conn(0x7f1edc0ca3b0 :-1 s=STATE_CONNECTING_RE pgs=0 cs=0 l=0)._try_send injecting socket failure 2017-04-05 17:44:41.918417 7f1ee1f1b700 0 -- 127.0.0.1:0/2584501559 >> 127.0.0.1:40000/0 conn(0x7f1edc0ca3b0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0).read_until injecting socket failure 2017-04-05 17:44:42.720221 7f1ee1f1b700 0 -- 127.0.0.1:0/2584501559 >> 127.0.0.1:40000/0 conn(0x7f1edc0ca3b0 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=0).read_until injecting socket failure 2017-04-05 17:44:44.322370 7f1ee1f1b700 0 -- 127.0.0.1:0/2584501559 >> 127.0.0.1:40000/0 conn(0x7f1edc0ca3b0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0)._try_send injecting socket failure 2017-04-05 17:44:47.525169 7f1ee1f1b700 0 -- 127.0.0.1:0/2584501559 >> 127.0.0.1:40000/0 conn(0x7f1edc0ca3b0 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=0)._try_send injecting socket failure 2017-04-05 17:44:53.934825 7f1ee1f1b700 0 -- 127.0.0.1:0/2584501559 >> 127.0.0.1:40000/0 conn(0x7f1edc0ca3b0 :-1 s=STATE_OPEN pgs=43 cs=1 l=1).read_until injecting socket failure { "health": { "health": { "health_services": [ { "mons": [ { "name": "a", "kb_total": 144366248, "kb_used": 42897416, "kb_avail": 94112416, "avail_percent": 65, "last_updated": "2017-04-05 17:44:15.424801", "store_stats": { "bytes_total": 5375478, "bytes_sst": 0, "bytes_log": 2458965, "bytes_misc": 2916513, "last_updated": "0.000000" }, "health": "HEALTH_OK" },
Updated by Greg Farnum almost 7 years ago
- Project changed from Ceph to RADOS
- Category set to Correctness/Safety
- Component(RADOS) MonClient added