Project

General

Profile

Actions

Bug #18467

closed

ceph ping mon.* can fail

Added by Sage Weil over 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
MonClient
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2017-01-01T05:21:39.480 INFO:tasks.workunit.client.0.smithi104.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/test.sh:1835: test_mon_ping:  ceph ping 'mon.*'
2017-01-01T05:21:39.482 INFO:tasks.workunit.client.0.smithi104.stderr:2017-01-01 05:21:38.855920 7fbb055a7700 -1 WARNING: all dangerous and experimental features are enabled.
2017-01-01T05:21:39.484 INFO:tasks.workunit.client.0.smithi104.stderr:2017-01-01 05:21:38.863141 7fbb055a7700 -1 WARNING: all dangerous and experimental features are enabled.
2017-01-01T05:21:41.868 INFO:tasks.workunit.client.0.smithi104.stderr:2017-01-01 05:21:41.867129 7fbafcd76700  0 monclient: hunting for new mon
2017-01-01T05:21:41.874 INFO:tasks.workunit.client.0.smithi104.stderr:2017-01-01 05:21:41.872998 7fbafed7a700  0 -- 172.21.15.104:0/4251581445 >> 172.21.15.104:6789/0 conn(0x7fbb0013a3b0 :-1 s=STATE_OPEN pgs=1056 cs=1 l=1).read_until injecting socket failure
2017-01-01T05:21:41.877 INFO:tasks.workunit.client.0.smithi104.stderr:Error connecting to cluster: TypeError
2017-01-01T05:21:41.889 INFO:tasks.workunit.client.0.smithi104.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/cephtool/test.sh:1: test_mon_ping:  rm -fr /tmp/cephtool.92D

/a/sage-2017-01-01_04:15:28-rados-wip-sage-testing---basic-smithi/679829

presumably the socket failure injectino is what triggred the issue.. and why it usually works.

Actions #1

Updated by Nathan Cutler over 7 years ago

The offending code in qa/workunits/cephtool/test.sh is:

function test_mon_ping()
{
  ceph ping mon.a
  ceph ping mon.b
  expect_false ceph ping mon.foo

  ceph ping mon.\*
}

I guess it should run the "ceph ping" commands, say, five times and sleep 5 seconds between each try?

Answer:

(11:28:12 PM) sage: smithfarm: hmm, i think the ping socket error should trigger a reconnect attempt and be masked at that level
(11:28:49 PM) sage: proably need to reproduce it with a client log debug objecter = 20 and debug ms = 20 log with the socket error injection turned up

Actions #3

Updated by Nathan Cutler over 7 years ago

That's with "ms inject socket failures: 500" which is unchanged. What's a reasonable higher value to try - 1000? 5000?

Actions #4

Updated by Chang Liu about 7 years ago

@Nathan Weinberg

Its value should be lower so that fault is easier to reproduce. Now, ping socket error will reconnect automatically as:

 ./bin/ceph ping mon.a
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2017-04-05 17:44:41.316100 7f1ee1f1b700  0 -- - >> 127.0.0.1:40000/0 conn(0x7f1edc0ca3b0 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=0)._try_send injecting socket failure
2017-04-05 17:44:41.517114 7f1ee1f1b700  0 -- - >> 127.0.0.1:40000/0 conn(0x7f1edc0ca3b0 :-1 s=STATE_CONNECTING_RE pgs=0 cs=0 l=0)._try_send injecting socket failure
2017-04-05 17:44:41.918417 7f1ee1f1b700  0 -- 127.0.0.1:0/2584501559 >> 127.0.0.1:40000/0 conn(0x7f1edc0ca3b0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0).read_until injecting socket failure
2017-04-05 17:44:42.720221 7f1ee1f1b700  0 -- 127.0.0.1:0/2584501559 >> 127.0.0.1:40000/0 conn(0x7f1edc0ca3b0 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=0).read_until injecting socket failure
2017-04-05 17:44:44.322370 7f1ee1f1b700  0 -- 127.0.0.1:0/2584501559 >> 127.0.0.1:40000/0 conn(0x7f1edc0ca3b0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=0 cs=0 l=0)._try_send injecting socket failure 
2017-04-05 17:44:47.525169 7f1ee1f1b700  0 -- 127.0.0.1:0/2584501559 >> 127.0.0.1:40000/0 conn(0x7f1edc0ca3b0 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=0)._try_send injecting socket failure
2017-04-05 17:44:53.934825 7f1ee1f1b700  0 -- 127.0.0.1:0/2584501559 >> 127.0.0.1:40000/0 conn(0x7f1edc0ca3b0 :-1 s=STATE_OPEN pgs=43 cs=1 l=1).read_until injecting socket failure
{
    "health": {
        "health": {
            "health_services": [
                {
                    "mons": [
                        {
                            "name": "a",
                            "kb_total": 144366248,
                            "kb_used": 42897416,
                            "kb_avail": 94112416,
                            "avail_percent": 65,
                            "last_updated": "2017-04-05 17:44:15.424801",
                            "store_stats": {
                                "bytes_total": 5375478,
                                "bytes_sst": 0,
                                "bytes_log": 2458965,
                                "bytes_misc": 2916513,
                                "last_updated": "0.000000" 
                            },  
                            "health": "HEALTH_OK" 
                        },  

Actions #5

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category set to Correctness/Safety
  • Component(RADOS) MonClient added
Actions #6

Updated by Sage Weil almost 7 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF