Bug #10178: mon rejects peer during election based on OSD_SET_ALLOC_HINT feature? - Ceph - Ceph

Actions

Copy link

Bug #10178

closed

mon rejects peer during election based on OSD_SET_ALLOC_HINT feature?

Added by Yuri Weinstein over 9 years ago. Updated over 9 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Yuri Weinstein

Category:

Monitor

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

giant

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-11-22_17:00:03-upgrade:firefly:newer-firefly-distro-basic-vps/615062/

2014-11-22T17:50:34.402 DEBUG:teuthology.misc:Ceph health: HEALTH_WARN 1 mons down, quorum 0,1 a,b
2014-11-22T17:50:35.402 ERROR:teuthology.parallel:Exception in parallel execution
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 82, in __exit__
    for result in self:
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 101, in next
    resurrect_traceback(result)
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 19, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/teuthology_master/teuthology/task/parallel.py", line 50, in _run_spawned
    mgr = run_tasks.run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 41, in run_one_task
    return fn(**kwargs)
  File "/home/teuthworker/src/teuthology_master/teuthology/task/sequential.py", line 48, in task
    mgr.__enter__()
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/var/lib/teuthworker/src/ceph-qa-suite_firefly/tasks/ceph.py", line 1086, in restart
    healthy(ctx=ctx, config=None)
  File "/var/lib/teuthworker/src/ceph-qa-suite_firefly/tasks/ceph.py", line 994, in healthy
    remote=mon0_remote,
  File "/home/teuthworker/src/teuthology_master/teuthology/misc.py", line 828, in wait_until_healthy
    while proceed():
  File "/home/teuthworker/src/teuthology_master/teuthology/contextutil.py", line 127, in __call__
    raise MaxWhileTries(error_msg)
MaxWhileTries: 'wait_until_healthy'reached maximum tries (150) after waiting for 900 seconds
2014-11-22T17:50:35.467 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 53, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 41, in run_one_task
    return fn(**kwargs)
  File "/home/teuthworker/src/teuthology_master/teuthology/task/parallel.py", line 43, in task
    p.spawn(_run_spawned, ctx, confg, taskname)
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 82, in __exit__
    for result in self:
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 101, in next
    resurrect_traceback(result)
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 19, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/teuthology_master/teuthology/task/parallel.py", line 50, in _run_spawned
    mgr = run_tasks.run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 41, in run_one_task
    return fn(**kwargs)
  File "/home/teuthworker/src/teuthology_master/teuthology/task/sequential.py", line 48, in task
    mgr.__enter__()
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/var/lib/teuthworker/src/ceph-qa-suite_firefly/tasks/ceph.py", line 1086, in restart
    healthy(ctx=ctx, config=None)
  File "/var/lib/teuthworker/src/ceph-qa-suite_firefly/tasks/ceph.py", line 994, in healthy
    remote=mon0_remote,
  File "/home/teuthworker/src/teuthology_master/teuthology/misc.py", line 828, in wait_until_healthy
    while proceed():
  File "/home/teuthworker/src/teuthology_master/teuthology/contextutil.py", line 127, in __call__
    raise MaxWhileTries(error_msg)
MaxWhileTries: 'wait_until_healthy'reached maximum tries (150) after waiting for 900 seconds

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Sage Weil over 9 years ago

Category set to Monitor
Status changed from New to 12
Priority changed from Normal to Urgent

2014-11-23 01:30:31.145242 7f2f449ce700 20  allow all
2014-11-23 01:30:31.145263 7f2f449ce700  1 mon.c@2(electing).elector(13) handle_nak from mon.1 quorum_features 52776558133247
2014-11-23 01:30:31.145273 7f2f449ce700 -1 mon.c@2(electing).elector(13) Shutting down because I do not support required monitor features: { compat={},rocompat={},incompat={} }

from here

2014-11-23 01:30:31.144955 7f4c74e74700  5 mon.b@1(peon).elector(14) handle_propose from mon.2
2014-11-23 01:30:31.144958 7f4c74e74700  5 mon.b@1(peon).elector(14)  ignoring propose from mon2 without required features
2014-11-23 01:30:31.144959 7f4c74e74700 10 mon.b@1(peon).elector(14) sending nak to peer mon.2 that only supports 17592186044415 of the required 52776558133247

that feature is #define CEPH_FEATURE_OSD_SET_ALLOC_HINT (1ULL<<45)

Actions

Copy link

Updated by Sage Weil over 9 years ago

Subject changed from "MaxWhileTries: 'wait_until_healthy'reached maximum tries" in upgrade:firefly:newer-firefly-distro-basic-vps run to mon rejects peer during election based on OSD_SET_ALLOC_HINT feature?

Actions

Copy link

Updated by Sage Weil over 9 years ago

diff --git a/src/mon/Monitor.h b/src/mon/Monitor.h
index da1fd0a..7fed638 100644
--- a/src/mon/Monitor.h
+++ b/src/mon/Monitor.h
@@ -550,7 +550,12 @@ public:
     return quorum_features;
   }
   uint64_t get_required_features() const {
-    return quorum_features;
+    // be conservative: exclude features known to have no impact
+    // on the mons.  start with just the recent ones.
+    return quorum_features & ~(CEPH_FEATURE_OSD_SET_ALLOC_HINT |
+                              CEPH_FEATURE_ERASURE_CODE_PLUGINS_V2 |
+                              CEPH_FEATURE_OSD_POOLRESEND |
+                              CEPH_FEATURE_MSGR_KEEPALIVE2);
   }
   void apply_quorum_to_compatset_features();
   void apply_compatset_features_to_quorum_requirements();

Actions

Copy link

Updated by Sage Weil over 9 years ago

Status changed from 12 to Fix Under Review
Backport set to giant

Actions

Copy link

Updated by Sage Weil over 9 years ago

https://github.com/ceph/ceph/pull/2999

Actions

Copy link

Updated by Sage Weil over 9 years ago

ok, new plan. instead of changing mon behavior, make the tests more resilient.

if we upgrade all mons, then restart a, b, c, with a delay in between, then we may fail to get back to health during one of those restarts. fix is to do wait-for-healthy: false on any mon restart during an upgrade sequence

Actions

Copy link

Updated by Sage Weil over 9 years ago

Assignee set to Yuri Weinstein

Actions

Copy link

Updated by Yuri Weinstein over 9 years ago

RE: https://github.com/ceph/ceph-qa-suite/pull/251

   - ceph.restart:
       daemons: [mon.b]
       wait-for-healthy: false

Actions

Copy link

Updated by Yuri Weinstein over 9 years ago

on giant fix - https://github.com/ceph/ceph-qa-suite/pull/252

Actions

Copy link

#10

Updated by Sage Weil over 9 years ago

Yuri, new plan: let's just add 'mon lease = 15' to vps.yaml and see if this comes up again.

Actions

Copy link

#11

Updated by Yuri Weinstein over 9 years ago

Sage, OK

I changed vps.ayml on teuthology to:

overrides:
  ceph:
    conf:
      global:
        osd heartbeat grace: 100
        mon lease = 15
  rgw:
    default_idle_timeout: 1200
  s3tests:
    idle_timeout: 1200

Do we need to revert https://github.com/ceph/ceph-qa-suite/pull/251 ?

Actions

Copy link

#12

Updated by Yuri Weinstein over 9 years ago

Committed vps.yaml on master, gaint and next

Fixed syntax for
mon lease = 15
to
mon lease: 15

Actions

Copy link

#13

Updated by Yuri Weinstein over 9 years ago

Assignee changed from Yuri Weinstein to Sage Weil

Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-11-26_09:31:02-upgrade:giant-x-next-distro-basic-vps/623077/

I think after this addition suites on vps fail with:

2014-11-26T11:52:14.312 INFO:teuthology.orchestra.run.vpm055.stderr:2014-11-26 14:52:14.310788 7ff65a2c0700  0 librados: client.admin authentication error (110) Connection timed out

2014-11-26T11:42:14.067 INFO:tasks.ceph.mon.b.vpm043.stderr:2014-11-26 14:42:14.066083 7f16e69b57a0 -1 mon.b@-1(probing) e0 option sanitization failed!
2014-11-26T11:42:14.067 INFO:tasks.ceph.mon.b.vpm043.stderr:2014-11-26 14:42:14.066087 7f16e69b57a0 -1 failed to initialize
2014-11-26T11:42:14.080 INFO:tasks.ceph.mon.a.vpm055.stderr:2014-11-26 14:42:14.079662 7fc03c8f27a0 -1 log_channel(cluster) log [ERR] : mon_lease_ack_timeout (10) must be greater than mon_lease (15)

Actions

Copy link

#14

Updated by Sage Weil over 9 years ago

mon_lease_ack_timeout: 25

2014-11-26 14:42:14.079662 7fc03c8f27a0 -1 log_channel(cluster) log [ERR] : mon_lease_ack_timeout (10) must be greater than mon_lease (15)

Actions

Copy link

#15

Updated by Sage Weil over 9 years ago

Assignee changed from Sage Weil to Yuri Weinstein

Actions

Copy link

#16

Updated by Yuri Weinstein over 9 years ago

Modified vps.yaml:

overrides:
  ceph:
    conf:
      global:
        osd heartbeat grace: 100
        # this line to address issue #1017
        mon lease: 15
        mon lease ack timeout: 25
  rgw:
    default_idle_timeout: 1200
  s3tests:
    idle_timeout: 1200

Actions

Copy link

#17