Bug #10178
closedmon rejects peer during election based on OSD_SET_ALLOC_HINT feature?
0%
Description
2014-11-22T17:50:34.402 DEBUG:teuthology.misc:Ceph health: HEALTH_WARN 1 mons down, quorum 0,1 a,b 2014-11-22T17:50:35.402 ERROR:teuthology.parallel:Exception in parallel execution Traceback (most recent call last): File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 82, in __exit__ for result in self: File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 101, in next resurrect_traceback(result) File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 19, in capture_traceback return func(*args, **kwargs) File "/home/teuthworker/src/teuthology_master/teuthology/task/parallel.py", line 50, in _run_spawned mgr = run_tasks.run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 41, in run_one_task return fn(**kwargs) File "/home/teuthworker/src/teuthology_master/teuthology/task/sequential.py", line 48, in task mgr.__enter__() File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/var/lib/teuthworker/src/ceph-qa-suite_firefly/tasks/ceph.py", line 1086, in restart healthy(ctx=ctx, config=None) File "/var/lib/teuthworker/src/ceph-qa-suite_firefly/tasks/ceph.py", line 994, in healthy remote=mon0_remote, File "/home/teuthworker/src/teuthology_master/teuthology/misc.py", line 828, in wait_until_healthy while proceed(): File "/home/teuthworker/src/teuthology_master/teuthology/contextutil.py", line 127, in __call__ raise MaxWhileTries(error_msg) MaxWhileTries: 'wait_until_healthy'reached maximum tries (150) after waiting for 900 seconds 2014-11-22T17:50:35.467 ERROR:teuthology.run_tasks:Saw exception from tasks. Traceback (most recent call last): File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 53, in run_tasks manager = run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 41, in run_one_task return fn(**kwargs) File "/home/teuthworker/src/teuthology_master/teuthology/task/parallel.py", line 43, in task p.spawn(_run_spawned, ctx, confg, taskname) File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 82, in __exit__ for result in self: File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 101, in next resurrect_traceback(result) File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 19, in capture_traceback return func(*args, **kwargs) File "/home/teuthworker/src/teuthology_master/teuthology/task/parallel.py", line 50, in _run_spawned mgr = run_tasks.run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 41, in run_one_task return fn(**kwargs) File "/home/teuthworker/src/teuthology_master/teuthology/task/sequential.py", line 48, in task mgr.__enter__() File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/var/lib/teuthworker/src/ceph-qa-suite_firefly/tasks/ceph.py", line 1086, in restart healthy(ctx=ctx, config=None) File "/var/lib/teuthworker/src/ceph-qa-suite_firefly/tasks/ceph.py", line 994, in healthy remote=mon0_remote, File "/home/teuthworker/src/teuthology_master/teuthology/misc.py", line 828, in wait_until_healthy while proceed(): File "/home/teuthworker/src/teuthology_master/teuthology/contextutil.py", line 127, in __call__ raise MaxWhileTries(error_msg) MaxWhileTries: 'wait_until_healthy'reached maximum tries (150) after waiting for 900 seconds
Updated by Sage Weil over 9 years ago
- Category set to Monitor
- Status changed from New to 12
- Priority changed from Normal to Urgent
2014-11-23 01:30:31.145242 7f2f449ce700 20 allow all 2014-11-23 01:30:31.145263 7f2f449ce700 1 mon.c@2(electing).elector(13) handle_nak from mon.1 quorum_features 52776558133247 2014-11-23 01:30:31.145273 7f2f449ce700 -1 mon.c@2(electing).elector(13) Shutting down because I do not support required monitor features: { compat={},rocompat={},incompat={} }
from here
2014-11-23 01:30:31.144955 7f4c74e74700 5 mon.b@1(peon).elector(14) handle_propose from mon.2 2014-11-23 01:30:31.144958 7f4c74e74700 5 mon.b@1(peon).elector(14) ignoring propose from mon2 without required features 2014-11-23 01:30:31.144959 7f4c74e74700 10 mon.b@1(peon).elector(14) sending nak to peer mon.2 that only supports 17592186044415 of the required 52776558133247
that feature is #define CEPH_FEATURE_OSD_SET_ALLOC_HINT (1ULL<<45)
Updated by Sage Weil over 9 years ago
- Subject changed from "MaxWhileTries: 'wait_until_healthy'reached maximum tries" in upgrade:firefly:newer-firefly-distro-basic-vps run to mon rejects peer during election based on OSD_SET_ALLOC_HINT feature?
Updated by Sage Weil over 9 years ago
diff --git a/src/mon/Monitor.h b/src/mon/Monitor.h index da1fd0a..7fed638 100644 --- a/src/mon/Monitor.h +++ b/src/mon/Monitor.h @@ -550,7 +550,12 @@ public: return quorum_features; } uint64_t get_required_features() const { - return quorum_features; + // be conservative: exclude features known to have no impact + // on the mons. start with just the recent ones. + return quorum_features & ~(CEPH_FEATURE_OSD_SET_ALLOC_HINT | + CEPH_FEATURE_ERASURE_CODE_PLUGINS_V2 | + CEPH_FEATURE_OSD_POOLRESEND | + CEPH_FEATURE_MSGR_KEEPALIVE2); } void apply_quorum_to_compatset_features(); void apply_compatset_features_to_quorum_requirements();?
Updated by Sage Weil over 9 years ago
- Status changed from 12 to Fix Under Review
- Backport set to giant
Updated by Sage Weil over 9 years ago
ok, new plan. instead of changing mon behavior, make the tests more resilient.
if we upgrade all mons, then restart a, b, c, with a delay in between, then we may fail to get back to health during one of those restarts. fix is to do wait-for-healthy: false on any mon restart during an upgrade sequence
Updated by Yuri Weinstein over 9 years ago
RE: https://github.com/ceph/ceph-qa-suite/pull/251
- ceph.restart: daemons: [mon.b] wait-for-healthy: false
Updated by Yuri Weinstein over 9 years ago
on giant fix - https://github.com/ceph/ceph-qa-suite/pull/252
Updated by Sage Weil over 9 years ago
Yuri, new plan: let's just add 'mon lease = 15' to vps.yaml and see if this comes up again.
Updated by Yuri Weinstein over 9 years ago
Sage, OK
I changed vps.ayml on teuthology to:
overrides: ceph: conf: global: osd heartbeat grace: 100 mon lease = 15 rgw: default_idle_timeout: 1200 s3tests: idle_timeout: 1200
Do we need to revert https://github.com/ceph/ceph-qa-suite/pull/251 ?
Updated by Yuri Weinstein over 9 years ago
Committed vps.yaml on master, gaint and next
Fixed syntax for mon lease = 15
to mon lease: 15
Updated by Yuri Weinstein over 9 years ago
- Assignee changed from Yuri Weinstein to Sage Weil
I think after this addition suites on vps fail with:
2014-11-26T11:52:14.312 INFO:teuthology.orchestra.run.vpm055.stderr:2014-11-26 14:52:14.310788 7ff65a2c0700 0 librados: client.admin authentication error (110) Connection timed out
2014-11-26T11:42:14.067 INFO:tasks.ceph.mon.b.vpm043.stderr:2014-11-26 14:42:14.066083 7f16e69b57a0 -1 mon.b@-1(probing) e0 option sanitization failed! 2014-11-26T11:42:14.067 INFO:tasks.ceph.mon.b.vpm043.stderr:2014-11-26 14:42:14.066087 7f16e69b57a0 -1 failed to initialize 2014-11-26T11:42:14.080 INFO:tasks.ceph.mon.a.vpm055.stderr:2014-11-26 14:42:14.079662 7fc03c8f27a0 -1 log_channel(cluster) log [ERR] : mon_lease_ack_timeout (10) must be greater than mon_lease (15)
Updated by Sage Weil over 9 years ago
mon_lease_ack_timeout: 25
2014-11-26 14:42:14.079662 7fc03c8f27a0 -1 log_channel(cluster) log [ERR] : mon_lease_ack_timeout (10) must be greater than mon_lease (15)
Updated by Sage Weil over 9 years ago
- Assignee changed from Sage Weil to Yuri Weinstein
Updated by Yuri Weinstein over 9 years ago
Modified vps.yaml:
overrides: ceph: conf: global: osd heartbeat grace: 100 # this line to address issue #1017 mon lease: 15 mon lease ack timeout: 25 rgw: default_idle_timeout: 1200 s3tests: idle_timeout: 1200
Updated by Sage Weil over 9 years ago
- Status changed from Fix Under Review to Resolved
Updated by Yuri Weinstein over 9 years ago
The initial run for this report passed - http://pulpito.front.sepia.ceph.com/teuthology-2014-12-02_17:00:03-upgrade:firefly:newer-firefly-distro-basic-vps/