Bug #4256: mon/Paxos.cc: 534: FAILED assert(begin->last_committed == last_committed) - Ceph - Ceph

Actions

Copy link

Bug #4256

closed

mon/Paxos.cc: 534: FAILED assert(begin->last_committed == last_committed)

Added by Sage Weil about 11 years ago. Updated about 11 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Joao Eduardo Luis

Category:

Monitor

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2013-02-23T11:07:03.330 INFO:teuthology.task.ceph.mon.c.err:mon/Paxos.cc: In function 'void Paxos::handle_begin(MMonPaxos*)' thread 7f973f2db700 time 2013-02-23 11:06:58.978785
2013-02-23T11:07:03.330 INFO:teuthology.task.ceph.mon.c.err:mon/Paxos.cc: 534: FAILED assert(begin->last_committed == last_committed)
2013-02-23T11:07:03.330 INFO:teuthology.task.ceph.mon.c.err: ceph version 0.57-493-g704db85 (704db850131643b26bafe6594946cacce483c171)
2013-02-23T11:07:03.330 INFO:teuthology.task.ceph.mon.c.err: 1: (Paxos::handle_begin(MMonPaxos*)+0xaf7) [0x4dc647]
2013-02-23T11:07:03.330 INFO:teuthology.task.ceph.mon.c.err: 2: (Paxos::dispatch(PaxosServiceMessage*)+0x25b) [0x4de64b]
2013-02-23T11:07:03.330 INFO:teuthology.task.ceph.mon.c.err: 3: (Monitor::_ms_dispatch(Message*)+0x145f) [0x4b72ef]
2013-02-23T11:07:03.331 INFO:teuthology.task.ceph.mon.c.err: 4: (Monitor::ms_dispatch(Message*)+0x32) [0x4cd962]
2013-02-23T11:07:03.331 INFO:teuthology.task.ceph.mon.c.err: 5: (DispatchQueue::entry()+0x341) [0x6b0e11]
2013-02-23T11:07:03.331 INFO:teuthology.task.ceph.mon.c.err: 6: (DispatchQueue::DispatchThread::entry()+0xd) [0x64002d]
2013-02-23T11:07:03.331 INFO:teuthology.task.ceph.mon.c.err: 7: (()+0x7e9a) [0x7f9743e34e9a]
2013-02-23T11:07:03.331 INFO:teuthology.task.ceph.mon.c.err: 8: (clone()+0x6d) [0x7f97425ed4bd]
2013-02-23T11:07:03.331 INFO:teuthology.task.ceph.mon.c.err: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

job was

ubuntu@teuthology:/a/sage-2013-02-23_08:44:35-regression-master-testing-basic/10339$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 92a49fb0f79f3300e6e50ddf56238e70678e4202
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
      osd:
        osd op thread timeout: 60
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: 704db850131643b26bafe6594946cacce483c171
  s3tests:
    branch: master
  workunit:
    sha1: 704db850131643b26bafe6594946cacce483c171
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock: null
- install: null
- ceph:
    conf:
      client:
        rbd cache: true
        rbd cache max dirty: 0
- qemu:
    all:
      test: https://raw.github.com/ceph/ceph/master/qa/workunits/suites/tiobench.sh

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Joao Eduardo Luis about 11 years ago

Status changed from 12 to In Progress

This bug is caused by what I most feared that could happen with the single-paxos approach, but it always remained a theoretical problem as I had never been able to trigger it.

Here's the deal:

mon.X probes, finds all the monitors with paxos versions [1, 8]
given that the version interval is lower than the paxos join tolerance, mon.X will call an election instead of synchronizing (paxos recovery will take care of it)
given the amount of monitors and the workload on the cluster, the election message is not handled right away
by the time the election is triggered, the paxos version (quorum-wide) is [5, 18] (a trim happened in the mean time)
mon.X joins the quorum
paxos recovery is kicked-off
mon.X receives versions [5, 18]
mon.X discards the versions as 5 > last_committed+1 (last_committed = 0)
mon.X continues business as usual, assuming the previous message was some stray that should not pose any issues
mon.X receives a begin message with m->last_committed = 10 > last_committed = 0
mon.X asserts

Before the single-paxos, this would have worked because, during the recovery, each paxos machine would have shared the latest stashed version, updated the 'last_committed' version, and then the message would not be discarded as the first incremental would be 'last_committed+1'.

However, with the single-paxos, recovery is no longer done by the paxos services (which are no longer paxos machines, as each and every single one of them share the same paxos machine), but instead by the single Paxos instance. This means we no longer keep stashed versions on the Paxos machine, as there is no point in them. This also means that once we trim the paxos, it's no longer possible to resort to the Paxos recovery unless there are no gaps between the receiving end's versions and the ones obtained during recovery -- otherwise, we may end up in an inconsistent state.

As I see it, we have the following solutions for this problem:

Always use the store sync to join the cluster.
Force the leader to stop trimming once a probe is received.
Force the one joining the cluster to bootstrap if, during recovery, he notices the version gap.
Increase the paxos version trim threshold.

All of them have their pros and cons, but (4) might not only be the easiest approach, but the one that might have the lesser cons:

(1)
Cons: If there are no gaps between mon.X and the quorum, might be an overkill to sync the whole store for large enough stores.
Pros: We would rely on the already established sync mechanism to do the dirty work, and given that we set out to trim only 30 seconds after the sync finishes, we give enough time for the monitor to join the cluster in the mean time.

(2)
Cons: A timeout would have to be established, in case the monitor failed. We can cancel it once an election is triggered, but that is assuming that the monitor doesn't fail.
Cons: There's a somewhat significant code change involved.
Pros: We keep the paxos recovery in place.

(3)
Cons: Monitors might end up taking a significant amount of time to join the cluster. Specially if there are dozens of them.
Pros: A store sync is triggered only when there's an actual need for it.

(4)
Cons: versions are kept longer in the store
Cons: the problem itself is not addressed; the problem might pose itself again if we keep on scaling the number of monitors.
Pros: Paxos recovery will be able to take care of the job.

There might be yet another option: During the probing phase, instead of calling an election and waiting for recovery to happen, trigger a paxos-only store sync that leverages the store sync mechanism and the disabled trim. That should give enough time for mon.X to sync its paxos state, apply it, and then join the quorum before the leader trims. This associated with (4) might be even better.

Actions

Copy link

Updated by Ian Colle about 11 years ago

Which of these options are we going to proceed with, so we can get this issue closed?

Actions

Copy link

Updated by Sage Weil about 11 years ago

Status changed from In Progress to Resolved

b271b656a889b14d052f0f354ade4e5e68c6740f

Actions

Copy link

Updated by Joao Eduardo Luis about 11 years ago

Status changed from Resolved to In Progress

Sage triggered a new iteration of this bug last Friday.

The cause is analogous to what we had previously fixed, but instead of happening on a peon it happened on the leader -- the leader ignores the state peons share with him during collect because he had drifted considerably from the previous quorum since his probing phase -- thus leading the peon to assert on the same spot upon newer proposal.

Actions

Copy link

Updated by Joao Eduardo Luis about 11 years ago

Status changed from In Progress to 4

wip-4256 commit commit:f75d0e0a2ae76211f3522c4acd21bfb1d123da5a

also added a patch to increase the trim tolerance on commit:549a56ff3e3fcda476f882400f3da15b1dd4b66f with the following motivation:

This increase only means that we'll keep more versions around before we
trim. It doesn't change the number of versions we'll keep around after
trimming (that's still as much as 'paxos_max_join_drift', i.e. 10), nor
does it change the criteria used to consider a monitor as having drifted
(same rule applies, 'paxos_max_join_drift').

This change however will enable the leader to put off trimming for a longer
period of time, giving a better chance for a monitor to join the cluster.
See, after going through the probing phase, at which point a monitor may
only be, say, 5 versions off, the same monitor may end up getting into the
quorum only to find that in-between probing and finally triggering an
election some 6 versions might have come to existence. Before this patch,
by then the state had been trimmed and the monitor would have to bootstrap
to perform a full store sync. With this patch in place, the monitor would
be able to sync the remaining 11 versions.