Project

General

Profile

Actions

Bug #5069

closed

monitor crashed during mon thrash in nightlies

Added by Tamilarasi muthamizhan almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
Joao Eduardo Luis
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

logs: ubuntu@teuthology:/a/teuthology-2013-05-14_01:00:05-rados-next-testing-basic/12938

2013-05-14T01:28:35.267 INFO:teuthology.task.workunit.client.0.err:2013-05-14 01:28:56.156297 7f64397fa700  0 monclient: hunting for new mon
2013-05-14T01:29:02.589 INFO:teuthology.task.ceph.mon.a.err:mon/MDSMonitor.cc: In function 'virtual void MDSMonitor::update_from_paxos()' thread 7fde038cb700 time 2013-05-14 01:29:23.476858
2013-05-14T01:29:02.589 INFO:teuthology.task.ceph.mon.a.err:mon/MDSMonitor.cc: 88: FAILED assert(version >= mdsmap.epoch)
2013-05-14T01:29:02.589 INFO:teuthology.task.ceph.mon.a.err: ceph version 0.61-277-gc5deb5d (c5deb5dbe2a1e7ada1cab778da41ddb30d9c880b)
2013-05-14T01:29:02.589 INFO:teuthology.task.ceph.mon.a.err: 1: (MDSMonitor::update_from_paxos()+0x80d) [0x52d84d]
2013-05-14T01:29:02.590 INFO:teuthology.task.ceph.mon.a.err: 2: (PaxosService::_active()+0x28c) [0x4f40bc]
2013-05-14T01:29:02.590 INFO:teuthology.task.ceph.mon.a.err: 3: (Context::complete(int)+0xa) [0x4cafda]
2013-05-14T01:29:02.590 INFO:teuthology.task.ceph.mon.a.err: 4: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x95) [0x4cd7f5]
2013-05-14T01:29:02.590 INFO:teuthology.task.ceph.mon.a.err: 5: (Paxos::handle_last(MMonPaxos*)+0xc72) [0x4eda22]
2013-05-14T01:29:02.590 INFO:teuthology.task.ceph.mon.a.err: 6: (Paxos::dispatch(PaxosServiceMessage*)+0x26b) [0x4ede7b]
2013-05-14T01:29:02.591 INFO:teuthology.task.ceph.mon.a.err: 7: (Monitor::_ms_dispatch(Message*)+0x10ac) [0x4c30ac]
2013-05-14T01:29:02.591 INFO:teuthology.task.ceph.mon.a.err: 8: (Monitor::ms_dispatch(Message*)+0x32) [0x4dcc62]
2013-05-14T01:29:02.591 INFO:teuthology.task.ceph.mon.a.err: 9: (DispatchQueue::entry()+0x3f1) [0x6b4ca1]
2013-05-14T01:29:02.591 INFO:teuthology.task.ceph.mon.a.err: 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x64263d]
2013-05-14T01:29:02.591 INFO:teuthology.task.ceph.mon.a.err: 11: (()+0x7e9a) [0x7fde0862ee9a]
2013-05-14T01:29:02.591 INFO:teuthology.task.ceph.mon.a.err: 12: (clone()+0x6d) [0x7fde06bdeccd]
2013-05-14T01:29:02.592 INFO:teuthology.task.ceph.mon.a.err: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

ubuntu@teuthology:/a/teuthology-2013-05-14_01:00:05-rados-next-testing-basic/12938$ cat config.yaml 
kernel: &id001
  kdb: true
  sha1: 4ebfb52255fd25a987154e0e8847a4155532f760
machine_type: plana
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject socket failures: 500
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
      osd:
        osd op thread timeout: 60
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: 72bf5f4813c273210b5ced7f7793bc1bf813690c
  s3tests:
    branch: next
  workunit:
    sha1: 72bf5f4813c273210b5ced7f7793bc1bf813690c
roles:
- - mon.a
  - mon.b
  - mon.c
  - osd.0
  - osd.1
  - mds.0
  - client.0
targets:
  ubuntu@plana88.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCn53/HM+LXJ+Lp7U6+zxGv4uzsKWyTsUR33DqdlN90Lh5mRIqohFq/oitlEvTuSywmu0Cfo6wOeMF7mRrsGCfeOHE7qPCDlyv99P+K96nS0EvRE/+hGhcQZjrJveGTk9IYgu83UXRjYXjVkPR1lJpBc8Kl+has7uL0vnxnNiZmkKFT+2X9tfLxrvV622eN2keXg4qLc7kRWbPw7Tevk2OV5/Lw591aIlIDQo7DhsVaM002YhFbIw591B4FogwqGtY3X43WcRD5DlyONKFxXSy7Pp1F9W6p0bNEMC2JeP1JRxQeKUgajGlsR2bxLHgFTGRJNygRlgW2gIMceFPrUNHv
tasks:
- internal.lock_machines:
  - 1
  - plana
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- internal.check_ceph_data: null
- internal.vm_setup: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock.check: null
- install: null
- ceph: null
- mon_thrash:
    revive_delay: 20
    thrash_delay: 1
- ceph-fuse: null
- workunit:
    clients:
      all:
      - mon/workloadgen.sh
    env:
      DURATION: '600'
      LOADGEN_NUM_OSDS: '5'
      VERBOSE: '1'


Files

mdsmonitor_assert_mon.1.log (29.8 KB) mdsmonitor_assert_mon.1.log Florian Wiessner, 05/16/2013 03:25 AM
bug_5069_all_mon_logs.txt (226 KB) bug_5069_all_mon_logs.txt Florian Wiessner, 05/16/2013 03:25 AM
Actions #1

Updated by Joao Eduardo Luis almost 11 years ago

Logs are gone. I'll try to reproduce. In the future, it would be nice to grab the logs and data dirs before nuking/aborting the run and just dump them on teuthology or point their location on the ticket.

Updated by Florian Wiessner almost 11 years ago

I have some more logs - unfortunatelly, the mon is unable to start up and rejoin the cluster after this assert.

Actions #3

Updated by Sage Weil almost 11 years ago

  • Priority changed from High to Urgent
Actions #4

Updated by Joao Eduardo Luis almost 11 years ago

  • Status changed from New to 4

This is definitely something wrong in the store: 'version' contains the last committed version on the store, while mdsmap.epoch contains the version of the current in-memory mdsmap.

The only 2 scenarios in which this could happen that would make some sort of sense are as follow:

  • the monitor went through a sync and it managed not to sync the mds keys -- or, at least, mdsmon:last_committed -- which would be a bug!
  • the monitor synchronized from a monitor with an older state of the store -- which should be impossible, given we only synchronize from someone with a paxos version higher than the one we have.

The reason why these are the only two that so far have made sense is because in-between bootstraps, resets and the sorts, the paxos services keep their in-memory state to some extent. For instance, the OSDMonitor keeps its in-memory 'osdmap' although it discards its pending value; same goes for the MDSMonitor, keeping its mdsmap. This is not a problem, and it even acts as a checkpoint that comes in hand to guarantee the correctness of a sync: we thus guarantee that we will assert on update_from_paxos() if whatever version we have on the store is lower than whatever we have in memory, given that if we have said version in memory and it is greater then we must currently have a past-state of the store.

I cannot ascertain what might have happened here without a store dump from a majority of monitors in the cluster including the one that have asserted out, as the logs don't really provide the necessary insight. Trying to reproduce it has been an unfruitful task too.

However, an easy conclusion we can take from both scenarios is that we have no good way to assess whether a full-store sync guarantees that all keys have been synchronized. They should, given we take a snapshot of the store and iterate over it, but that alone is no longer enough for me to blindly believe in it. We should maybe consider making the store sync somewhat more service-aware instead of doing a full-store sync regardless of what we are synchronizing.

Feedback appreciated before setting this to Need More Info.

Actions #5

Updated by Sage Weil almost 11 years ago

  • Status changed from 4 to Need More Info

i think the blind sync is still the way forward.. the more we make it aware of what is on top the harder it is to do different things (like break up PGMap).

is there a more complete log that would shed more light?

Actions #6

Updated by Sage Weil almost 11 years ago

  • Status changed from Need More Info to In Progress

der, we have a complete log from the teuthology failure.

Actions #7

Updated by Joao Eduardo Luis almost 11 years ago

  • Status changed from In Progress to Need More Info
  • Priority changed from Urgent to High

Sage thought there were logs on teuthology, on the directory from the initial report from Tamil. That run must have been killed, and in fact there are no logs, thus marking this Need More Info.

Also, as we haven't seen this since, Sage suggests to mark this as High instead of Urgent.

Actions #8

Updated by Joao Eduardo Luis almost 11 years ago

  • Status changed from Need More Info to 4

I've been able to reproduce this on some locked nodes that were hammering the monitors pretty hard for the past week.

The job itself was the following:

nuke-on-error: false
interactive-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
    fs: xfs
    log-whitelist:
    - slow request
    branch: master
  install:
    ceph:
      branch: master
  workunit:
    branch: master
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
  - client.0
- - mon.d
  - mon.e
  - osd.6
  - osd.7
- - osd.8
  - osd.9
  - osd.10
  - mon.f
- - osd.11
  - osd.12
  - osd.13
  - mon.g
tasks:
- chef: null
- clock.check: null
- install: null
- ceph: null
- mon_thrash:
    revive_delay: 20
    thrash_delay: 1
    store-thrash: true
    thrash-many: true
    store-thrash-probability: 100
- rados:
    clients:
    - client.0
    objects: 50
    op_weights:
      delete: 50
      read: 100
      rollback: 50
      snap_create: 50
      snap_remove: 50
      write: 100
    ops: 4000

I grabbed the logs and the stores to my desktop, and here are the notes I've been taking so far:

The monitor (mon.d) doesn't have much of anything in its store besides a couple of paxos and pgmap stuff -- no mdsmap at all, so it's no wonder it all goes to crap.

mon.d appears to have gone through a couple of syncs. The first appears to have succeeded, bringing it up to version 2062 like most of the monitors in the cluster; however, while attempting to form quorum, a sync was triggered due to mon.a having version 2074.

mon.d's synchronization was started but mon.a instructed mon.d to retry the sync; mon.d never retried the sync, which may have been due to an election getting in the way? Truth is, mon.d eventually bootstrapped and joined the cluster as it was. It's hard to assess if the store got corrupted during the first sync, or if it was the one that was never retried that was the ultimate culprit.

Update: even though the synchronization had to be retried, we cleared the whole store prior to sending the sync_start message to the leader. This means that we get to only become aware that we should retry after clearing the whole store, and if we are killed in-between that we may end up with an empty store. Now, why would we end up with some deleted values anyway is beyond me. Maybe one of those leveldb bugs that bring deleted keys back to life after a compaction?!

Actions #9

Updated by Joao Eduardo Luis almost 11 years ago

Forgot to mention that the sync flag is set on the store. Sage pointed out that the real issue here is that we're allowing the monitor to join the quorum with a store marked as having an on-going sync (although I recall a patch going in that avoids just that sort of thing, or meant to; will check that when I get back).

Actions #10

Updated by Sage Weil almost 11 years ago

  • Status changed from 4 to Resolved

this should have been fixed with the recent mon sync and other changes

Actions

Also available in: Atom PDF