Bug #5069
closedmonitor crashed during mon thrash in nightlies
0%
Description
logs: ubuntu@teuthology:/a/teuthology-2013-05-14_01:00:05-rados-next-testing-basic/12938
2013-05-14T01:28:35.267 INFO:teuthology.task.workunit.client.0.err:2013-05-14 01:28:56.156297 7f64397fa700 0 monclient: hunting for new mon 2013-05-14T01:29:02.589 INFO:teuthology.task.ceph.mon.a.err:mon/MDSMonitor.cc: In function 'virtual void MDSMonitor::update_from_paxos()' thread 7fde038cb700 time 2013-05-14 01:29:23.476858 2013-05-14T01:29:02.589 INFO:teuthology.task.ceph.mon.a.err:mon/MDSMonitor.cc: 88: FAILED assert(version >= mdsmap.epoch) 2013-05-14T01:29:02.589 INFO:teuthology.task.ceph.mon.a.err: ceph version 0.61-277-gc5deb5d (c5deb5dbe2a1e7ada1cab778da41ddb30d9c880b) 2013-05-14T01:29:02.589 INFO:teuthology.task.ceph.mon.a.err: 1: (MDSMonitor::update_from_paxos()+0x80d) [0x52d84d] 2013-05-14T01:29:02.590 INFO:teuthology.task.ceph.mon.a.err: 2: (PaxosService::_active()+0x28c) [0x4f40bc] 2013-05-14T01:29:02.590 INFO:teuthology.task.ceph.mon.a.err: 3: (Context::complete(int)+0xa) [0x4cafda] 2013-05-14T01:29:02.590 INFO:teuthology.task.ceph.mon.a.err: 4: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x95) [0x4cd7f5] 2013-05-14T01:29:02.590 INFO:teuthology.task.ceph.mon.a.err: 5: (Paxos::handle_last(MMonPaxos*)+0xc72) [0x4eda22] 2013-05-14T01:29:02.590 INFO:teuthology.task.ceph.mon.a.err: 6: (Paxos::dispatch(PaxosServiceMessage*)+0x26b) [0x4ede7b] 2013-05-14T01:29:02.591 INFO:teuthology.task.ceph.mon.a.err: 7: (Monitor::_ms_dispatch(Message*)+0x10ac) [0x4c30ac] 2013-05-14T01:29:02.591 INFO:teuthology.task.ceph.mon.a.err: 8: (Monitor::ms_dispatch(Message*)+0x32) [0x4dcc62] 2013-05-14T01:29:02.591 INFO:teuthology.task.ceph.mon.a.err: 9: (DispatchQueue::entry()+0x3f1) [0x6b4ca1] 2013-05-14T01:29:02.591 INFO:teuthology.task.ceph.mon.a.err: 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x64263d] 2013-05-14T01:29:02.591 INFO:teuthology.task.ceph.mon.a.err: 11: (()+0x7e9a) [0x7fde0862ee9a] 2013-05-14T01:29:02.591 INFO:teuthology.task.ceph.mon.a.err: 12: (clone()+0x6d) [0x7fde06bdeccd] 2013-05-14T01:29:02.592 INFO:teuthology.task.ceph.mon.a.err: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. ubuntu@teuthology:/a/teuthology-2013-05-14_01:00:05-rados-next-testing-basic/12938$ cat config.yaml kernel: &id001 kdb: true sha1: 4ebfb52255fd25a987154e0e8847a4155532f760 machine_type: plana nuke-on-error: true overrides: ceph: conf: global: ms inject socket failures: 500 mon: debug mon: 20 debug ms: 20 debug paxos: 20 osd: osd op thread timeout: 60 fs: btrfs log-whitelist: - slow request sha1: 72bf5f4813c273210b5ced7f7793bc1bf813690c s3tests: branch: next workunit: sha1: 72bf5f4813c273210b5ced7f7793bc1bf813690c roles: - - mon.a - mon.b - mon.c - osd.0 - osd.1 - mds.0 - client.0 targets: ubuntu@plana88.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCn53/HM+LXJ+Lp7U6+zxGv4uzsKWyTsUR33DqdlN90Lh5mRIqohFq/oitlEvTuSywmu0Cfo6wOeMF7mRrsGCfeOHE7qPCDlyv99P+K96nS0EvRE/+hGhcQZjrJveGTk9IYgu83UXRjYXjVkPR1lJpBc8Kl+has7uL0vnxnNiZmkKFT+2X9tfLxrvV622eN2keXg4qLc7kRWbPw7Tevk2OV5/Lw591aIlIDQo7DhsVaM002YhFbIw591B4FogwqGtY3X43WcRD5DlyONKFxXSy7Pp1F9W6p0bNEMC2JeP1JRxQeKUgajGlsR2bxLHgFTGRJNygRlgW2gIMceFPrUNHv tasks: - internal.lock_machines: - 1 - plana - internal.save_config: null - internal.check_lock: null - internal.connect: null - internal.check_conflict: null - internal.check_ceph_data: null - internal.vm_setup: null - kernel: *id001 - internal.base: null - internal.archive: null - internal.coredump: null - internal.syslog: null - internal.timer: null - chef: null - clock.check: null - install: null - ceph: null - mon_thrash: revive_delay: 20 thrash_delay: 1 - ceph-fuse: null - workunit: clients: all: - mon/workloadgen.sh env: DURATION: '600' LOADGEN_NUM_OSDS: '5' VERBOSE: '1'
Files
Updated by Joao Eduardo Luis almost 11 years ago
Logs are gone. I'll try to reproduce. In the future, it would be nice to grab the logs and data dirs before nuking/aborting the run and just dump them on teuthology or point their location on the ticket.
Updated by Florian Wiessner almost 11 years ago
- File mdsmonitor_assert_mon.1.log mdsmonitor_assert_mon.1.log added
- File bug_5069_all_mon_logs.txt bug_5069_all_mon_logs.txt added
I have some more logs - unfortunatelly, the mon is unable to start up and rejoin the cluster after this assert.
Updated by Joao Eduardo Luis almost 11 years ago
- Status changed from New to 4
This is definitely something wrong in the store: 'version' contains the last committed version on the store, while mdsmap.epoch contains the version of the current in-memory mdsmap.
The only 2 scenarios in which this could happen that would make some sort of sense are as follow:
- the monitor went through a sync and it managed not to sync the mds keys -- or, at least, mdsmon:last_committed -- which would be a bug!
- the monitor synchronized from a monitor with an older state of the store -- which should be impossible, given we only synchronize from someone with a paxos version higher than the one we have.
The reason why these are the only two that so far have made sense is because in-between bootstraps, resets and the sorts, the paxos services keep their in-memory state to some extent. For instance, the OSDMonitor keeps its in-memory 'osdmap' although it discards its pending value; same goes for the MDSMonitor, keeping its mdsmap. This is not a problem, and it even acts as a checkpoint that comes in hand to guarantee the correctness of a sync: we thus guarantee that we will assert on update_from_paxos() if whatever version we have on the store is lower than whatever we have in memory, given that if we have said version in memory and it is greater then we must currently have a past-state of the store.
I cannot ascertain what might have happened here without a store dump from a majority of monitors in the cluster including the one that have asserted out, as the logs don't really provide the necessary insight. Trying to reproduce it has been an unfruitful task too.
However, an easy conclusion we can take from both scenarios is that we have no good way to assess whether a full-store sync guarantees that all keys have been synchronized. They should, given we take a snapshot of the store and iterate over it, but that alone is no longer enough for me to blindly believe in it. We should maybe consider making the store sync somewhat more service-aware instead of doing a full-store sync regardless of what we are synchronizing.
Feedback appreciated before setting this to Need More Info.
Updated by Sage Weil almost 11 years ago
- Status changed from 4 to Need More Info
i think the blind sync is still the way forward.. the more we make it aware of what is on top the harder it is to do different things (like break up PGMap).
is there a more complete log that would shed more light?
Updated by Sage Weil almost 11 years ago
- Status changed from Need More Info to In Progress
der, we have a complete log from the teuthology failure.
Updated by Joao Eduardo Luis almost 11 years ago
- Status changed from In Progress to Need More Info
- Priority changed from Urgent to High
Sage thought there were logs on teuthology, on the directory from the initial report from Tamil. That run must have been killed, and in fact there are no logs, thus marking this Need More Info.
Also, as we haven't seen this since, Sage suggests to mark this as High instead of Urgent.
Updated by Joao Eduardo Luis almost 11 years ago
- Status changed from Need More Info to 4
I've been able to reproduce this on some locked nodes that were hammering the monitors pretty hard for the past week.
The job itself was the following:
nuke-on-error: false interactive-on-error: true overrides: ceph: conf: global: ms inject socket failures: 5000 mon: debug mon: 20 debug ms: 1 debug paxos: 20 fs: xfs log-whitelist: - slow request branch: master install: ceph: branch: master workunit: branch: master roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - client.0 - - mon.d - mon.e - osd.6 - osd.7 - - osd.8 - osd.9 - osd.10 - mon.f - - osd.11 - osd.12 - osd.13 - mon.g tasks: - chef: null - clock.check: null - install: null - ceph: null - mon_thrash: revive_delay: 20 thrash_delay: 1 store-thrash: true thrash-many: true store-thrash-probability: 100 - rados: clients: - client.0 objects: 50 op_weights: delete: 50 read: 100 rollback: 50 snap_create: 50 snap_remove: 50 write: 100 ops: 4000
I grabbed the logs and the stores to my desktop, and here are the notes I've been taking so far:
The monitor (mon.d) doesn't have much of anything in its store besides a couple of paxos and pgmap stuff -- no mdsmap at all, so it's no wonder it all goes to crap.
mon.d appears to have gone through a couple of syncs. The first appears to have succeeded, bringing it up to version 2062 like most of the monitors in the cluster; however, while attempting to form quorum, a sync was triggered due to mon.a having version 2074.
mon.d's synchronization was started but mon.a instructed mon.d to retry the sync; mon.d never retried the sync, which may have been due to an election getting in the way? Truth is, mon.d eventually bootstrapped and joined the cluster as it was. It's hard to assess if the store got corrupted during the first sync, or if it was the one that was never retried that was the ultimate culprit.
Update: even though the synchronization had to be retried, we cleared the whole store prior to sending the sync_start message to the leader. This means that we get to only become aware that we should retry after clearing the whole store, and if we are killed in-between that we may end up with an empty store. Now, why would we end up with some deleted values anyway is beyond me. Maybe one of those leveldb bugs that bring deleted keys back to life after a compaction?!
Updated by Joao Eduardo Luis almost 11 years ago
Forgot to mention that the sync flag is set on the store. Sage pointed out that the real issue here is that we're allowing the monitor to join the quorum with a store marked as having an on-going sync (although I recall a patch going in that avoids just that sort of thing, or meant to; will check that when I get back).
Updated by Sage Weil almost 11 years ago
- Status changed from 4 to Resolved
this should have been fixed with the recent mon sync and other changes