Bug #12429
closedOSD crash creating/deleting pools
0%
Description
While working on cephfs tests that aggressively create/delete pools, saw this crash on master:
268619 -269> 2015-07-22 12:21:39.232780 7f4275ffd700 -1 ./osd/OSDMap.h: In function 'const string& OSDMap::get_pool_name(int64_t) const' thread 7f4275ffd700 time 2015-07-22 12:21:39.215478 268620 ./osd/OSDMap.h: 724: FAILED assert(i != pool_name.end()) 268621 268622 ceph version 9.0.2-755-g44464f9 (44464f9c689bcbf961521ff6053d0a431be06b75) 268623 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x1aef453] 268624 2: (OSDMap::get_pool_name(long) const+0x7a) [0x14342d0] 268625 3: (PGPool::update(std::tr1::shared_ptr<OSDMap const>)+0xbf) [0x163c8b5] 268626 4: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap const>, std::tr1::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryC tx*)+0x297) [0x16674db] 268627 5: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x418 ) [0x140e8b6] 268628 6: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x19a) [0x141f916] 268629 7: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x30) [0x143ca00] 268630 8: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x33) [0x14e188d] 268631 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x70b) [0x1adf45d] 268632 10: (ThreadPool::WorkThread::entry()+0x23) [0x1ae3413] 268633 11: (Thread::entry_wrapper()+0xa8) [0x1ad78b4] 268634 12: (Thread::_entry_func(void*)+0x18) [0x1ad7802] 268635 13: (()+0x7555) [0x7f4292706555] 268636 14: (clone()+0x6d) [0x7f4290f21f3d] 268637 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 268638
Files
Updated by John Spray almost 9 years ago
- File cluster.mon.a.log cluster.mon.a.log added
- Category set to OSD
- Priority changed from Normal to High
Updated by John Spray almost 9 years ago
- File osd.0.log.short.gz osd.0.log.short.gz added
Updated by huang jun almost 9 years ago
i just tested create and delete 1000 pools,it works fine.
what's your test load or how can i reproduce it on my test cluster?
thanks
Updated by John Spray almost 9 years ago
It has only happened once over several days of fairly regular creation/destruction.
It was the cephfs tests (tasks/cephfs in ceph-qa-suite) running against a vstart cluster -- this is a new thing, but I'll send some notes to ceph-devel about how to run it when it's ready.
Probably worth noting that this wasn't just a create/delete thrash, it was real tests with IO running between create/delete cycles every 30 seconds or so.
Updated by John Spray almost 9 years ago
- File mon.a.log.log.short.gz mon.a.log.log.short.gz added
- File osd.0.log.log.short.gz osd.0.log.log.short.gz added
Another instance. Crashed in get_pool_name on the mon as well.
Updated by John Spray almost 9 years ago
- File osdmap.bin osdmap.bin added
The OSDMap itself is missing the pool_name member for a pool that exists -- looks like a mon bug.
Updated by John Spray almost 9 years ago
I think the problem may be because of the update to osdmap that we do during "fs new", setting crash_replay interval.
When we do this, the pool goes onto OSDMap.new_pools, and in apply_incremental new_pools is handled after old_pools, so a deleted pool comes back from the dead (this is just at theory but the only one I have so far). We can either make OSDMap tolerant of incremental updates that both update and delete the same pool, or make MDSMonitor wait for its osdmonitor updates to commit before it proceeds.
Updated by Greg Farnum almost 9 years ago
I'm fairly confused about how our tests could have created this scenario. Once we request_proposal() any subsequent actions will be ordered in the next epoch. And even if they weren't, how could we compress an fs create and a delete into a single paxos stage as part of an entire test run?
Updated by Kefu Chai over 8 years ago
- Status changed from New to Fix Under Review
Updated by Kefu Chai over 8 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to hammer,firefly
Updated by Loïc Dachary over 8 years ago
- Status changed from Pending Backport to Resolved