Project

General

Profile

Actions

Bug #12429

closed

OSD crash creating/deleting pools

Added by John Spray over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
hammer,firefly
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While working on cephfs tests that aggressively create/delete pools, saw this crash on master:

268619   -269> 2015-07-22 12:21:39.232780 7f4275ffd700 -1 ./osd/OSDMap.h: In function 'const string& OSDMap::get_pool_name(int64_t) const' thread 7f4275ffd700 time 2015-07-22 12:21:39.215478
268620 ./osd/OSDMap.h: 724: FAILED assert(i != pool_name.end())                        
268621                                                                                 
268622  ceph version 9.0.2-755-g44464f9 (44464f9c689bcbf961521ff6053d0a431be06b75)     
268623  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x1aef453]
268624  2: (OSDMap::get_pool_name(long) const+0x7a) [0x14342d0]                        
268625  3: (PGPool::update(std::tr1::shared_ptr<OSDMap const>)+0xbf) [0x163c8b5]       
268626  4: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap const>, std::tr1::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryC       tx*)+0x297) [0x16674db]
268627  5: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x418       ) [0x140e8b6]
268628  6: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x19a) [0x141f916]
268629  7: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x30) [0x143ca00]
268630  8: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x33) [0x14e188d]
268631  9: (ThreadPool::worker(ThreadPool::WorkThread*)+0x70b) [0x1adf45d]             
268632  10: (ThreadPool::WorkThread::entry()+0x23) [0x1ae3413]                         
268633  11: (Thread::entry_wrapper()+0xa8) [0x1ad78b4]                                 
268634  12: (Thread::_entry_func(void*)+0x18) [0x1ad7802]                              
268635  13: (()+0x7555) [0x7f4292706555]                                               
268636  14: (clone()+0x6d) [0x7f4290f21f3d]                                            
268637  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
268638                                                                                 


Files

cluster.mon.a.log (147 KB) cluster.mon.a.log John Spray, 07/22/2015 11:26 AM
osd.0.log.short.gz (188 KB) osd.0.log.short.gz John Spray, 07/22/2015 11:27 AM
mon.a.log.log.short.gz (296 KB) mon.a.log.log.short.gz John Spray, 07/29/2015 01:03 PM
osd.0.log.log.short.gz (356 KB) osd.0.log.log.short.gz John Spray, 07/29/2015 01:03 PM
osdmap.bin (3.67 KB) osdmap.bin John Spray, 07/29/2015 01:27 PM

Related issues 2 (0 open2 closed)

Copied to Ceph - Backport #12584: OSD crash creating/deleting poolsResolvedKefu ChaiActions
Copied to Ceph - Backport #12585: OSD crash creating/deleting poolsResolvedNathan Cutler07/22/2015Actions
Actions #1

Updated by John Spray over 8 years ago

Actions #3

Updated by huang jun over 8 years ago

i just tested create and delete 1000 pools,it works fine.
what's your test load or how can i reproduce it on my test cluster?
thanks

Actions #4

Updated by John Spray over 8 years ago

It has only happened once over several days of fairly regular creation/destruction.

It was the cephfs tests (tasks/cephfs in ceph-qa-suite) running against a vstart cluster -- this is a new thing, but I'll send some notes to ceph-devel about how to run it when it's ready.

Probably worth noting that this wasn't just a create/delete thrash, it was real tests with IO running between create/delete cycles every 30 seconds or so.

Updated by John Spray over 8 years ago

Another instance. Crashed in get_pool_name on the mon as well.

Actions #6

Updated by John Spray over 8 years ago

The OSDMap itself is missing the pool_name member for a pool that exists -- looks like a mon bug.

Actions #7

Updated by John Spray over 8 years ago

I think the problem may be because of the update to osdmap that we do during "fs new", setting crash_replay interval.

When we do this, the pool goes onto OSDMap.new_pools, and in apply_incremental new_pools is handled after old_pools, so a deleted pool comes back from the dead (this is just at theory but the only one I have so far). We can either make OSDMap tolerant of incremental updates that both update and delete the same pool, or make MDSMonitor wait for its osdmonitor updates to commit before it proceeds.

Actions #8

Updated by Greg Farnum over 8 years ago

I'm fairly confused about how our tests could have created this scenario. Once we request_proposal() any subsequent actions will be ordered in the next epoch. And even if they weren't, how could we compress an fs create and a delete into a single paxos stage as part of an entire test run?

Actions #9

Updated by Kefu Chai over 8 years ago

  • Status changed from New to Fix Under Review
Actions #10

Updated by Kefu Chai over 8 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to hammer,firefly
Actions #11

Updated by Loïc Dachary over 8 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF