Project

General

Profile

Actions

Bug #5986

closed

mon: FAILED assert(snaps.count(s)) when removing pool snap on 0.61.7

Added by Joao Eduardo Luis over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Urgent
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
cuttlefish
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While attempting to reproduce #5959, I managed to trigger this crash. It doesn't trigger on next, but I'm able to trigger it quite reliably on 0.61.7.

    -9> 2013-08-15 15:10:06.166078 7f32cdf51700  1 -- 127.0.0.1:6789/0 <== client.? 127.0.0.1:0/30484 5 ==== mon_command(osd pool rmsnap data data1 v 0) v1 ==== 80+0+0 (2596197869 0 0) 0x2b88f00 con 0x2bab8c0
    -8> 2013-08-15 15:10:06.166091 7f32cdf51700 20 mon.a@0(leader) e1 have connection
    -7> 2013-08-15 15:10:06.166095 7f32cdf51700 20 mon.a@0(leader) e1 ms_dispatch existing session MonSession: client.? 127.0.0.1:0/30484 is openallow * for client.? 127.0.0.1:0/30484
    -6> 2013-08-15 15:10:06.166101 7f32cdf51700 20 mon.a@0(leader) e1  caps allow *
    -5> 2013-08-15 15:10:06.166105 7f32cdf51700  0 mon.a@0(leader) e1 handle_command mon_command(osd pool rmsnap data data1 v 0) v1
    -4> 2013-08-15 15:10:06.166109 7f32cdf51700 10 mon.a@0(leader).paxosservice(osdmap 1..154) dispatch mon_command(osd pool rmsnap data data1 v 0) v1 from client.? 127.0.0.1:0/30484
    -3> 2013-08-15 15:10:06.166115 7f32cdf51700  1 mon.a@0(leader).paxos(paxos active c 1..519) is_readable now=2013-08-15 15:10:06.166116 lease_expire=2013-08-15 15:10:08.752149 has v0 lc 519
    -2> 2013-08-15 15:10:06.166122 7f32cdf51700 10 mon.a@0(leader).osd e154 preprocess_query mon_command(osd pool rmsnap data data1 v 0) v1 from client.? 127.0.0.1:0/30484
    -1> 2013-08-15 15:10:06.166134 7f32cdf51700  7 mon.a@0(leader).osd e154 prepare_update mon_command(osd pool rmsnap data data1 v 0) v1 from client.? 127.0.0.1:0/30484
     0> 2013-08-15 15:10:06.167328 7f32cdf51700 -1 osd/osd_types.cc: In function 'void pg_pool_t::remove_snap(snapid_t)' thread 7f32cdf51700 time 2013-08-15 15:10:06.166154
osd/osd_types.cc: 648: FAILED assert(snaps.count(s))

 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
 1: ./ceph-mon() [0x696e6d]
 2: (OSDMonitor::prepare_command(MMonCommand*)+0x5c1d) [0x52e24d]
 3: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x23b) [0x52fe9b]
 4: (PaxosService::dispatch(PaxosServiceMessage*)+0x702) [0x50efa2]
 5: (Monitor::handle_command(MMonCommand*)+0x24f) [0x4dbaaf]
 6: (Monitor::_ms_dispatch(Message*)+0x8bb) [0x4ddceb]
 7: (Monitor::ms_dispatch(Message*)+0x32) [0x4fac72]
 8: (DispatchQueue::entry()+0x3c3) [0x6ba0c3]
 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x64e43d]
 10: (()+0x7f8e) [0x7f32d2f27f8e]
 11: (clone()+0x6d) [0x7f32d1471e1d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

while running:

$ lrm=0 i=0 ; while [ 1 ]; do ( ceph osd pool rmsnap data data1 & ) ; ( ceph osd pool mksnap data data1 & ) ; if [[ $i -gt 2 ]]; then ( ceph osd pool rmsnap dev-test test-snap$lrm & ) ; lrm=$(($lrm+1)) ; fi ; ceph osd pool mksnap dev-test test-snap$i ; i=$(($i+1)) ; echo ; done

My guess is that this results from exposing uncommitted state, and given we've been fixing this sort of thing lately we are no longer able to trigger it on next. Would be nice to backport the patch if that's indeed the case. Also going to try to trigger this on cuttlefish's HEAD.

Actions #1

Updated by Joao Eduardo Luis over 10 years ago

  • Assignee set to Joao Eduardo Luis
Actions #2

Updated by Joao Eduardo Luis over 10 years ago

I can now confirm this is also really easy to trigger on cuttlefish HEAD.

Actions #3

Updated by Joao Eduardo Luis over 10 years ago

  • Description updated (diff)
Actions #4

Updated by Joao Eduardo Luis over 10 years ago

  • Status changed from 12 to Pending Backport
  • Backport set to cuttlefish

Well, duh.

This was fixed by Sage (and reviewed by me) on d90683fdeda15b726dcf0a7cab7006c31e99f146

Actions #5

Updated by Sage Weil over 10 years ago

  • Priority changed from Normal to Urgent
Actions #6

Updated by Sage Weil over 10 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF