Project

General

Profile

Actions

Bug #4250

closed

mon: crash in finish_proposal after recovery

Added by Sage Weil about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Urgent
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2013-02-23 09:25:32.171986 7f5dd6d96700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f5dd6d96700

 ceph version 0.57-493-g704db85 (704db850131643b26bafe6594946cacce483c171)
 1: ceph-mon() [0x59cb6a]
 2: (()+0xfcb0) [0x7f5ddb8f7cb0]
 3: (Paxos::finish_proposal()+0x133) [0x4d9c83]
 4: (Paxos::handle_accept(MMonPaxos*)+0x77c) [0x4dad3c]
 5: (Paxos::dispatch(PaxosServiceMessage*)+0x24b) [0x4de63b]
 6: (Monitor::_ms_dispatch(Message*)+0x145f) [0x4b72ef]
 7: (Monitor::ms_dispatch(Message*)+0x32) [0x4cd962]
 8: (DispatchQueue::entry()+0x341) [0x6b0e11]
 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x64002d]
 10: (()+0x7e9a) [0x7f5ddb8efe9a]
 11: (clone()+0x6d) [0x7f5dda0a84bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---

from job
ubuntu@teuthology:/a/sage-2013-02-23_08:44:35-regression-master-testing-basic/10218$ cat orig.config.yaml 
kernel:
  kdb: true
  sha1: 92a49fb0f79f3300e6e50ddf56238e70678e4202
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject socket failures: 500
      mon.b:
        clock offset: 10
    log-whitelist:
    - slow request
    sha1: 704db850131643b26bafe6594946cacce483c171
  s3tests:
    branch: master
  workunit:
    sha1: 704db850131643b26bafe6594946cacce483c171
roles:
- - mon.a
  - mon.d
  - mon.g
  - mon.j
  - mon.m
  - mon.p
  - mon.s
  - osd.0
- - mon.b
  - mon.e
  - mon.h
  - mon.k
  - mon.n
  - mon.q
  - mon.t
  - mds.a
- - mon.c
  - mon.f
  - mon.i
  - mon.l
  - mon.o
  - mon.r
  - mon.u
  - osd.1
tasks:
- chef: null
- clock: null
- install: null
- ceph:
    log-whitelist:
    - slow request
    - .*clock.*skew.*
    - clocks not synchronized
    wait-for-healthy: false
- mon_clock_skew_check:
    expect-skew: true
Actions #1

Updated by Sage Weil about 11 years ago

  • Priority changed from Normal to Urgent
Actions #2

Updated by Joao Eduardo Luis about 11 years ago

I haven't been able to reproduce this, but given the stack trace, I have a feeling that it was fixed by 98408f5ca4f2396838002be739cb2f5d15b7aac3

Actions #3

Updated by Tamilarasi muthamizhan about 11 years ago

recent logs: ubuntu@teuthology:/a/teuthology-2013-02-25_01:00:05-regression-master-testing-gcov/11462

Actions #5

Updated by Joao Eduardo Luis about 11 years ago

Unless something triggered of which I'm not aware of, it doesn't appear that it did. Tamil's update was in fact #4256 (prior to being patched).

Actions #6

Updated by Tamilarasi muthamizhan about 11 years ago

Logs: ubuntu@teuthology:/a/teuthology-2013-03-07_01:00:05-regression-next-testing-basic/17589
     0> 2013-03-07 10:50:25.886062 7f92f89ef700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f92f89ef700

 ceph version 0.58-351-ga58eec9 (a58eec90caf3a3d04c9e7bd4e6b9c160b6b69175)
 1: ceph-mon() [0x580cba]
 2: (()+0xfcb0) [0x7f92fd75acb0]
 3: (Paxos::finish_proposal()+0x133) [0x4dcda3]
 4: (Paxos::handle_accept(MMonPaxos*)+0x77c) [0x4ddd0c]
 5: (Paxos::dispatch(PaxosServiceMessage*)+0x24b) [0x4e181b]
 6: (Monitor::_ms_dispatch(Message*)+0x145f) [0x4ba2af]
 7: (Monitor::ms_dispatch(Message*)+0x32) [0x4d0d12]
 8: (DispatchQueue::entry()+0x341) [0x695441]
 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x62472d]
 10: (()+0x7e9a) [0x7f92fd752e9a]
 11: (clone()+0x6d) [0x7f92fbd02cbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

ubuntu@teuthology:/a/teuthology-2013-03-07_01:00:05-regression-next-testing-basic/17589$ cat config.yaml 
kernel: &id001
  kdb: true
  sha1: 2f60d3028438dd1fef122d37786ee685d727e8a7
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject socket failures: 500
      mon.b:
        clock offset: 10
    log-whitelist:
    - slow request
    sha1: a58eec90caf3a3d04c9e7bd4e6b9c160b6b69175
  s3tests:
    branch: next
  workunit:
    sha1: a58eec90caf3a3d04c9e7bd4e6b9c160b6b69175
roles:
- - mon.a
  - mon.d
  - mon.g
  - mon.j
  - mon.m
  - mon.p
  - mon.s
  - osd.0
- - mon.b
  - mon.e
  - mon.h
  - mon.k
  - mon.n
  - mon.q
  - mon.t
  - mds.a
- - mon.c
  - mon.f
  - mon.i
  - mon.l
  - mon.o
  - mon.r
  - mon.u
  - osd.1
targets:
  ubuntu@plana14.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCs3IrZeF7C0YL6VU9avVBEyNgKwbiH45BBAuxQMCUyVTsJHy60uThveH1PBt6TuMvRILpujJCYp9kFcLAy+kjyTbZBHZmkoX44wT6AzqLJgGpJzpAEgd3ZpR8Nx9oTuZ0B3le8JFiK90wgJbJrsbpi9x8FtVUZQH+8PilAwza8P6ZSKzwv1dIeCxCqtkZ2oFMIzspLgLLAZ4gkZGfMc43PubouSr4b8TnfTyay4imcJyc6lhAAOhFng6ebcBuCK09QQJ7c3Y5Tgiqh82UqUAO9RTFmeLkXRsWHJ8L1N8+PvtBmQ83neQvql/+cTHz8lDUjomNLOrfUGkj6K4583Tq3
  ubuntu@plana15.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCrdOMpeQVLQ+RrMyCLqxOoU/uNzrq/WmYYHhE9yAJSAeD652SCOzMCaChBawwFypiHB/1Zv++PGx2mIceuh8BpAjs0iWoWwj39TDMsB8GYm2A5qFK9BfG080rc8LtmNX//IX3IdbwzxKIM3odcrg1sdQ4p6zLMQYiuwUb5+8clItH7Vl8SzgT6Y+NNyXuwQRZ2JqCcnuV22fSpcfEYVh3HtjXw/G6k/NmdPnP3lab5kQzYsio9A2WmlGmtHHntRMZ+syMCPZI6Rn7rySElxLoet9WqK0qxcusHmPZf1N4gBre0fYnSK7ix6N7TRXlI86TA5Z/VHmkqDaSyuO4YUYoL
  ubuntu@plana66.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDKx2XFNsNesOeyXr45nVdT6jd4xrHGvtyU/Bf5YRNrbxyESvxzeeAX9WZi2oDS1b1OSRnu7mvSOr5IcAEs+N5sNFMyQpx0rPO2WLCI3O8BVm37AC0phqAvrZeYM4bTw5oQ/59J9k0+2KlzwZ9z1xrdHko1NOlZ5C7y9+R2y8z+tnfJKI7yXeGjWSVsxLML8l5SzZ4PTW+EaTK+W94gSBMKOAvW7fbMPkyHft/QQyyg9maCJ674LQ6nQSL9JskVCi2F/njx0f/OL5ej8TPvL90qPZ3nn7ueAfokk//TGi/Actuerd2isdON/yOoRTouzCItYFlw7Zk9/iuEFJhISpDz
tasks:
- internal.lock_machines:
  - 3
  - plana
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- install: null
- ceph:
    log-whitelist:
    - slow request
    - .*clock.*skew.*
    - clocks not synchronized
    wait-for-healthy: false
- mon_clock_skew_check:
    expect-skew: true

Actions #7

Updated by Tamilarasi muthamizhan about 11 years ago

ubuntu@teuthology:/a/teuthology-2013-03-06_01:00:04-regression-master-testing-gcov/17076

Actions #8

Updated by Joao Eduardo Luis about 11 years ago

Patch commit:2fa8bc2384005cb301158cc9a52682ebcb65efad on wip-4250 has fixed this. Still letting it run through teuthology a bit longer to make sure.

Actions #9

Updated by Joao Eduardo Luis about 11 years ago

  • Subject changed from mon: crash in finish_proposal during clock skew check to mon: crash in finish_proposal after recovery

This bug has nothing to do with timechecks. It happened due to the proposal from within Paxos, after a recovery, in order to run a learned old value. Given that this proposal is made directly to Paxos, skipping the proposal queue, we were hitting a bug due to dereferencing list.front() when list.size() == 0.

Actions #10

Updated by Joao Eduardo Luis about 11 years ago

  • Status changed from 12 to Fix Under Review
Actions #11

Updated by Sage Weil about 11 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF