Bug #4250
closed
mon: crash in finish_proposal after recovery
Added by Sage Weil about 11 years ago.
Updated about 11 years ago.
Description
2013-02-23 09:25:32.171986 7f5dd6d96700 -1 *** Caught signal (Segmentation fault) **
in thread 7f5dd6d96700
ceph version 0.57-493-g704db85 (704db850131643b26bafe6594946cacce483c171)
1: ceph-mon() [0x59cb6a]
2: (()+0xfcb0) [0x7f5ddb8f7cb0]
3: (Paxos::finish_proposal()+0x133) [0x4d9c83]
4: (Paxos::handle_accept(MMonPaxos*)+0x77c) [0x4dad3c]
5: (Paxos::dispatch(PaxosServiceMessage*)+0x24b) [0x4de63b]
6: (Monitor::_ms_dispatch(Message*)+0x145f) [0x4b72ef]
7: (Monitor::ms_dispatch(Message*)+0x32) [0x4cd962]
8: (DispatchQueue::entry()+0x341) [0x6b0e11]
9: (DispatchQueue::DispatchThread::entry()+0xd) [0x64002d]
10: (()+0x7e9a) [0x7f5ddb8efe9a]
11: (clone()+0x6d) [0x7f5dda0a84bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- begin dump of recent events ---
from job
ubuntu@teuthology:/a/sage-2013-02-23_08:44:35-regression-master-testing-basic/10218$ cat orig.config.yaml
kernel:
kdb: true
sha1: 92a49fb0f79f3300e6e50ddf56238e70678e4202
nuke-on-error: true
overrides:
ceph:
conf:
global:
ms inject socket failures: 500
mon.b:
clock offset: 10
log-whitelist:
- slow request
sha1: 704db850131643b26bafe6594946cacce483c171
s3tests:
branch: master
workunit:
sha1: 704db850131643b26bafe6594946cacce483c171
roles:
- - mon.a
- mon.d
- mon.g
- mon.j
- mon.m
- mon.p
- mon.s
- osd.0
- - mon.b
- mon.e
- mon.h
- mon.k
- mon.n
- mon.q
- mon.t
- mds.a
- - mon.c
- mon.f
- mon.i
- mon.l
- mon.o
- mon.r
- mon.u
- osd.1
tasks:
- chef: null
- clock: null
- install: null
- ceph:
log-whitelist:
- slow request
- .*clock.*skew.*
- clocks not synchronized
wait-for-healthy: false
- mon_clock_skew_check:
expect-skew: true
- Priority changed from Normal to Urgent
recent logs: ubuntu@teuthology:/a/teuthology-2013-02-25_01:00:05-regression-master-testing-gcov/11462
Unless something triggered of which I'm not aware of, it doesn't appear that it did. Tamil's update was in fact #4256 (prior to being patched).
Logs: ubuntu@teuthology:/a/teuthology-2013-03-07_01:00:05-regression-next-testing-basic/17589
0> 2013-03-07 10:50:25.886062 7f92f89ef700 -1 *** Caught signal (Segmentation fault) **
in thread 7f92f89ef700
ceph version 0.58-351-ga58eec9 (a58eec90caf3a3d04c9e7bd4e6b9c160b6b69175)
1: ceph-mon() [0x580cba]
2: (()+0xfcb0) [0x7f92fd75acb0]
3: (Paxos::finish_proposal()+0x133) [0x4dcda3]
4: (Paxos::handle_accept(MMonPaxos*)+0x77c) [0x4ddd0c]
5: (Paxos::dispatch(PaxosServiceMessage*)+0x24b) [0x4e181b]
6: (Monitor::_ms_dispatch(Message*)+0x145f) [0x4ba2af]
7: (Monitor::ms_dispatch(Message*)+0x32) [0x4d0d12]
8: (DispatchQueue::entry()+0x341) [0x695441]
9: (DispatchQueue::DispatchThread::entry()+0xd) [0x62472d]
10: (()+0x7e9a) [0x7f92fd752e9a]
11: (clone()+0x6d) [0x7f92fbd02cbd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
ubuntu@teuthology:/a/teuthology-2013-03-07_01:00:05-regression-next-testing-basic/17589$ cat config.yaml
kernel: &id001
kdb: true
sha1: 2f60d3028438dd1fef122d37786ee685d727e8a7
nuke-on-error: true
overrides:
ceph:
conf:
global:
ms inject socket failures: 500
mon.b:
clock offset: 10
log-whitelist:
- slow request
sha1: a58eec90caf3a3d04c9e7bd4e6b9c160b6b69175
s3tests:
branch: next
workunit:
sha1: a58eec90caf3a3d04c9e7bd4e6b9c160b6b69175
roles:
- - mon.a
- mon.d
- mon.g
- mon.j
- mon.m
- mon.p
- mon.s
- osd.0
- - mon.b
- mon.e
- mon.h
- mon.k
- mon.n
- mon.q
- mon.t
- mds.a
- - mon.c
- mon.f
- mon.i
- mon.l
- mon.o
- mon.r
- mon.u
- osd.1
targets:
ubuntu@plana14.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCs3IrZeF7C0YL6VU9avVBEyNgKwbiH45BBAuxQMCUyVTsJHy60uThveH1PBt6TuMvRILpujJCYp9kFcLAy+kjyTbZBHZmkoX44wT6AzqLJgGpJzpAEgd3ZpR8Nx9oTuZ0B3le8JFiK90wgJbJrsbpi9x8FtVUZQH+8PilAwza8P6ZSKzwv1dIeCxCqtkZ2oFMIzspLgLLAZ4gkZGfMc43PubouSr4b8TnfTyay4imcJyc6lhAAOhFng6ebcBuCK09QQJ7c3Y5Tgiqh82UqUAO9RTFmeLkXRsWHJ8L1N8+PvtBmQ83neQvql/+cTHz8lDUjomNLOrfUGkj6K4583Tq3
ubuntu@plana15.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCrdOMpeQVLQ+RrMyCLqxOoU/uNzrq/WmYYHhE9yAJSAeD652SCOzMCaChBawwFypiHB/1Zv++PGx2mIceuh8BpAjs0iWoWwj39TDMsB8GYm2A5qFK9BfG080rc8LtmNX//IX3IdbwzxKIM3odcrg1sdQ4p6zLMQYiuwUb5+8clItH7Vl8SzgT6Y+NNyXuwQRZ2JqCcnuV22fSpcfEYVh3HtjXw/G6k/NmdPnP3lab5kQzYsio9A2WmlGmtHHntRMZ+syMCPZI6Rn7rySElxLoet9WqK0qxcusHmPZf1N4gBre0fYnSK7ix6N7TRXlI86TA5Z/VHmkqDaSyuO4YUYoL
ubuntu@plana66.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDKx2XFNsNesOeyXr45nVdT6jd4xrHGvtyU/Bf5YRNrbxyESvxzeeAX9WZi2oDS1b1OSRnu7mvSOr5IcAEs+N5sNFMyQpx0rPO2WLCI3O8BVm37AC0phqAvrZeYM4bTw5oQ/59J9k0+2KlzwZ9z1xrdHko1NOlZ5C7y9+R2y8z+tnfJKI7yXeGjWSVsxLML8l5SzZ4PTW+EaTK+W94gSBMKOAvW7fbMPkyHft/QQyyg9maCJ674LQ6nQSL9JskVCi2F/njx0f/OL5ej8TPvL90qPZ3nn7ueAfokk//TGi/Actuerd2isdON/yOoRTouzCItYFlw7Zk9/iuEFJhISpDz
tasks:
- internal.lock_machines:
- 3
- plana
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- install: null
- ceph:
log-whitelist:
- slow request
- .*clock.*skew.*
- clocks not synchronized
wait-for-healthy: false
- mon_clock_skew_check:
expect-skew: true
ubuntu@teuthology:/a/teuthology-2013-03-06_01:00:04-regression-master-testing-gcov/17076
Patch commit:2fa8bc2384005cb301158cc9a52682ebcb65efad on wip-4250 has fixed this. Still letting it run through teuthology a bit longer to make sure.
- Subject changed from mon: crash in finish_proposal during clock skew check to mon: crash in finish_proposal after recovery
This bug has nothing to do with timechecks. It happened due to the proposal from within Paxos, after a recovery, in order to run a learned old value. Given that this proposal is made directly to Paxos, skipping the proposal queue, we were hitting a bug due to dereferencing list.front() when list.size() == 0.
- Status changed from 12 to Fix Under Review
- Status changed from Fix Under Review to Resolved
Also available in: Atom
PDF