Bug #4250
closedmon: crash in finish_proposal after recovery
0%
Description
2013-02-23 09:25:32.171986 7f5dd6d96700 -1 *** Caught signal (Segmentation fault) ** in thread 7f5dd6d96700 ceph version 0.57-493-g704db85 (704db850131643b26bafe6594946cacce483c171) 1: ceph-mon() [0x59cb6a] 2: (()+0xfcb0) [0x7f5ddb8f7cb0] 3: (Paxos::finish_proposal()+0x133) [0x4d9c83] 4: (Paxos::handle_accept(MMonPaxos*)+0x77c) [0x4dad3c] 5: (Paxos::dispatch(PaxosServiceMessage*)+0x24b) [0x4de63b] 6: (Monitor::_ms_dispatch(Message*)+0x145f) [0x4b72ef] 7: (Monitor::ms_dispatch(Message*)+0x32) [0x4cd962] 8: (DispatchQueue::entry()+0x341) [0x6b0e11] 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x64002d] 10: (()+0x7e9a) [0x7f5ddb8efe9a] 11: (clone()+0x6d) [0x7f5dda0a84bd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events ---
from job
ubuntu@teuthology:/a/sage-2013-02-23_08:44:35-regression-master-testing-basic/10218$ cat orig.config.yaml kernel: kdb: true sha1: 92a49fb0f79f3300e6e50ddf56238e70678e4202 nuke-on-error: true overrides: ceph: conf: global: ms inject socket failures: 500 mon.b: clock offset: 10 log-whitelist: - slow request sha1: 704db850131643b26bafe6594946cacce483c171 s3tests: branch: master workunit: sha1: 704db850131643b26bafe6594946cacce483c171 roles: - - mon.a - mon.d - mon.g - mon.j - mon.m - mon.p - mon.s - osd.0 - - mon.b - mon.e - mon.h - mon.k - mon.n - mon.q - mon.t - mds.a - - mon.c - mon.f - mon.i - mon.l - mon.o - mon.r - mon.u - osd.1 tasks: - chef: null - clock: null - install: null - ceph: log-whitelist: - slow request - .*clock.*skew.* - clocks not synchronized wait-for-healthy: false - mon_clock_skew_check: expect-skew: true
Updated by Joao Eduardo Luis about 11 years ago
I haven't been able to reproduce this, but given the stack trace, I have a feeling that it was fixed by 98408f5ca4f2396838002be739cb2f5d15b7aac3
Updated by Tamilarasi muthamizhan about 11 years ago
recent logs: ubuntu@teuthology:/a/teuthology-2013-02-25_01:00:05-regression-master-testing-gcov/11462
Updated by Ian Colle about 11 years ago
Has this been seen since http://tracker.ceph.com/projects/ceph/repository/revisions/98408f5ca4f2396838002be739cb2f5d15b7aac3 was committed?
Updated by Joao Eduardo Luis about 11 years ago
Unless something triggered of which I'm not aware of, it doesn't appear that it did. Tamil's update was in fact #4256 (prior to being patched).
Updated by Tamilarasi muthamizhan about 11 years ago
Logs: ubuntu@teuthology:/a/teuthology-2013-03-07_01:00:05-regression-next-testing-basic/17589 0> 2013-03-07 10:50:25.886062 7f92f89ef700 -1 *** Caught signal (Segmentation fault) ** in thread 7f92f89ef700 ceph version 0.58-351-ga58eec9 (a58eec90caf3a3d04c9e7bd4e6b9c160b6b69175) 1: ceph-mon() [0x580cba] 2: (()+0xfcb0) [0x7f92fd75acb0] 3: (Paxos::finish_proposal()+0x133) [0x4dcda3] 4: (Paxos::handle_accept(MMonPaxos*)+0x77c) [0x4ddd0c] 5: (Paxos::dispatch(PaxosServiceMessage*)+0x24b) [0x4e181b] 6: (Monitor::_ms_dispatch(Message*)+0x145f) [0x4ba2af] 7: (Monitor::ms_dispatch(Message*)+0x32) [0x4d0d12] 8: (DispatchQueue::entry()+0x341) [0x695441] 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x62472d] 10: (()+0x7e9a) [0x7f92fd752e9a] 11: (clone()+0x6d) [0x7f92fbd02cbd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. ubuntu@teuthology:/a/teuthology-2013-03-07_01:00:05-regression-next-testing-basic/17589$ cat config.yaml kernel: &id001 kdb: true sha1: 2f60d3028438dd1fef122d37786ee685d727e8a7 nuke-on-error: true overrides: ceph: conf: global: ms inject socket failures: 500 mon.b: clock offset: 10 log-whitelist: - slow request sha1: a58eec90caf3a3d04c9e7bd4e6b9c160b6b69175 s3tests: branch: next workunit: sha1: a58eec90caf3a3d04c9e7bd4e6b9c160b6b69175 roles: - - mon.a - mon.d - mon.g - mon.j - mon.m - mon.p - mon.s - osd.0 - - mon.b - mon.e - mon.h - mon.k - mon.n - mon.q - mon.t - mds.a - - mon.c - mon.f - mon.i - mon.l - mon.o - mon.r - mon.u - osd.1 targets: ubuntu@plana14.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCs3IrZeF7C0YL6VU9avVBEyNgKwbiH45BBAuxQMCUyVTsJHy60uThveH1PBt6TuMvRILpujJCYp9kFcLAy+kjyTbZBHZmkoX44wT6AzqLJgGpJzpAEgd3ZpR8Nx9oTuZ0B3le8JFiK90wgJbJrsbpi9x8FtVUZQH+8PilAwza8P6ZSKzwv1dIeCxCqtkZ2oFMIzspLgLLAZ4gkZGfMc43PubouSr4b8TnfTyay4imcJyc6lhAAOhFng6ebcBuCK09QQJ7c3Y5Tgiqh82UqUAO9RTFmeLkXRsWHJ8L1N8+PvtBmQ83neQvql/+cTHz8lDUjomNLOrfUGkj6K4583Tq3 ubuntu@plana15.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCrdOMpeQVLQ+RrMyCLqxOoU/uNzrq/WmYYHhE9yAJSAeD652SCOzMCaChBawwFypiHB/1Zv++PGx2mIceuh8BpAjs0iWoWwj39TDMsB8GYm2A5qFK9BfG080rc8LtmNX//IX3IdbwzxKIM3odcrg1sdQ4p6zLMQYiuwUb5+8clItH7Vl8SzgT6Y+NNyXuwQRZ2JqCcnuV22fSpcfEYVh3HtjXw/G6k/NmdPnP3lab5kQzYsio9A2WmlGmtHHntRMZ+syMCPZI6Rn7rySElxLoet9WqK0qxcusHmPZf1N4gBre0fYnSK7ix6N7TRXlI86TA5Z/VHmkqDaSyuO4YUYoL ubuntu@plana66.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDKx2XFNsNesOeyXr45nVdT6jd4xrHGvtyU/Bf5YRNrbxyESvxzeeAX9WZi2oDS1b1OSRnu7mvSOr5IcAEs+N5sNFMyQpx0rPO2WLCI3O8BVm37AC0phqAvrZeYM4bTw5oQ/59J9k0+2KlzwZ9z1xrdHko1NOlZ5C7y9+R2y8z+tnfJKI7yXeGjWSVsxLML8l5SzZ4PTW+EaTK+W94gSBMKOAvW7fbMPkyHft/QQyyg9maCJ674LQ6nQSL9JskVCi2F/njx0f/OL5ej8TPvL90qPZ3nn7ueAfokk//TGi/Actuerd2isdON/yOoRTouzCItYFlw7Zk9/iuEFJhISpDz tasks: - internal.lock_machines: - 3 - plana - internal.save_config: null - internal.check_lock: null - internal.connect: null - internal.check_conflict: null - kernel: *id001 - internal.base: null - internal.archive: null - internal.coredump: null - internal.syslog: null - internal.timer: null - chef: null - clock: null - install: null - ceph: log-whitelist: - slow request - .*clock.*skew.* - clocks not synchronized wait-for-healthy: false - mon_clock_skew_check: expect-skew: true
Updated by Tamilarasi muthamizhan about 11 years ago
ubuntu@teuthology:/a/teuthology-2013-03-06_01:00:04-regression-master-testing-gcov/17076
Updated by Joao Eduardo Luis about 11 years ago
Patch commit:2fa8bc2384005cb301158cc9a52682ebcb65efad on wip-4250 has fixed this. Still letting it run through teuthology a bit longer to make sure.
Updated by Joao Eduardo Luis about 11 years ago
- Subject changed from mon: crash in finish_proposal during clock skew check to mon: crash in finish_proposal after recovery
This bug has nothing to do with timechecks. It happened due to the proposal from within Paxos, after a recovery, in order to run a learned old value. Given that this proposal is made directly to Paxos, skipping the proposal queue, we were hitting a bug due to dereferencing list.front() when list.size() == 0.
Updated by Joao Eduardo Luis about 11 years ago
- Status changed from 12 to Fix Under Review
Updated by Sage Weil about 11 years ago
- Status changed from Fix Under Review to Resolved