Actions
Bug #4798
closedmon: message stuck in processing loop
% Done:
0%
Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
2013-04-23 13:26:15.939153 7f0610c40700 10 mon.a@1(leader).paxosservice(osdmap) dispatch osd_alive(want up_thru 1200 have 1200) v1 from osd.1 10.214.131.24:6809/7704 2013-04-23 13:26:15.943422 7f0610c40700 10 mon.a@1(leader).paxosservice(osdmap) waiting for paxos -> readable (v1200) 2013-04-23 13:26:15.947816 7f0610c40700 1 mon.a@1(leader).paxos(paxos active c 3721..3743) is_readable now=2013-04-23 13:26:15.947823 lease_expire=2013-04-23 13:23:48.390189 has v0 lc 3743 2013-04-23 13:26:15.947855 7f0610c40700 10 mon.a@1(leader).paxosservice(osdmap) dispatch mon_command(osd out 1 v 0) v1 from client.? 10.214.131.24:0/19036 2013-04-23 13:26:15.952132 7f0610c40700 10 mon.a@1(leader).paxosservice(osdmap) waiting for paxos -> readable (v0) 2013-04-23 13:26:15.956504 7f0610c40700 1 mon.a@1(leader).paxos(paxos active c 3721..3743) is_readable now=2013-04-23 13:26:15.956512 lease_expire=2013-04-23 13:23:48.390189 has v0 lc 3743 2013-04-23 13:26:15.956544 7f0610c40700 10 mon.a@1(leader).paxosservice(osdmap) dispatch pool_op(create unmanaged snap pool 3 auid 0 tid 4871 name v1201) v4 from client.4140 10.214.131.23:0/1005490 2013-04-23 13:26:15.960836 7f0610c40700 10 mon.a@1(leader).paxosservice(osdmap) waiting for paxos -> readable (v1201) 2013-04-23 13:26:15.965238 7f0610c40700 1 mon.a@1(leader).paxos(paxos active c 3721..3743) is_readable now=2013-04-23 13:26:15.965245 lease_expire=2013-04-23 13:23:48.390189 has v0 lc 3743 2013-04-23 13:26:15.965273 7f0610c40700 10 mon.a@1(leader).paxosservice(osdmap) dispatch osd_alive(want up_thru 1200 have 1200) v1 from osd.1 10.214.131.24:6809/7704 2013-04-23 13:26:15.969558 7f0610c40700 10 mon.a@1(leader).paxosservice(osdmap) waiting for paxos -> readable (v1200) 2013-04-23 13:26:15.973857 7f0610c40700 1 mon.a@1(leader).paxos(paxos active c 3721..3743) is_readable now=2013-04-23 13:26:15.973863 lease_expire=2013-04-23 13:23:48.390189 has v0 lc 3743 2013-04-23 13:26:15.973895 7f0610c40700 10 mon.a@1(leader).paxosservice(osdmap) dispatch osd_alive(want up_thru 1200 have 1200) v1 from osd.1 10.214.131.24:6809/7704 2013-04-23 13:26:15.978275 7f0610c40700 10 mon.a@1(leader).paxosservice(osdmap) waiting for paxos -> readable (v1200) 2013-04-23 13:26:15.982584 7f0610c40700 1 mon.a@1(leader).paxos(paxos active c 3721..3743) is_readable now=2013-04-23 13:26:15.982590 lease_expire=2013-04-23 13:23:48.390189 has v0 lc 3743 2013-04-23 13:26:15.982623 7f0610c40700 10 mon.a@1(leader).paxosservice(osdmap) dispatch mon_command(osd out 1 v 0) v1 from client.? 10.214.131.24:0/19036 2013-04-23 13:26:15.987016 7f0610c40700 10 mon.a@1(leader).paxosservice(osdmap) waiting for paxos -> readable (v0) 2013-04-23 13:26:15.991305 7f0610c40700 1 mon.a@1(leader).paxos(paxos active c 3721..3743) is_readable now=2013-04-23 13:26:15.991311 lease_expire=2013-04-23 13:23:48.390189 has v0 lc 3743 2013-04-23 13:26:15.991345 7f0610c40700 10 mon.a@1(leader).paxosservice(osdmap) dispatch osd_alive(want up_thru 1202 have 1202) v1 from osd.4 10.214.131.23:6810/5822 2013-04-23 13:26:15.995715 7f0610c40700 10 mon.a@1(leader).paxosservice(osdmap) waiting for paxos -> readable (v1202) 2013-04-23 13:26:16.000009 7f0610c40700 1 mon.a@1(leader).paxos(paxos active c 3721..3743) is_readable now=2013-04-23 13:26:16.000015 lease_expire=2013-04-23 13:23:48.390189 has v0 lc 3743 2013-04-23 13:26:16.000048 7f0610c40700 10 mon.a@1(leader).paxosservice(osdmap) dispatch osd_alive(want up_thru 1200 have 1200) v1 from osd.1 10.214.131.24:6809/7704 2013-04-23 13:26:16.004324 7f0610c40700 10 mon.a@1(leader).paxosservice(osdmap) waiting for paxos -> readable (v1200) 2013-04-23 13:26:16.008751 7f0610c40700 1 mon.a@1(leader).paxos(paxos active c 3721..3743) is_readable now=2013-04-23 13:26:16.008758 lease_expire=2013-04-23 13:23:48.390189 has v0 lc 3743 2013-04-23 13:26:16.008793 7f0610c40700 10 mon.a@1(leader).paxosservice(osdmap) dispatch osd_alive(want up_thru 1200 have 1200) v1 from osd.1 10.214.131.24:6809/7704 2013-04-23 13:26:16.013072 7f0610c40700 10 mon.a@1(leader).paxosservice(osdmap) waiting for paxos -> readable (v1200) 2013-04-23 13:26:16.017480 7f0610c40700 1 mon.a@1(leader).paxos(paxos active c 3721..3743) is_readable now=2013-04-23 13:26:16.017487 lease_expire=2013-04-23 13:23:48.390189 has v0 lc 3743 2013-04-23 13:26:16.017521 7f0610c40700 10 mon.a@1(leader).paxosservice(osdmap) dispatch mon_command(osd out 1 v 0) v1 from client.? 10.214.131.24:0/19036 2013-04-23 13:26:16.021811 7f0610c40700 10 mon.a@1(leader).paxosservice(osdmap) waiting for paxos -> readable (v0) 2013-04-23 13:26:16.026204 7f0610c40700 1 mon.a@1(leader).paxos(paxos active c 3721..3743) is_readable now=2013-04-23 13:26:16.026211 lease_expire=2013-04-23 13:23:48.390189 has v0 lc 3743 2013-04-23 13:26:16.026244 7f0610c40700 10 mon.a@1(leader).paxosservice(osdmap) dispatch osd_alive(want up_thru 1200 have 1200) v1 from osd.1 10.214.131.24:6809/7704
with same messages looping. clients can't connection, mons not forming quorum.
Updated by Greg Farnum about 11 years ago
Is it actually a loop or is the command getting re-sent?
Updated by Sage Weil about 11 years ago
- Status changed from New to In Progress
about to test a fix. the problem is that routed_request are resent to the new leader, even if that is us.. so it is proxying to self. i didn't follow that rabbit hole to see why that broke later, but it's clearly somewhere between wrong and not optimal so i'm fixing that first and see if it resolves things.
Updated by Greg Farnum about 11 years ago
Okay, but of course the only reason we're seeing this is that the monitors aren't forming a quorum, right? So that's the root cause and we'll need to deal with both issues.
Updated by Sage Weil about 11 years ago
from the logs it looks like starvation from these messages is preventing any new message processing (and thus quorum)
Updated by Sage Weil about 11 years ago
- Status changed from In Progress to Fix Under Review
Updated by Sage Weil about 11 years ago
- Status changed from Fix Under Review to Resolved
Actions