Bug #1865
closedmon: need to disconnect clients when we drop out of quorum
0%
Description
From sepia4:/tmp/cephtest/archive/log/osd.0.log:
2011-12-29 11:07:53.150535 7f27672fd700 data -> (7.0) 2011-12-29 11:07:53.150546 7f27672fd700 rbd -> (7.0) 2011-12-29 11:08:53.370240 7f276d50b700 osd.0 7 OSD::ms_handle_reset() 2011-12-29 11:08:53.370269 7f276d50b700 osd.0 7 OSD::ms_handle_reset() s=0x78b9480 2011-12-29 11:08:53.370514 7f27672fd700 -- 10.3.14.131:6800/1994 >> 10.3.14.168:0/434436905 pipe(0x116f280 sd=15 pgs=0 cs=0 l=0).accept peer addr is really 10.3.14.168:0/434436905 (socket is 10.3.14.168:44479/0) 2011-12-29 11:08:53.370675 7f27672fd700 data -> (7.0) 2011-12-29 11:08:53.370687 7f27672fd700 rbd -> (7.0) 2011-12-29 11:09:53.590090 7f276d50b700 osd.0 7 OSD::ms_handle_reset() 2011-12-29 11:09:53.590130 7f276d50b700 osd.0 7 OSD::ms_handle_reset() s=0x78b9480 2011-12-29 11:09:53.590328 7f27672fd700 -- 10.3.14.131:6800/1994 >> 10.3.14.168:0/434436905 pipe(0x62bac80 sd=15 pgs=0 cs=0 l=0).accept peer addr is really 10.3.14.168:0/434436905 (socket is 10.3.14.168:44480/0) 2011-12-29 11:09:53.590457 7f27672fd700 data -> (7.0) 2011-12-29 11:09:53.590468 7f27672fd700 rbd -> (7.0) 2011-12-29 11:10:53.810084 7f276d50b700 osd.0 7 OSD::ms_handle_reset() 2011-12-29 11:10:53.810125 7f276d50b700 osd.0 7 OSD::ms_handle_reset() s=0x78b9480 2011-12-29 11:10:53.810329 7f27672fd700 -- 10.3.14.131:6800/1994 >> 10.3.14.168:0/434436905 pipe(0x62baa00 sd=17 pgs=0 cs=0 l=0).accept peer addr is really 10.3.14.168:0/434436905 (socket is 10.3.14.168:44481/0)
The teuthology job is nightly_coverage_2011-12-28-b/5273, and is still running. The relevant machines are sepia38, sepia4, and sepia41.
Updated by Sage Weil over 12 years ago
- Subject changed from osd getting wrong peer address to mon: need to disconnect clients when we drop out of quorum
- Category set to Monitor
- Target version set to v0.40
the kernel client is repeated reconnecting to a down osd and resendig its' requests because its osdmpa is out of date. Because it's connected to a monitor that is out of the quorum, but the process is still alive.
Updated by Sage Weil over 12 years ago
the ceph-mon is deadlocked by
[<ffffffffffffffff>] 0xffffffffffffffff [<ffffffff8111b7ee>] sleep_on_page+0xe/0x20 [<ffffffff8111ba33>] wait_on_page_bit+0x73/0x80 [<ffffffff8111be93>] filemap_fdatawait_range+0x113/0x1a0 [<ffffffff8111bf4b>] filemap_fdatawait+0x2b/0x30 [<ffffffff8119ece5>] sync_inodes_sb+0x1d5/0x260 [<ffffffff811a32c0>] __sync_filesystem+0x80/0x90 [<ffffffff811a32ef>] sync_one_sb+0x1f/0x30 [<ffffffff81177ecf>] iterate_supers+0x7f/0xe0 [<ffffffff811a3345>] sys_sync+0x45/0x70
because there is a kernel mount on the same node, the osd is down, and the client is connected to this monitor and isn't getting a new session.
i think we need an active ping mechanism, where the mon says "i am still in quorum" or else the client will disconnect and try someone else.
Updated by Sage Weil over 12 years ago
- Translation missing: en.field_position set to 30
Updated by Greg Farnum over 12 years ago
- Status changed from New to Duplicate
Adding active ping requirements to the monitors is contrary to the direction we want to take them with clients, though! We need a better solution than that. :(
In any case, the current title is a duplicate of #1831. Resolving as such.