Bug #1865: mon: need to disconnect clients when we drop out of quorum - Ceph - Ceph

Actions

Copy link

Bug #1865

closed

mon: need to disconnect clients when we drop out of quorum

Added by Josh Durgin over 12 years ago. Updated over 12 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

Monitor

Target version:

v0.40

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

From sepia4:/tmp/cephtest/archive/log/osd.0.log:

2011-12-29 11:07:53.150535 7f27672fd700 data -> (7.0)
2011-12-29 11:07:53.150546 7f27672fd700 rbd -> (7.0)
2011-12-29 11:08:53.370240 7f276d50b700 osd.0 7 OSD::ms_handle_reset()
2011-12-29 11:08:53.370269 7f276d50b700 osd.0 7 OSD::ms_handle_reset() s=0x78b9480
2011-12-29 11:08:53.370514 7f27672fd700 -- 10.3.14.131:6800/1994 >> 10.3.14.168:0/434436905 pipe(0x116f280 sd=15 pgs=0 cs=0 l=0).accept peer addr is really 10.3.14.168:0/434436905 (socket is 10.3.14.168:44479/0)
2011-12-29 11:08:53.370675 7f27672fd700 data -> (7.0)
2011-12-29 11:08:53.370687 7f27672fd700 rbd -> (7.0)
2011-12-29 11:09:53.590090 7f276d50b700 osd.0 7 OSD::ms_handle_reset()
2011-12-29 11:09:53.590130 7f276d50b700 osd.0 7 OSD::ms_handle_reset() s=0x78b9480
2011-12-29 11:09:53.590328 7f27672fd700 -- 10.3.14.131:6800/1994 >> 10.3.14.168:0/434436905 pipe(0x62bac80 sd=15 pgs=0 cs=0 l=0).accept peer addr is really 10.3.14.168:0/434436905 (socket is 10.3.14.168:44480/0)
2011-12-29 11:09:53.590457 7f27672fd700 data -> (7.0)
2011-12-29 11:09:53.590468 7f27672fd700 rbd -> (7.0)
2011-12-29 11:10:53.810084 7f276d50b700 osd.0 7 OSD::ms_handle_reset()
2011-12-29 11:10:53.810125 7f276d50b700 osd.0 7 OSD::ms_handle_reset() s=0x78b9480
2011-12-29 11:10:53.810329 7f27672fd700 -- 10.3.14.131:6800/1994 >> 10.3.14.168:0/434436905 pipe(0x62baa00 sd=17 pgs=0 cs=0 l=0).accept peer addr is really 10.3.14.168:0/434436905 (socket is 10.3.14.168:44481/0)

The teuthology job is nightly_coverage_2011-12-28-b/5273, and is still running. The relevant machines are sepia38, sepia4, and sepia41.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Sage Weil over 12 years ago

Subject changed from osd getting wrong peer address to mon: need to disconnect clients when we drop out of quorum
Category set to Monitor
Target version set to v0.40

the kernel client is repeated reconnecting to a down osd and resendig its' requests because its osdmpa is out of date. Because it's connected to a monitor that is out of the quorum, but the process is still alive.

Actions

Copy link

Updated by Sage Weil over 12 years ago

the ceph-mon is deadlocked by

[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8111b7ee>] sleep_on_page+0xe/0x20
[<ffffffff8111ba33>] wait_on_page_bit+0x73/0x80
[<ffffffff8111be93>] filemap_fdatawait_range+0x113/0x1a0
[<ffffffff8111bf4b>] filemap_fdatawait+0x2b/0x30
[<ffffffff8119ece5>] sync_inodes_sb+0x1d5/0x260
[<ffffffff811a32c0>] __sync_filesystem+0x80/0x90
[<ffffffff811a32ef>] sync_one_sb+0x1f/0x30
[<ffffffff81177ecf>] iterate_supers+0x7f/0xe0
[<ffffffff811a3345>] sys_sync+0x45/0x70

because there is a kernel mount on the same node, the osd is down, and the client is connected to this monitor and isn't getting a new session.

i think we need an active ping mechanism, where the mon says "i am still in quorum" or else the client will disconnect and try someone else.

Actions

Copy link

Updated by Sage Weil over 12 years ago

Translation missing: en.field_position set to 30

Actions

Copy link

Updated by Greg Farnum over 12 years ago

Status changed from New to Duplicate

Adding active ping requirements to the monitors is contrary to the direction we want to take them with clients, though! We need a better solution than that. :(

In any case, the current title is a duplicate of #1831. Resolving as such.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #1865

mon: need to disconnect clients when we drop out of quorum

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Greg Farnum over 12 years ago