Project

General

Profile

Bug #1865

mon: need to disconnect clients when we drop out of quorum

Added by Josh Durgin almost 8 years ago. Updated almost 8 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
Start date:
12/29/2011
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

From sepia4:/tmp/cephtest/archive/log/osd.0.log:

2011-12-29 11:07:53.150535 7f27672fd700 data -> (7.0)
2011-12-29 11:07:53.150546 7f27672fd700 rbd -> (7.0)
2011-12-29 11:08:53.370240 7f276d50b700 osd.0 7 OSD::ms_handle_reset()
2011-12-29 11:08:53.370269 7f276d50b700 osd.0 7 OSD::ms_handle_reset() s=0x78b9480
2011-12-29 11:08:53.370514 7f27672fd700 -- 10.3.14.131:6800/1994 >> 10.3.14.168:0/434436905 pipe(0x116f280 sd=15 pgs=0 cs=0 l=0).accept peer addr is really 10.3.14.168:0/434436905 (socket is 10.3.14.168:44479/0)
2011-12-29 11:08:53.370675 7f27672fd700 data -> (7.0)
2011-12-29 11:08:53.370687 7f27672fd700 rbd -> (7.0)
2011-12-29 11:09:53.590090 7f276d50b700 osd.0 7 OSD::ms_handle_reset()
2011-12-29 11:09:53.590130 7f276d50b700 osd.0 7 OSD::ms_handle_reset() s=0x78b9480
2011-12-29 11:09:53.590328 7f27672fd700 -- 10.3.14.131:6800/1994 >> 10.3.14.168:0/434436905 pipe(0x62bac80 sd=15 pgs=0 cs=0 l=0).accept peer addr is really 10.3.14.168:0/434436905 (socket is 10.3.14.168:44480/0)
2011-12-29 11:09:53.590457 7f27672fd700 data -> (7.0)
2011-12-29 11:09:53.590468 7f27672fd700 rbd -> (7.0)
2011-12-29 11:10:53.810084 7f276d50b700 osd.0 7 OSD::ms_handle_reset()
2011-12-29 11:10:53.810125 7f276d50b700 osd.0 7 OSD::ms_handle_reset() s=0x78b9480
2011-12-29 11:10:53.810329 7f27672fd700 -- 10.3.14.131:6800/1994 >> 10.3.14.168:0/434436905 pipe(0x62baa00 sd=17 pgs=0 cs=0 l=0).accept peer addr is really 10.3.14.168:0/434436905 (socket is 10.3.14.168:44481/0)

The teuthology job is nightly_coverage_2011-12-28-b/5273, and is still running. The relevant machines are sepia38, sepia4, and sepia41.


Related issues

Related to Ceph - Bug #1831: mon: should not accept (and should disconnect) session when not in quorum Resolved 12/14/2011

History

#1 Updated by Sage Weil almost 8 years ago

  • Subject changed from osd getting wrong peer address to mon: need to disconnect clients when we drop out of quorum
  • Category set to Monitor
  • Target version set to v0.40

the kernel client is repeated reconnecting to a down osd and resendig its' requests because its osdmpa is out of date. Because it's connected to a monitor that is out of the quorum, but the process is still alive.

#2 Updated by Sage Weil almost 8 years ago

the ceph-mon is deadlocked by

[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8111b7ee>] sleep_on_page+0xe/0x20
[<ffffffff8111ba33>] wait_on_page_bit+0x73/0x80
[<ffffffff8111be93>] filemap_fdatawait_range+0x113/0x1a0
[<ffffffff8111bf4b>] filemap_fdatawait+0x2b/0x30
[<ffffffff8119ece5>] sync_inodes_sb+0x1d5/0x260
[<ffffffff811a32c0>] __sync_filesystem+0x80/0x90
[<ffffffff811a32ef>] sync_one_sb+0x1f/0x30
[<ffffffff81177ecf>] iterate_supers+0x7f/0xe0
[<ffffffff811a3345>] sys_sync+0x45/0x70

because there is a kernel mount on the same node, the osd is down, and the client is connected to this monitor and isn't getting a new session.

i think we need an active ping mechanism, where the mon says "i am still in quorum" or else the client will disconnect and try someone else.

#3 Updated by Sage Weil almost 8 years ago

  • translation missing: en.field_position set to 30

#4 Updated by Greg Farnum almost 8 years ago

  • Status changed from New to Duplicate

Adding active ping requirements to the monitors is contrary to the direction we want to take them with clients, though! We need a better solution than that. :(

In any case, the current title is a duplicate of #1831. Resolving as such.

Also available in: Atom PDF