Project

General

Profile

Bug #24052

repeated eviction of idle client until some IO happens

Added by Dan van der Ster 7 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
05/08/2018
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
multimds
Pull request ID:

Description

We see repeated eviction of idle client sessions. We have client_reconnect_stale on the ceph-fuse clients, and these on the mds:

mds session blacklist on evict = false
mds session blacklist on timeout = false

For a few clients, an mds is looping every 300s trying to evict the same unresponsive client, e.g.

2018-05-08 22:51:33.410599 mds.cephflax-mds-2a4cfd0e2c mds.2 137.138.52.159:6800/3875509242 650 : cluster [WRN] evicting unresponsive client  (31740949), after 300.403721 seconds
2018-05-08 22:56:35.451850 mds.cephflax-mds-2a4cfd0e2c mds.2 137.138.52.159:6800/3875509242 653 : cluster [WRN] evicting unresponsive client  (31740949), after 300.224156 seconds

For example, mds.2 thinks it has this stale client session:

    {
        "id": 31740949,
        "num_leases": 0,
        "num_caps": 1,
        "state": "stale",
        "replay_requests": 0,
        "completed_requests": 0,
        "reconnecting": false,
        "inst": "client.31740949 128.142.160.86:0/2973670471",
        "client_metadata": {}
    }

The session doesn't seem to reconnect properly. Instead, the stale session has the same behaviour no matter if the session is auto-evicted, or if I do it manually with:

   ceph tell mds.`hostname -s` client evict id=31740949

In either case, the session disappears for a few seconds from the mds session ls, then re-appears in state=open:

 {
        "id": 31740949,
        "num_leases": 0,
        "num_caps": 1,
        "state": "open",
        "replay_requests": 0,
        "completed_requests": 0,
        "reconnecting": false,
        "inst": "client.31740949 128.142.160.86:0/2973670471",
        "client_metadata": {}
    }

But then after some time (mayeb 45 or 60s) the session is marked as stale again and we restart the loop.

Note that this client was first evicted during an mds restart, in which it took far longer than the 45s timeout to attempt to reconnect:

2018-05-08 15:02:50.819655 mds.cephflax-mds-2a4cfd0e2c mds.2 137.138.52.159:6800/3875509242 401 : cluster [WRN] evicting unresponsive client hpc057.cern.ch:xxx (31740949), after waiting 45 seconds during MDS startup
2018-05-08 15:04:18.066339 mds.cephflax-mds-2a4cfd0e2c mds.2 137.138.52.159:6800/3875509242 408 : cluster [INF] denied reconnect attempt (mds is up:active) from client.31740949 128.142.160.86:0/2973670471 after 132.332634 (allowed interval 45)

With `debug_ms=1` and `debug_client=10` on the client side, I don't see any messages at all coming from mds.2 when the eviction occurs.

When I check on the client side with the mds_sessions asok command, I see the client only has open sessions with mds.0 and mds.1.

However, as soon as I list the root of the mount point (`ls /hpcscratch`) from the client, then the session is fixed -- it is no longer stale:

[
    {
        "id": 31740949,
        "num_leases": 0,
        "num_caps": 5,
        "state": "open",
        "replay_requests": 0,
        "completed_requests": 0,
        "reconnecting": false,
        "inst": "client.31740949 128.142.160.86:0/2973670471",
        "client_metadata": {
            "ceph_sha1": "52085d5249a80c5f5121a76d6288429f35e4e77b",
            "ceph_version": "ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)",
            "entity_id": "xxx",
            "hostname": "hpc057.cern.ch",
            "mount_point": "/hpcscratch",
            "pid": "1084268",
            "root": "/volumes/_nogroup/393f2dcc-6b09-44d7-8d20-0e84b072ed26" 
        }
    }
]

Is this the expected behaviour? Would it be possible to trigger that full reconnection even on clients that are not doing any IO? Or is there a better way to handle these type of sessions?

I posted the fuse log at ceph-post-file: 0cbec52d-a6e3-4fad-9ecd-d70dfcea6a59

The first activity with mds.2 is at 2018-05-08 23:46:18.217138, when I `ls` the mountpoint.

And unfortunately I didn't have a high enough log level on mds.2 at this time, but it did log this:

2018-05-08 23:46:18.218814 7fd112df7700  0 -- 137.138.52.159:6800/3875509242 >> 128.142.160.86:0/2973670471 conn(0x555b44e23800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=1 existing_state=STATE_STANDBY
2018-05-08 23:46:18.218831 7fd112df7700  0 -- 137.138.52.159:6800/3875509242 >> 128.142.160.86:0/2973670471 conn(0x555b44e23800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing

Finally, please note also as possibly useful info that this folder (a manila share) is pinned to mds.2.

  "/volumes/_nogroup/393f2dcc-6b09-44d7-8d20-0e84b072ed26",
  2,
  2

Related issues

Copied to fs - Backport #24295: luminous: repeated eviction of idle client until some IO happens Resolved
Copied to fs - Backport #24296: mimic: repeated eviction of idle client until some IO happens Resolved

History

#1 Updated by Patrick Donnelly 7 months ago

  • Assignee set to Zheng Yan
  • Priority changed from Normal to High
  • Target version set to v13.2.0

#2 Updated by Zheng Yan 7 months ago

It's not expected. could you upload client log with debug_ms=1

#3 Updated by Dan van der Ster 7 months ago

The log for that client at 128.142.160.86 is here: ceph-post-file: dd10811e-2790-43e4-b0a9-135725f70209

Thanks for your help.

#4 Updated by Zheng Yan 7 months ago

  • Status changed from New to Need Review

#6 Updated by Patrick Donnelly 7 months ago

  • Target version changed from v13.2.0 to v14.0.0
  • Backport set to mimic,luminous
  • Affected Versions deleted (v12.2.4)
  • Component(FS) deleted (Client, ceph-fuse)

#7 Updated by Patrick Donnelly 7 months ago

  • Status changed from Need Review to Pending Backport

#8 Updated by Nathan Cutler 7 months ago

  • Copied to Backport #24295: luminous: repeated eviction of idle client until some IO happens added

#9 Updated by Nathan Cutler 7 months ago

  • Copied to Backport #24296: mimic: repeated eviction of idle client until some IO happens added

#10 Updated by Nathan Cutler 4 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF