Project

General

Profile

Actions

Bug #61394

closed

qa/quincy: cluster [WRN] evicting unresponsive client smithi152 (4298), after 303.726 seconds" in cluster log

Added by Kotresh Hiremath Ravishankar 11 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Description: fs/bugs/client_trim_caps/{begin/{0-install 1-ceph 2-logrotate} centos_latest clusters/small-cluster conf/{client mds mon osd} objectstore/bluestore-bitmap overrides/{ignorelist_health ignorelist_wrongly_marked_down no_client_pidfile} tasks/trim-i24137}

Log: http://qa-proxy.ceph.com/teuthology/yuriw-2023-05-23_15:23:11-fs-wip-yuri10-testing-2023-05-18-0815-quincy-distro-default-smithi/7284019/teuthology.log

Job Link: https://pulpito.ceph.com/yuriw-2023-05-23_15:23:11-fs-wip-yuri10-testing-2023-05-18-0815-quincy-distro-default-smithi/7284019/

Failure Reason:
"2023-05-24T00:18:22.682443+0000 mds.a (mds.0) 1 : cluster [WRN] evicting unresponsive client smithi152 (4298), after 303.726 seconds" in cluster log


Related issues 1 (0 open1 closed)

Related to CephFS - Bug #24137: client: segfault in trim_capsResolvedPatrick Donnelly05/15/2018

Actions
Actions #1

Updated by Milind Changire 11 months ago

  • Assignee set to Xiubo Li
Actions #2

Updated by Patrick Donnelly 11 months ago

Client log here:

/teuthology/yuriw-2023-05-23_15:23:11-fs-wip-yuri10-testing-2023-05-18-0815-quincy-distro-default-smithi/7284019/remote/smithi152/log/ceph-client.admin.115487.log.gz

Actions #3

Updated by Patrick Donnelly 11 months ago

  • Related to Bug #24137: client: segfault in trim_caps added
Actions #4

Updated by Xiubo Li 11 months ago

Checked the logs, the client was just stopped due to some reasons suddenly and I haven't found any suspicious logs yet.

Actions #5

Updated by Xiubo Li 11 months ago

Patrick,

From reading the src/test/fs/test_trim_caps.cc and the test case, the expected result is that the two mounts in parent and child processes shouldn't crash, right ? Just the logs in the src/test/fs/test_trim_caps.cc a little confused.

If that's is true, actually the test passed and there was no any crash. Just the child process exited without doing the unmount and then leaving the MDS keep tracking the killed mounter.

Maybe we should explicitly do the unmounting before child process exiting.

Actions #6

Updated by Xiubo Li 11 months ago

  • Status changed from New to In Progress
Actions #8

Updated by Xiubo Li 11 months ago

Patrick Donnelly wrote:

Client log here:

/teuthology/yuriw-2023-05-23_15:23:11-fs-wip-yuri10-testing-2023-05-18-0815-quincy-distro-default-smithi/7284019/remote/smithi152/log/ceph-client.admin.115487.log.gz

From this test case:

2023-05-24T00:12:37.908 DEBUG:teuthology.orchestra.run.smithi152:> sudo TESTDIR=/home/ubuntu/cephtest bash -c ceph_test_trim_caps
2023-05-24T00:12:37.974 INFO:teuthology.orchestra.run.smithi152.stdout:child exits

...

2023-05-24T00:20:15.215 INFO:tasks.ceph:Checking cluster log for badness...
2023-05-24T00:20:15.215 DEBUG:teuthology.orchestra.run.smithi152:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v 'overall HEALTH_' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(MDS_FAILED\)' | egrep -v '\(MDS_DEGRADED\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(MDS_DAMAGE\)' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v 'overall HEALTH_' | egrep -v '\(OSD_DOWN\)' | egrep -v '\(OSD_' | egrep -v 'but it is still running' | egrep -v 'is not responding' | head -n 1
2023-05-24T00:20:15.287 INFO:teuthology.orchestra.run.smithi152.stdout:2023-05-24T00:18:22.682443+0000 mds.a (mds.0) 1 : cluster [WRN] evicting unresponsive client smithi152 (4298), after 303.726 seconds
2023-05-24T00:20:15.287 WARNING:tasks.ceph:Found errors (ERR|WRN|SEC) in cluster log
2023-05-24T00:20:15.288 DEBUG:teuthology.orchestra.run.smithi152:> sudo egrep '\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v 'overall HEALTH_' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(MDS_FAILED\)' | egrep -v '\(MDS_DEGRADED\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(MDS_DAMAGE\)' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v 'overall HEALTH_' | egrep -v '\(OSD_DOWN\)' | egrep -v '\(OSD_' | egrep -v 'but it is still running' | egrep -v 'is not responding' | head -n 1
2023-05-24T00:20:15.358 DEBUG:teuthology.orchestra.run.smithi152:> sudo egrep '\[ERR\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v 'overall HEALTH_' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(MDS_FAILED\)' | egrep -v '\(MDS_DEGRADED\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(MDS_DAMAGE\)' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v 'overall HEALTH_' | egrep -v '\(OSD_DOWN\)' | egrep -v '\(OSD_' | egrep -v 'but it is still running' | egrep -v 'is not responding' | head -n 1
2023-05-24T00:20:15.429 DEBUG:teuthology.orchestra.run.smithi152:> sudo egrep '\[WRN\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v 'overall HEALTH_' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(MDS_FAILED\)' | egrep -v '\(MDS_DEGRADED\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(MDS_DAMAGE\)' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v 'overall HEALTH_' | egrep -v '\(OSD_DOWN\)' | egrep -v '\(OSD_' | egrep -v 'but it is still running' | egrep -v 'is not responding' | head -n 1
2023-05-24T00:20:15.499 INFO:teuthology.orchestra.run.smithi152.stdout:2023-05-24T00:18:22.682443+0000 mds.a (mds.0) 1 : cluster [WRN] evicting unresponsive client smithi152 (4298), after 303.726 seconds

When checking the cluster badness the ceph_test_trim_caps test has finished for around 8 minutes. And the MDS will wait 5 minutes to check the unresponsive client and then evict it.

So if the test case could check the badness just in around 5 minutes, the test should pass.

I found a success test case, which only took around 4 minutes:
https://pulpito.ceph.com/yuriw-2023-05-31_21:56:15-fs-wip-yuri6-testing-2023-05-31-0933-quincy-distro-default-smithi/7292740/

2023-06-01T03:09:18.613 INFO:teuthology.task.exec:Running commands on role client.0 host ubuntu@smithi148.front.sepia.ceph.com
2023-06-01T03:09:18.614 DEBUG:teuthology.orchestra.run.smithi148:> sudo TESTDIR=/home/ubuntu/cephtest bash -c ceph_test_trim_caps
2023-06-01T03:09:18.680 INFO:teuthology.orchestra.run.smithi148.stdout:child exits

...

2023-06-01T03:12:31.217 INFO:tasks.ceph:Checking cluster log for badness...
2023-06-01T03:12:31.218 DEBUG:teuthology.orchestra.run.smithi148:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v 'overall HEALTH_' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(MDS_FAILED\)' | egrep -v '\(MDS_DEGRADED\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(MDS_DAMAGE\)' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v 'overall HEALTH_' | egrep -v '\(OSD_DOWN\)' | egrep -v '\(OSD_' | egrep -v 'but it is still running' | egrep -v 'is not responding' | head -n 1

Actions #9

Updated by Xiubo Li 11 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 51890
Actions #11

Updated by Venky Shankar 10 months ago

Xiubo - PTAL at https://pulpito.ceph.com/vshankar-2023-06-20_10:07:44-fs-wip-vshankar-testing-20230620.052303-testing-default-smithi/7308836/

Kotresh pointed out this tracker during standup last week. When I debugged this before I knew about this tracker, the client was not renewing its caps with the MDS (request_renew message to the MDS), causing the MDS to evict and blocklist the client. Do you think the issue is the same as mentioned in note-8?

Actions #12

Updated by Xiubo Li 10 months ago

Venky Shankar wrote:

Xiubo - PTAL at https://pulpito.ceph.com/vshankar-2023-06-20_10:07:44-fs-wip-vshankar-testing-20230620.052303-testing-default-smithi/7308836/

Kotresh pointed out this tracker during standup last week. When I debugged this before I knew about this tracker, the client was not renewing its caps with the MDS (request_renew message to the MDS), causing the MDS to evict and blocklist the client. Do you think the issue is the same as mentioned in note-8?

Yeah, it looks like the a similar issue, but this one is a kclient.

I will debug it tomorrow in detail.

Actions #13

Updated by Xiubo Li 10 months ago

Venky Shankar wrote:

Xiubo - PTAL at https://pulpito.ceph.com/vshankar-2023-06-20_10:07:44-fs-wip-vshankar-testing-20230620.052303-testing-default-smithi/7308836/

Kotresh pointed out this tracker during standup last week. When I debugged this before I knew about this tracker, the client was not renewing its caps with the MDS (request_renew message to the MDS), causing the MDS to evict and blocklist the client. Do you think the issue is the same as mentioned in note-8?

Checked the logs, found that the kclient detected that the mds.0 was hung and then try to reconnect but was rejected by mds.0:

2023-06-20T14:00:24.527847+00:00 smithi202 kernel: [ 5065.714029] ceph: test_dummy_encryption mode enabled
2023-06-20T14:00:24.635919+00:00 smithi202 kernel: [ 5065.822431] libceph: mon0 (1)172.21.15.33:6789 session established
2023-06-20T14:00:24.635941+00:00 smithi202 kernel: [ 5065.822980] libceph: client6105 fsid 054ad731-bfaf-4587-9674-7b36e070bce0
2023-06-20T14:00:24.635944+00:00 smithi202 kernel: [ 5065.823118] ceph: test_dummy_encryption mode enabled
2023-06-20T14:05:30.663955+00:00 smithi202 kernel: [ 5371.858135] libceph: mds0 (1)172.21.15.33:6837 socket closed (con state OPEN)
2023-06-20T14:05:30.847883+00:00 smithi202 kernel: [ 5372.040949] ceph: [054ad731-bfaf-4587-9674-7b36e070bce0 6105] check_session_state: mds0 hung
2023-06-20T14:05:30.847895+00:00 smithi202 kernel: [ 5372.041437] libceph: mds0 (1)172.21.15.33:6837 session reset
2023-06-20T14:05:30.847896+00:00 smithi202 kernel: [ 5372.041442] ceph: [054ad731-bfaf-4587-9674-7b36e070bce0 6105] mds_peer_reset: mds0 closed our session
2023-06-20T14:05:30.847896+00:00 smithi202 kernel: [ 5372.041455] ceph: [054ad731-bfaf-4587-9674-7b36e070bce0 6105] send_mds_reconnect: mds0 reconnect start
2023-06-20T14:05:30.847897+00:00 smithi202 kernel: [ 5372.042226] ceph: [054ad731-bfaf-4587-9674-7b36e070bce0 6105] handle_session: mds0 reconnect denied
2023-06-20T14:05:31.251921+00:00 smithi202 kernel: [ 5372.443744] libceph: mon0 (1)172.21.15.33:6789 session established
2023-06-20T14:05:31.251941+00:00 smithi202 kernel: [ 5372.446011] libceph: client6108 fsid 054ad731-bfaf-4587-9674-7b36e070bce0
2023-06-20T14:05:31.251942+00:00 smithi202 kernel: [ 5372.446060] ceph: test_dummy_encryption mode enabled

While in mds.0, it didn't receive any ping from client.6105 and 300 seconds later just evicted the client:

2023-06-20T14:05:25.254+0000 7fa2d4855640 20 mds.0.server deferring marking session client.6105 v1:172.21.15.202:0/2228992724 stale since it holds no caps
2023-06-20T14:05:25.254+0000 7fa2d4855640 20 mds.0.server laggiest active session is client.4664 v1:172.21.15.202:0/3945822218 and renewed caps recently (2.03597s ago)
...
2023-06-20T14:05:30.254+0000 7fa2d4855640 10 mds.0.server find_idle_sessions. last cleared laggy state 5672.85s ago
2023-06-20T14:05:30.254+0000 7fa2d4855640 20 mds.0.server evicting session client.6105 v1:172.21.15.202:0/2228992724 since autoclose has arrived
2023-06-20T14:05:30.254+0000 7fa2d4855640 20 mds.0.server laggiest active session is client.4664 v1:172.21.15.202:0/3945822218 and renewed caps recently (7.03589s ago)
2023-06-20T14:05:30.298+0000 7fa2d4855640  0 log_channel(cluster) log [WRN] : evicting unresponsive client smithi202 (6105), after 305.619 seconds
2023-06-20T14:05:30.298+0000 7fa2d4855640 10 mds.0.server autoclosing stale session client.6105 v1:172.21.15.202:0/2228992724 last renewed caps 305.619s ago
...
2023-06-20T14:05:30.642+0000 7fa2d6058640 10 mds.0.server kill_session 0x5606c4189400
2023-06-20T14:05:30.642+0000 7fa2d6058640 10 mds.0.server journal_close_session : client.6105 v1:172.21.15.202:0/2228992724 pending_prealloc_inos [] free_prealloc_inos [0x1000000b7f7~0x164,0x1000000b95c~0x284] delegated_inos []
2023-06-20T14:05:30.642+0000 7fa2d6058640 20 mds.0.sessionmap mark_projected s=0x5606c4189400 name=client.6105 pv=45762 -> 45763
2023-06-20T14:05:30.642+0000 7fa2d6058640 10 mds.0.inotable: project_release_ids [0x1000000b7f7~0x164,0x1000000b95c~0x284] to [0x1000000b3f3~0x86,0x1000000b47a~0xf4,0x1000000b587~0x26d,0x1000000bbe0~0xffffff4420]/[0x1000000b3f3~0x86,0x1000000b47a~0xf4,0x1000000b587~0x26d,0x1000000bbe0~0xffffff4420]
...
2023-06-20T14:05:30.658+0000 7fa2d004c640 10 remove_session: mds.metrics: session=0x5606c4189400, client=client.6105 v1:172.21.15.202:0/2228992724
2023-06-20T14:05:30.658+0000 7fa2d004c640 10 mds.0.sessionmap remove_session s=0x5606c4189400 name=client.6105
...
2023-06-20T14:05:30.842+0000 7fa2d985f640 10 mds.c parse_caps: parsing auth_cap_str='allow *'
2023-06-20T14:05:30.842+0000 7fa2d6058640 10 mds.c  new session 0x5606c10a6500 for client.6105 v1:172.21.15.202:0/2228992724 con 0x5606bd6fd800
2023-06-20T14:05:30.842+0000 7fa2d6058640 10 mds.c parse_caps: parsing auth_cap_str='allow *'
2023-06-20T14:05:30.842+0000 7fa2d6058640 10 mds.c ms_handle_accept v1:172.21.15.202:0/2228992724 con 0x5606bd6fd800 session 0x5606c10a6500
2023-06-20T14:05:30.842+0000 7fa2d6058640  1 -- [v2:172.21.15.33:6834/1215596373,v1:172.21.15.33:6837/1215596373] <== client.6105 v1:172.21.15.202:0/2228992724 1 ==== client_reconnect(1 caps 1 realms ) v3 ==== 0+0+98 (unknown 0 0 3797099533) 0x5606bd4aaea0 con 0x5606bd6fd800
2023-06-20T14:05:30.842+0000 7fa2d6058640  7 mds.0.server handle_client_reconnect client.6105
2023-06-20T14:05:30.842+0000 7fa2d6058640 20 mds.0.11 get_session have 0x5606c10a6500 client.6105 v1:172.21.15.202:0/2228992724 state closed
2023-06-20T14:05:30.842+0000 7fa2d6058640  0 mds.0.server  ignoring msg from not-open sessionclient_reconnect(1 caps 1 realms ) v3

It seems a network issue, because not only the client.6105 from inode 172.21.15.202 have this issue, such as other clients:

2023-06-20T14:05:20.254+0000 7fa2d4855640 20 mds.0.server laggiest active session is client.4664 v1:172.21.15.202:0/3945822218 and renewed caps recently (17.0037s ago)
Actions #14

Updated by Xiubo Li 10 months ago

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo - PTAL at https://pulpito.ceph.com/vshankar-2023-06-20_10:07:44-fs-wip-vshankar-testing-20230620.052303-testing-default-smithi/7308836/

Kotresh pointed out this tracker during standup last week. When I debugged this before I knew about this tracker, the client was not renewing its caps with the MDS (request_renew message to the MDS), causing the MDS to evict and blocklist the client. Do you think the issue is the same as mentioned in note-8?

Checked the logs, found that the kclient detected that the mds.0 was hung and then try to reconnect but was rejected by mds.0:

[...]

While in mds.0, it didn't receive any ping from client.6105 and 300 seconds later just evicted the client:

[...]

It seems a network issue, because not only the client.6105 from inode 172.21.15.202 have this issue, such as other clients:

[...]

Raised one new tracker to follow this https://tracker.ceph.com/issues/61843, it's a kclient bug.

Actions #15

Updated by Rishabh Dave 7 months ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF