Bug #61394
qa/quincy: cluster [WRN] evicting unresponsive client smithi152 (4298), after 303.726 seconds" in cluster log
0%
Description
Description: fs/bugs/client_trim_caps/{begin/{0-install 1-ceph 2-logrotate} centos_latest clusters/small-cluster conf/{client mds mon osd} objectstore/bluestore-bitmap overrides/{ignorelist_health ignorelist_wrongly_marked_down no_client_pidfile} tasks/trim-i24137}
Failure Reason:
"2023-05-24T00:18:22.682443+0000 mds.a (mds.0) 1 : cluster [WRN] evicting unresponsive client smithi152 (4298), after 303.726 seconds" in cluster log
Related issues
History
#1 Updated by Milind Changire 6 months ago
- Assignee set to Xiubo Li
#2 Updated by Patrick Donnelly 6 months ago
Client log here:
/teuthology/yuriw-2023-05-23_15:23:11-fs-wip-yuri10-testing-2023-05-18-0815-quincy-distro-default-smithi/7284019/remote/smithi152/log/ceph-client.admin.115487.log.gz
#3 Updated by Patrick Donnelly 6 months ago
- Related to Bug #24137: client: segfault in trim_caps added
#5 Updated by Xiubo Li 6 months ago
Patrick,
From reading the src/test/fs/test_trim_caps.cc and the test case, the expected result is that the two mounts in parent and child processes shouldn't crash, right ? Just the logs in the src/test/fs/test_trim_caps.cc a little confused.
If that's is true, actually the test passed and there was no any crash. Just the child process exited without doing the unmount and then leaving the MDS keep tracking the killed mounter.
Maybe we should explicitly do the unmounting before child process exiting.
#7 Updated by Kotresh Hiremath Ravishankar 6 months ago
another instance in reef:
https://pulpito.ceph.com/yuriw-2023-05-28_14:46:14-fs-reef-release-distro-default-smithi/7288948
#8 Updated by Xiubo Li 6 months ago
Patrick Donnelly wrote:
Client log here:
/teuthology/yuriw-2023-05-23_15:23:11-fs-wip-yuri10-testing-2023-05-18-0815-quincy-distro-default-smithi/7284019/remote/smithi152/log/ceph-client.admin.115487.log.gz
From this test case:
2023-05-24T00:12:37.908 DEBUG:teuthology.orchestra.run.smithi152:> sudo TESTDIR=/home/ubuntu/cephtest bash -c ceph_test_trim_caps 2023-05-24T00:12:37.974 INFO:teuthology.orchestra.run.smithi152.stdout:child exits ... 2023-05-24T00:20:15.215 INFO:tasks.ceph:Checking cluster log for badness... 2023-05-24T00:20:15.215 DEBUG:teuthology.orchestra.run.smithi152:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v 'overall HEALTH_' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(MDS_FAILED\)' | egrep -v '\(MDS_DEGRADED\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(MDS_DAMAGE\)' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v 'overall HEALTH_' | egrep -v '\(OSD_DOWN\)' | egrep -v '\(OSD_' | egrep -v 'but it is still running' | egrep -v 'is not responding' | head -n 1 2023-05-24T00:20:15.287 INFO:teuthology.orchestra.run.smithi152.stdout:2023-05-24T00:18:22.682443+0000 mds.a (mds.0) 1 : cluster [WRN] evicting unresponsive client smithi152 (4298), after 303.726 seconds 2023-05-24T00:20:15.287 WARNING:tasks.ceph:Found errors (ERR|WRN|SEC) in cluster log 2023-05-24T00:20:15.288 DEBUG:teuthology.orchestra.run.smithi152:> sudo egrep '\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v 'overall HEALTH_' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(MDS_FAILED\)' | egrep -v '\(MDS_DEGRADED\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(MDS_DAMAGE\)' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v 'overall HEALTH_' | egrep -v '\(OSD_DOWN\)' | egrep -v '\(OSD_' | egrep -v 'but it is still running' | egrep -v 'is not responding' | head -n 1 2023-05-24T00:20:15.358 DEBUG:teuthology.orchestra.run.smithi152:> sudo egrep '\[ERR\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v 'overall HEALTH_' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(MDS_FAILED\)' | egrep -v '\(MDS_DEGRADED\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(MDS_DAMAGE\)' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v 'overall HEALTH_' | egrep -v '\(OSD_DOWN\)' | egrep -v '\(OSD_' | egrep -v 'but it is still running' | egrep -v 'is not responding' | head -n 1 2023-05-24T00:20:15.429 DEBUG:teuthology.orchestra.run.smithi152:> sudo egrep '\[WRN\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v 'overall HEALTH_' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(MDS_FAILED\)' | egrep -v '\(MDS_DEGRADED\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(MDS_DAMAGE\)' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v 'overall HEALTH_' | egrep -v '\(OSD_DOWN\)' | egrep -v '\(OSD_' | egrep -v 'but it is still running' | egrep -v 'is not responding' | head -n 1 2023-05-24T00:20:15.499 INFO:teuthology.orchestra.run.smithi152.stdout:2023-05-24T00:18:22.682443+0000 mds.a (mds.0) 1 : cluster [WRN] evicting unresponsive client smithi152 (4298), after 303.726 seconds
When checking the cluster badness the ceph_test_trim_caps test has finished for around 8 minutes. And the MDS will wait 5 minutes to check the unresponsive client and then evict it.
So if the test case could check the badness just in around 5 minutes, the test should pass.
I found a success test case, which only took around 4 minutes:
https://pulpito.ceph.com/yuriw-2023-05-31_21:56:15-fs-wip-yuri6-testing-2023-05-31-0933-quincy-distro-default-smithi/7292740/
2023-06-01T03:09:18.613 INFO:teuthology.task.exec:Running commands on role client.0 host ubuntu@smithi148.front.sepia.ceph.com 2023-06-01T03:09:18.614 DEBUG:teuthology.orchestra.run.smithi148:> sudo TESTDIR=/home/ubuntu/cephtest bash -c ceph_test_trim_caps 2023-06-01T03:09:18.680 INFO:teuthology.orchestra.run.smithi148.stdout:child exits ... 2023-06-01T03:12:31.217 INFO:tasks.ceph:Checking cluster log for badness... 2023-06-01T03:12:31.218 DEBUG:teuthology.orchestra.run.smithi148:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v 'overall HEALTH_' | egrep -v '\(FS_DEGRADED\)' | egrep -v '\(MDS_FAILED\)' | egrep -v '\(MDS_DEGRADED\)' | egrep -v '\(FS_WITH_FAILED_MDS\)' | egrep -v '\(MDS_DAMAGE\)' | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | egrep -v '\(FS_INLINE_DATA_DEPRECATED\)' | egrep -v 'overall HEALTH_' | egrep -v '\(OSD_DOWN\)' | egrep -v '\(OSD_' | egrep -v 'but it is still running' | egrep -v 'is not responding' | head -n 1
#11 Updated by Venky Shankar 5 months ago
Xiubo - PTAL at https://pulpito.ceph.com/vshankar-2023-06-20_10:07:44-fs-wip-vshankar-testing-20230620.052303-testing-default-smithi/7308836/
Kotresh pointed out this tracker during standup last week. When I debugged this before I knew about this tracker, the client was not renewing its caps with the MDS (request_renew message to the MDS), causing the MDS to evict and blocklist the client. Do you think the issue is the same as mentioned in note-8?
#12 Updated by Xiubo Li 5 months ago
Venky Shankar wrote:
Xiubo - PTAL at https://pulpito.ceph.com/vshankar-2023-06-20_10:07:44-fs-wip-vshankar-testing-20230620.052303-testing-default-smithi/7308836/
Kotresh pointed out this tracker during standup last week. When I debugged this before I knew about this tracker, the client was not renewing its caps with the MDS (request_renew message to the MDS), causing the MDS to evict and blocklist the client. Do you think the issue is the same as mentioned in note-8?
Yeah, it looks like the a similar issue, but this one is a kclient.
I will debug it tomorrow in detail.
#13 Updated by Xiubo Li 5 months ago
Venky Shankar wrote:
Xiubo - PTAL at https://pulpito.ceph.com/vshankar-2023-06-20_10:07:44-fs-wip-vshankar-testing-20230620.052303-testing-default-smithi/7308836/
Kotresh pointed out this tracker during standup last week. When I debugged this before I knew about this tracker, the client was not renewing its caps with the MDS (request_renew message to the MDS), causing the MDS to evict and blocklist the client. Do you think the issue is the same as mentioned in note-8?
Checked the logs, found that the kclient detected that the mds.0 was hung and then try to reconnect but was rejected by mds.0:
2023-06-20T14:00:24.527847+00:00 smithi202 kernel: [ 5065.714029] ceph: test_dummy_encryption mode enabled 2023-06-20T14:00:24.635919+00:00 smithi202 kernel: [ 5065.822431] libceph: mon0 (1)172.21.15.33:6789 session established 2023-06-20T14:00:24.635941+00:00 smithi202 kernel: [ 5065.822980] libceph: client6105 fsid 054ad731-bfaf-4587-9674-7b36e070bce0 2023-06-20T14:00:24.635944+00:00 smithi202 kernel: [ 5065.823118] ceph: test_dummy_encryption mode enabled 2023-06-20T14:05:30.663955+00:00 smithi202 kernel: [ 5371.858135] libceph: mds0 (1)172.21.15.33:6837 socket closed (con state OPEN) 2023-06-20T14:05:30.847883+00:00 smithi202 kernel: [ 5372.040949] ceph: [054ad731-bfaf-4587-9674-7b36e070bce0 6105] check_session_state: mds0 hung 2023-06-20T14:05:30.847895+00:00 smithi202 kernel: [ 5372.041437] libceph: mds0 (1)172.21.15.33:6837 session reset 2023-06-20T14:05:30.847896+00:00 smithi202 kernel: [ 5372.041442] ceph: [054ad731-bfaf-4587-9674-7b36e070bce0 6105] mds_peer_reset: mds0 closed our session 2023-06-20T14:05:30.847896+00:00 smithi202 kernel: [ 5372.041455] ceph: [054ad731-bfaf-4587-9674-7b36e070bce0 6105] send_mds_reconnect: mds0 reconnect start 2023-06-20T14:05:30.847897+00:00 smithi202 kernel: [ 5372.042226] ceph: [054ad731-bfaf-4587-9674-7b36e070bce0 6105] handle_session: mds0 reconnect denied 2023-06-20T14:05:31.251921+00:00 smithi202 kernel: [ 5372.443744] libceph: mon0 (1)172.21.15.33:6789 session established 2023-06-20T14:05:31.251941+00:00 smithi202 kernel: [ 5372.446011] libceph: client6108 fsid 054ad731-bfaf-4587-9674-7b36e070bce0 2023-06-20T14:05:31.251942+00:00 smithi202 kernel: [ 5372.446060] ceph: test_dummy_encryption mode enabled
While in mds.0, it didn't receive any ping from client.6105 and 300 seconds later just evicted the client:
2023-06-20T14:05:25.254+0000 7fa2d4855640 20 mds.0.server deferring marking session client.6105 v1:172.21.15.202:0/2228992724 stale since it holds no caps 2023-06-20T14:05:25.254+0000 7fa2d4855640 20 mds.0.server laggiest active session is client.4664 v1:172.21.15.202:0/3945822218 and renewed caps recently (2.03597s ago) ... 2023-06-20T14:05:30.254+0000 7fa2d4855640 10 mds.0.server find_idle_sessions. last cleared laggy state 5672.85s ago 2023-06-20T14:05:30.254+0000 7fa2d4855640 20 mds.0.server evicting session client.6105 v1:172.21.15.202:0/2228992724 since autoclose has arrived 2023-06-20T14:05:30.254+0000 7fa2d4855640 20 mds.0.server laggiest active session is client.4664 v1:172.21.15.202:0/3945822218 and renewed caps recently (7.03589s ago) 2023-06-20T14:05:30.298+0000 7fa2d4855640 0 log_channel(cluster) log [WRN] : evicting unresponsive client smithi202 (6105), after 305.619 seconds 2023-06-20T14:05:30.298+0000 7fa2d4855640 10 mds.0.server autoclosing stale session client.6105 v1:172.21.15.202:0/2228992724 last renewed caps 305.619s ago ... 2023-06-20T14:05:30.642+0000 7fa2d6058640 10 mds.0.server kill_session 0x5606c4189400 2023-06-20T14:05:30.642+0000 7fa2d6058640 10 mds.0.server journal_close_session : client.6105 v1:172.21.15.202:0/2228992724 pending_prealloc_inos [] free_prealloc_inos [0x1000000b7f7~0x164,0x1000000b95c~0x284] delegated_inos [] 2023-06-20T14:05:30.642+0000 7fa2d6058640 20 mds.0.sessionmap mark_projected s=0x5606c4189400 name=client.6105 pv=45762 -> 45763 2023-06-20T14:05:30.642+0000 7fa2d6058640 10 mds.0.inotable: project_release_ids [0x1000000b7f7~0x164,0x1000000b95c~0x284] to [0x1000000b3f3~0x86,0x1000000b47a~0xf4,0x1000000b587~0x26d,0x1000000bbe0~0xffffff4420]/[0x1000000b3f3~0x86,0x1000000b47a~0xf4,0x1000000b587~0x26d,0x1000000bbe0~0xffffff4420] ... 2023-06-20T14:05:30.658+0000 7fa2d004c640 10 remove_session: mds.metrics: session=0x5606c4189400, client=client.6105 v1:172.21.15.202:0/2228992724 2023-06-20T14:05:30.658+0000 7fa2d004c640 10 mds.0.sessionmap remove_session s=0x5606c4189400 name=client.6105 ... 2023-06-20T14:05:30.842+0000 7fa2d985f640 10 mds.c parse_caps: parsing auth_cap_str='allow *' 2023-06-20T14:05:30.842+0000 7fa2d6058640 10 mds.c new session 0x5606c10a6500 for client.6105 v1:172.21.15.202:0/2228992724 con 0x5606bd6fd800 2023-06-20T14:05:30.842+0000 7fa2d6058640 10 mds.c parse_caps: parsing auth_cap_str='allow *' 2023-06-20T14:05:30.842+0000 7fa2d6058640 10 mds.c ms_handle_accept v1:172.21.15.202:0/2228992724 con 0x5606bd6fd800 session 0x5606c10a6500 2023-06-20T14:05:30.842+0000 7fa2d6058640 1 -- [v2:172.21.15.33:6834/1215596373,v1:172.21.15.33:6837/1215596373] <== client.6105 v1:172.21.15.202:0/2228992724 1 ==== client_reconnect(1 caps 1 realms ) v3 ==== 0+0+98 (unknown 0 0 3797099533) 0x5606bd4aaea0 con 0x5606bd6fd800 2023-06-20T14:05:30.842+0000 7fa2d6058640 7 mds.0.server handle_client_reconnect client.6105 2023-06-20T14:05:30.842+0000 7fa2d6058640 20 mds.0.11 get_session have 0x5606c10a6500 client.6105 v1:172.21.15.202:0/2228992724 state closed 2023-06-20T14:05:30.842+0000 7fa2d6058640 0 mds.0.server ignoring msg from not-open sessionclient_reconnect(1 caps 1 realms ) v3
It seems a network issue, because not only the client.6105 from inode 172.21.15.202 have this issue, such as other clients:
2023-06-20T14:05:20.254+0000 7fa2d4855640 20 mds.0.server laggiest active session is client.4664 v1:172.21.15.202:0/3945822218 and renewed caps recently (17.0037s ago)
#14 Updated by Xiubo Li 5 months ago
Xiubo Li wrote:
Venky Shankar wrote:
Xiubo - PTAL at https://pulpito.ceph.com/vshankar-2023-06-20_10:07:44-fs-wip-vshankar-testing-20230620.052303-testing-default-smithi/7308836/
Kotresh pointed out this tracker during standup last week. When I debugged this before I knew about this tracker, the client was not renewing its caps with the MDS (request_renew message to the MDS), causing the MDS to evict and blocklist the client. Do you think the issue is the same as mentioned in note-8?
Checked the logs, found that the kclient detected that the mds.0 was hung and then try to reconnect but was rejected by mds.0:
[...]
While in mds.0, it didn't receive any ping from client.6105 and 300 seconds later just evicted the client:
[...]
It seems a network issue, because not only the client.6105 from inode 172.21.15.202 have this issue, such as other clients:
[...]
Raised one new tracker to follow this https://tracker.ceph.com/issues/61843, it's a kclient bug.
#15 Updated by Rishabh Dave 3 months ago
- Status changed from Fix Under Review to Resolved