Project

General

Profile

Actions

Bug #65803

open

mds: some asok commands wait with asok thread blocked

Added by Leonid Usov 14 days ago. Updated 4 days ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Q/A
Tags:
backport_processed
Backport:
squid,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, qa-suite, quiesce
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Teuthology script is often running slow, as much as it is not able to keep timed events, arriving late and finding a quiesce set already expired.

Example: https://pulpito.ceph.com/pdonnell-2024-05-03_22:48:16-fs-wip-pdonnell-testing-20240503.163550-debug-distro-default-smithi/7688644/

2024-05-04T03:45:45.852 DEBUG:tasks.quiescer.fs.[cephfs]:Running ceph command: 'tell mds.24472 quiesce db / --timeout 6
0 --expiration 90.0 --await-for 42.0'

2024-05-04T03:45:54.567 INFO:tasks.quiescer.fs.[cephfs]:Successfully quiesced set '19e440bc', quiesce took 11.6 seconds
. Will release after: 30.0

2024-05-04T03:46:27.347 DEBUG:tasks.quiescer.fs.[cephfs]:Running ceph command: 'tell mds.24472 quiesce db --set-id 19e440bc --release --await'
// 1 minute difference!
2024-05-04T03:47:25.002 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't release set '19e440bc' with rc: 1 (EPERM), stdout:
{
    "epoch": 34,
    "leader": 24472,
    "set_version": 15,
    "sets": {
        "19e440bc": {
            "version": 15,
            "age_ref": 0.4,
            "state": {
                "name": "EXPIRED",
                "age": 0.0
            },
            "timeout": 60.0,
            "expiration": 90.0,
            "members": {
                "file:/": {
                    "excluded": false,
                    "state": {
                        "name": "QUIESCED",
                        "age": 90.0
                    }
                }
            }
        }
    }
}

Other similar failures:

https://pulpito.ceph.com/pdonnell-2024-05-03_22:48:16-fs-wip-pdonnell-testing-20240503.163550-debug-distro-default-smithi/7688507/
https://pulpito.ceph.com/pdonnell-2024-05-03_22:48:16-fs-wip-pdonnell-testing-20240503.163550-debug-distro-default-smithi/7688512/
https://pulpito.ceph.com/pdonnell-2024-05-03_22:48:16-fs-wip-pdonnell-testing-20240503.163550-debug-distro-default-smithi/7688477/


Related issues 2 (2 open0 closed)

Copied to CephFS - Backport #66046: squid: mds: some asok commands wait with asok thread blockedNewLeonid UsovActions
Copied to CephFS - Backport #66047: reef: mds: some asok commands wait with asok thread blockedNewLeonid UsovActions
Actions #1

Updated by Leonid Usov 14 days ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 57274
Actions #2

Updated by Patrick Donnelly 13 days ago

The actual issue is that `flush journal` command is synchronous: it's blocking the admin socket thread in the mds:

# flush journal command:
2024-05-04T03:45:54.479 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:45:54.473+0000 7f56fa833740  4 client.15918 mds_command: new command op to 24472 tid=0 multi_id=0 [{"prefix": "flush journal"}]
...
# the quiesce command is learning commands:
2024-05-04T03:46:29.309 INFO:teuthology.orchestra.run.smithi117.stderr:2024-05-04T03:46:29.302+0000 7f49ea688b80  4 client.15936 mds_command: new command op to 24472 tid=0 multi_id=0 [{"prefix": "get_command_descriptions"}]
...
# flush journal command reply:
2024-05-04T03:47:24.173 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.171+0000 7f56d4ff9640  1 -- 172.21.15.117:0/248605477 <== mds.0 v2:172.21.15.175:6838/267841121 1 ==== command_reply(tid 0: 0 ) ==== 8+0+44 (crc 0 0 0) 0x557e7cf9e050 con 0x557e7cc241c0
2024-05-04T03:47:24.174 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.171+0000 7f56d4ff9640 10 client.15918 handle_command_reply: tid=0
2024-05-04T03:47:24.174 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.171+0000 7f56fa833740  1 client.15918 shutdown
2024-05-04T03:47:24.174 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.171+0000 7f56fa833740  1 -- 172.21.15.117:0/248605477 >> [v2:172.21.15.117:3300/0,v1:172.21.15.117:6789/0] conn(0x557e7cc901b0 msgr2=0x557e7d041a60 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2024-05-04T03:47:24.174 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.171+0000 7f56fa833740  1 --2- 172.21.15.117:0/248605477 >> [v2:172.21.15.117:3300/0,v1:172.21.15.117:6789/0] conn(0x557e7cc901b0 0x557e7d041a60 secure :-1 s=READY pgs=577 cs=0 l=1 rev1=1 crypto rx=0x7f56e4004a80 tx=0x7f56e40040f0 comp rx=0 tx=0).stop
2024-05-04T03:47:24.174 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f56fa833740  1 -- 172.21.15.117:0/248605477 shutdown_connections
2024-05-04T03:47:24.174 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f56fa833740  1 --2- 172.21.15.117:0/248605477 >> [v2:172.21.15.175:6838/267841121,v1:172.21.15.175:6839/267841121] conn(0x557e7cc241c0 0x557e7d0b3bc0 crc :-1 s=READY pgs=86 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).stop
2024-05-04T03:47:24.174 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f56fa833740  1 --2- 172.21.15.117:0/248605477 >> [v2:172.21.15.117:3300/0,v1:172.21.15.117:6789/0] conn(0x557e7cc901b0 0x557e7d041a60 unknown :-1 s=CLOSED pgs=577 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).stop
2024-05-04T03:47:24.174 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f56fa833740  1 --2- 172.21.15.117:0/248605477 >> [v2:172.21.15.138:3300/0,v1:172.21.15.138:6789/0] conn(0x557e7cc6f4d0 0x557e7d040b30 unknown :-1 s=CLOSED pgs=0 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).stop
2024-05-04T03:47:24.174 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f56fa833740  1 --2- 172.21.15.117:0/248605477 >> [v2:172.21.15.175:3300/0,v1:172.21.15.175:6789/0] conn(0x557e7cc672f0 0x557e7d0405f0 unknown :-1 s=CLOSED pgs=0 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).stop
2024-05-04T03:47:24.174 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f56fa833740  1 -- 172.21.15.117:0/248605477 >> 172.21.15.117:0/248605477 conn(0x557e7cc84300 msgr2=0x557e7d02b970 unknown :-1 s=STATE_NONE l=0).mark_down
2024-05-04T03:47:24.174 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f56d4ff9640  0 client.15918 ms_handle_reset on v2:172.21.15.175:6838/267841121
2024-05-04T03:47:24.175 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f56fa833740  1 -- 172.21.15.117:0/248605477 shutdown_connections
2024-05-04T03:47:24.175 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f56fa833740  1 -- 172.21.15.117:0/248605477 wait complete.
2024-05-04T03:47:24.175 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f56fa833740 20 client.15918 trim_cache size 0 max 16384
2024-05-04T03:47:24.175 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stdout:{
2024-05-04T03:47:24.175 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stdout:    "message": "",
2024-05-04T03:47:24.175 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stdout:    "return_code": 0
2024-05-04T03:47:24.175 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stdout:}
...
# get_command_descriptions reply:
2024-05-04T03:47:24.176 INFO:teuthology.orchestra.run.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f49bd7fa700  1 -- 172.21.15.117:0/3225639302 <== mds.0 v2:172.21.15.175:6838/267841121 1 ==== command_reply(tid 0: 0 ) ==== 8+0+23648 (crc 0 0 0) 0x557a6f2d4340 con 0x557a6f3be920
2024-05-04T03:47:24.176 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.173+0000 7f56f38aa640  1 -- 172.21.15.117:0/2016595433 >> [v2:172.21.15.117:3300/0,v1:172.21.15.117:6789/0] conn(0x7f56ec103da0 msgr2=0x7f56ec10bad0 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2024-05-04T03:47:24.177 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.173+0000 7f56f38aa640  1 --2- 172.21.15.117:0/2016595433 >> [v2:172.21.15.117:3300/0,v1:172.21.15.117:6789/0] conn(0x7f56ec103da0 0x7f56ec10bad0 secure :-1 s=READY pgs=573 cs=0 l=1 rev1=1 crypto rx=0x7f56dc00c890 tx=0x7f56dc00cd50 comp rx=0 tx=0).stop
2024-05-04T03:47:24.177 INFO:teuthology.orchestra.run.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f49bd7fa700 10 client.15936 handle_command_reply: tid=0
2024-05-04T03:47:24.177 INFO:teuthology.orchestra.run.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f49ea688b80  1 client.15936 shutdown
Actions #3

Updated by Patrick Donnelly 12 days ago

  • Subject changed from Teuthology is not keeping up with the tight timing constraints of the quiescer to mds: some asok commands wait with asok thread blocked
  • Category set to Correctness/Safety
  • Target version set to v20.0.0
  • Source set to Q/A
  • Backport set to squid,reef
  • Component(FS) MDS added
Actions #4

Updated by Patrick Donnelly 4 days ago

  • Status changed from Fix Under Review to Pending Backport
Actions #5

Updated by Backport Bot 4 days ago

  • Copied to Backport #66046: squid: mds: some asok commands wait with asok thread blocked added
Actions #6

Updated by Backport Bot 4 days ago

  • Copied to Backport #66047: reef: mds: some asok commands wait with asok thread blocked added
Actions #7

Updated by Backport Bot 4 days ago

  • Tags set to backport_processed
Actions

Also available in: Atom PDF