Actions
Bug #65803
openmds: some asok commands wait with asok thread blocked
Status:
Pending Backport
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:
0%
Source:
Q/A
Tags:
backport_processed
Backport:
squid,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, qa-suite, quiesce
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Teuthology script is often running slow, as much as it is not able to keep timed events, arriving late and finding a quiesce set already expired.
2024-05-04T03:45:45.852 DEBUG:tasks.quiescer.fs.[cephfs]:Running ceph command: 'tell mds.24472 quiesce db / --timeout 6 0 --expiration 90.0 --await-for 42.0' 2024-05-04T03:45:54.567 INFO:tasks.quiescer.fs.[cephfs]:Successfully quiesced set '19e440bc', quiesce took 11.6 seconds . Will release after: 30.0 2024-05-04T03:46:27.347 DEBUG:tasks.quiescer.fs.[cephfs]:Running ceph command: 'tell mds.24472 quiesce db --set-id 19e440bc --release --await' // 1 minute difference! 2024-05-04T03:47:25.002 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't release set '19e440bc' with rc: 1 (EPERM), stdout: { "epoch": 34, "leader": 24472, "set_version": 15, "sets": { "19e440bc": { "version": 15, "age_ref": 0.4, "state": { "name": "EXPIRED", "age": 0.0 }, "timeout": 60.0, "expiration": 90.0, "members": { "file:/": { "excluded": false, "state": { "name": "QUIESCED", "age": 90.0 } } } } } }
Other similar failures:
https://pulpito.ceph.com/pdonnell-2024-05-03_22:48:16-fs-wip-pdonnell-testing-20240503.163550-debug-distro-default-smithi/7688507/
https://pulpito.ceph.com/pdonnell-2024-05-03_22:48:16-fs-wip-pdonnell-testing-20240503.163550-debug-distro-default-smithi/7688512/
https://pulpito.ceph.com/pdonnell-2024-05-03_22:48:16-fs-wip-pdonnell-testing-20240503.163550-debug-distro-default-smithi/7688477/
Updated by Leonid Usov 14 days ago
- Status changed from New to Fix Under Review
- Pull request ID set to 57274
Updated by Patrick Donnelly 13 days ago
The actual issue is that `flush journal` command is synchronous: it's blocking the admin socket thread in the mds:
# flush journal command: 2024-05-04T03:45:54.479 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:45:54.473+0000 7f56fa833740 4 client.15918 mds_command: new command op to 24472 tid=0 multi_id=0 [{"prefix": "flush journal"}] ... # the quiesce command is learning commands: 2024-05-04T03:46:29.309 INFO:teuthology.orchestra.run.smithi117.stderr:2024-05-04T03:46:29.302+0000 7f49ea688b80 4 client.15936 mds_command: new command op to 24472 tid=0 multi_id=0 [{"prefix": "get_command_descriptions"}] ... # flush journal command reply: 2024-05-04T03:47:24.173 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.171+0000 7f56d4ff9640 1 -- 172.21.15.117:0/248605477 <== mds.0 v2:172.21.15.175:6838/267841121 1 ==== command_reply(tid 0: 0 ) ==== 8+0+44 (crc 0 0 0) 0x557e7cf9e050 con 0x557e7cc241c0 2024-05-04T03:47:24.174 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.171+0000 7f56d4ff9640 10 client.15918 handle_command_reply: tid=0 2024-05-04T03:47:24.174 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.171+0000 7f56fa833740 1 client.15918 shutdown 2024-05-04T03:47:24.174 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.171+0000 7f56fa833740 1 -- 172.21.15.117:0/248605477 >> [v2:172.21.15.117:3300/0,v1:172.21.15.117:6789/0] conn(0x557e7cc901b0 msgr2=0x557e7d041a60 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down 2024-05-04T03:47:24.174 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.171+0000 7f56fa833740 1 --2- 172.21.15.117:0/248605477 >> [v2:172.21.15.117:3300/0,v1:172.21.15.117:6789/0] conn(0x557e7cc901b0 0x557e7d041a60 secure :-1 s=READY pgs=577 cs=0 l=1 rev1=1 crypto rx=0x7f56e4004a80 tx=0x7f56e40040f0 comp rx=0 tx=0).stop 2024-05-04T03:47:24.174 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f56fa833740 1 -- 172.21.15.117:0/248605477 shutdown_connections 2024-05-04T03:47:24.174 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f56fa833740 1 --2- 172.21.15.117:0/248605477 >> [v2:172.21.15.175:6838/267841121,v1:172.21.15.175:6839/267841121] conn(0x557e7cc241c0 0x557e7d0b3bc0 crc :-1 s=READY pgs=86 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).stop 2024-05-04T03:47:24.174 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f56fa833740 1 --2- 172.21.15.117:0/248605477 >> [v2:172.21.15.117:3300/0,v1:172.21.15.117:6789/0] conn(0x557e7cc901b0 0x557e7d041a60 unknown :-1 s=CLOSED pgs=577 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).stop 2024-05-04T03:47:24.174 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f56fa833740 1 --2- 172.21.15.117:0/248605477 >> [v2:172.21.15.138:3300/0,v1:172.21.15.138:6789/0] conn(0x557e7cc6f4d0 0x557e7d040b30 unknown :-1 s=CLOSED pgs=0 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).stop 2024-05-04T03:47:24.174 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f56fa833740 1 --2- 172.21.15.117:0/248605477 >> [v2:172.21.15.175:3300/0,v1:172.21.15.175:6789/0] conn(0x557e7cc672f0 0x557e7d0405f0 unknown :-1 s=CLOSED pgs=0 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).stop 2024-05-04T03:47:24.174 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f56fa833740 1 -- 172.21.15.117:0/248605477 >> 172.21.15.117:0/248605477 conn(0x557e7cc84300 msgr2=0x557e7d02b970 unknown :-1 s=STATE_NONE l=0).mark_down 2024-05-04T03:47:24.174 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f56d4ff9640 0 client.15918 ms_handle_reset on v2:172.21.15.175:6838/267841121 2024-05-04T03:47:24.175 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f56fa833740 1 -- 172.21.15.117:0/248605477 shutdown_connections 2024-05-04T03:47:24.175 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f56fa833740 1 -- 172.21.15.117:0/248605477 wait complete. 2024-05-04T03:47:24.175 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f56fa833740 20 client.15918 trim_cache size 0 max 16384 2024-05-04T03:47:24.175 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stdout:{ 2024-05-04T03:47:24.175 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stdout: "message": "", 2024-05-04T03:47:24.175 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stdout: "return_code": 0 2024-05-04T03:47:24.175 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stdout:} ... # get_command_descriptions reply: 2024-05-04T03:47:24.176 INFO:teuthology.orchestra.run.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f49bd7fa700 1 -- 172.21.15.117:0/3225639302 <== mds.0 v2:172.21.15.175:6838/267841121 1 ==== command_reply(tid 0: 0 ) ==== 8+0+23648 (crc 0 0 0) 0x557a6f2d4340 con 0x557a6f3be920 2024-05-04T03:47:24.176 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.173+0000 7f56f38aa640 1 -- 172.21.15.117:0/2016595433 >> [v2:172.21.15.117:3300/0,v1:172.21.15.117:6789/0] conn(0x7f56ec103da0 msgr2=0x7f56ec10bad0 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down 2024-05-04T03:47:24.177 INFO:teuthology.task.background_exec.ubuntu@smithi117.front.sepia.ceph.com.smithi117.stderr:2024-05-04T03:47:24.173+0000 7f56f38aa640 1 --2- 172.21.15.117:0/2016595433 >> [v2:172.21.15.117:3300/0,v1:172.21.15.117:6789/0] conn(0x7f56ec103da0 0x7f56ec10bad0 secure :-1 s=READY pgs=573 cs=0 l=1 rev1=1 crypto rx=0x7f56dc00c890 tx=0x7f56dc00cd50 comp rx=0 tx=0).stop 2024-05-04T03:47:24.177 INFO:teuthology.orchestra.run.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f49bd7fa700 10 client.15936 handle_command_reply: tid=0 2024-05-04T03:47:24.177 INFO:teuthology.orchestra.run.smithi117.stderr:2024-05-04T03:47:24.172+0000 7f49ea688b80 1 client.15936 shutdown
Updated by Patrick Donnelly 12 days ago
- Subject changed from Teuthology is not keeping up with the tight timing constraints of the quiescer to mds: some asok commands wait with asok thread blocked
- Category set to Correctness/Safety
- Target version set to v20.0.0
- Source set to Q/A
- Backport set to squid,reef
- Component(FS) MDS added
Updated by Patrick Donnelly 4 days ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot 4 days ago
- Copied to Backport #66046: squid: mds: some asok commands wait with asok thread blocked added
Updated by Backport Bot 4 days ago
- Copied to Backport #66047: reef: mds: some asok commands wait with asok thread blocked added
Actions