Bug #50927
openrbd_mirror_stress.sh got stuck on ceph_test_rbd_mirror_random_write
0%
Description
This is on octopus and not really clear what happened, but since ceph_test_rbd_mirror_random_write has been broken for a while (#49117), noting just in case:
2021-05-20T19:03:15.278 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stderr:+ timeout 5s ceph_test_rbd_mirror_random_write --cluster cluster2 mirror test --debug-rbd=20 --debug-journaler=20 2021-05-20T19:03:15.655 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:100 IOs, 61108224 bytes 2021-05-20T19:03:15.823 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:200 IOs, 85588992 bytes 2021-05-20T19:03:15.984 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:300 IOs, 116281344 bytes 2021-05-20T19:03:16.111 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:400 IOs, 134831104 bytes 2021-05-20T19:03:16.199 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:500 IOs, 153641984 bytes 2021-05-20T19:03:16.428 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:600 IOs, 205452288 bytes 2021-05-20T19:03:16.582 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:700 IOs, 244076544 bytes 2021-05-20T19:03:16.752 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:800 IOs, 278330368 bytes 2021-05-20T19:03:16.927 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:900 IOs, 321813504 bytes 2021-05-20T19:03:17.050 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:1000 IOs, 351192064 bytes 2021-05-20T19:03:17.431 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:1100 IOs, 480282624 bytes 2021-05-20T19:03:17.573 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:1200 IOs, 505250816 bytes 2021-05-20T19:03:17.745 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:1300 IOs, 546650112 bytes 2021-05-20T19:03:18.015 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:1400 IOs, 628509696 bytes 2021-05-20T19:03:18.203 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:1500 IOs, 677974016 bytes 2021-05-20T19:03:18.306 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:1600 IOs, 685206528 bytes 2021-05-20T19:03:18.509 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:1700 IOs, 760324096 bytes 2021-05-20T19:03:18.693 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:1800 IOs, 792964096 bytes 2021-05-20T19:03:18.841 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:1900 IOs, 821672960 bytes 2021-05-20T19:03:18.942 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:2000 IOs, 829261824 bytes 2021-05-20T19:03:19.102 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:2100 IOs, 882796544 bytes 2021-05-20T19:03:19.424 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:2200 IOs, 975109120 bytes 2021-05-20T19:03:19.600 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:2300 IOs, 1020017664 bytes 2021-05-20T19:03:19.747 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:2400 IOs, 1059738624 bytes 2021-05-20T19:03:19.943 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:2500 IOs, 1099682816 bytes 2021-05-20T19:03:20.188 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:2600 IOs, 1100299264 bytes 2021-05-20T19:04:11.889 INFO:tasks.rbd_mirror_thrash:kill cluster1.client.mirror.0
It should have timed out at 19:03:20 like this:
2021-05-20T19:02:50.218 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:5500 IOs, 2298831872 bytes 2021-05-20T19:02:50.379 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stderr:+ error_code=124 2021-05-20T19:02:50.380 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stderr:+ set -e 2021-05-20T19:02:50.380 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stderr:+ [ 124 -eq 124 ] 2021-05-20T19:02:50.381 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stderr:+ return 0
Updated by Ilya Dryomov almost 3 years ago
- Assignee set to Ilya Dryomov
Updated by Ilya Dryomov almost 3 years ago
This appears to be a bug in bionic's timeout(1)!
ubuntu 26206 0.0 0.0 12048 2612 ? S 13:08 0:00 timeout 5s ceph_test_rbd_mirror_random_write --cluster cluster2 mirror image_15 --debug-rbd=20 --debug-journaler=20 ubuntu 26207 0.0 0.0 0 0 ? Z 13:08 0:01 [ceph_test_rbd_m] <defunct>
ceph_test_rbd_mirror_random_write has gotten timeout's SIGTERM and exited but timeout is forever sleeping in sigsuspend(), apparently having missed SIGCHLD for ceph_test_rbd_mirror_random_write.
Looks exactly like one of the regressions introduced in 8.27:
timeout: fix race possibly terminating wrong process
https://github.com/coreutils/coreutils/commit/2f69dba5df8caaf9eda658c1808b1379e9949f22
and fixed in 8.29:
timeout: fix regression when invoked with blocked SIGCHLD
https://github.com/coreutils/coreutils/commit/cccd01b4dae1061dcad2ae14bea976ed86268d9b
timeout: fix a small race that would ignore command exit
https://github.com/coreutils/coreutils/commit/cbf35912da66a17a6113d5a434214dd7651f403a
bionic ships 8.28 with a few unrelated patches:
$ timeout --version timeout (GNU coreutils) 8.28