Project

General

Profile

Actions

Bug #50927

open

rbd_mirror_stress.sh got stuck on ceph_test_rbd_mirror_random_write

Added by Ilya Dryomov almost 3 years ago. Updated almost 3 years ago.

Status:
New
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

http://qa-proxy.ceph.com/teuthology/yuriw-2021-05-20_14:49:52-rbd-wip-yuri2-testing-2021-05-19-1635-octopus-distro-basic-smithi/6125214/teuthology.log

This is on octopus and not really clear what happened, but since ceph_test_rbd_mirror_random_write has been broken for a while (#49117), noting just in case:

2021-05-20T19:03:15.278 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stderr:+ timeout 5s ceph_test_rbd_mirror_random_write --cluster cluster2 mirror test --debug-rbd=20 --debug-journaler=20
2021-05-20T19:03:15.655 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:100 IOs, 61108224 bytes                                                                          
2021-05-20T19:03:15.823 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:200 IOs, 85588992 bytes                                                                          
2021-05-20T19:03:15.984 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:300 IOs, 116281344 bytes
2021-05-20T19:03:16.111 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:400 IOs, 134831104 bytes                                                                         
2021-05-20T19:03:16.199 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:500 IOs, 153641984 bytes                                                                         
2021-05-20T19:03:16.428 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:600 IOs, 205452288 bytes                                                                         
2021-05-20T19:03:16.582 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:700 IOs, 244076544 bytes                                                                         
2021-05-20T19:03:16.752 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:800 IOs, 278330368 bytes                                                                         
2021-05-20T19:03:16.927 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:900 IOs, 321813504 bytes                                                                         
2021-05-20T19:03:17.050 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:1000 IOs, 351192064 bytes                                                                        
2021-05-20T19:03:17.431 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:1100 IOs, 480282624 bytes                                                                        
2021-05-20T19:03:17.573 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:1200 IOs, 505250816 bytes                                                                        
2021-05-20T19:03:17.745 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:1300 IOs, 546650112 bytes                                                                        
2021-05-20T19:03:18.015 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:1400 IOs, 628509696 bytes                                                                        
2021-05-20T19:03:18.203 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:1500 IOs, 677974016 bytes                                                                        
2021-05-20T19:03:18.306 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:1600 IOs, 685206528 bytes                                                                        
2021-05-20T19:03:18.509 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:1700 IOs, 760324096 bytes                                                                        
2021-05-20T19:03:18.693 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:1800 IOs, 792964096 bytes                                                                        
2021-05-20T19:03:18.841 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:1900 IOs, 821672960 bytes                                                                        
2021-05-20T19:03:18.942 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:2000 IOs, 829261824 bytes                                                                        
2021-05-20T19:03:19.102 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:2100 IOs, 882796544 bytes                                                                        
2021-05-20T19:03:19.424 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:2200 IOs, 975109120 bytes                                                                        
2021-05-20T19:03:19.600 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:2300 IOs, 1020017664 bytes                                                                       
2021-05-20T19:03:19.747 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:2400 IOs, 1059738624 bytes                                                                       
2021-05-20T19:03:19.943 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:2500 IOs, 1099682816 bytes                                                                       
2021-05-20T19:03:20.188 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:2600 IOs, 1100299264 bytes                                                                       
2021-05-20T19:04:11.889 INFO:tasks.rbd_mirror_thrash:kill cluster1.client.mirror.0

It should have timed out at 19:03:20 like this:

2021-05-20T19:02:50.218 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stdout:5500 IOs, 2298831872 bytes
2021-05-20T19:02:50.379 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stderr:+ error_code=124
2021-05-20T19:02:50.380 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stderr:+ set -e
2021-05-20T19:02:50.380 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stderr:+ [ 124 -eq 124 ]
2021-05-20T19:02:50.381 INFO:tasks.workunit.cluster1.client.mirror.smithi081.stderr:+ return 0
Actions #2

Updated by Ilya Dryomov almost 3 years ago

This appears to be a bug in bionic's timeout(1)!

ubuntu     26206  0.0  0.0  12048  2612 ?        S    13:08   0:00 timeout 5s ceph_test_rbd_mirror_random_write --cluster cluster2 mirror image_15 --debug-rbd=20 --debug-journaler=20
ubuntu     26207  0.0  0.0      0     0 ?        Z    13:08   0:01 [ceph_test_rbd_m] <defunct>

ceph_test_rbd_mirror_random_write has gotten timeout's SIGTERM and exited but timeout is forever sleeping in sigsuspend(), apparently having missed SIGCHLD for ceph_test_rbd_mirror_random_write.

Looks exactly like one of the regressions introduced in 8.27:

timeout: fix race possibly terminating wrong process
https://github.com/coreutils/coreutils/commit/2f69dba5df8caaf9eda658c1808b1379e9949f22

and fixed in 8.29:

timeout: fix regression when invoked with blocked SIGCHLD
https://github.com/coreutils/coreutils/commit/cccd01b4dae1061dcad2ae14bea976ed86268d9b

timeout: fix a small race that would ignore command exit
https://github.com/coreutils/coreutils/commit/cbf35912da66a17a6113d5a434214dd7651f403a

bionic ships 8.28 with a few unrelated patches:

$ timeout --version
timeout (GNU coreutils) 8.28
Actions

Also available in: Atom PDF