Bug #17794
openrealm pull failure in teuthology with Unknown error 2202
0%
Description
2016-11-02T22:53:56.634 INFO:teuthology.orchestra.run.mira016.stderr:2016-11-02 22:53:56.618534 7f6c2aeef880 20 sending request to http://mira112.front.sepia.ceph.com:7280/admin/realm?name=realm0
2016-11-02T22:53:56.635 INFO:teuthology.orchestra.run.mira016.stderr:request failed: (2202) Unknown error 2202
2016-11-02T22:53:56.635 INFO:teuthology.orchestra.run.mira016.stderr:2016-11-02 22:53:56.623529 7f6c2aeef880 0 curl_easy_perform returned status 7 error: Failed to connect to mira112.front.sepia.ceph.com port 7280: Connection refused
Updated by Casey Bodley over 7 years ago
I actually saw this once in my own local testing when I started the gateways under valgrind - valgrind apparently returns control to bash before radosgw finishes starting up.
I pushed a branch to ceph-qa-suite that will retry this command for 30 seconds: https://github.com/ceph/ceph-qa-suite/pull/1250
Updated by Casey Bodley over 7 years ago
That fix to retry the 'realm pull' does not appear to solve the issue. It reproduced again [1] and all 10 tries failed the same way. Looking for a log of the master's gateway, I found none - though there was a valgrind log [2] complaining about a cryptic InvalidJump. And the teuthology.log also contained:
2016-11-11T19:35:00.531 INFO:teuthology.misc:Shutting down rgw daemons... 2016-11-11T19:35:00.531 DEBUG:tasks.rgw.client.0:waiting for process to exit 2016-11-11T19:35:00.532 INFO:teuthology.orchestra.run:waiting for 300 2016-11-11T19:35:07.467 INFO:tasks.rgw.client.0.mira026.stderr:daemon-helper: command crashed with signal 11 2016-11-11T19:35:12.533 INFO:tasks.rgw.client.0:Stopped
[1] http://qa-proxy.ceph.com/teuthology/cbodley-2016-11-11_12:43:56-rgw-master---basic-mira/540989/teuthology.log
[2] http://qa-proxy.ceph.com/teuthology/cbodley-2016-11-11_12:43:56-rgw-master---basic-mira/540989/remote/mira026/log/valgrind/client.0.log.gz
Updated by Casey Bodley over 7 years ago
So, despite the 'command crashed with signal 11' and valgrind output, I don't think we're actually segfaulting here. I ran this teuthology job with interactive-on-error and found that after a minute or so, the gateway would respond and I could send a successful 'radosgw-admin realm pull' from the other node.
I think the real issue here is in the way we're starting the gateways. When we restart a radosgw instance under valgrind, we're using the --foreground flag. This means it won't return control until radosgw is killed. It also means that the next commands won't wait for radosgw to fork/finish starting up, so we can't avoid races like this.
We need to find a way to run radosgw without --foreground, and make sure that startup completes before continuing on with the rgw task.
Updated by Konstantin Shalygin 20 days ago
- Source changed from other to Community (dev)