realm pull failure in teuthology with Unknown error 2202
2016-11-02T22:53:56.634 INFO:teuthology.orchestra.run.mira016.stderr:2016-11-02 22:53:56.618534 7f6c2aeef880 20 sending request to http://mira112.front.sepia.ceph.com:7280/admin/realm?name=realm0
2016-11-02T22:53:56.635 INFO:teuthology.orchestra.run.mira016.stderr:request failed: (2202) Unknown error 2202
2016-11-02T22:53:56.635 INFO:teuthology.orchestra.run.mira016.stderr:2016-11-02 22:53:56.623529 7f6c2aeef880 0 curl_easy_perform returned status 7 error: Failed to connect to mira112.front.sepia.ceph.com port 7280: Connection refused
#1 Updated by Casey Bodley over 3 years ago
I actually saw this once in my own local testing when I started the gateways under valgrind - valgrind apparently returns control to bash before radosgw finishes starting up.
I pushed a branch to ceph-qa-suite that will retry this command for 30 seconds: https://github.com/ceph/ceph-qa-suite/pull/1250
#2 Updated by Casey Bodley over 3 years ago
That fix to retry the 'realm pull' does not appear to solve the issue. It reproduced again  and all 10 tries failed the same way. Looking for a log of the master's gateway, I found none - though there was a valgrind log  complaining about a cryptic InvalidJump. And the teuthology.log also contained:
2016-11-11T19:35:00.531 INFO:teuthology.misc:Shutting down rgw daemons... 2016-11-11T19:35:00.531 DEBUG:tasks.rgw.client.0:waiting for process to exit 2016-11-11T19:35:00.532 INFO:teuthology.orchestra.run:waiting for 300 2016-11-11T19:35:07.467 INFO:tasks.rgw.client.0.mira026.stderr:daemon-helper: command crashed with signal 11 2016-11-11T19:35:12.533 INFO:tasks.rgw.client.0:Stopped
#3 Updated by Casey Bodley over 3 years ago
So, despite the 'command crashed with signal 11' and valgrind output, I don't think we're actually segfaulting here. I ran this teuthology job with interactive-on-error and found that after a minute or so, the gateway would respond and I could send a successful 'radosgw-admin realm pull' from the other node.
I think the real issue here is in the way we're starting the gateways. When we restart a radosgw instance under valgrind, we're using the --foreground flag. This means it won't return control until radosgw is killed. It also means that the next commands won't wait for radosgw to fork/finish starting up, so we can't avoid races like this.
We need to find a way to run radosgw without --foreground, and make sure that startup completes before continuing on with the rgw task.