Project

General

Profile

Bug #17794

realm pull failure in teuthology with Unknown error 2202

Added by Orit Wasserman over 3 years ago. Updated over 3 years ago.

Status:
New
Priority:
Normal
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

http://qa-proxy.ceph.com/teuthology/owasserm-2016-11-02_22:33:41-rgw-wip-orit-testing---basic-mira/512520/teuthology.log

2016-11-02T22:53:56.634 INFO:teuthology.orchestra.run.mira016.stderr:2016-11-02 22:53:56.618534 7f6c2aeef880 20 sending request to http://mira112.front.sepia.ceph.com:7280/admin/realm?name=realm0
2016-11-02T22:53:56.635 INFO:teuthology.orchestra.run.mira016.stderr:request failed: (2202) Unknown error 2202
2016-11-02T22:53:56.635 INFO:teuthology.orchestra.run.mira016.stderr:2016-11-02 22:53:56.623529 7f6c2aeef880 0 curl_easy_perform returned status 7 error: Failed to connect to mira112.front.sepia.ceph.com port 7280: Connection refused

History

#1 Updated by Casey Bodley over 3 years ago

I actually saw this once in my own local testing when I started the gateways under valgrind - valgrind apparently returns control to bash before radosgw finishes starting up.

I pushed a branch to ceph-qa-suite that will retry this command for 30 seconds: https://github.com/ceph/ceph-qa-suite/pull/1250

#2 Updated by Casey Bodley over 3 years ago

That fix to retry the 'realm pull' does not appear to solve the issue. It reproduced again [1] and all 10 tries failed the same way. Looking for a log of the master's gateway, I found none - though there was a valgrind log [2] complaining about a cryptic InvalidJump. And the teuthology.log also contained:

2016-11-11T19:35:00.531 INFO:teuthology.misc:Shutting down rgw daemons...
2016-11-11T19:35:00.531 DEBUG:tasks.rgw.client.0:waiting for process to exit
2016-11-11T19:35:00.532 INFO:teuthology.orchestra.run:waiting for 300
2016-11-11T19:35:07.467 INFO:tasks.rgw.client.0.mira026.stderr:daemon-helper: command crashed with signal 11
2016-11-11T19:35:12.533 INFO:tasks.rgw.client.0:Stopped

[1] http://qa-proxy.ceph.com/teuthology/cbodley-2016-11-11_12:43:56-rgw-master---basic-mira/540989/teuthology.log
[2] http://qa-proxy.ceph.com/teuthology/cbodley-2016-11-11_12:43:56-rgw-master---basic-mira/540989/remote/mira026/log/valgrind/client.0.log.gz

#3 Updated by Casey Bodley over 3 years ago

So, despite the 'command crashed with signal 11' and valgrind output, I don't think we're actually segfaulting here. I ran this teuthology job with interactive-on-error and found that after a minute or so, the gateway would respond and I could send a successful 'radosgw-admin realm pull' from the other node.

I think the real issue here is in the way we're starting the gateways. When we restart a radosgw instance under valgrind, we're using the --foreground flag. This means it won't return control until radosgw is killed. It also means that the next commands won't wait for radosgw to fork/finish starting up, so we can't avoid races like this.

We need to find a way to run radosgw without --foreground, and make sure that startup completes before continuing on with the rgw task.

Also available in: Atom PDF