Bug #17794: realm pull failure in teuthology with Unknown error 2202 - rgw - Ceph

Actions

Copy link

Bug #17794

open

realm pull failure in teuthology with Unknown error 2202

Added by Orit Wasserman over 7 years ago. Updated 20 days ago.

Status:

New

Priority:

Normal

Assignee:

Orit Wasserman

Target version:

% Done:

Source:

Community (dev)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

http://qa-proxy.ceph.com/teuthology/owasserm-2016-11-02_22:33:41-rgw-wip-orit-testing---basic-mira/512520/teuthology.log

2016-11-02T22:53:56.634 INFO:teuthology.orchestra.run.mira016.stderr:2016-11-02 22:53:56.618534 7f6c2aeef880 20 sending request to http://mira112.front.sepia.ceph.com:7280/admin/realm?name=realm0
2016-11-02T22:53:56.635 INFO:teuthology.orchestra.run.mira016.stderr:request failed: (2202) Unknown error 2202
2016-11-02T22:53:56.635 INFO:teuthology.orchestra.run.mira016.stderr:2016-11-02 22:53:56.623529 7f6c2aeef880 0 curl_easy_perform returned status 7 error: Failed to connect to mira112.front.sepia.ceph.com port 7280: Connection refused

Actions

Copy link

Updated by Casey Bodley over 7 years ago

I actually saw this once in my own local testing when I started the gateways under valgrind - valgrind apparently returns control to bash before radosgw finishes starting up.

I pushed a branch to ceph-qa-suite that will retry this command for 30 seconds: https://github.com/ceph/ceph-qa-suite/pull/1250

Actions

Copy link

Updated by Casey Bodley over 7 years ago

That fix to retry the 'realm pull' does not appear to solve the issue. It reproduced again [1] and all 10 tries failed the same way. Looking for a log of the master's gateway, I found none - though there was a valgrind log [2] complaining about a cryptic InvalidJump. And the teuthology.log also contained:

2016-11-11T19:35:00.531 INFO:teuthology.misc:Shutting down rgw daemons...
2016-11-11T19:35:00.531 DEBUG:tasks.rgw.client.0:waiting for process to exit
2016-11-11T19:35:00.532 INFO:teuthology.orchestra.run:waiting for 300
2016-11-11T19:35:07.467 INFO:tasks.rgw.client.0.mira026.stderr:daemon-helper: command crashed with signal 11
2016-11-11T19:35:12.533 INFO:tasks.rgw.client.0:Stopped

[1] http://qa-proxy.ceph.com/teuthology/cbodley-2016-11-11_12:43:56-rgw-master---basic-mira/540989/teuthology.log
[2] http://qa-proxy.ceph.com/teuthology/cbodley-2016-11-11_12:43:56-rgw-master---basic-mira/540989/remote/mira026/log/valgrind/client.0.log.gz

Actions

Copy link

Updated by Casey Bodley over 7 years ago

So, despite the 'command crashed with signal 11' and valgrind output, I don't think we're actually segfaulting here. I ran this teuthology job with interactive-on-error and found that after a minute or so, the gateway would respond and I could send a successful 'radosgw-admin realm pull' from the other node.

I think the real issue here is in the way we're starting the gateways. When we restart a radosgw instance under valgrind, we're using the --foreground flag. This means it won't return control until radosgw is killed. It also means that the next commands won't wait for radosgw to fork/finish starting up, so we can't avoid races like this.

We need to find a way to run radosgw without --foreground, and make sure that startup completes before continuing on with the rgw task.

Actions

Copy link

Updated by Konstantin Shalygin 20 days ago

Source changed from other to Community (dev)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #17794

realm pull failure in teuthology with Unknown error 2202

Updated by Casey Bodley over 7 years ago

Updated by Casey Bodley over 7 years ago

Updated by Casey Bodley over 7 years ago

Updated by Konstantin Shalygin 20 days ago