Bug #47002: python-rados: connection error - RADOS - Ceph

Actions

Copy link

Bug #47002

closed

python-rados: connection error

Added by Dimitri Savineau over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Brad Hubbard

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Yes

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v16.0.0

ceph-qa-suite:

Component(RADOS):

Pull request ID:

36807

Crash signature (v1):

Crash signature (v2):

Description

When using ceph and ceph-iscsi from chacra repositories, I not able to start rbd-target-api anymore (it was working fine few days ago) on a non containerized environment.

# rpm -qa | egrep '(ceph|rados)'
libcephfs2-16.0.0-4355.g4fa1eaf.el8.x86_64
python3-cephfs-16.0.0-4355.g4fa1eaf.el8.x86_64
python3-ceph-common-16.0.0-4355.g4fa1eaf.el8.x86_64
ceph-common-16.0.0-4355.g4fa1eaf.el8.x86_64
ceph-iscsi-3.4-27.g6193889.el8.noarch
librados2-16.0.0-4355.g4fa1eaf.el8.x86_64
python3-rados-16.0.0-4355.g4fa1eaf.el8.x86_64
python3-ceph-argparse-16.0.0-4355.g4fa1eaf.el8.x86_64
libradosstriper1-16.0.0-4355.g4fa1eaf.el8.x86_64

The rbd-target-api log ouput shows that there's a permission denied when trying to connect the ceph cluster even if every requirement are present (ceph config, pool, keyring, etc...).

2020-08-12 17:42:32,786    ERROR [rbd-target-api:2993:<module>()] - Unable to connect to the cluster (keyring missing?) - [errno 13] RADOS permission denied (error connecting to the cluster)
2020-08-12 17:42:32,787 CRITICAL [rbd-target-api:2764:halt()] - Unable to open/read the configuration object

I'm using the ceph admin keyring for this and I'm able to run some ceph commands with that keyring.

# ls -hl /etc/ceph/
total 16K
-rw-------. 1 ceph ceph  151 Aug 12 14:43 ceph.client.admin.keyring
-rw-r--r--. 1 ceph ceph  455 Aug 12 14:43 ceph.conf
-rw-------. 1 root root 1012 Aug 12 16:59 iscsi-gateway.cfg
-rw-r--r--. 1 root root   92 Aug 12 09:14 rbdmap

When trying the cluster connection manually via python (which is what is current implemented in ceph-iscsi [1]) we can see that the error is coming from rados.pyx [2].

>>> import rados
>>> cluster = rados.Rados(conffile='/etc/ceph/ceph.conf', name='client.admin')
>>> cluster.connect()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "rados.pyx", line 1003, in rados.Rados.connect
rados.PermissionDeniedError: [errno 13] RADOS permission denied (error connecting to the cluster)

It's not clear for me what changed in the rados python binding recently that could explain this failure.

When adding the clustername parameter to the Rados class then it works again.

>>> import rados
>>> cluster = rados.Rados(clustername='ceph', conffile='/etc/ceph/ceph.conf', name='client.admin')
>>> cluster.connect()
>>>

[1] https://github.com/ceph/ceph-iscsi/blob/master/ceph_iscsi_config/common.py#L37-L49
[2] https://github.com/ceph/ceph/blob/master/src/pybind/rados/rados.pyx

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Priority changed from Normal to Urgent

Can you roughly point out a sha1 or timeline when you observed this in master? https://github.com/ceph/ceph/pull/33912 is the only recent change I can see in this area, which could possibly be responsible for this. Perhaps, you could try reverting 7e4275390c713b8728a3deab7058bf5e21c424cb and see if that helps.

Actions

Copy link

Updated by Dimitri Savineau over 3 years ago

Something around 10-12 august.

When I tried to debug this on august 12th this was already failing. I can try to find more information in the CI logs.

At the moment I can only tell you that 4fa1eaf is what was tested on august 12th.

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Dimitri Savineau wrote:

Something around 10-12 august.

When I tried to debug this on august 12th this was already failing. I can try to find more information in the CI logs.

https://github.com/ceph/ceph/pull/33912 merged in the beginning of July so if you have a successful run from before then and none after that, it would explain why.

At the moment I can only tell you that 4fa1eaf is what was tested on august 12th.

Is reverting 7e4275390c713b8728a3deab7058bf5e21c424cb and testing an option for you?

Actions

Copy link

Updated by Dimitri Savineau over 3 years ago

https://github.com/ceph/ceph/pull/33912 merged in the beginning of July so if you have a successful run from before then and none after that, it would explain why.

I thought about that one but the merge date doesn't really matched.

This PR was merged on 8 Jul.

I found successfull CI jobs until 10 Aug (7ce4f9a4fd).
I also found older failure (compared to 4fa1eaf) on 11 Aug (740480660a)

Is reverting 7e4275390c713b8728a3deab7058bf5e21c424cb and testing an option for you?

I can give it a try

Actions

Copy link

Updated by Kefu Chai over 3 years ago

i don't think 7e4275390c713b8728a3deab7058bf5e21c424cb is the offending change.

and i am able to connect my vstart cluster using:

$ export PYTHONPATH=/var/ssd/ceph/src/pybind:/var/ssd/ceph/build/lib/cython_modules/lib.3
$ python3
>>> import rados
>>> cluster = rados.Rados(conffile='/var/ssd/ceph/build/ceph.conf', name='client.admin')
>>> cluster.connect()

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Status changed from New to Need More Info

Actions

Copy link

Updated by Brad Hubbard over 3 years ago

Has duplicate Bug #47060: All ragweed tests failing added

Actions

Copy link

Updated by Brad Hubbard over 3 years ago

Assignee set to Brad Hubbard

I can reproduce this and I'm looking into it.

Actions

Copy link

Updated by Brad Hubbard over 3 years ago

Status changed from Need More Info to New

So, I can reproduce this on teuthology but not locally. Kefu mentioned that https://github.com/ceph/ceph/pull/36516 may have caused this but it's not at all clear how. I tried to trace this with gdb but it is insanely difficult to follow once you cross the cython boundary. I'll re-engage on this in the morning.

[edit] Meant to mention that we could try backing out 36516 to see if that fixes it (note that a lot of the code in those commits is crimson specific). I might try that tomorrow if no one beats me to it [/edit]

Actions

Copy link

#10

Updated by Brad Hubbard over 3 years ago

There's something quite different about the local environment (vstart cluster) and the teuthology environment. For example:

Teuthology.

$ python -c "import rados;rados = rados.Rados(clustername='ceph', conffile=\"/etc/ceph/ceph.conf\", name='client.admin');rados.connect();print(rados.get_cluster_stats())" 
{'kb': 1132462080, 'kb_used': 13240608, 'kb_avail': 1119221472, 'num_objects': 411}
$ python -c "import rados;rados = rados.Rados(clustername='fred', conffile=\"/etc/ceph/ceph.conf\", name='client.admin');rados.connect();print(rados.get_cluster_stats())" 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "rados.pyx", line 1003, in rados.Rados.connect
rados.ObjectNotFound: [errno 2] RADOS object not found (error connecting to the cluster)

Local.

$ python -c "import rados;rados = rados.Rados(clustername='ceph', conffile=\"/etc/ceph/ceph.conf\", name='client.admin');rados.connect();print(rados.get_cluster_stats())" 
{'kb': 317718504, 'kb_used': 3146232, 'kb_avail': 314572272, 'num_objects': 1}
python -c "import rados;rados = rados.Rados(clustername='fred', conffile=\"/etc/ceph/ceph.conf\", name='client.admin');rados.connect();print(rados.get_cluster_stats())" 
{'kb': 317718504, 'kb_used': 3146232, 'kb_avail': 314572272, 'num_objects': 1}

So in the teuthology environment the 'clustername' argument is crucial whereas locally it's ignored. Need to think about what's causing that difference.

Actions

Copy link

#11

Updated by Brad Hubbard over 3 years ago

Progress update.

I eliminated anything to do with python by writing a small C client that just reads the conf file and connects.

The actual cause of the failure to connect is that it can not find a cephx keyring for the client and therefore disables cephx. The ENOENT error comes from the following code.

 459 int MonClient::init()                                                                                                                                                                                                                   
 460 {
 461   ldout(cct, 10) << __func__ << dendl;
 462 
 463   entity_name = cct->_conf->name;
 464 
 465   auth_registry.refresh_config();
 466 
 467   std::lock_guard l(monc_lock);
 468   keyring.reset(new KeyRing);
 469   if (auth_registry.is_supported_method(messenger->get_mytype(),
 470 »·······»·······»·······»·······»·······CEPH_AUTH_CEPHX)) {
 471     // this should succeed, because auth_registry just checked!
 472     int r = keyring->from_ceph_context(cct);
 473     if (r != 0) {
 474       // but be somewhat graceful in case there was a race condition
 475       lderr(cct) << "keyring not found" << dendl;
 476       return r;
 477     }
 478   }
 479   if (!auth_registry.any_supported_methods(messenger->get_mytype())) {
 480     return -ENOENT;
 481   }

The difference between the teuthology test and the local test, at least at one level, is that the local ceph.conf has a 'keyring' line whereas the teuthology ceph.conf does not include any 'keyring' line. This investigation was badly hampered by the fact that I could not get debugging output. Anything I did to enable debugging caused the issue to disappear (just your typical heisenbug). Testing with the commits from https://github.com/ceph/ceph/pull/36516 reverted does not resolve the issue.

Still looking for the answer to 'what happened?' but we are closing in.

Actions

Copy link

#12

Updated by Brad Hubbard over 3 years ago

I have a candidate patch for this that I should be able to test tomorrow morning APAC time (it's building now).

Actions

Copy link

#13

Updated by Brad Hubbard over 3 years ago

Ok, I've had a chance to test this. Without my patch in rados_connect we see the following value for 'keyring' and the connection fails with ENOENT.

Breakpoint 1, librados::v14_2_0::RadosClient::connect (this=0x4bd290) at /home/brad/working/src/ceph/src/librados/RadosClient.cc:212
212     {
(gdb)  p cct->_conf->keyring
$1 = "/etc/ceph/.client.admin.keyring,/etc/ceph/.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,"

With the patch we get the following and the connection succeeds.

Breakpoint 5, librados::v14_2_0::RadosClient::connect (this=0x4bd290) at /home/brad/working/src/ceph/src/librados/RadosClient.cc:212
212     {
(gdb) p cct->_conf->keyring
$7 = "/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,"

I've satisfied myself regarding the lingering questions I had about this issue so we are good to go.

Actions

Copy link

#14