Project

General

Profile

Bug #8566

Bug #8565: Calamari Install hangs forever when ceph is not there.

Calamari Installation -- Asks for ceph installation when ceph is already there.

Added by Warren Usui about 7 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Backend (services)
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

After encountering 8565, I closed the browser and installed Ceph manually.

I brought the browser back up, and the message that I see is:

New Calamari Installation

This appears to be the first time you have started Calamari and there are no clusters currently configured.

1 Ceph servers are connected to Calamari, but no Ceph cluster has been created yet. Please use ceph-deploy to create a cluster; please see the Inktank Ceph Enterprise documentation for more details.

8565 apparently connected the server, even though it looked like things hung.

At this point, It seems like this message should not be displayed because Ceph is installed on that system.

Associated revisions

Revision b7245cd2 (diff)
Added by John Spray about 7 years ago

salt: load ceph modules at function scope

This is what already happened until the timeout
stuff in 58242677e. Imports were moved up to
module scope for defining exception classes, result
was that whole module needed reloading on transition
from ceph not installed to ceph installed.

Revert to importing at function scope, and make our
exceptions not inherit from rados.Error so that they
don't care.

Fixes: #8566

Signed-off-by: John Spray <>

History

#1 Updated by Yan-Fa Li about 7 years ago

Warren, what does I installed ceph manually mean? Can you write down the exact steps. Are you using ceph deploy? Is it installing salt minions? There is currently no way for Calamari to discover an existing cluster automagically. At least I'm not aware of one, John may well have written it and I've just never used it.

@John, how do we bootstrap calamari onto an existing cluster? I don't believe I've ever tested that.

Thanks

Yan

#2 Updated by Dan Mick about 7 years ago

As in #8565, it doesn't make sense to install the minions on hosts that aren't part of a cluster already. My guess would be that if the salt-minion daemons had been restarted, they would have communicated with the calamari master and been recognized, but, it's an order of installation we're not really trying to handle.

#3 Updated by John Spray about 7 years ago

The key missing piece of information in this report is what "Ceph is installed" means -- just the packages, or is there actually a Ceph cluster up and running?

In the former case, this is not a bug: the UI is being reasonable in not proceeding until it can see a Ceph cluster. In the latter case, something is wrong.

Commenting separately on #8565 about whether it's legitimate to add servers before they have ceph clusters on.

#4 Updated by John Spray about 7 years ago

Yan-Fa Li wrote:

There is currently no way for Calamari to discover an existing cluster automagically. At least I'm not aware of one, John may well have written it and I've just never used it.

@John, how do we bootstrap calamari onto an existing cluster? I don't believe I've ever tested that.

Connecting servers already running a Ceph cluster to Calamari is the typical case.

I think we're having some problems of language talking about this stuff, for what things like "installed", "bootstrap", "automagically" and even "cluster" mean in practice. I'm going to put some words here in the hopes of getting a clearer common vocabulary.

Servers are not detected: they are connected explicitly to calamari by the administrator using "ceph-deploy calamari connect <server>". Subsequently the administrator may authorize these servers using the Calamari UI.

Once mon servers are connected and authorized, 0 or more clusters (where cluster means one ceph FSID) will be detected within a few seconds. This part is automatic: once some mon servers from a cluster are connected and authorized, the cluster will appear in the /api/v2/cluster resource.

Calamari (the backend) doesn't care whether a Ceph cluster is already present at the point where a server is connected, or if it is added later. It will detect clusters whenever they appear.

We used to talk about "bootstrapping" Calamari onto Ceph servers, using the mechanism of running a single command from a Ceph server that downloaded a script that proceeded to download needed packages and set things up. That mechanism still exists, but it is replaced in the ICE process by the "ceph-deploy calamari connect" mechanism.

#5 Updated by Warren Usui about 7 years ago

In the previous episode, Here's what I did:

Here's what I did.

1. Reimaged two vms

2. Established an ssh connection between the two vms.

3. Ran ice_setup.py on machine A

4. Ran calamari-clt initialize on machine A.

5. Started a browser and went to machine A.

6. Ran ceph-deploy calamari connect

7. Clicked the ADD button on the browser to add the host to be recognized by Calamari.

8. Went out to get some coffee.

9. Came back and ^C'ed out of the screen that said Accept Request sent (The green button in the lower right of the screen did not work.

The bug being reported here was that the Accept Request sent window took several minutes and still did not indicate completion.

Now continuing the story...

1. I exit the browser.

2. I run the following on A:

ceph-deploy new B
ceph-deploy install B
ceph-deploy mon create B
ceph-deploy gatherkeys B
ceph-deploy osd prepare B:/tmp/x
ceph-deploy osd prepare B:/tmp/y
ceph-deploy osd activate B:/tmp/x
ceph-deploy osd activate B:/tmp/y
ceph-deploy mds create B

Now I start the browser again, go to A, and get:

New Calamari Installation

This appears to be the first time you have started Calamari and there are no clusters currently configured.

1 Ceph servers are connected to Calamari, but no Ceph cluster has been created yet. Please use ceph-deploy to create a cluster; please see the Inktank Ceph Enterprise documentation for more details.

#6 Updated by Warren Usui about 7 years ago

In the previous episode, Here's what I did:

1. Reimaged two vms

2. Established an ssh connection between the two vms.

3. Ran ice_setup.py on machine A

4. Ran calamari-clt initialize on machine A.

5. Started a browser and went to machine A.

6. Ran ceph-deploy calamari connect

7. Clicked the ADD button on the browser to add the host to be recognized by Calamari.

8. Went out to get some coffee.

9. Came back and ^C'ed out of the screen that said Accept Request sent (The green button in the lower right of the screen did not work.

Now continuing the story...

1. I exit the browser.

2. I run the following on A:

ceph-deploy new B
ceph-deploy install B
ceph-deploy mon create B
ceph-deploy gatherkeys B
ceph-deploy osd prepare B:/tmp/x
ceph-deploy osd prepare B:/tmp/y
ceph-deploy osd activate B:/tmp/x
ceph-deploy osd activate B:/tmp/y
ceph-deploy mds create B

Now I start the browser again, go to A, and get:

New Calamari Installation

This appears to be the first time you have started Calamari and there are no clusters currently configured.

1 Ceph servers are connected to Calamari, but no Ceph cluster has been created yet. Please use ceph-deploy to create a cluster; please see the Inktank Ceph Enterprise documentation for more details.

So it seems like ceph is installed on B, but the browser says that it is not.

#7 Updated by Yan-Fa Li about 7 years ago

So the thing I'm not sure of, and I'm not sure what the docs say, is do you have to kick the salt-minions after you have installed ceph, or can they auto rediscover the host? @JohnW and @JohnS what have we documented and what is the expected behavior of:

1. connect calamari first to servers
2. install ceph afterwards

Do the minions on the newly minted ceph nodes periodically check for a new ceph install or their respective hosts?

What is the expected behavior in this use case?

#8 Updated by John Spray about 7 years ago

Yan-Fa Li wrote:

Do the minions on the newly minted ceph nodes periodically check for a new ceph install or their respective hosts?

What is the expected behavior in this use case?

Yes, they are checking every 10 seconds. They will detect the new cluster immediately after it's created.

If that's not happening, from server B
  • Check that "ceph status" is working
  • Grab the output of "salt-call ceph.get_heartbeats"

And, as usual, look through the logs for any errors etc.

#9 Updated by Warren Usui about 7 years ago

Ceph Status at this point:

sudo ceph status
    cluster 41b672d7-6b4d-47d4-9a8d-1ff40d2961dd
     health HEALTH_WARN 104 pgs degraded; 88 pgs incomplete; 88 pgs stuck inactive; 192 pgs stuck unclean; recovery 24/36 objects degraded (66.667%)
     monmap e1: 1 mons at {vpm017=10.214.138.72:6789/0}, election epoch 2, quorum 0 vpm017
     mdsmap e3: 1/1/1 up {0=vpm017=up:creating}
     osdmap e10: 2 osds: 2 up, 2 in
      pgmap v17: 192 pgs, 3 pools, 1090 bytes data, 12 objects
            23873 MB used, 171 GB / 196 GB avail
            24/36 objects degraded (66.667%)
                 104 active+degraded+remapped
                  88 incomplete
salt-call ceph.get_heartbeats displayed:
[INFO    ] Executing command 'repoquery --queryformat="%{NAME}_|-%{VERSION}_|-%{RELEASE}_|-%{ARCH}_|-%{REPOID}" --all --pkgnarrow=installed' in directory '/root'
local:
    ----------
    - boot_time:
        1402447428
    - ceph_version:
        0.81-0.el6
    - services:
        ----------
        ceph-mds.vpm017:
            ----------
            cluster:
                ceph
            fsid:
                41b672d7-6b4d-47d4-9a8d-1ff40d2961dd
            id:
                vpm017
            status:
                None
            type:
                mds
            version:
                0.81
        ceph-mon.vpm017:
            ----------
            cluster:
                ceph
            fsid:
                41b672d7-6b4d-47d4-9a8d-1ff40d2961dd
            id:
                vpm017
            status:
                ----------
                election_epoch:
                    2
                extra_probe_peers:
                monmap:
                    ----------
                    created:
                        0.000000
                    epoch:
                        1
                    fsid:
                        41b672d7-6b4d-47d4-9a8d-1ff40d2961dd
                    modified:
                        0.000000
                    mons:
                        ----------
                        - addr:
                            10.214.138.72:6789/0
                        - name:
                            vpm017
                        - rank:
                            0
                name:
                    vpm017
                outside_quorum:
                quorum:
                    - 0
                rank:
                    0
                state:
                    leader
                sync_provider:
            type:
                mon
            version:
                0.81
        ceph-osd.0:
            ----------
            cluster:
                ceph
            fsid:
                41b672d7-6b4d-47d4-9a8d-1ff40d2961dd
            id:
                0
            status:
                None
            type:
                osd
            version:
                0.81
        ceph-osd.1:
            ----------
            cluster:
                ceph
            fsid:
                41b672d7-6b4d-47d4-9a8d-1ff40d2961dd
            id:
                1
            status:
                None
            type:
                osd
            version:
                0.81
    ----------
    - 41b672d7-6b4d-47d4-9a8d-1ff40d2961dd:
        ----------
        fsid:
            41b672d7-6b4d-47d4-9a8d-1ff40d2961dd
        name:
            ceph
        versions:
            ----------
            config:
                63e00d5933fd064daa50e3f79dba0819
            health:
                577d9e00416be5a3ff90e5ae0fc0fc56
            mds_map:
                3
            mon_map:
                1
            mon_status:
                2
            osd_map:
                10
            pg_summary:
                3762226bd4cc305c4afe0660d0d20484

There were no errors on ceph.log or cluster/ceph.log

/var/log/calamari/cthulhu.log was:

2014-06-10 21:43:33,215 - ERROR - cthulhu Recovery failed
Traceback (most recent call last):
  File "/opt/calamari/venv/lib/python2.6/site-packages/calamari_cthulhu-0.1-py2.6.egg/cthulhu/manager/manager.py", line 257, in start
    self._recover()
  File "/opt/calamari/venv/lib/python2.6/site-packages/calamari_cthulhu-0.1-py2.6.egg/cthulhu/manager/manager.py", line 189, in _recover
    for server in session.query(Server).all():
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2241, in all
    return list(self)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2353, in __iter__
    return self._execute_and_instances(context)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2366, in _execute_and_instances
    close_with_result=True)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2357, in _connection_from_session
    **kw)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 799, in connection
    close_with_result=close_with_result)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 803, in _connection_for_bind
    return self.transaction._connection_for_bind(engine)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 299, in _connection_for_bind
    conn = bind.contextual_connect()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1661, in contextual_connect
    self.pool.connect(),
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 272, in connect
    return _ConnectionFairy(self).checkout()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 431, in __init__
    rec = self._connection_record = pool._do_get()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 788, in _do_get
    con = self._create_connection()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 225, in _create_connection
    return _ConnectionRecord(self)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 318, in __init__
    self.connection = self.__connect()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 379, in __connect
    connection = self.__pool._creator()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/engine/strategies.py", line 80, in connect
    return dialect.connect(*cargs, **cparams)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/engine/default.py", line 283, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/opt/calamari/venv/lib/python2.6/site-packages/psycopg2/__init__.py", line 164, in connect
    conn = _connect(dsn, connection_factory=connection_factory, async=async)
  File "/opt/calamari/venv/lib/python2.6/site-packages/psycogreen/gevent.py", line 29, in gevent_wait_callback
    state = conn.poll()
OperationalError: (OperationalError) could not connect to server: Connection refused
    Is the server running on host "localhost" and accepting
    TCP/IP connections on port 5432?
 None None
2014-06-10 21:43:34,863 - ERROR - cthulhu Recovery failed
Traceback (most recent call last):
  File "/opt/calamari/venv/lib/python2.6/site-packages/calamari_cthulhu-0.1-py2.6.egg/cthulhu/manager/manager.py", line 257, in start
    self._recover()
  File "/opt/calamari/venv/lib/python2.6/site-packages/calamari_cthulhu-0.1-py2.6.egg/cthulhu/manager/manager.py", line 189, in _recover
    for server in session.query(Server).all():
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2241, in all
    return list(self)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2353, in __iter__
    return self._execute_and_instances(context)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2366, in _execute_and_instances
    close_with_result=True)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2357, in _connection_from_session
    **kw)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 799, in connection
    close_with_result=close_with_result)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 803, in _connection_for_bind
    return self.transaction._connection_for_bind(engine)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 299, in _connection_for_bind
    conn = bind.contextual_connect()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1661, in contextual_connect
    self.pool.connect(),
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 272, in connect
    return _ConnectionFairy(self).checkout()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 431, in __init__
    rec = self._connection_record = pool._do_get()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 788, in _do_get
    con = self._create_connection()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 225, in _create_connection
    return _ConnectionRecord(self)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 318, in __init__
    self.connection = self.__connect()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 379, in __connect
    connection = self.__pool._creator()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/engine/strategies.py", line 80, in connect
    return dialect.connect(*cargs, **cparams)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/engine/default.py", line 283, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/opt/calamari/venv/lib/python2.6/site-packages/psycopg2/__init__.py", line 164, in connect
    conn = _connect(dsn, connection_factory=connection_factory, async=async)
  File "/opt/calamari/venv/lib/python2.6/site-packages/psycogreen/gevent.py", line 29, in gevent_wait_callback
    state = conn.poll()
OperationalError: (OperationalError) could not connect to server: Connection refused
    Is the server running on host "localhost" and accepting
    TCP/IP connections on port 5432?
 None None
2014-06-10 21:43:36,508 - ERROR - cthulhu Recovery failed
Traceback (most recent call last):
  File "/opt/calamari/venv/lib/python2.6/site-packages/calamari_cthulhu-0.1-py2.6.egg/cthulhu/manager/manager.py", line 257, in start
    self._recover()
  File "/opt/calamari/venv/lib/python2.6/site-packages/calamari_cthulhu-0.1-py2.6.egg/cthulhu/manager/manager.py", line 189, in _recover
    for server in session.query(Server).all():
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2241, in all
    return list(self)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2353, in __iter__
    return self._execute_and_instances(context)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2366, in _execute_and_instances
    close_with_result=True)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2357, in _connection_from_session
    **kw)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 799, in connection
    close_with_result=close_with_result)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 803, in _connection_for_bind
    return self.transaction._connection_for_bind(engine)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 299, in _connection_for_bind
    conn = bind.contextual_connect()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1661, in contextual_connect
    self.pool.connect(),
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 272, in connect
    return _ConnectionFairy(self).checkout()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 431, in __init__
    rec = self._connection_record = pool._do_get()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 788, in _do_get
    con = self._create_connection()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 225, in _create_connection
    return _ConnectionRecord(self)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 318, in __init__
    self.connection = self.__connect()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 379, in __connect
    connection = self.__pool._creator()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/engine/strategies.py", line 80, in connect
    return dialect.connect(*cargs, **cparams)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/engine/default.py", line 283, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/opt/calamari/venv/lib/python2.6/site-packages/psycopg2/__init__.py", line 164, in connect
    conn = _connect(dsn, connection_factory=connection_factory, async=async)
  File "/opt/calamari/venv/lib/python2.6/site-packages/psycogreen/gevent.py", line 29, in gevent_wait_callback
    state = conn.poll()
OperationalError: (OperationalError) could not connect to server: Connection refused
    Is the server running on host "localhost" and accepting
    TCP/IP connections on port 5432?
 None None
2014-06-10 21:43:39,157 - ERROR - cthulhu Recovery failed
Traceback (most recent call last):
  File "/opt/calamari/venv/lib/python2.6/site-packages/calamari_cthulhu-0.1-py2.6.egg/cthulhu/manager/manager.py", line 257, in start
    self._recover()
  File "/opt/calamari/venv/lib/python2.6/site-packages/calamari_cthulhu-0.1-py2.6.egg/cthulhu/manager/manager.py", line 189, in _recover
    for server in session.query(Server).all():
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2241, in all
    return list(self)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2353, in __iter__
    return self._execute_and_instances(context)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2366, in _execute_and_instances
    close_with_result=True)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2357, in _connection_from_session
    **kw)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 799, in connection
    close_with_result=close_with_result)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 803, in _connection_for_bind
    return self.transaction._connection_for_bind(engine)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 299, in _connection_for_bind
    conn = bind.contextual_connect()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1661, in contextual_connect
    self.pool.connect(),
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 272, in connect
    return _ConnectionFairy(self).checkout()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 431, in __init__
    rec = self._connection_record = pool._do_get()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 788, in _do_get
    con = self._create_connection()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 225, in _create_connection
    return _ConnectionRecord(self)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 318, in __init__
    self.connection = self.__connect()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 379, in __connect
    connection = self.__pool._creator()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/engine/strategies.py", line 80, in connect
    return dialect.connect(*cargs, **cparams)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/engine/default.py", line 283, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/opt/calamari/venv/lib/python2.6/site-packages/psycopg2/__init__.py", line 164, in connect
    conn = _connect(dsn, connection_factory=connection_factory, async=async)
  File "/opt/calamari/venv/lib/python2.6/site-packages/psycogreen/gevent.py", line 29, in gevent_wait_callback
    state = conn.poll()
OperationalError: (OperationalError) could not connect to server: Connection refused
    Is the server running on host "localhost" and accepting
    TCP/IP connections on port 5432?
 None None
2014-06-10 21:43:42,806 - ERROR - cthulhu Recovery failed
Traceback (most recent call last):
  File "/opt/calamari/venv/lib/python2.6/site-packages/calamari_cthulhu-0.1-py2.6.egg/cthulhu/manager/manager.py", line 257, in start
    self._recover()
  File "/opt/calamari/venv/lib/python2.6/site-packages/calamari_cthulhu-0.1-py2.6.egg/cthulhu/manager/manager.py", line 189, in _recover
    for server in session.query(Server).all():
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2241, in all
    return list(self)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2353, in __iter__
    return self._execute_and_instances(context)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2366, in _execute_and_instances
    close_with_result=True)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/query.py", line 2357, in _connection_from_session
    **kw)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 799, in connection
    close_with_result=close_with_result)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 803, in _connection_for_bind
    return self.transaction._connection_for_bind(engine)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/orm/session.py", line 299, in _connection_for_bind
    conn = bind.contextual_connect()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1661, in contextual_connect
    self.pool.connect(),
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 272, in connect
    return _ConnectionFairy(self).checkout()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 431, in __init__
    rec = self._connection_record = pool._do_get()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 788, in _do_get
    con = self._create_connection()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 225, in _create_connection
    return _ConnectionRecord(self)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 318, in __init__
    self.connection = self.__connect()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/pool.py", line 379, in __connect
    connection = self.__pool._creator()
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/engine/strategies.py", line 80, in connect
    return dialect.connect(*cargs, **cparams)
  File "/opt/calamari/venv/lib/python2.6/site-packages/sqlalchemy/engine/default.py", line 283, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/opt/calamari/venv/lib/python2.6/site-packages/psycopg2/__init__.py", line 164, in connect
    conn = _connect(dsn, connection_factory=connection_factory, async=async)
  File "/opt/calamari/venv/lib/python2.6/site-packages/psycogreen/gevent.py", line 29, in gevent_wait_callback
    state = conn.poll()
OperationalError: (OperationalError) could not connect to server: Connection refused
    Is the server running on host "localhost" and accepting
    TCP/IP connections on port 5432?
 None None
2014-06-10 22:03:05,552 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:03:05,607 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:03:30,559 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:03:30,613 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:03:55,564 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:03:55,619 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:04:20,570 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:04:20,624 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:04:45,574 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:04:45,628 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:05:10,581 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:05:10,635 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:05:35,586 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:05:35,640 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:06:00,592 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:06:00,645 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:06:25,597 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:06:25,651 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:06:50,601 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:06:50,655 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:07:15,606 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:07:15,660 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:07:40,611 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:07:40,666 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:08:05,617 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:08:05,671 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:08:30,624 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:08:30,678 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:08:55,629 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:08:55,683 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:09:20,633 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:09:20,689 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:09:45,638 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:09:45,695 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:10:10,642 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:10:10,700 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:10:35,647 - WARNING - cthulhu Re-opening connection to salt-master
2014-06-10 22:10:35,705 - WARNING - cthulhu Re-opening connection to salt-master

/var/log/calamari/http_error.log was:

http://pastebin.com/7WwXq6im

/var/log/calamari/calimari.log showed

2014-06-10 21:03:18,883 - metric_access - django.request Not Found: /favicon.ico
2014-06-10 21:16:57,535 - metric_access - django.request Not Found: /favicon.ico

#10 Updated by Warren Usui about 7 years ago

Note that reinstalling the calamari server (Starting with a clean system) still results in this problem.

#11 Updated by Warren Usui about 7 years ago

I have left these systems up. Vpm016 is the server. Vpm017 is the client and ceph machine.

When in doubt the passwd is admin.

#12 Updated by Dan Mick about 7 years ago

...and the user is 'root' (root/admin)

#13 Updated by Dan Mick about 7 years ago

vpm017 has no minion running; its log shows:

2014-06-10 23:21:29,652 [salt.crypt                                  ][CRITICAL] The Salt Master server's public key did not authenticate!
The master may need to be updated if it is a version of Salt lower than 2014.1.4, or
If you are confident that you are connecting to a valid Salt Master, then remove the master public key and restart the Salt Minion.
The master public key can be found at:
/etc/salt/pki/minion/minion_master.pub
2014-06-10 23:21:29,656 [salt.crypt                                  ][ERROR   ] The master key has changed, the salt master could have been subverted, verify salt master's public key

#14 Updated by Dan Mick about 7 years ago

Warren says this is likely from the server being reinstalled, so I will try to fix up the server key on the minion while he tries to reproduce back to the "first server installed" state on two other VMs. (Edit: manually installing the new server key and starting salt-minion made the cluster reporting start, as expected)

#15 Updated by Dan Mick about 7 years ago

OK. I think I understand this. salt-minion was started when there was no Ceph, and thus no rados.py; salt-minion imports ceph.py at the beginning of time, which tries and fails to import rados, and never tries again.

I think the fix is to get ceph.py to retry the import rados. Testing: stop salt-minion, restart, things work. Stop, rename rados.py to notrados.py, delete rados.py?; browser still gets information from cthulhu whcih is becoming stale, but "salt '*' ceph.get_heartbeats" from the master proves that salt-minion is broken again. Reinstate rados.py and restart the minion, and the call from the master works.

That should allow testing my idea for a fix, which is to put the try: import rados inside get_heartbeats().

#16 Updated by Dan Mick about 7 years ago

That works. However, I do not understand what's going on with CephError()'s variant definition, or how to adapt it to the idea of dynamically retrying the import.

#17 Updated by Dan Mick about 7 years ago

  • Category set to Backend (services)
  • Status changed from New to Resolved
  • Assignee set to John Spray
  • Target version set to v1.2-dev11
  • Source changed from other to Q/A

John had a different fix.

#18 Updated by Warren Usui about 7 years ago

This appears to be fixed.

#19 Updated by Warren Usui about 7 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF