Project

General

Profile

Bug #43447

mgr/diskprediction: diskprediction module fails to initialize with newer SciPy versions

Added by Volker Theile about 2 months ago. Updated 25 days ago.

Status:
New
Priority:
Normal
Assignee:
Category:
diskprediction_cloud
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

In latest master it seems

ceph -h
does not list the manager module commands. Because of that it is not possible to start a vstart environment which inifinitely waits for the dashboard module to be started.

vstart is using

ceph -h
to check if a module is loaded, see https://github.com/ceph/ceph/blob/master/src/vstart.sh#L972.

# start-ceph.sh 
=== mon.c === 
Stopping Ceph mon.c on ceph-master-docker...done
=== mon.b === 
Stopping Ceph mon.b on ceph-master-docker...done
=== mon.a === 
Stopping Ceph mon.a on ceph-master-docker...done
=== mgr.x === 
Stopping Ceph mgr.x on ceph-master-docker...done
** going verbose **
rm -f core* 
hostname ceph-master-docker
ip 192.168.178.24
port 40763
/ceph/build/bin/ceph-authtool --create-keyring --gen-key --name=mon. /ceph/build/keyring --cap mon 'allow *' 
creating /ceph/build/keyring
/ceph/build/bin/ceph-authtool --gen-key --name=client.admin --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' --cap mgr 'allow *' /ceph/build/keyring 
/ceph/build/bin/monmaptool --create --clobber --addv a [v2:192.168.178.24:40763,v1:192.168.178.24:40764] --addv b [v2:192.168.178.24:40765,v1:192.168.178.24:40766] --addv c [v2:192.168.178.24:40767,v1:192.168.178.24:40768] --print /tmp/ceph_monmap.30775 
/ceph/build/bin/monmaptool: monmap file /tmp/ceph_monmap.30775
/ceph/build/bin/monmaptool: generated fsid 78ca53ba-40d0-4716-9d53-f24f254e2d0d
epoch 0
fsid 78ca53ba-40d0-4716-9d53-f24f254e2d0d
last_changed 2020-01-02T16:52:44.629708+0000
created 2020-01-02T16:52:44.629708+0000
min_mon_release 0 (unknown)
0: [v2:192.168.178.24:40763/0,v1:192.168.178.24:40764/0] mon.a
1: [v2:192.168.178.24:40765/0,v1:192.168.178.24:40766/0] mon.b
2: [v2:192.168.178.24:40767/0,v1:192.168.178.24:40768/0] mon.c
/ceph/build/bin/monmaptool: writing epoch 0 to /tmp/ceph_monmap.30775 (3 monitors)
rm -rf -- /ceph/build/dev/mon.a 
mkdir -p /ceph/build/dev/mon.a 
/ceph/build/bin/ceph-mon --mkfs -c /ceph/build/ceph.conf -i a --monmap=/tmp/ceph_monmap.30775 --keyring=/ceph/build/keyring 
rm -rf -- /ceph/build/dev/mon.b 
mkdir -p /ceph/build/dev/mon.b 
/ceph/build/bin/ceph-mon --mkfs -c /ceph/build/ceph.conf -i b --monmap=/tmp/ceph_monmap.30775 --keyring=/ceph/build/keyring 
rm -rf -- /ceph/build/dev/mon.c 
mkdir -p /ceph/build/dev/mon.c 
/ceph/build/bin/ceph-mon --mkfs -c /ceph/build/ceph.conf -i c --monmap=/tmp/ceph_monmap.30775 --keyring=/ceph/build/keyring 
rm -- /tmp/ceph_monmap.30775 
/ceph/build/bin/ceph-mon -i a -c /ceph/build/ceph.conf 
/ceph/build/bin/ceph-mon -i b -c /ceph/build/ceph.conf 
/ceph/build/bin/ceph-mon -i c -c /ceph/build/ceph.conf 
Populating config ...

[mgr]
    mgr/telemetry/enable = false
    mgr/telemetry/nag = false
Setting debug configs ...
creating /ceph/build/dev/mgr.x/keyring
/ceph/build/bin/ceph -c /ceph/build/ceph.conf -k /ceph/build/keyring -i /ceph/build/dev/mgr.x/keyring auth add mgr.x mon 'allow profile mgr' mds 'allow *' osd 'allow *' 
added key for mgr.x
/ceph/build/bin/ceph -c /ceph/build/ceph.conf -k /ceph/build/keyring config set mgr mgr/dashboard/x/ssl_server_port 41763 --force 
/ceph/build/bin/ceph -c /ceph/build/ceph.conf -k /ceph/build/keyring config set mgr mgr/restful/x/server_port 42763 --force 
Starting mgr.x
/ceph/build/bin/ceph-mgr -i x -c /ceph/build/ceph.conf 
/ceph/build/bin/ceph -c /ceph/build/ceph.conf -k /ceph/build/keyring -h 
waiting for mgr dashboard module to start
/ceph/build/bin/ceph -c /ceph/build/ceph.conf -k /ceph/build/keyring -h 
waiting for mgr dashboard module to start
/ceph/build/bin/ceph -c /ceph/build/ceph.conf -k /ceph/build/keyring -h 
waiting for mgr dashboard module to start
/ceph/build/bin/ceph -c /ceph/build/ceph.conf -k /ceph/build/keyring -h 
waiting for mgr dashboard module to start
/ceph/build/bin/ceph -c /ceph/build/ceph.conf -k /ceph/build/keyring -h 
waiting for mgr dashboard module to start
/ceph/build/bin/ceph -c /ceph/build/ceph.conf -k /ceph/build/keyring -h 
waiting for mgr dashboard module to start
/ceph/build/bin/ceph -c /ceph/build/ceph.conf -k /ceph/build/keyring -h 
waiting for mgr dashboard module to start
/ceph/build/bin/ceph -c /ceph/build/ceph.conf -k /ceph/build/keyring -h 
waiting for mgr dashboard module to start
/ceph/build/bin/ceph -c /ceph/build/ceph.conf -k /ceph/build/keyring -h 
waiting for mgr dashboard module to start
/ceph/build/bin/ceph -c /ceph/build/ceph.conf -k /ceph/build/keyring -h 
waiting for mgr dashboard module to start
/ceph/build/bin/ceph -c /ceph/build/ceph.conf -k /ceph/build/keyring -h 
waiting for mgr dashboard module to start
/ceph/build/bin/ceph -c /ceph/build/ceph.conf -k /ceph/build/keyring -h 
waiting for mgr dashboard module to start
/ceph/build/bin/ceph -c /ceph/build/ceph.conf -k /ceph/build/keyring -h 
waiting for mgr dashboard module to start
/ceph/build/bin/ceph -c /ceph/build/ceph.conf -k /ceph/build/keyring -h 
waiting for mgr dashboard module to start
/ceph/build/bin/ceph -c /ceph/build/ceph.conf -k /ceph/build/keyring -h 
waiting for mgr dashboard module to start
/ceph/build/bin/ceph -c /ceph/build/ceph.conf -k /ceph/build/keyring -h 
waiting for mgr dashboard module to start
...

The configuration:

# cat /ceph/build/ceph.conf
; generated by vstart.sh on Thu 02 Jan 2020 04:52:44 PM UTC
[client.vstart.sh]
        num mon = 3
        num osd = 3
        num mds = 3
        num mgr = 1
        num rgw = 1
        num ganesha = 0

[global]
        fsid = 55d5f021-9cff-41bb-861e-17dcb5df051e
        osd failsafe full ratio = .99
        mon osd full ratio = .99
        mon osd nearfull ratio = .99
        mon osd backfillfull ratio = .99
        erasure code dir = /ceph/build/lib
        plugin dir = /ceph/build/lib
        filestore fd cache size = 32
        run dir = /ceph/build/out
        crash dir = /ceph/build/out
        enable experimental unrecoverable data corrupting features = *
        osd_crush_chooseleaf_type = 0
        debug asok assert abort = true

        ms bind msgr2 = true
        ms bind msgr1 = true

        lockdep = true
        auth cluster required = cephx
        auth service required = cephx
        auth client required = cephx
[client]
        keyring = /ceph/build/keyring
        log file = /ceph/build/out/$name.$pid.log
        admin socket = /tmp/ceph-asok.O14tIJ/$name.$pid.asok

        ; needed for s3tests
        rgw crypt s3 kms backend = testing
        rgw crypt s3 kms encryption keys = testkey-1=YmluCmJvb3N0CmJvb3N0LWJ1aWxkCmNlcGguY29uZgo= testkey-2=aWIKTWFrZWZpbGUKbWFuCm91dApzcmMKVGVzdGluZwo=
        rgw crypt require ssl = false
        ; uncomment the following to set LC days as the value in seconds;
        ; needed for passing lc time based s3-tests (can be verbose)
        ; rgw lc debug interval = 10
        ; The following settings are for SSE-KMS with Vault
        ;rgw crypt s3 kms backend = vault
        ;rgw crypt vault auth = token
        ;rgw crypt vault token file = /ceph/build/vault.token
        ;rgw crypt vault addr = http://127.0.0.1:8200
        ;rgw crypt vault secret engine = kv
        ;rgw crypt vault prefix = /v1/kv/data
        ;rgw crypt vault secret engine = transit
        ;rgw crypt vault prefix = /v1/transit/export/encryption-key/

[cephfs-shell]
        debug shell = true

[client.rgw.8000]
        rgw frontends = beast port=8000
        admin socket = /ceph/build/out/radosgw.8000.asok
[mds]

        log file = /ceph/build/out/$name.log
        admin socket = /tmp/ceph-asok.O14tIJ/$name.asok
        chdir = "" 
        pid file = /ceph/build/out/$name.pid
        heartbeat file = /ceph/build/out/$name.heartbeat

        mds data = /ceph/build/dev/mds.$id
        mds root ino uid = 0
        mds root ino gid = 0

[mgr]
        mgr data = /ceph/build/dev/mgr.$id
        mgr module path = /ceph/src/pybind/mgr
        cephadm path = /ceph/src/cephadm/cephadm

        log file = /ceph/build/out/$name.log
        admin socket = /tmp/ceph-asok.O14tIJ/$name.asok
        chdir = "" 
        pid file = /ceph/build/out/$name.pid
        heartbeat file = /ceph/build/out/$name.heartbeat

[osd]

        log file = /ceph/build/out/$name.log
        admin socket = /tmp/ceph-asok.O14tIJ/$name.asok
        chdir = "" 
        pid file = /ceph/build/out/$name.pid
        heartbeat file = /ceph/build/out/$name.heartbeat

        osd_check_max_object_name_len_on_startup = false
        osd data = /ceph/build/dev/osd$id
        osd journal = /ceph/build/dev/osd$id/journal
        osd journal size = 100
        osd class tmp = out
        osd class dir = /ceph/build/lib
        osd class load list = *
        osd class default list = *
        osd fast shutdown = false

        filestore wbthrottle xfs ios start flusher = 10
        filestore wbthrottle xfs ios hard limit = 20
        filestore wbthrottle xfs inodes hard limit = 30
        filestore wbthrottle btrfs ios start flusher = 10
        filestore wbthrottle btrfs ios hard limit = 20
        filestore wbthrottle btrfs inodes hard limit = 30
        bluestore fsck on mount = true
        bluestore block create = true
        bluestore block db path = /ceph/build/dev/osd$id/block.db.file
        bluestore block db size = 1073741824
        bluestore block db create = true
        bluestore block wal path = /ceph/build/dev/osd$id/block.wal.file
        bluestore block wal size = 1048576000
        bluestore block wal create = true

        ; kstore
        kstore fsck on mount = true
        osd objectstore = bluestore

[mon]
        mgr initial modules = dashboard restful iostat

        log file = /ceph/build/out/$name.log
        admin socket = /tmp/ceph-asok.O14tIJ/$name.asok
        chdir = "" 
        pid file = /ceph/build/out/$name.pid
        heartbeat file = /ceph/build/out/$name.heartbeat

        debug mon = 20
        debug paxos = 20
        debug auth = 20
        debug mgrc = 20
        debug ms = 1

        mon cluster log file = /ceph/build/out/cluster.mon.$id.log
        osd pool default erasure code profile = plugin=jerasure technique=reed_sol_van k=2 m=1 crush-failure-domain=osd
[mon.a]
        host = ceph-master-docker
        mon data = /ceph/build/dev/mon.a
[mon.b]
        host = ceph-master-docker
        mon data = /ceph/build/dev/mon.b
[mon.c]
        host = ceph-master-docker
        mon data = /ceph/build/dev/mon.c
[global]
        mon host =  [v2:192.168.178.24:40763,v1:192.168.178.24:40764] [v2:192.168.178.24:40765,v1:192.168.178.24:40766] [v2:192.168.178.24:40767,v1:192.168.178.24:40768]
[mgr.x]
        host = ceph-master-docker

The module is configured:

# /ceph/build/bin/ceph -c /ceph/build/ceph.conf -k /ceph/build/keyring mgr module ls              
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2020-01-02T16:55:29.438+0000 7f6dd3fa1700 -1 WARNING: all dangerous and experimental features are enabled.
2020-01-02T16:55:29.462+0000 7f6dd3fa1700 -1 WARNING: all dangerous and experimental features are enabled.
{
    "enabled_modules": [
        "dashboard",
        "iostat",
        "restful" 
    ],
    "disabled_modules": []
}

mgr.x.log View (103 KB) Volker Theile, 01/02/2020 05:10 PM

History

#1 Updated by Volker Theile about 2 months ago

#2 Updated by Volker Theile about 2 months ago

  • Assignee set to Sage Weil

@Sage I assigned the issue to you because you already worked on that issue some time ago: https://github.com/ceph/ceph/pull/30217/commits/81280455112b1c8f4495ef63b9a5e236e5c65465

#3 Updated by Kiefer Chang about 2 months ago

The mgr hangs when loading `diskprediction_local` module.

Some experiments:
- Move diskprediction_local module folder out of source tree to avoid loading it --> mgr starts.
- The new-created container has scipy 1.4.1. I temporarily replace it with scipy 1.3.2 --> mgr starts.

May be related to #42764.

#4 Updated by Volker Theile about 2 months ago

Kiefer, you made my day. I can confirm that removing the src/pybind/mgr/diskprediction_local/ directory fixes the issue. I assume that downgrading scipy will also fix it, too.

#5 Updated by Patrick Seidensal about 1 month ago

I ran into this today and removing `diskprediction_local` from `src/pybind/mgr` resolved the issue for me.

#6 Updated by Patrick Seidensal about 1 month ago

@Kiefer Chang

That explains why a new build of my container seems to have broken vstart! Thanks for the investigation.

#7 Updated by Lenz Grimmer 25 days ago

FWIW, I ran into this issue myself a few days ago and was not able to determine the root cause. Downgrading scipy to version 1.3.2 fixed the issue for me.

We should investigate why newer versions of that module cause the diskprediction module to get stuck in initialization.

#8 Updated by Lenz Grimmer 25 days ago

  • Project changed from Ceph to mgr
  • Subject changed from vstart: ceph -h does not show manager module commands to mgr/diskprediction: diskprediction module fails to initialize with newer SciPy versions
  • Category changed from ceph cli to diskprediction_cloud
  • Regression changed from No to Yes
  • Severity changed from 3 - minor to 2 - major

Updated component and severity according to the latest findings.

Also available in: Atom PDF