Same issues here.
The cluster has been upgraded from Jewel to Kraken and then to Luminous on Ubuntu 16.04, ceph-mgr was deployed with ceph-deploy while the cluster was already http://xor.meo.ws/OyvoZF8v0aWq0D-rOOg2y6u03fp_yzYv.txtrunning Luminous.
Cluster health:
cluster:
id: ceb0b908-8566-405c-8bd5-88a5562d4ec9
health: HEALTH_WARN
pool eu-de-fks-1.rgw.buckets.data has many more objects per pg than average (too few pgs?)
all OSDs are running luminous or later but require_osd_release < luminous
no active mgr
services:
mon: 3 daemons, quorum fvz-meph-de-fks-d,fvz-meph-de-fks-f,fvz-meph-de-fks-e
mgr: no daemons active
mds: cfs-1/1/1 up {0=fvz-meph-de-fks-e=up:active}, 1 up:standby-replay
osd: 50 osds: 50 up, 50 in
rgw: 1 daemon active
data:
pools: 27 pools, 15084 pgs
objects: 19864k objects, 62350 GB
usage: 121 TB used, 75816 GB / 195 TB avail
pgs: 15080 active+clean
4 active+clean+scrubbing+deep
io:
client: 971 kB/s rd, 272 kB/s wr, 24 op/s rd, 7 op/s wr
Related log entries after initial deployment of ceph-mgr:
2017-09-11 06:41:53.535025 7fb5aa7b8500 0 set uid:gid to 64045:64045 (ceph:ceph)
2017-09-11 06:41:53.535048 7fb5aa7b8500 0 ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process (unknown), pid 17031
2017-09-11 06:41:53.536853 7fb5aa7b8500 0 pidfile_write: ignore empty --pid-file
2017-09-11 06:41:53.541880 7fb5aa7b8500 1 mgr send_beacon standby
2017-09-11 06:41:54.547383 7fb5a1aec700 1 mgr handle_mgr_map Activating!
2017-09-11 06:41:54.547575 7fb5a1aec700 1 mgr handle_mgr_map I am now activating
2017-09-11 06:41:54.650677 7fb59dae4700 1 mgr start Creating threads for 0 modules
2017-09-11 06:41:54.650696 7fb59dae4700 1 mgr send_beacon active
2017-09-11 06:41:55.542252 7fb59eae6700 1 mgr send_beacon active
2017-09-11 06:41:55.542627 7fb59eae6700 1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs
2017-09-11 06:41:57.542697 7fb59eae6700 1 mgr send_beacon active
[... lots of "send_beacon active" messages ...]
2017-09-11 07:29:29.640892 7fb59eae6700 1 mgr send_beacon active
2017-09-11 07:29:30.866366 7fb59d2e3700 -1 *** Caught signal (Aborted) **
in thread 7fb59d2e3700 thread_name:ms_dispatch
ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
1: (()+0x3de6b4) [0x55f6640e16b4]
2: (()+0x11390) [0x7fb5a8fef390]
3: (gsignal()+0x38) [0x7fb5a7f7f428]
4: (abort()+0x16a) [0x7fb5a7f8102a]
5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fb5a88c284d]
6: (()+0x8d6b6) [0x7fb5a88c06b6]
7: (()+0x8d701) [0x7fb5a88c0701]
8: (()+0x8d919) [0x7fb5a88c0919]
9: (()+0x2318ad) [0x55f663f348ad]
10: (()+0x3e91bd) [0x55f6640ec1bd]
11: (DaemonPerfCounters::update(MMgrReport*)+0x821) [0x55f663f96651]
12: (DaemonServer::handle_report(MMgrReport*)+0x1ae) [0x55f663f9b79e]+
13: (DaemonServer::ms_dispatch(Message*)+0x64) [0x55f663fa8d64]
14: (DispatchQueue::entry()+0xf4a) [0x55f664438f3a]
15: (DispatchQueue::DispatchThread::entry()+0xd) [0x55f6641dc44d]
16: (()+0x76ba) [0x7fb5a8fe56ba]
17: (clone()+0x6d) [0x7fb5a80513dd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- begin dump of recent events ---
[...]
-19> 2017-09-11 07:29:30.637691 7fb59d2e3700 1 -- 5.9.136.6:6800/17031 <== osd.60 136.243.22.99:6812/23851 562 ==== mgrreport(osd.60 +0-0 packed 4966) v4 ==== 4989+0+0 (1558199989 0 0) 0x55f66feb1500 con 0x55f66dda9800
-18> 2017-09-11 07:29:30.637721 7fb59d2e3700 4 mgr.server handle_report from 0x55f66dda9800 osd,60
-17> 2017-09-11 07:29:30.641446 7fb5a42f1700 5 -- 5.9.136.6:6800/17031 >> 136.243.22.99:6812/23851 conn(0x55f66dda9800 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=3 cs=1 l=1). rx osd.60 seq 563 0x55f66feb1b00 pg_stats(285 pgs tid 0 v 0) v1
-16> 2017-09-11 07:29:30.641523 7fb59d2e3700 1 -- 5.9.136.6:6800/17031 <== osd.60 136.243.22.99:6812/23851 563 ==== pg_stats(285 pgs tid 0 v 0) v1 ==== 164246+0+0 (1188816834 0 0) 0x55f66feb1b00 con 0x55f66dda9800
-15> 2017-09-11 07:29:30.765765 7fb5a4af2700 5 -- 5.9.136.6:6800/17031 >> 5.9.155.119:6801/10380 conn(0x55f66d9e3800 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=3 cs=1 l=1). rx osd.7 seq 570 0x55f670207c00 mgrreport(osd.7 +0-0 packed 4966) v4
-14> 2017-09-11 07:29:30.765836 7fb59d2e3700 1 -- 5.9.136.6:6800/17031 <== osd.7 5.9.155.119:6801/10380 570 ==== mgrreport(osd.7 +0-0 packed 4966) v4 ==== 4988+0+0 (1276362683 0 0) 0x55f670207c00 con 0x55f66d9e3800
-13> 2017-09-11 07:29:30.765865 7fb59d2e3700 4 mgr.server handle_report from 0x55f66d9e3800 osd,7
-12> 2017-09-11 07:29:30.768446 7fb5a4af2700 5 -- 5.9.136.6:6800/17031 >> 5.9.155.119:6801/10380 conn(0x55f66d9e3800 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=3 cs=1 l=1). rx osd.7 seq 571 0x55f66fe64300 pg_stats(244 pgs tid 0 v 0) v1
-11> 2017-09-11 07:29:30.768518 7fb59d2e3700 1 -- 5.9.136.6:6800/17031 <== osd.7 5.9.155.119:6801/10380 571 ==== pg_stats(244 pgs tid 0 v 0) v1 ==== 140636+0+0 (841178967 0 0) 0x55f66fe64300 con 0x55f66d9e3800
-10> 2017-09-11 07:29:30.800429 7fb5a4af2700 5 -- 5.9.136.6:0/1560413667 >> 5.9.136.6:6789/0 conn(0x55f66d75f800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=7780524 cs=1 l=1). rx mon.0 seq 3378 0x55f670f7d680 log(1 entries from seq 5470 at 2017-09-11 07:29:29.494911) v1
-9> 2017-09-11 07:29:30.800497 7fb5a1aec700 1 -- 5.9.136.6:0/1560413667 <== mon.0 5.9.136.6:6789/0 3378 ==== log(1 entries from seq 5470 at 2017-09-11 07:29:29.494911) v1 ==== 303+0+0 (1078424852 0 0) 0x55f670f7d680 con 0x55f66d75f800
-8> 2017-09-11 07:29:30.800542 7fb5a1aec700 4 mgr ms_dispatch active log(1 entries from seq 5470 at 2017-09-11 07:29:29.494911) v1
-7> 2017-09-11 07:29:30.800555 7fb5a1aec700 4 mgr ms_dispatch log(1 entries from seq 5470 at 2017-09-11 07:29:29.494911) v1
-6> 2017-09-11 07:29:30.843630 7fb5a3af0700 5 -- 5.9.136.6:6800/17031 >> 5.9.136.6:0/7871 conn(0x55f66de53800 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1 cs=1 l=1). rx mon.0 seq 573 0x55f66fa4d500 mgrreport(mon.fvz-meph-de-fks-d +0-0 packed 1630) v4
-5> 2017-09-11 07:29:30.843701 7fb59d2e3700 1 -- 5.9.136.6:6800/17031 <== mon.0 5.9.136.6:0/7871 573 ==== mgrreport(mon.fvz-meph-de-fks-d +0-0 packed 1630) v4 ==== 1668+0+0 (925211312 0 0) 0x55f66fa4d500 con 0x55f66de53800
-4> 2017-09-11 07:29:30.843732 7fb59d2e3700 4 mgr.server handle_report from 0x55f66de53800 mon,fvz-meph-de-fks-d
-3> 2017-09-11 07:29:30.849910 7fb5a42f1700 5 -- 5.9.136.6:6800/17031 >> 136.243.138.74:0/3991168366 conn(0x55f66f5ce800 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=3 cs=1 l=1). rx client.43334541 seq 3 0x55f66f693e00 mgrreport(rgw.docker +103-0 packed 2294) v4
-2> 2017-09-11 07:29:30.849984 7fb59d2e3700 1 -- 5.9.136.6:6800/17031 <== client.43334541 136.243.138.74:0/3991168366 3 ==== mgrreport(rgw.docker +103-0 packed 2294) v4 ==== 10526+0+0 (1417840409 0 0) 0x55f66f693e00 con 0x55f66f5ce800
-1> 2017-09-11 07:29:30.850014 7fb59d2e3700 4 mgr.server handle_report from 0x55f66f5ce800 rgw,docker
0> 2017-09-11 07:29:30.866366 7fb59d2e3700 -1 *** Caught signal (Aborted) **
in thread 7fb59d2e3700 thread_name:ms_dispatch
ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
1: (()+0x3de6b4) [0x55f6640e16b4]
2: (()+0x11390) [0x7fb5a8fef390]
3: (gsignal()+0x38) [0x7fb5a7f7f428]
4: (abort()+0x16a) [0x7fb5a7f8102a]
5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fb5a88c284d]
6: (()+0x8d6b6) [0x7fb5a88c06b6]
7: (()+0x8d701) [0x7fb5a88c0701]
8: (()+0x8d919) [0x7fb5a88c0919]
9: (()+0x2318ad) [0x55f663f348ad]
10: (()+0x3e91bd) [0x55f6640ec1bd]
11: (DaemonPerfCounters::update(MMgrReport*)+0x821) [0x55f663f96651]
12: (DaemonServer::handle_report(MMgrReport*)+0x1ae) [0x55f663f9b79e]
13: (DaemonServer::ms_dispatch(Message*)+0x64) [0x55f663fa8d64]
14: (DispatchQueue::entry()+0xf4a) [0x55f664438f3a]
15: (DispatchQueue::DispatchThread::entry()+0xd) [0x55f6641dc44d]
16: (()+0x76ba) [0x7fb5a8fe56ba]
17: (clone()+0x6d) [0x7fb5a80513dd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
[...]
--- end dump of recent events ---
I tried to manually run ceph-mgr with
/usr/bin/ceph-mgr -f --cluster ceph --id $HOSTNAME --setuser ceph --setgroup ceph
which immediately fails to keep running for longer than a few seconds.
stdout and logfile are attached, objdump: http://xor.meo.ws/oxo2q8h_oKAG6q7mARvNKkR_JdYjn89B.txt