Bug #21197
closedcrash on DaemonPerfCounters::update
0%
Description
-1> 2017-08-31 17:11:27.825845 7f2f0f390700 4 mgr.server handle_report from 0x55cadc00e800 mds,b 0> 2017-08-31 17:11:27.846157 7f2f0f390700 -1 *** Caught signal (Aborted) ** in thread 7f2f0f390700 thread_name:ms_dispatch ceph version 12.1.2-1333-gb8aafc6 (b8aafc63ce9b8af3be065449cba78f5d8cb6ee70) mimic (dev) 1: (()+0x3dd7d4) [0x55cacf6c57d4] 2: (()+0x11390) [0x7f2f1a85c390] 3: (gsignal()+0x38) [0x7f2f197ec428] 4: (abort()+0x16a) [0x7f2f197ee02a] 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7f2f1a12f84d] 6: (()+0x8d6b6) [0x7f2f1a12d6b6] 7: (()+0x8d701) [0x7f2f1a12d701] 8: (()+0x8d919) [0x7f2f1a12d919] 9: (()+0x2306fd) [0x55cacf5186fd] 10: (()+0x3e82dd) [0x55cacf6d02dd] 11: (DaemonPerfCounters::update(MMgrReport*)+0x821) [0x55cacf57a6a1] 12: (DaemonServer::handle_report(MMgrReport*)+0x1ae) [0x55cacf57f7ee] 13: (DaemonServer::ms_dispatch(Message*)+0x64) [0x55cacf58cdc4] 14: (DispatchQueue::entry()+0xf4a) [0x55cacfa17eda] 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x55cacf7b8a7d] 16: (()+0x76ba) [0x7f2f1a8526ba] 17: (clone()+0x6d) [0x7f2f198be3dd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher
Files
Updated by Chang Liu over 6 years ago
#10 0x000055cacf5186fd in ceph::buffer::list::copy_in (crc_reset=<optimized out>, src=<optimized out>, len=<optimized out>, off=<optimized out>, this=<optimized out>) at /home/liuchang/WorkSpace/ceph/src/common/buffer.cc:1862 #11 0x000055cacf6d02dd in ceph::buffer::list::iterator_impl<false>::copy (this=0x7f2f0f38e600, len=<optimized out>, dest=0x7f2f0f38e5f0 "") at /home/liuchang/WorkSpace/ceph/src/common/buffer.cc:1214 #12 0x000055cacf57a6a1 in decode_raw<ceph_le<unsigned long long> > (p=..., t=...) at /home/liuchang/WorkSpace/ceph/src/include/encoding.h:61 #13 decode (p=..., v=<synthetic pointer>) at /home/liuchang/WorkSpace/ceph/src/include/encoding.h:107 #14 DaemonPerfCounters::update (this=this@entry=0x55cadbfcfb18, report=report@entry=0x55cadbfc9e00) at /home/liuchang/WorkSpace/ceph/src/mgr/DaemonState.cc:142 #15 0x000055cacf57f7ee in DaemonServer::handle_report (this=this@entry=0x55cadab7f350, m=m@entry=0x55cadbfc9e00) at /home/liuchang/WorkSpace/ceph/src/mgr/DaemonServer.cc:406 #16 0x000055cacf58cdc4 in DaemonServer::ms_dispatch (this=0x55cadab7f350, m=0x55cadbfc9e00) at /home/liuchang/WorkSpace/ceph/src/mgr/DaemonServer.cc:257 #17 0x000055cacfa17eda in Messenger::ms_deliver_dispatch (m=0x55cadbfc9e00, this=0x55cadaac4800) at /home/liuchang/WorkSpace/ceph/src/msg/Messenger.h:625 #18 DispatchQueue::entry (this=0x55cadaac4980) at /home/liuchang/WorkSpace/ceph/src/msg/DispatchQueue.cc:197 #19 0x000055cacf7b8a7d in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /home/liuchang/WorkSpace/ceph/src/msg/DispatchQueue.h:101 #20 0x00007f2f1a8526ba in start_thread (arg=0x7f2f0f390700) at pthread_create.c:333 #21 0x00007f2f198be3dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 (gdb) f 14 #14 DaemonPerfCounters::update (this=this@entry=0x55cadbfcfb18, report=report@entry=0x55cadbfc9e00) at /home/liuchang/WorkSpace/ceph/src/mgr/DaemonState.cc:142 142 ::decode(val, p); (gdb) l 135 130 131 const auto now = ceph_clock_now(); 132 133 // Parse packed data according to declared set of types 134 bufferlist::iterator p = report->packed.begin(); 135 DECODE_START(1, p); 136 for (const auto &t_path : declared_types) { 137 const auto &t = types.at(t_path); 138 uint64_t val = 0; 139 uint64_t avgcount = 0; (gdb) l 140 uint64_t avgcount2 = 0; 141 142 ::decode(val, p); 143 if (t.type & PERFCOUNTER_LONGRUNAVG) { 144 ::decode(avgcount, p); 145 ::decode(avgcount2, p); 146 } 147 // TODO: interface for insertion of avgs 148 instances[t_path].push(now, val); 149 } (gdb) down #13 decode (p=..., v=<synthetic pointer>) at /home/liuchang/WorkSpace/ceph/src/include/encoding.h:107 107 WRITE_INTTYPE_ENCODER(uint64_t, le64) (gdb) down #12 0x000055cacf57a6a1 in decode_raw<ceph_le<unsigned long long> > (p=..., t=...) at /home/liuchang/WorkSpace/ceph/src/include/encoding.h:61 61 p.copy(sizeof(t), (char*)&t); (gdb) down #11 0x000055cacf6d02dd in ceph::buffer::list::iterator_impl<false>::copy (this=0x7f2f0f38e600, len=<optimized out>, dest=0x7f2f0f38e5f0 "") at /home/liuchang/WorkSpace/ceph/src/common/buffer.cc:1214 1214 assert(p->length() > 0);
version is "ceph version 12.1.2-1333-gb8aafc6 (b8aafc63ce9b8af3be065449cba78f5d8cb6ee70) mimic (dev)"
Updated by Brad Hubbard over 6 years ago
- Category set to ceph-mgr
- Priority changed from Normal to High
Updated by Katie Holly over 6 years ago
- File stdout.txt stdout.txt added
- File logs.txt logs.txt added
Same issues here.
The cluster has been upgraded from Jewel to Kraken and then to Luminous on Ubuntu 16.04, ceph-mgr was deployed with ceph-deploy while the cluster was already http://xor.meo.ws/OyvoZF8v0aWq0D-rOOg2y6u03fp_yzYv.txtrunning Luminous.
Cluster health:
cluster: id: ceb0b908-8566-405c-8bd5-88a5562d4ec9 health: HEALTH_WARN pool eu-de-fks-1.rgw.buckets.data has many more objects per pg than average (too few pgs?) all OSDs are running luminous or later but require_osd_release < luminous no active mgr services: mon: 3 daemons, quorum fvz-meph-de-fks-d,fvz-meph-de-fks-f,fvz-meph-de-fks-e mgr: no daemons active mds: cfs-1/1/1 up {0=fvz-meph-de-fks-e=up:active}, 1 up:standby-replay osd: 50 osds: 50 up, 50 in rgw: 1 daemon active data: pools: 27 pools, 15084 pgs objects: 19864k objects, 62350 GB usage: 121 TB used, 75816 GB / 195 TB avail pgs: 15080 active+clean 4 active+clean+scrubbing+deep io: client: 971 kB/s rd, 272 kB/s wr, 24 op/s rd, 7 op/s wr
Related log entries after initial deployment of ceph-mgr:
2017-09-11 06:41:53.535025 7fb5aa7b8500 0 set uid:gid to 64045:64045 (ceph:ceph) 2017-09-11 06:41:53.535048 7fb5aa7b8500 0 ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process (unknown), pid 17031 2017-09-11 06:41:53.536853 7fb5aa7b8500 0 pidfile_write: ignore empty --pid-file 2017-09-11 06:41:53.541880 7fb5aa7b8500 1 mgr send_beacon standby 2017-09-11 06:41:54.547383 7fb5a1aec700 1 mgr handle_mgr_map Activating! 2017-09-11 06:41:54.547575 7fb5a1aec700 1 mgr handle_mgr_map I am now activating 2017-09-11 06:41:54.650677 7fb59dae4700 1 mgr start Creating threads for 0 modules 2017-09-11 06:41:54.650696 7fb59dae4700 1 mgr send_beacon active 2017-09-11 06:41:55.542252 7fb59eae6700 1 mgr send_beacon active 2017-09-11 06:41:55.542627 7fb59eae6700 1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs 2017-09-11 06:41:57.542697 7fb59eae6700 1 mgr send_beacon active [... lots of "send_beacon active" messages ...] 2017-09-11 07:29:29.640892 7fb59eae6700 1 mgr send_beacon active 2017-09-11 07:29:30.866366 7fb59d2e3700 -1 *** Caught signal (Aborted) ** in thread 7fb59d2e3700 thread_name:ms_dispatch ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc) 1: (()+0x3de6b4) [0x55f6640e16b4] 2: (()+0x11390) [0x7fb5a8fef390] 3: (gsignal()+0x38) [0x7fb5a7f7f428] 4: (abort()+0x16a) [0x7fb5a7f8102a] 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fb5a88c284d] 6: (()+0x8d6b6) [0x7fb5a88c06b6] 7: (()+0x8d701) [0x7fb5a88c0701] 8: (()+0x8d919) [0x7fb5a88c0919] 9: (()+0x2318ad) [0x55f663f348ad] 10: (()+0x3e91bd) [0x55f6640ec1bd] 11: (DaemonPerfCounters::update(MMgrReport*)+0x821) [0x55f663f96651] 12: (DaemonServer::handle_report(MMgrReport*)+0x1ae) [0x55f663f9b79e]+ 13: (DaemonServer::ms_dispatch(Message*)+0x64) [0x55f663fa8d64] 14: (DispatchQueue::entry()+0xf4a) [0x55f664438f3a] 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x55f6641dc44d] 16: (()+0x76ba) [0x7fb5a8fe56ba] 17: (clone()+0x6d) [0x7fb5a80513dd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- [...] -19> 2017-09-11 07:29:30.637691 7fb59d2e3700 1 -- 5.9.136.6:6800/17031 <== osd.60 136.243.22.99:6812/23851 562 ==== mgrreport(osd.60 +0-0 packed 4966) v4 ==== 4989+0+0 (1558199989 0 0) 0x55f66feb1500 con 0x55f66dda9800 -18> 2017-09-11 07:29:30.637721 7fb59d2e3700 4 mgr.server handle_report from 0x55f66dda9800 osd,60 -17> 2017-09-11 07:29:30.641446 7fb5a42f1700 5 -- 5.9.136.6:6800/17031 >> 136.243.22.99:6812/23851 conn(0x55f66dda9800 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=3 cs=1 l=1). rx osd.60 seq 563 0x55f66feb1b00 pg_stats(285 pgs tid 0 v 0) v1 -16> 2017-09-11 07:29:30.641523 7fb59d2e3700 1 -- 5.9.136.6:6800/17031 <== osd.60 136.243.22.99:6812/23851 563 ==== pg_stats(285 pgs tid 0 v 0) v1 ==== 164246+0+0 (1188816834 0 0) 0x55f66feb1b00 con 0x55f66dda9800 -15> 2017-09-11 07:29:30.765765 7fb5a4af2700 5 -- 5.9.136.6:6800/17031 >> 5.9.155.119:6801/10380 conn(0x55f66d9e3800 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=3 cs=1 l=1). rx osd.7 seq 570 0x55f670207c00 mgrreport(osd.7 +0-0 packed 4966) v4 -14> 2017-09-11 07:29:30.765836 7fb59d2e3700 1 -- 5.9.136.6:6800/17031 <== osd.7 5.9.155.119:6801/10380 570 ==== mgrreport(osd.7 +0-0 packed 4966) v4 ==== 4988+0+0 (1276362683 0 0) 0x55f670207c00 con 0x55f66d9e3800 -13> 2017-09-11 07:29:30.765865 7fb59d2e3700 4 mgr.server handle_report from 0x55f66d9e3800 osd,7 -12> 2017-09-11 07:29:30.768446 7fb5a4af2700 5 -- 5.9.136.6:6800/17031 >> 5.9.155.119:6801/10380 conn(0x55f66d9e3800 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=3 cs=1 l=1). rx osd.7 seq 571 0x55f66fe64300 pg_stats(244 pgs tid 0 v 0) v1 -11> 2017-09-11 07:29:30.768518 7fb59d2e3700 1 -- 5.9.136.6:6800/17031 <== osd.7 5.9.155.119:6801/10380 571 ==== pg_stats(244 pgs tid 0 v 0) v1 ==== 140636+0+0 (841178967 0 0) 0x55f66fe64300 con 0x55f66d9e3800 -10> 2017-09-11 07:29:30.800429 7fb5a4af2700 5 -- 5.9.136.6:0/1560413667 >> 5.9.136.6:6789/0 conn(0x55f66d75f800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=7780524 cs=1 l=1). rx mon.0 seq 3378 0x55f670f7d680 log(1 entries from seq 5470 at 2017-09-11 07:29:29.494911) v1 -9> 2017-09-11 07:29:30.800497 7fb5a1aec700 1 -- 5.9.136.6:0/1560413667 <== mon.0 5.9.136.6:6789/0 3378 ==== log(1 entries from seq 5470 at 2017-09-11 07:29:29.494911) v1 ==== 303+0+0 (1078424852 0 0) 0x55f670f7d680 con 0x55f66d75f800 -8> 2017-09-11 07:29:30.800542 7fb5a1aec700 4 mgr ms_dispatch active log(1 entries from seq 5470 at 2017-09-11 07:29:29.494911) v1 -7> 2017-09-11 07:29:30.800555 7fb5a1aec700 4 mgr ms_dispatch log(1 entries from seq 5470 at 2017-09-11 07:29:29.494911) v1 -6> 2017-09-11 07:29:30.843630 7fb5a3af0700 5 -- 5.9.136.6:6800/17031 >> 5.9.136.6:0/7871 conn(0x55f66de53800 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1 cs=1 l=1). rx mon.0 seq 573 0x55f66fa4d500 mgrreport(mon.fvz-meph-de-fks-d +0-0 packed 1630) v4 -5> 2017-09-11 07:29:30.843701 7fb59d2e3700 1 -- 5.9.136.6:6800/17031 <== mon.0 5.9.136.6:0/7871 573 ==== mgrreport(mon.fvz-meph-de-fks-d +0-0 packed 1630) v4 ==== 1668+0+0 (925211312 0 0) 0x55f66fa4d500 con 0x55f66de53800 -4> 2017-09-11 07:29:30.843732 7fb59d2e3700 4 mgr.server handle_report from 0x55f66de53800 mon,fvz-meph-de-fks-d -3> 2017-09-11 07:29:30.849910 7fb5a42f1700 5 -- 5.9.136.6:6800/17031 >> 136.243.138.74:0/3991168366 conn(0x55f66f5ce800 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=3 cs=1 l=1). rx client.43334541 seq 3 0x55f66f693e00 mgrreport(rgw.docker +103-0 packed 2294) v4 -2> 2017-09-11 07:29:30.849984 7fb59d2e3700 1 -- 5.9.136.6:6800/17031 <== client.43334541 136.243.138.74:0/3991168366 3 ==== mgrreport(rgw.docker +103-0 packed 2294) v4 ==== 10526+0+0 (1417840409 0 0) 0x55f66f693e00 con 0x55f66f5ce800 -1> 2017-09-11 07:29:30.850014 7fb59d2e3700 4 mgr.server handle_report from 0x55f66f5ce800 rgw,docker 0> 2017-09-11 07:29:30.866366 7fb59d2e3700 -1 *** Caught signal (Aborted) ** in thread 7fb59d2e3700 thread_name:ms_dispatch ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc) 1: (()+0x3de6b4) [0x55f6640e16b4] 2: (()+0x11390) [0x7fb5a8fef390] 3: (gsignal()+0x38) [0x7fb5a7f7f428] 4: (abort()+0x16a) [0x7fb5a7f8102a] 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fb5a88c284d] 6: (()+0x8d6b6) [0x7fb5a88c06b6] 7: (()+0x8d701) [0x7fb5a88c0701] 8: (()+0x8d919) [0x7fb5a88c0919] 9: (()+0x2318ad) [0x55f663f348ad] 10: (()+0x3e91bd) [0x55f6640ec1bd] 11: (DaemonPerfCounters::update(MMgrReport*)+0x821) [0x55f663f96651] 12: (DaemonServer::handle_report(MMgrReport*)+0x1ae) [0x55f663f9b79e] 13: (DaemonServer::ms_dispatch(Message*)+0x64) [0x55f663fa8d64] 14: (DispatchQueue::entry()+0xf4a) [0x55f664438f3a] 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x55f6641dc44d] 16: (()+0x76ba) [0x7fb5a8fe56ba] 17: (clone()+0x6d) [0x7fb5a80513dd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- [...] --- end dump of recent events ---
I tried to manually run ceph-mgr with
/usr/bin/ceph-mgr -f --cluster ceph --id $HOSTNAME --setuser ceph --setgroup ceph
which immediately fails to keep running for longer than a few seconds.
stdout and logfile are attached, objdump: http://xor.meo.ws/oxo2q8h_oKAG6q7mARvNKkR_JdYjn89B.txt
Updated by Katie Holly over 6 years ago
TLDR at EOF
Some more information to my setup:
I'm running a Docker swarm with 15 radosgw instances, the used Docker image is self-built and includes a static "--name=client.rgw.docker" argument. All radosgw instances share the exact same Docker image and as such also the same "client.rgw.docker" name. This was purely a design choice to simplify deployment without having to keep track of handed out keys.
Brad Hubbard and Ben Hines both helped me debug this issue on the ceph-users list and Ben noted that the shared --name argument on all radosgw instances might confuse ceph-mgr in some way.
I updated the Docker image to automatically generate client keys when starting a new radosgw instance and they no longer share a single --name, but have all completely unique, random names which seems to have fixed this issue for me.
TLDR/Long story short: Running more than one radosgw instance with the exact same --name value causes this ceph-mgr crashes for me, especially when running 15 of them at the same time. Generating an unique key or each radosgw instance solves this issue.
Updated by John Spray over 6 years ago
- Assignee set to John Spray
Thanks for the debugging -- when the RGW <-> mgr stuff was added we knew that it would require unique daemon names to produce sane output, but obviously it shouldn't crash when two daemons have the same name.
Fix incoming.
Updated by John Spray over 6 years ago
- Status changed from In Progress to Fix Under Review
- Backport set to luminous
Updated by Sage Weil over 6 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Nathan Cutler over 6 years ago
- Copied to Backport #21656: luminous: crash on DaemonPerfCounters::update added
Updated by John Spray about 6 years ago
- Status changed from Pending Backport to Resolved
Updated by lin zhou over 5 years ago
I update from 12.2.5 to 12.2.7.and face this same problem.
mu deploy multi-rgw and using the same client name.
0> 2018-07-25 17:12:36.088942 7fa2bb20b700 -1 *** Caught signal (Aborted) ** in thread 7fa2bb20b700 thread_name:ms_dispatch ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable) 1: (()+0x3fc0ec) [0x55d0cc9bb0ec] 2: (()+0xf890) [0x7fa2c98b6890] 3: (gsignal()+0x37) [0x7fa2c88e7067] 4: (abort()+0x148) [0x7fa2c88e8448] 5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7fa2c91d4b3d] 6: (()+0x5ebb6) [0x7fa2c91d2bb6] 7: (()+0x5ec01) [0x7fa2c91d2c01] 8: (()+0x5ee19) [0x7fa2c91d2e19] 9: (std::__throw_out_of_range(char const*)+0x66) [0x7fa2c9228b76] 10: (DaemonPerfCounters::update(MMgrReport*)+0xd87) [0x55d0cc861b37] 11: (DaemonServer::handle_report(MMgrReport*)+0x246) [0x55d0cc86a916] 12: (DaemonServer::ms_dispatch(Message*)+0x47) [0x55d0cc877d27] 13: (DispatchQueue::entry()+0x7ca) [0x55d0cccd239a] 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x55d0ccaa852d] 15: (()+0x8064) [0x7fa2c98af064] 16: (clone()+0x6d) [0x7fa2c899a62d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Brad Hubbard over 5 years ago
- Status changed from Resolved to 12
Reopening.
See last comment and https://bugzilla.redhat.com/show_bug.cgi?id=1634964. this does not appear to be fixed in luminous yet.
Updated by Mykola Golub over 5 years ago
- Status changed from 12 to Resolved
This a different issue than initially reported -- a regression introduced when adding avgcount support. It is tracked in #36244.