Project

General

Profile

Bug #21197

crash on DaemonPerfCounters::update

Added by Chang Liu 4 months ago. Updated 3 months ago.

Status:
Pending Backport
Priority:
High
Assignee:
Category:
ceph-mgr
Target version:
-
Start date:
08/31/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

    -1> 2017-08-31 17:11:27.825845 7f2f0f390700  4 mgr.server handle_report from 0x55cadc00e800 mds,b
     0> 2017-08-31 17:11:27.846157 7f2f0f390700 -1 *** Caught signal (Aborted) **
 in thread 7f2f0f390700 thread_name:ms_dispatch                                  

 ceph version 12.1.2-1333-gb8aafc6 (b8aafc63ce9b8af3be065449cba78f5d8cb6ee70) mimic (dev)
 1: (()+0x3dd7d4) [0x55cacf6c57d4]                                               
 2: (()+0x11390) [0x7f2f1a85c390]                                                
 3: (gsignal()+0x38) [0x7f2f197ec428]                                            
 4: (abort()+0x16a) [0x7f2f197ee02a]                                             
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7f2f1a12f84d]            
 6: (()+0x8d6b6) [0x7f2f1a12d6b6]                                                
 7: (()+0x8d701) [0x7f2f1a12d701]                                                
 8: (()+0x8d919) [0x7f2f1a12d919]                                                
 9: (()+0x2306fd) [0x55cacf5186fd]                                               
 10: (()+0x3e82dd) [0x55cacf6d02dd]                                              
 11: (DaemonPerfCounters::update(MMgrReport*)+0x821) [0x55cacf57a6a1]            
 12: (DaemonServer::handle_report(MMgrReport*)+0x1ae) [0x55cacf57f7ee]           
 13: (DaemonServer::ms_dispatch(Message*)+0x64) [0x55cacf58cdc4]                 
 14: (DispatchQueue::entry()+0xf4a) [0x55cacfa17eda]                             
 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x55cacf7b8a7d]               
 16: (()+0x76ba) [0x7f2f1a8526ba]                                                
 17: (clone()+0x6d) [0x7f2f198be3dd]                                             
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---                                                           
   0/ 5 none                                                                     
   0/ 1 lockdep                                                                  
   0/ 1 context                                                                  
   1/ 1 crush                                                                    
   1/ 5 mds                                                                      
   1/ 5 mds_balancer                                                             
   1/ 5 mds_locker                                                               
   1/ 5 mds_log                                                                  
   1/ 5 mds_log_expire                                                           
   1/ 5 mds_migrator                                                             
   0/ 1 buffer                                                                   
   0/ 1 timer                                                                    
   0/ 1 filer                                                                    
   0/ 1 striper                                                                  
   0/ 1 objecter                                                                 
   0/ 5 rados                                                                    
   0/ 5 rbd                                                                      
   0/ 5 rbd_mirror                                                               
   0/ 5 rbd_replay                                                               
   0/ 5 journaler                                                                
   0/ 5 objectcacher  

stdout.txt View (3.15 KB) Katie Holly, 09/12/2017 01:33 AM

logs.txt View (155 KB) Katie Holly, 09/12/2017 01:33 AM


Related issues

Copied to mgr - Backport #21656: luminous: crash on DaemonPerfCounters::update In Progress

History

#1 Updated by Chang Liu 4 months ago

#10 0x000055cacf5186fd in ceph::buffer::list::copy_in (crc_reset=<optimized out>, src=<optimized out>, len=<optimized out>, off=<optimized out>, this=<optimized out>)
    at /home/liuchang/WorkSpace/ceph/src/common/buffer.cc:1862
#11 0x000055cacf6d02dd in ceph::buffer::list::iterator_impl<false>::copy (this=0x7f2f0f38e600, len=<optimized out>, dest=0x7f2f0f38e5f0 "") at /home/liuchang/WorkSpace/ceph/src/common/buffer.cc:1214
#12 0x000055cacf57a6a1 in decode_raw<ceph_le<unsigned long long> > (p=..., t=...) at /home/liuchang/WorkSpace/ceph/src/include/encoding.h:61
#13 decode (p=..., v=<synthetic pointer>) at /home/liuchang/WorkSpace/ceph/src/include/encoding.h:107
#14 DaemonPerfCounters::update (this=this@entry=0x55cadbfcfb18, report=report@entry=0x55cadbfc9e00) at /home/liuchang/WorkSpace/ceph/src/mgr/DaemonState.cc:142
#15 0x000055cacf57f7ee in DaemonServer::handle_report (this=this@entry=0x55cadab7f350, m=m@entry=0x55cadbfc9e00) at /home/liuchang/WorkSpace/ceph/src/mgr/DaemonServer.cc:406
#16 0x000055cacf58cdc4 in DaemonServer::ms_dispatch (this=0x55cadab7f350, m=0x55cadbfc9e00) at /home/liuchang/WorkSpace/ceph/src/mgr/DaemonServer.cc:257
#17 0x000055cacfa17eda in Messenger::ms_deliver_dispatch (m=0x55cadbfc9e00, this=0x55cadaac4800) at /home/liuchang/WorkSpace/ceph/src/msg/Messenger.h:625
#18 DispatchQueue::entry (this=0x55cadaac4980) at /home/liuchang/WorkSpace/ceph/src/msg/DispatchQueue.cc:197
#19 0x000055cacf7b8a7d in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /home/liuchang/WorkSpace/ceph/src/msg/DispatchQueue.h:101
#20 0x00007f2f1a8526ba in start_thread (arg=0x7f2f0f390700) at pthread_create.c:333
#21 0x00007f2f198be3dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) f 14
#14 DaemonPerfCounters::update (this=this@entry=0x55cadbfcfb18, report=report@entry=0x55cadbfc9e00) at /home/liuchang/WorkSpace/ceph/src/mgr/DaemonState.cc:142
142         ::decode(val, p);
(gdb) l 135
130
131       const auto now = ceph_clock_now();
132
133       // Parse packed data according to declared set of types
134       bufferlist::iterator p = report->packed.begin();
135       DECODE_START(1, p);
136       for (const auto &t_path : declared_types) {
137         const auto &t = types.at(t_path);
138         uint64_t val = 0;
139         uint64_t avgcount = 0;
(gdb) l
140         uint64_t avgcount2 = 0;
141
142         ::decode(val, p);
143         if (t.type & PERFCOUNTER_LONGRUNAVG) {
144           ::decode(avgcount, p);
145           ::decode(avgcount2, p);
146         }
147         // TODO: interface for insertion of avgs
148         instances[t_path].push(now, val);
149       }
(gdb) down
#13 decode (p=..., v=<synthetic pointer>) at /home/liuchang/WorkSpace/ceph/src/include/encoding.h:107
107     WRITE_INTTYPE_ENCODER(uint64_t, le64)
(gdb) down
#12 0x000055cacf57a6a1 in decode_raw<ceph_le<unsigned long long> > (p=..., t=...) at /home/liuchang/WorkSpace/ceph/src/include/encoding.h:61
61        p.copy(sizeof(t), (char*)&t);
(gdb) down
#11 0x000055cacf6d02dd in ceph::buffer::list::iterator_impl<false>::copy (this=0x7f2f0f38e600, len=<optimized out>, dest=0x7f2f0f38e5f0 "") at /home/liuchang/WorkSpace/ceph/src/common/buffer.cc:1214
1214          assert(p->length() > 0);

version is "ceph version 12.1.2-1333-gb8aafc6 (b8aafc63ce9b8af3be065449cba78f5d8cb6ee70) mimic (dev)"

#2 Updated by Brad Hubbard 3 months ago

  • Category set to ceph-mgr
  • Priority changed from Normal to High

#3 Updated by Katie Holly 3 months ago

Same issues here.

The cluster has been upgraded from Jewel to Kraken and then to Luminous on Ubuntu 16.04, ceph-mgr was deployed with ceph-deploy while the cluster was already http://xor.meo.ws/OyvoZF8v0aWq0D-rOOg2y6u03fp_yzYv.txtrunning Luminous.

Cluster health:

  cluster:
    id:     ceb0b908-8566-405c-8bd5-88a5562d4ec9
    health: HEALTH_WARN
            pool eu-de-fks-1.rgw.buckets.data has many more objects per pg than average (too few pgs?)
            all OSDs are running luminous or later but require_osd_release < luminous
            no active mgr

  services:
    mon: 3 daemons, quorum fvz-meph-de-fks-d,fvz-meph-de-fks-f,fvz-meph-de-fks-e
    mgr: no daemons active
    mds: cfs-1/1/1 up  {0=fvz-meph-de-fks-e=up:active}, 1 up:standby-replay
    osd: 50 osds: 50 up, 50 in
    rgw: 1 daemon active

  data:
    pools:   27 pools, 15084 pgs
    objects: 19864k objects, 62350 GB
    usage:   121 TB used, 75816 GB / 195 TB avail
    pgs:     15080 active+clean
             4     active+clean+scrubbing+deep

  io:
    client:   971 kB/s rd, 272 kB/s wr, 24 op/s rd, 7 op/s wr

Related log entries after initial deployment of ceph-mgr:

2017-09-11 06:41:53.535025 7fb5aa7b8500  0 set uid:gid to 64045:64045 (ceph:ceph)
2017-09-11 06:41:53.535048 7fb5aa7b8500  0 ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process (unknown), pid 17031
2017-09-11 06:41:53.536853 7fb5aa7b8500  0 pidfile_write: ignore empty --pid-file
2017-09-11 06:41:53.541880 7fb5aa7b8500  1 mgr send_beacon standby
2017-09-11 06:41:54.547383 7fb5a1aec700  1 mgr handle_mgr_map Activating!
2017-09-11 06:41:54.547575 7fb5a1aec700  1 mgr handle_mgr_map I am now activating
2017-09-11 06:41:54.650677 7fb59dae4700  1 mgr start Creating threads for 0 modules
2017-09-11 06:41:54.650696 7fb59dae4700  1 mgr send_beacon active
2017-09-11 06:41:55.542252 7fb59eae6700  1 mgr send_beacon active
2017-09-11 06:41:55.542627 7fb59eae6700  1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs
2017-09-11 06:41:57.542697 7fb59eae6700  1 mgr send_beacon active
[... lots of "send_beacon active" messages ...]
2017-09-11 07:29:29.640892 7fb59eae6700  1 mgr send_beacon active
2017-09-11 07:29:30.866366 7fb59d2e3700 -1 *** Caught signal (Aborted) **
 in thread 7fb59d2e3700 thread_name:ms_dispatch

 ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
 1: (()+0x3de6b4) [0x55f6640e16b4]
 2: (()+0x11390) [0x7fb5a8fef390]
 3: (gsignal()+0x38) [0x7fb5a7f7f428]
 4: (abort()+0x16a) [0x7fb5a7f8102a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fb5a88c284d]
 6: (()+0x8d6b6) [0x7fb5a88c06b6]
 7: (()+0x8d701) [0x7fb5a88c0701]
 8: (()+0x8d919) [0x7fb5a88c0919]
 9: (()+0x2318ad) [0x55f663f348ad]
 10: (()+0x3e91bd) [0x55f6640ec1bd]
 11: (DaemonPerfCounters::update(MMgrReport*)+0x821) [0x55f663f96651]
 12: (DaemonServer::handle_report(MMgrReport*)+0x1ae) [0x55f663f9b79e]+
 13: (DaemonServer::ms_dispatch(Message*)+0x64) [0x55f663fa8d64]
 14: (DispatchQueue::entry()+0xf4a) [0x55f664438f3a]
 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x55f6641dc44d]
 16: (()+0x76ba) [0x7fb5a8fe56ba]
 17: (clone()+0x6d) [0x7fb5a80513dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
[...]
   -19> 2017-09-11 07:29:30.637691 7fb59d2e3700  1 -- 5.9.136.6:6800/17031 <== osd.60 136.243.22.99:6812/23851 562 ==== mgrreport(osd.60 +0-0 packed 4966) v4 ==== 4989+0+0 (1558199989 0 0) 0x55f66feb1500 con 0x55f66dda9800
   -18> 2017-09-11 07:29:30.637721 7fb59d2e3700  4 mgr.server handle_report from 0x55f66dda9800 osd,60
   -17> 2017-09-11 07:29:30.641446 7fb5a42f1700  5 -- 5.9.136.6:6800/17031 >> 136.243.22.99:6812/23851 conn(0x55f66dda9800 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=3 cs=1 l=1). rx osd.60 seq 563 0x55f66feb1b00 pg_stats(285 pgs tid 0 v 0) v1
   -16> 2017-09-11 07:29:30.641523 7fb59d2e3700  1 -- 5.9.136.6:6800/17031 <== osd.60 136.243.22.99:6812/23851 563 ==== pg_stats(285 pgs tid 0 v 0) v1 ==== 164246+0+0 (1188816834 0 0) 0x55f66feb1b00 con 0x55f66dda9800
   -15> 2017-09-11 07:29:30.765765 7fb5a4af2700  5 -- 5.9.136.6:6800/17031 >> 5.9.155.119:6801/10380 conn(0x55f66d9e3800 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=3 cs=1 l=1). rx osd.7 seq 570 0x55f670207c00 mgrreport(osd.7 +0-0 packed 4966) v4
   -14> 2017-09-11 07:29:30.765836 7fb59d2e3700  1 -- 5.9.136.6:6800/17031 <== osd.7 5.9.155.119:6801/10380 570 ==== mgrreport(osd.7 +0-0 packed 4966) v4 ==== 4988+0+0 (1276362683 0 0) 0x55f670207c00 con 0x55f66d9e3800
   -13> 2017-09-11 07:29:30.765865 7fb59d2e3700  4 mgr.server handle_report from 0x55f66d9e3800 osd,7
   -12> 2017-09-11 07:29:30.768446 7fb5a4af2700  5 -- 5.9.136.6:6800/17031 >> 5.9.155.119:6801/10380 conn(0x55f66d9e3800 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=3 cs=1 l=1). rx osd.7 seq 571 0x55f66fe64300 pg_stats(244 pgs tid 0 v 0) v1
   -11> 2017-09-11 07:29:30.768518 7fb59d2e3700  1 -- 5.9.136.6:6800/17031 <== osd.7 5.9.155.119:6801/10380 571 ==== pg_stats(244 pgs tid 0 v 0) v1 ==== 140636+0+0 (841178967 0 0) 0x55f66fe64300 con 0x55f66d9e3800
   -10> 2017-09-11 07:29:30.800429 7fb5a4af2700  5 -- 5.9.136.6:0/1560413667 >> 5.9.136.6:6789/0 conn(0x55f66d75f800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=7780524 cs=1 l=1). rx mon.0 seq 3378 0x55f670f7d680 log(1 entries from seq 5470 at 2017-09-11 07:29:29.494911) v1
    -9> 2017-09-11 07:29:30.800497 7fb5a1aec700  1 -- 5.9.136.6:0/1560413667 <== mon.0 5.9.136.6:6789/0 3378 ==== log(1 entries from seq 5470 at 2017-09-11 07:29:29.494911) v1 ==== 303+0+0 (1078424852 0 0) 0x55f670f7d680 con 0x55f66d75f800
    -8> 2017-09-11 07:29:30.800542 7fb5a1aec700  4 mgr ms_dispatch active log(1 entries from seq 5470 at 2017-09-11 07:29:29.494911) v1
    -7> 2017-09-11 07:29:30.800555 7fb5a1aec700  4 mgr ms_dispatch log(1 entries from seq 5470 at 2017-09-11 07:29:29.494911) v1
    -6> 2017-09-11 07:29:30.843630 7fb5a3af0700  5 -- 5.9.136.6:6800/17031 >> 5.9.136.6:0/7871 conn(0x55f66de53800 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1 cs=1 l=1). rx mon.0 seq 573 0x55f66fa4d500 mgrreport(mon.fvz-meph-de-fks-d +0-0 packed 1630) v4
    -5> 2017-09-11 07:29:30.843701 7fb59d2e3700  1 -- 5.9.136.6:6800/17031 <== mon.0 5.9.136.6:0/7871 573 ==== mgrreport(mon.fvz-meph-de-fks-d +0-0 packed 1630) v4 ==== 1668+0+0 (925211312 0 0) 0x55f66fa4d500 con 0x55f66de53800
    -4> 2017-09-11 07:29:30.843732 7fb59d2e3700  4 mgr.server handle_report from 0x55f66de53800 mon,fvz-meph-de-fks-d
    -3> 2017-09-11 07:29:30.849910 7fb5a42f1700  5 -- 5.9.136.6:6800/17031 >> 136.243.138.74:0/3991168366 conn(0x55f66f5ce800 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=3 cs=1 l=1). rx client.43334541 seq 3 0x55f66f693e00 mgrreport(rgw.docker +103-0 packed 2294) v4
    -2> 2017-09-11 07:29:30.849984 7fb59d2e3700  1 -- 5.9.136.6:6800/17031 <== client.43334541 136.243.138.74:0/3991168366 3 ==== mgrreport(rgw.docker +103-0 packed 2294) v4 ==== 10526+0+0 (1417840409 0 0) 0x55f66f693e00 con 0x55f66f5ce800
    -1> 2017-09-11 07:29:30.850014 7fb59d2e3700  4 mgr.server handle_report from 0x55f66f5ce800 rgw,docker
     0> 2017-09-11 07:29:30.866366 7fb59d2e3700 -1 *** Caught signal (Aborted) **
 in thread 7fb59d2e3700 thread_name:ms_dispatch

 ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
 1: (()+0x3de6b4) [0x55f6640e16b4]
 2: (()+0x11390) [0x7fb5a8fef390]
 3: (gsignal()+0x38) [0x7fb5a7f7f428]
 4: (abort()+0x16a) [0x7fb5a7f8102a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fb5a88c284d]
 6: (()+0x8d6b6) [0x7fb5a88c06b6]
 7: (()+0x8d701) [0x7fb5a88c0701]
 8: (()+0x8d919) [0x7fb5a88c0919]
 9: (()+0x2318ad) [0x55f663f348ad]
 10: (()+0x3e91bd) [0x55f6640ec1bd]
 11: (DaemonPerfCounters::update(MMgrReport*)+0x821) [0x55f663f96651]
 12: (DaemonServer::handle_report(MMgrReport*)+0x1ae) [0x55f663f9b79e]
 13: (DaemonServer::ms_dispatch(Message*)+0x64) [0x55f663fa8d64]
 14: (DispatchQueue::entry()+0xf4a) [0x55f664438f3a]
 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x55f6641dc44d]
 16: (()+0x76ba) [0x7fb5a8fe56ba]
 17: (clone()+0x6d) [0x7fb5a80513dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
[...]
--- end dump of recent events ---

I tried to manually run ceph-mgr with

/usr/bin/ceph-mgr -f --cluster ceph --id $HOSTNAME --setuser ceph --setgroup ceph

which immediately fails to keep running for longer than a few seconds.
stdout and logfile are attached, objdump: http://xor.meo.ws/oxo2q8h_oKAG6q7mARvNKkR_JdYjn89B.txt

#4 Updated by Katie Holly 3 months ago

TLDR at EOF

Some more information to my setup:

I'm running a Docker swarm with 15 radosgw instances, the used Docker image is self-built and includes a static "--name=client.rgw.docker" argument. All radosgw instances share the exact same Docker image and as such also the same "client.rgw.docker" name. This was purely a design choice to simplify deployment without having to keep track of handed out keys.

Brad Hubbard and Ben Hines both helped me debug this issue on the ceph-users list and Ben noted that the shared --name argument on all radosgw instances might confuse ceph-mgr in some way.

I updated the Docker image to automatically generate client keys when starting a new radosgw instance and they no longer share a single --name, but have all completely unique, random names which seems to have fixed this issue for me.

TLDR/Long story short: Running more than one radosgw instance with the exact same --name value causes this ceph-mgr crashes for me, especially when running 15 of them at the same time. Generating an unique key or each radosgw instance solves this issue.

#5 Updated by John Spray 3 months ago

  • Assignee set to John Spray

Thanks for the debugging -- when the RGW <-> mgr stuff was added we knew that it would require unique daemon names to produce sane output, but obviously it shouldn't crash when two daemons have the same name.

Fix incoming.

#6 Updated by John Spray 3 months ago

  • Status changed from New to In Progress

#7 Updated by John Spray 3 months ago

  • Status changed from In Progress to Need Review
  • Backport set to luminous

#8 Updated by Sage Weil 3 months ago

  • Status changed from Need Review to Pending Backport

#9 Updated by Nathan Cutler 2 months ago

  • Copied to Backport #21656: luminous: crash on DaemonPerfCounters::update added

Also available in: Atom PDF