Project

General

Profile

Actions

Bug #40029

closed

ceph-mon: Caught signal (Aborted) in (CrushWrapper::update_choose_args(CephContext*)+0x2fa) [0x7f516505614a]

Added by Iain Buclaw almost 5 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When adding a new osd, all primary monitors crashed.

2019-05-27 09:57:24.536 7f515ca31700  0 log_channel(cluster) log [DBG] : osdmap e23907: 57 total, 55 up, 55 in
2019-05-27 09:57:24.536 7f515ca31700  1 mon.host-247@1(leader).auth v48405 _upgrade_format_to_mimic upgrading from format 2 to 3
2019-05-27 09:57:24.536 7f515ca31700  0 log_channel(cluster) log [DBG] : mgrmap e1234: host-257(active), standbys: host-259, host-252, host-247, host-262
2019-05-27 09:57:24.536 7f515ca31700  0 log_channel(cluster) log [WRN] : Health check failed: 1/5 mons down, quorum host-247,host-262,host-259,host-257 (MON_DOWN)
2019-05-27 09:57:24.544 7f515ca31700  0 mon.host-247@1(leader).osd e23907 create-or-move crush item name 'osd.55' initial_weight 0.4411 at location {host=host-382,root=default}
2019-05-27 09:57:24.548 7f515ca31700 -1 *** Caught signal (Aborted) **
 in thread 7f515ca31700 thread_name:fn_monstore

 ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)
 1: /usr/bin/ceph-mon() [0x849490]
 2: (()+0x11390) [0x7f5164556390]
 3: (gsignal()+0x38) [0x7f516319e428]
 4: (abort()+0x16a) [0x7f51631a002a]
 5: (tcmalloc::Log(tcmalloc::LogMode, char const*, int, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem)+0x22e) [0x7f51640995ce]
 6: (()+0x1375f) [0x7f516408d75f]
 7: (operator delete[](void*)+0x1fd) [0x7f51640b066d]
 8: (CrushWrapper::update_choose_args(CephContext*)+0x2fa) [0x7f516505614a]
 9: (CrushWrapper::remove_root(int)+0x1cf) [0x7f516505af3f]
 10: (CrushWrapper::remove_root(int)+0x70) [0x7f516505ade0]
 11: (CrushWrapper::trim_roots_with_class()+0x1a7) [0x7f516505d047]
 12: (CrushWrapper::rebuild_roots_with_classes()+0xc2) [0x7f516506acf2]
 13: (CrushWrapper::insert_item(CephContext*, int, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)+0x450) [0x7f516506c290]
 14: (CrushWrapper::create_or_move_item(CephContext*, int, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)+0xf2) [0x7f516506d132]
 15: (OSDMonitor::prepare_command_impl(boost::intrusive_ptr<MonOpRequest>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&)+0x9ff5) [0x7ad945]
 16: (OSDMonitor::prepare_command(boost::intrusive_ptr<MonOpRequest>)+0x252) [0x7c6ae2]
 17: (OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x170) [0x7c6e40]
 18: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x996) [0x757626]
 19: (PaxosService::C_RetryMessage::_finish(int)+0x61) [0x6bd281]
 20: (C_MonOp::finish(int)+0x43) [0x66dcb3]
 21: (Context::complete(int)+0x9) [0x66cec9]
 22: (void finish_contexts<Context>(CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xb3) [0x6744c3]
 23: (Paxos::finish_round()+0x98) [0x74d928]
 24: (Paxos::commit_finish()+0x565) [0x74f865]
 25: (C_Committed::finish(int)+0x31) [0x753711]
 26: (Context::complete(int)+0x9) [0x66cec9]
 27: (MonitorDBStore::C_DoTransaction::finish(int)+0x97) [0x750ef7]
 28: (Context::complete(int)+0x9) [0x66cec9]
 29: (Finisher::finisher_thread_entry()+0x12e) [0x7f5164c61b5e]
 30: (()+0x76ba) [0x7f516454c6ba]
 31: (clone()+0x6d) [0x7f516327041d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Looking at the crush tree, the osds its crashing on are all not moved into a host.

ID  CLASS WEIGHT   REWEIGHT SIZE    USE     AVAIL   %USE  VAR  PGS TYPE NAME       
 -1       26.90276        -  27 TiB 7.2 TiB  19 TiB 27.11 1.00   - root default    
 -2        0.88199        - 883 GiB 323 GiB 560 GiB 36.60 1.35   -     host host-247 
  0   ssd  0.44099  1.00000 442 GiB 163 GiB 279 GiB 36.90 1.36  77         osd.0   
  1   ssd  0.44099  1.00000 442 GiB 160 GiB 281 GiB 36.30 1.34  77         osd.1   
 -3        0.88199        - 897 GiB 375 GiB 522 GiB 41.80 1.54   -     host host-248 
  2   ssd  0.44099  1.00000 442 GiB 170 GiB 271 GiB 38.61 1.42  79         osd.2   
  3   ssd  0.44099  1.00000 456 GiB 205 GiB 251 GiB 44.90 1.66  77         osd.3   
 -7        0.88199        - 883 GiB 344 GiB 539 GiB 38.96 1.44   -     host host-252 
 10   ssd  0.44099  1.00000 442 GiB 172 GiB 270 GiB 38.84 1.43  81         osd.10  
 11   ssd  0.44099  1.00000 442 GiB 173 GiB 269 GiB 39.08 1.44  77         osd.11  
 -8        0.88199        - 883 GiB 282 GiB 602 GiB 31.88 1.18   -     host host-253 
 12   ssd  0.44099  1.00000 442 GiB 163 GiB 278 GiB 36.96 1.36  80         osd.12  
 13   ssd  0.44099  1.00000 442 GiB 118 GiB 323 GiB 26.80 0.99  59         osd.13  
 -9        0.88199        - 883 GiB 331 GiB 552 GiB 37.50 1.38   -     host host-254 
 14   ssd  0.44099  1.00000 442 GiB 165 GiB 277 GiB 37.32 1.38  74         osd.14  
 15   ssd  0.44099  1.00000 442 GiB 166 GiB 275 GiB 37.68 1.39  78         osd.15  
-10        0.88199        - 883 GiB 342 GiB 541 GiB 38.72 1.43   -     host host-255 
 16   ssd  0.44099  1.00000 442 GiB 161 GiB 281 GiB 36.37 1.34  75         osd.16  
 17   ssd  0.44099  1.00000 442 GiB 181 GiB 260 GiB 41.07 1.52  87         osd.17  
-11        0.88199        - 883 GiB 343 GiB 540 GiB 38.90 1.44   -     host host-256 
 18   ssd  0.44099  1.00000 442 GiB 171 GiB 271 GiB 38.68 1.43  80         osd.18  
 19   ssd  0.44099  1.00000 442 GiB 173 GiB 269 GiB 39.11 1.44  84         osd.19  
-12        0.88199        - 883 GiB 334 GiB 549 GiB 37.83 1.40   -     host host-257 
 20   ssd  0.44099  1.00000 442 GiB 171 GiB 271 GiB 38.72 1.43  82         osd.20  
 21   ssd  0.44099  1.00000 442 GiB 163 GiB 278 GiB 36.95 1.36  78         osd.21  
-13        0.88199        - 883 GiB 326 GiB 557 GiB 36.89 1.36   -     host host-258 
 22   ssd  0.44099  1.00000 442 GiB 166 GiB 275 GiB 37.66 1.39  88         osd.22  
 23   ssd  0.44099  1.00000 442 GiB 159 GiB 282 GiB 36.11 1.33  78         osd.23  
-14        0.88199        - 883 GiB 350 GiB 533 GiB 39.65 1.46   -     host host-259 
 24   ssd  0.44099  1.00000 442 GiB 170 GiB 272 GiB 38.40 1.42  81         osd.24  
 25   ssd  0.44099  1.00000 442 GiB 181 GiB 261 GiB 40.90 1.51  88         osd.25  
-15        0.88199        - 883 GiB 335 GiB 549 GiB 37.88 1.40   -     host host-260 
 26   ssd  0.44099  1.00000 442 GiB 169 GiB 272 GiB 38.36 1.42  83         osd.26  
 27   ssd  0.44099  1.00000 442 GiB 165 GiB 276 GiB 37.41 1.38  81         osd.27  
-16        0.88199        - 883 GiB 350 GiB 533 GiB 39.66 1.46   -     host host-261 
 28   ssd  0.44099  1.00000 442 GiB 172 GiB 269 GiB 39.00 1.44  87         osd.28  
 29   ssd  0.44099  1.00000 442 GiB 178 GiB 263 GiB 40.33 1.49  84         osd.29  
-17        0.88199        - 883 GiB 341 GiB 542 GiB 38.60 1.42   -     host host-262 
 30   ssd  0.44099  1.00000 442 GiB 171 GiB 270 GiB 38.83 1.43  88         osd.30  
 31   ssd  0.44099  1.00000 442 GiB 169 GiB 272 GiB 38.37 1.42  86         osd.31  
-18        0.88199        - 883 GiB 324 GiB 559 GiB 36.74 1.36   -     host host-263 
 32   ssd  0.44099  1.00000 442 GiB 151 GiB 291 GiB 34.20 1.26  73         osd.32  
 33   ssd  0.44099  1.00000 442 GiB 173 GiB 268 GiB 39.29 1.45  81         osd.33  
-23        0.88199        - 883 GiB 344 GiB 539 GiB 38.92 1.44   -     host host-349 
  4   ssd  0.44099  1.00000 442 GiB 172 GiB 270 GiB 38.96 1.44  86         osd.4   
  5   ssd  0.44099  1.00000 442 GiB 172 GiB 270 GiB 38.88 1.43  90         osd.5   
-21        0.88199        - 883 GiB 318 GiB 565 GiB 35.99 1.33   -     host host-350 
  6   ssd  0.44099  1.00000 442 GiB 161 GiB 281 GiB 36.46 1.35  79         osd.6   
  7   ssd  0.44099  1.00000 442 GiB 157 GiB 285 GiB 35.52 1.31  75         osd.7   
-19        0.88199        - 883 GiB 331 GiB 552 GiB 37.49 1.38   -     host host-351 
  8   ssd  0.44099  1.00000 442 GiB 158 GiB 284 GiB 35.75 1.32  77         osd.8   
  9   ssd  0.44099  1.00000 442 GiB 173 GiB 268 GiB 39.23 1.45  86         osd.9   
-25        0.88199        - 883 GiB 342 GiB 541 GiB 38.69 1.43   -     host host-352 
 34   ssd  0.44099  1.00000 442 GiB 172 GiB 269 GiB 38.97 1.44  83         osd.34  
 35   ssd  0.44099  1.00000 442 GiB 170 GiB 272 GiB 38.41 1.42  85         osd.35  
-38        0.88199        - 883 GiB 327 GiB 556 GiB 37.00 1.36   -     host host-353 
 36   ssd  0.44099  1.00000 442 GiB 159 GiB 282 GiB 36.03 1.33  79         osd.36  
 37   ssd  0.44099  1.00000 442 GiB 168 GiB 274 GiB 37.97 1.40  84         osd.37  
-47        0.88217        - 903 GiB  78 GiB 825 GiB  8.63 0.32   -     host host-369 
 41   ssd  0.44109  1.00000 452 GiB  40 GiB 412 GiB  8.77 0.32  29         osd.41  
 43   ssd  0.44109  1.00000 452 GiB  38 GiB 413 GiB  8.49 0.31  27         osd.43  
-53        0.88217        - 903 GiB  56 GiB 848 GiB  6.19 0.23   -     host host-370 
 46   ssd  0.44109  1.00000 452 GiB  54 GiB 398 GiB 11.91 0.44  33         osd.46  
 61   ssd  0.44109  1.00000 452 GiB 2.1 GiB 450 GiB  0.46 0.02  13         osd.61  
-55        0.88217        - 903 GiB 101 GiB 803 GiB 11.14 0.41   -     host host-372 
 48   ssd  0.44109  1.00000 452 GiB  49 GiB 402 GiB 10.94 0.40  31         osd.48  
 51   ssd  0.44109  1.00000 452 GiB  51 GiB 401 GiB 11.33 0.42  36         osd.51  
-49        0.88217        - 903 GiB  87 GiB 816 GiB  9.67 0.36   -     host host-373 
 42   ssd  0.44109  1.00000 452 GiB  45 GiB 407 GiB 10.00 0.37  30         osd.42  
 47   ssd  0.44109  1.00000 452 GiB  42 GiB 410 GiB  9.33 0.34  21         osd.47  
-61        0.88217        - 903 GiB 142 GiB 762 GiB 15.70 0.58   -     host host-374 
 53   ssd  0.44109  1.00000 452 GiB  79 GiB 373 GiB 17.52 0.65  35         osd.53  
 64   ssd  0.44109  1.00000 452 GiB  63 GiB 389 GiB 13.87 0.51  34         osd.64  
-44        0.88217        - 903 GiB  88 GiB 815 GiB  9.78 0.36   -     host host-375 
 40   ssd  0.44109  1.00000 452 GiB  43 GiB 409 GiB  9.53 0.35  29         osd.40  
 49   ssd  0.44109  1.00000 452 GiB  45 GiB 406 GiB 10.02 0.37  27         osd.49  
-51        0.88217        - 903 GiB 110 GiB 794 GiB 12.14 0.45   -     host host-376 
 45   ssd  0.44109  1.00000 452 GiB  68 GiB 384 GiB 14.98 0.55  40         osd.45  
 56   ssd  0.44109  1.00000 452 GiB  42 GiB 410 GiB  9.30 0.34  25         osd.56  
-63        0.88217        - 903 GiB  80 GiB 823 GiB  8.86 0.33   -     host host-377 
 54   ssd  0.44109  1.00000 452 GiB  43 GiB 409 GiB  9.55 0.35  22         osd.54  
 58   ssd  0.44109  1.00000 452 GiB  37 GiB 415 GiB  8.17 0.30  26         osd.58  
-42        0.88217        - 903 GiB  86 GiB 818 GiB  9.49 0.35   -     host host-378 
 39   ssd  0.44109  1.00000 452 GiB  40 GiB 411 GiB  8.93 0.33  25         osd.39  
 44   ssd  0.44109  1.00000 452 GiB  45 GiB 406 GiB 10.05 0.37  32         osd.44  
-59        0.88217        - 903 GiB  77 GiB 827 GiB  8.48 0.31   -     host host-379 
 52   ssd  0.44109  1.00000 452 GiB  47 GiB 404 GiB 10.51 0.39  28         osd.52  
 68   ssd  0.44109  1.00000 452 GiB  29 GiB 423 GiB  6.45 0.24  20         osd.68  
-40        0.44109        - 452 GiB  47 GiB 405 GiB 10.34 0.38   -     host host-380 
 38   ssd  0.44109  1.00000 452 GiB  47 GiB 405 GiB 10.34 0.38  34         osd.38  
-57        0.88217        - 903 GiB  56 GiB 848 GiB  6.17 0.23   -     host host-381 
 50   ssd  0.44109  1.00000 452 GiB  33 GiB 418 GiB  7.35 0.27  21         osd.50  
 63   ssd  0.44109  1.00000 452 GiB  23 GiB 429 GiB  4.98 0.18  25         osd.63  
 55              0        0     0 B     0 B     0 B     0    0   0 osd.55          
 57              0        0     0 B     0 B     0 B     0    0   0 osd.57          
 59   ssd        0        0     0 B     0 B     0 B     0    0   0 osd.59          
 60   ssd        0        0     0 B     0 B     0 B     0    0   0 osd.60          
 62   ssd        0        0     0 B     0 B     0 B     0    0   0 osd.62          
 65   ssd        0        0     0 B     0 B     0 B     0    0   0 osd.65          
 66              0        0     0 B     0 B     0 B     0    0   0 osd.66          
 67   ssd        0        0     0 B     0 B     0 B     0    0   0 osd.67          
                      TOTAL  27 TiB 7.2 TiB  19 TiB 27.11                          
MIN/MAX VAR: 0.02/1.66  STDDEV: 13.96

Fixed the crash by removing all osds from the crush tree (55, 57, 59, 60, 62, 65, 66, 67), although had to constantly restart the crashing monitors in order to get the first change in. Once all were removed, cluster became stable again.


Files

ceph-mon.host-369.log.xz (603 KB) ceph-mon.host-369.log.xz Iain Buclaw, 05/28/2019 09:43 AM
crushmap.txt (14.2 KB) crushmap.txt Iain Buclaw, 05/28/2019 10:18 AM

Related issues 1 (0 open1 closed)

Related to RADOS - Bug #39978: Adding OSD to Luminous Cluster will crash the active monDuplicate05/20/2019

Actions
Actions #1

Updated by Iain Buclaw almost 5 years ago

Recreating the OSDs, it seems that the monitors consistently crash when creating osd.57. And they consistently recover after removing osd.57.

Actions #2

Updated by Iain Buclaw almost 5 years ago

Happens on any host I create osd.57 on.

Actions #3

Updated by Zheng Yan almost 5 years ago

  • Project changed from CephFS to Ceph
Actions #4

Updated by Iain Buclaw almost 5 years ago

Attached logs of primary monitor with:
debug mon 10
debug ms 1

Started prior to osd-57 being added, and stopped after osd-57 was removed.

Actions #5

Updated by Iain Buclaw almost 5 years ago

I removed osd.0 and osd.1 from host-247, and re-ran deployment of osds to host-371. Both got added successfully.

Pushed out osd.57 to host-382, and the monitors crashed again. So it really is only this one osd number that can't be added to the crushmap for whatever reason.

Attaching crush map, in case it helps debugging.

Actions #6

Updated by Patrick Donnelly almost 5 years ago

  • Project changed from Ceph to RADOS
  • Start date deleted (05/27/2019)
  • Component(RADOS) Monitor added
Actions #7

Updated by Neha Ojha almost 5 years ago

  • Priority changed from Normal to Urgent
Actions #8

Updated by Greg Farnum almost 5 years ago

  • Related to Bug #39978: Adding OSD to Luminous Cluster will crash the active mon added
Actions #9

Updated by Greg Farnum over 4 years ago

  • Priority changed from Urgent to High
Actions #10

Updated by Florian Haas over 4 years ago

  • Affected Versions v12.2.12 added

We just bumped into this, on Luminous (12.2.12). It actually caused us momentary loss of quorum.

Sequence of events was this:

  • We run ceph-ansible to add OSDs on a new node (cc-ceph-osd26-fra1).
  • First OSD being added (osd.59 in our case) is added to the crushmap, but does not create a host bucket for itself.
  • Current mon leader crashes, second mon takes over as leader. 2/3 quorum.
  • The ceph osd new process for osd.59 still runs, fails over to other mon, crashes that mon too. 1/3, loss of quorum.

At this stage, we did not dare restart our third mon. But restarting any of the other two would crash them with SIGABRT again.

Recovery:

  • We kill the ceph osd new process for osd.59.
  • We restart the second mon, which rejoins successfully. 2/3 quorum.
  • We remove osd.59 from the crushmap and osdmap with ceph osd purge.
  • We restart the third mon, which rejoins successfully. 3/3 quorum.

I'm really curious as to where this is coming from. We can of course work around it by manually adding CRUSH buckets for hosts, but even if we don't, then surely it shouldn't crash the mon?

Log files

First failing mon (ceph-mon.cc-ceph-osd11-fra1)

   -51> 2019-08-22 10:08:11.896352 7f93797ab700  0 log_channel(cluster) log [DBG] : osdmap e302401: 296 total, 295 up, 295 in
   -50> 2019-08-22 10:08:11.896363 7f93797ab700 10 log_client _send_to_monlog to self
   -49> 2019-08-22 10:08:11.896366 7f93797ab700 10 log_client  log_queue is 6 last_log 6 sent 5 num 6 unsent 1 sending 1
   -48> 2019-08-22 10:08:11.896372 7f93797ab700 10 log_client  will send 2019-08-22 10:08:11.896361 mon.cc-ceph-osd11-fra1 mon.0 10.93.0.111:6789/0 6 : cluster [DBG] osdmap e302401: 296 total, 295 up, 295 in
   -47> 2019-08-22 10:08:11.896388 7f93797ab700  1 -- 10.93.0.111:6789/0 --> 10.93.0.111:6789/0 -- log(1 entries from seq 6 at 2019-08-22 10:08:11.896361) v1 -- 0x563995da2000 con 0
   -46> 2019-08-22 10:08:11.896394 7f937e244700  5 -- 10.93.0.111:6789/0 >> 10.93.0.113:6789/0 conn(0x563995a60000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1202231 cs=1 l=0). rx mon.2 seq 1358036438 0x563995abbc00 forward(osd_beacon(pgs [6.f9,2.a3,3.185,3.33a,3.3ee,3.3fd,7.7a,4.2dd,4.e4,2.1eb,6.40] lec 302400 v302401) v1 caps allow profile osd tid 1732978 con_features 4611087853746454523) v3
   -45> 2019-08-22 10:08:11.896481 7f937e244700  5 -- 10.93.0.111:6789/0 >> 10.93.0.113:6789/0 conn(0x563995a60000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1202231 cs=1 l=0). rx mon.2 seq 1358036439 0x563995aba400 forward(osd_beacon(pgs [2.47,6.119,2.157,6.75,3.3af,4.398,4.2d7,2.1a8,4.61,3.2,2.3e3,2.130] lec 302400 v302401) v1 caps allow profile osd tid 1732979 con_features 4611087853746454523) v3
   -44> 2019-08-22 10:08:11.896559 7f937e244700  5 -- 10.93.0.111:6789/0 >> 10.93.0.113:6789/0 conn(0x563995a60000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1202231 cs=1 l=0). rx mon.2 seq 1358036440 0x563995ab8c00 forward(log(2 entries from seq 648 at 2019-08-22 10:08:00.682075) v1 caps allow profile osd tid 1732980 con_features 4611087853746454523) v3
   -43> 2019-08-22 10:08:11.896615 7f93797ab700  0 log_channel(cluster) log [DBG] : mgrmap e144: cc-ceph-osd11-fra1(active), standbys: cc-ceph-osd13-fra1, cc-ceph-osd12-fra1
   -42> 2019-08-22 10:08:11.896624 7f93797ab700 10 log_client _send_to_monlog to self
   -41> 2019-08-22 10:08:11.896625 7f93797ab700 10 log_client  log_queue is 7 last_log 7 sent 6 num 7 unsent 1 sending 1
   -40> 2019-08-22 10:08:11.896629 7f93797ab700 10 log_client  will send 2019-08-22 10:08:11.896622 mon.cc-ceph-osd11-fra1 mon.0 10.93.0.111:6789/0 7 : cluster [DBG] mgrmap e144: cc-ceph-osd11-fra1(active), standbys: cc-ceph-osd13-fra1, cc-ceph-osd12-fra1
   -39> 2019-08-22 10:08:11.896618 7f937ea45700  5 -- 10.93.0.111:6789/0 >> 10.93.0.112:6789/0 conn(0x563995a2c800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1198034 cs=1 l=0). rx mon.1 seq 1809490043 0x563995d01600 forward(mdsbeacon(2515805112/cc-ceph-osd13-fra1 up:standby seq 301415 v18) v7 caps allow profile mds tid 741101 con_features 575862587620376571) v3
   -38> 2019-08-22 10:08:11.896638 7f93797ab700  1 -- 10.93.0.111:6789/0 --> 10.93.0.111:6789/0 -- log(1 entries from seq 7 at 2019-08-22 10:08:11.896622) v1 -- 0x563995d09980 con 0
   -37> 2019-08-22 10:08:11.896716 7f937e244700  5 -- 10.93.0.111:6789/0 >> 10.93.0.113:6789/0 conn(0x563995a60000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1202231 cs=1 l=0). rx mon.2 seq 1358036441 0x563995ad1500 forward(mgrbeacon mgr.cc-ceph-osd11-fra1(2ae305d1-6742-4b1c-af69-825a3bae8b53,2517363527, 10.93.0.111:6800/1219802, 1) v6 caps allow profile mgr tid 1732981 con_features 4611087853746454523) v3
   -36> 2019-08-22 10:08:11.896729 7f937ea45700  5 -- 10.93.0.111:6789/0 >> 10.93.0.112:6789/0 conn(0x563995a2c800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1198034 cs=1 l=0). rx mon.1 seq 1809490044 0x563995d01900 forward(mgrbeacon mgr.cc-ceph-osd13-fra1(2ae305d1-6742-4b1c-af69-825a3bae8b53,2517349297, -, 0) v6 caps allow profile mgr tid 741102 con_features 4611087853746454523) v3
   -35> 2019-08-22 10:08:11.896793 7f937ea45700  5 -- 10.93.0.111:6789/0 >> 10.93.0.112:6789/0 conn(0x563995a2c800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1198034 cs=1 l=0). rx mon.1 seq 1809490045 0x563995cec100 forward(mgrbeacon mgr.cc-ceph-osd12-fra1(2ae305d1-6742-4b1c-af69-825a3bae8b53,2517870358, -, 0) v6 caps allow profile mgr tid 741103 con_features 4611087853746454523) v3
   -34> 2019-08-22 10:08:11.896815 7f937e244700  5 -- 10.93.0.111:6789/0 >> 10.93.0.113:6789/0 conn(0x563995a60000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1202231 cs=1 l=0). rx mon.2 seq 1358036442 0x563995ad0f00 forward(monmgrreport(0 checks) v1 caps allow profile mgr tid 1732982 con_features 4611087853746454523) v3
   -33> 2019-08-22 10:08:11.896857 7f937ea45700  5 -- 10.93.0.111:6789/0 >> 10.93.0.112:6789/0 conn(0x563995a2c800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1198034 cs=1 l=0). rx mon.1 seq 1809490046 0x563995cea300 forward(log(1 entries from seq 53885 at 2019-08-22 10:08:11.435591) v1 caps allow * tid 741104 con_features 4611087853746454523) v3
   -32> 2019-08-22 10:08:11.896861 7f937e244700  5 -- 10.93.0.111:6789/0 >> 10.93.0.113:6789/0 conn(0x563995a60000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1202231 cs=1 l=0). rx mon.2 seq 1358036443 0x563995ad0c00 forward(log(1 entries from seq 100463 at 2019-08-22 10:08:11.435479) v1 caps allow * tid 1732983 con_features 4611087853746454523) v3
   -31> 2019-08-22 10:08:11.896937 7f937ea45700  5 -- 10.93.0.111:6789/0 >> 10.93.0.112:6789/0 conn(0x563995a2c800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1198034 cs=1 l=0). rx mon.1 seq 1809490047 0x563995cea900 forward(mgrbeacon mgr.cc-ceph-osd13-fra1(2ae305d1-6742-4b1c-af69-825a3bae8b53,2517349297, -, 0) v6 caps allow profile mgr tid 741105 con_features 4611087853746454523) v3
   -30> 2019-08-22 10:08:11.897175 7f93797ab700  0 log_channel(cluster) log [INF] : overall HEALTH_OK
   -29> 2019-08-22 10:08:11.897180 7f93797ab700 10 log_client _send_to_monlog to self
   -28> 2019-08-22 10:08:11.897181 7f93797ab700 10 log_client  log_queue is 8 last_log 8 sent 7 num 8 unsent 1 sending 1
   -27> 2019-08-22 10:08:11.897183 7f93797ab700 10 log_client  will send 2019-08-22 10:08:11.897180 mon.cc-ceph-osd11-fra1 mon.0 10.93.0.111:6789/0 8 : cluster [INF] overall HEALTH_OK
   -26> 2019-08-22 10:08:11.897187 7f93797ab700  1 -- 10.93.0.111:6789/0 --> 10.93.0.111:6789/0 -- log(1 entries from seq 8 at 2019-08-22 10:08:11.897180) v1 -- 0x563995d09c00 con 0
   -25> 2019-08-22 10:08:11.897206 7f93797ab700  5 mon.cc-ceph-osd11-fra1@0(leader).paxos(paxos active c 181280484..181281212) is_readable = 1 - now=2019-08-22 10:08:11.897207 lease_expire=2019-08-22 10:08:16.894606 has v0 lc 181281212
   -24> 2019-08-22 10:08:11.897223 7f93797ab700  5 mon.cc-ceph-osd11-fra1@0(leader).paxos(paxos active c 181280484..181281212) is_readable = 1 - now=2019-08-22 10:08:11.897224 lease_expire=2019-08-22 10:08:16.894606 has v0 lc 181281212
   -23> 2019-08-22 10:08:11.897235 7f93797ab700  5 mon.cc-ceph-osd11-fra1@0(leader).paxos(paxos active c 181280484..181281212) is_readable = 1 - now=2019-08-22 10:08:11.897235 lease_expire=2019-08-22 10:08:16.894606 has v0 lc 181281212
   -22> 2019-08-22 10:08:11.897246 7f93797ab700  5 mon.cc-ceph-osd11-fra1@0(leader).paxos(paxos active c 181280484..181281212) is_readable = 1 - now=2019-08-22 10:08:11.897247 lease_expire=2019-08-22 10:08:16.894606 has v0 lc 181281212
   -21> 2019-08-22 10:08:11.897257 7f93797ab700  5 mon.cc-ceph-osd11-fra1@0(leader).paxos(paxos active c 181280484..181281212) is_readable = 1 - now=2019-08-22 10:08:11.897257 lease_expire=2019-08-22 10:08:16.894606 has v0 lc 181281212
   -20> 2019-08-22 10:08:11.897266 7f93797ab700  5 mon.cc-ceph-osd11-fra1@0(leader).paxos(paxos active c 181280484..181281212) is_readable = 1 - now=2019-08-22 10:08:11.897267 lease_expire=2019-08-22 10:08:16.894606 has v0 lc 181281212
   -19> 2019-08-22 10:08:11.897364 7f93797ab700  0 mon.cc-ceph-osd11-fra1@0(leader).osd e302401 create-or-move crush item name 'osd.59' initial_weight 1.6374 at location {host=cc-ceph-osd26-fra1,root=default}
   -18> 2019-08-22 10:08:11.900484 7f937da43700  1 -- 10.93.0.111:6789/0 >> - conn(0x563995a72800 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=30 -
   -17> 2019-08-22 10:08:11.900858 7f937da43700  2 -- 10.93.0.111:6789/0 >> 10.93.0.10:0/561290038 conn(0x563995a72800 :6789 s=STATE_ACCEPTING_WAIT_SEQ pgs=1 cs=1 l=1).handle_connect_msg accept write reply msg done
   -16> 2019-08-22 10:08:11.900917 7f937da43700  2 -- 10.93.0.111:6789/0 >> 10.93.0.10:0/561290038 conn(0x563995a72800 :6789 s=STATE_ACCEPTING_WAIT_SEQ pgs=1 cs=1 l=1)._process_connection accept get newly_acked_seq 0
   -15> 2019-08-22 10:08:11.901137 7f937da43700  5 -- 10.93.0.111:6789/0 >> 10.93.0.10:0/561290038 conn(0x563995a72800 :6789 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1 cs=1 l=1). rx client.? seq 1 0x563995dd8000 auth(proto 0 35 bytes epoch 0) v1
   -14> 2019-08-22 10:08:11.902284 7f937e244700  1 -- 10.93.0.111:6789/0 >> - conn(0x563995ddc000 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=31 -
   -13> 2019-08-22 10:08:11.902419 7f937da43700  1 -- 10.93.0.111:6789/0 >> 10.93.0.10:0/561290038 conn(0x563995a72800 :6789 s=STATE_OPEN pgs=1 cs=1 l=1).read_bulk peer close file descriptor 30
   -12> 2019-08-22 10:08:11.902450 7f937da43700  1 -- 10.93.0.111:6789/0 >> 10.93.0.10:0/561290038 conn(0x563995a72800 :6789 s=STATE_OPEN pgs=1 cs=1 l=1).read_until read failed
   -11> 2019-08-22 10:08:11.902460 7f937da43700  1 -- 10.93.0.111:6789/0 >> 10.93.0.10:0/561290038 conn(0x563995a72800 :6789 s=STATE_OPEN pgs=1 cs=1 l=1).process read tag failed
   -10> 2019-08-22 10:08:11.902472 7f937da43700  1 -- 10.93.0.111:6789/0 >> 10.93.0.10:0/561290038 conn(0x563995a72800 :6789 s=STATE_OPEN pgs=1 cs=1 l=1).fault on lossy channel, failing
    -9> 2019-08-22 10:08:11.902477 7f937da43700  2 -- 10.93.0.111:6789/0 >> 10.93.0.10:0/561290038 conn(0x563995a72800 :6789 s=STATE_OPEN pgs=1 cs=1 l=1)._stop
    -8> 2019-08-22 10:08:11.902557 7f937e244700  2 -- 10.93.0.111:6789/0 >> 10.93.0.84:0/2316970930 conn(0x563995ddc000 :6789 s=STATE_ACCEPTING_WAIT_SEQ pgs=2 cs=1 l=1).handle_connect_msg accept write reply msg done
    -7> 2019-08-22 10:08:11.902660 7f937e244700  2 -- 10.93.0.111:6789/0 >> 10.93.0.84:0/2316970930 conn(0x563995ddc000 :6789 s=STATE_ACCEPTING_WAIT_SEQ pgs=2 cs=1 l=1)._process_connection accept get newly_acked_seq 0
    -6> 2019-08-22 10:08:11.902762 7f937e244700  5 -- 10.93.0.111:6789/0 >> 10.93.0.84:0/2316970930 conn(0x563995ddc000 :6789 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2 cs=1 l=1). rx client.? seq 1 0x563995da4300 auth(proto 0 35 bytes epoch 0) v1
    -5> 2019-08-22 10:08:11.904886 7f937e244700  1 -- 10.93.0.111:6789/0 >> 10.93.0.84:0/2316970930 conn(0x563995ddc000 :6789 s=STATE_OPEN pgs=2 cs=1 l=1).read_bulk peer close file descriptor 31
    -4> 2019-08-22 10:08:11.904907 7f937e244700  1 -- 10.93.0.111:6789/0 >> 10.93.0.84:0/2316970930 conn(0x563995ddc000 :6789 s=STATE_OPEN pgs=2 cs=1 l=1).read_until read failed
    -3> 2019-08-22 10:08:11.904914 7f937e244700  1 -- 10.93.0.111:6789/0 >> 10.93.0.84:0/2316970930 conn(0x563995ddc000 :6789 s=STATE_OPEN pgs=2 cs=1 l=1).process read tag failed
    -2> 2019-08-22 10:08:11.904920 7f937e244700  1 -- 10.93.0.111:6789/0 >> 10.93.0.84:0/2316970930 conn(0x563995ddc000 :6789 s=STATE_OPEN pgs=2 cs=1 l=1).fault on lossy channel, failing
    -1> 2019-08-22 10:08:11.904933 7f937e244700  2 -- 10.93.0.111:6789/0 >> 10.93.0.84:0/2316970930 conn(0x563995ddc000 :6789 s=STATE_OPEN pgs=2 cs=1 l=1)._stop
     0> 2019-08-22 10:08:11.905527 7f93797ab700 -1 *** Caught signal (Aborted) **
 in thread 7f93797ab700 thread_name:ms_dispatch

 ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
 1: (()+0x9e6334) [0x56398b15f334]
 2: (()+0x11390) [0x7f9384bea390]
 3: (gsignal()+0x38) [0x7f9383130428]
 4: (abort()+0x16a) [0x7f938313202a]
 5: (tcmalloc::Log(tcmalloc::LogMode, char const*, int, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem)+0x22e) [0x7f93841a75ce]
 6: (()+0x1375f) [0x7f938419b75f]
 7: (operator delete[](void*)+0x1fd) [0x7f93841be66d]
 8: (CrushWrapper::update_choose_args(CephContext*)+0x2a6) [0x56398b0b8cb6]
 9: (CrushWrapper::remove_root(int)+0x1ac) [0x56398b0c054c]
 10: (CrushWrapper::remove_root(int)+0x6d) [0x56398b0c040d]
 11: (CrushWrapper::trim_roots_with_class()+0x1b7) [0x56398b0c0b37]
 12: (CrushWrapper::rebuild_roots_with_classes()+0xbd) [0x56398b0d191d]
 13: (CrushWrapper::insert_item(CephContext*, int, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)+0x5b0) [0x56398b0d3630]
 14: (CrushWrapper::create_or_move_item(CephContext*, int, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)+0x25a) [0x56398b0d474a]
 15: (OSDMonitor::prepare_command_impl(boost::intrusive_ptr<MonOpRequest>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > >&)+0x47c4) [0x56398ad1e654]
 16: (OSDMonitor::prepare_command(boost::intrusive_ptr<MonOpRequest>)+0x2e6) [0x56398ad408e6]
 17: (OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x283) [0x56398ad40d23]
 18: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x9b4) [0x56398acbeb84]
 19: (PaxosService::C_RetryMessage::_finish(int)+0x5e) [0x56398abcabde]
 20: (Context::complete(int)+0x9) [0x56398ab73c39]
 21: (void finish_contexts<Context>(CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0x20b) [0x56398ab7df1b]
 22: (Paxos::finish_round()+0x2a6) [0x56398acb57e6]
 23: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0xf9d) [0x56398acb6bdd]
 24: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x263) [0x56398acb7533]
 25: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xefe) [0x56398ab6961e]
 26: (Monitor::_ms_dispatch(Message*)+0x6db) [0x56398ab6a1ab]
 27: (Monitor::ms_dispatch(Message*)+0x23) [0x56398ab9b2c3]
 28: (DispatchQueue::entry()+0xf4a) [0x56398b10605a]
 29: (DispatchQueue::DispatchThread::entry()+0xd) [0x56398ae99dcd]
 30: (()+0x76ba) [0x7f9384be06ba]
 31: (clone()+0x6d) [0x7f938320241d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mon.cc-ceph-osd11-fra1.log
--- end dump of recent events ---

Second failing mon (ceph-mon.cc-ceph-osd12-fra1)

   -35> 2019-08-22 10:10:25.685801 7fc992a39700 10 log_client _send_to_monlog to self
   -34> 2019-08-22 10:10:25.685802 7fc992a39700 10 log_client  log_queue is 13 last_log 13 sent 12 num 13 unsent 1 sending 1
   -33> 2019-08-22 10:10:25.685804 7fc992a39700 10 log_client  will send 2019-08-22 10:10:25.685800 mon.cc-ceph-osd12-fra1 mon.1 10.93.0.112:6789/0 13 : cluster [DBG] osd.299 10.93.0.120:6815/2718523 reported failed by osd.117 10.93.0.112:6823/3868055
   -32> 2019-08-22 10:10:25.685811 7fc992a39700  1 -- 10.93.0.112:6789/0 --> 10.93.0.112:6789/0 -- log(1 entries from seq 13 at 2019-08-22 10:10:25.685800) v1 -- 0x560f45291480 con 0
   -31> 2019-08-22 10:10:25.685813 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578919 0x560f45b07900 forward(mgrbeacon mgr.cc-ceph-osd13-fra1(2ae305d1-6742-4b1c-af69-825a3bae8b53,2517349297, -, 0) v6 caps allow profile mgr tid 1733651 con_features 4611087853746454523) v3
   -30> 2019-08-22 10:10:25.685842 7fc992a39700  5 mon.cc-ceph-osd12-fra1@1(leader).paxos(paxos active c 181280484..181281234) is_readable = 1 - now=2019-08-22 10:10:25.685843 lease_expire=2019-08-22 10:10:30.682781 has v0 lc 181281234
   -29> 2019-08-22 10:10:25.685932 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578920 0x560f45c21600 forward(mgrbeacon mgr.cc-ceph-osd11-fra1(2ae305d1-6742-4b1c-af69-825a3bae8b53,2517363527, 10.93.0.111:6800/1219802, 1) v6 caps allow profile mgr tid 1733652 con_features 4611087853746454523) v3
   -28> 2019-08-22 10:10:25.685986 7fc992a39700  0 mon.cc-ceph-osd12-fra1@1(leader).osd e302401 create-or-move crush item name 'osd.59' initial_weight 1.6374 at location {host=cc-ceph-osd26-fra1,root=default}
   -27> 2019-08-22 10:10:25.686137 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578921 0x560f45c20100 forward(monmgrreport(0 checks) v1 caps allow profile mgr tid 1733653 con_features 4611087853746454523) v3
   -26> 2019-08-22 10:10:25.686214 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578922 0x560f45c20700 forward(mgrbeacon mgr.cc-ceph-osd12-fra1(2ae305d1-6742-4b1c-af69-825a3bae8b53,2517870358, -, 0) v6 caps allow profile mgr tid 1733654 con_features 4611087853746454523) v3
   -25> 2019-08-22 10:10:25.686281 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578923 0x560f45c21c00 forward(log(1 entries from seq 505 at 2019-08-22 10:10:10.525732) v1 caps allow profile osd tid 1733655 con_features 4611087853746454523) v3
   -24> 2019-08-22 10:10:25.686353 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578924 0x560f45c1ec00 forward(mdsbeacon(2515805112/cc-ceph-osd13-fra1 up:standby seq 301447 v18) v7 caps allow profile mds tid 1733656 con_features 575862587620376571) v3
   -23> 2019-08-22 10:10:25.686475 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578925 0x560f45b9d000 forward(mgrbeacon mgr.cc-ceph-osd13-fra1(2ae305d1-6742-4b1c-af69-825a3bae8b53,2517349297, -, 0) v6 caps allow profile mgr tid 1733657 con_features 4611087853746454523) v3
   -22> 2019-08-22 10:10:25.686569 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578926 0x560f45b07000 forward(mgrbeacon mgr.cc-ceph-osd11-fra1(2ae305d1-6742-4b1c-af69-825a3bae8b53,2517363527, 10.93.0.111:6800/1219802, 1) v6 caps allow profile mgr tid 1733658 con_features 4611087853746454523) v3
   -21> 2019-08-22 10:10:25.686805 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578927 0x560f45b05e00 forward(monmgrreport(0 checks) v1 caps allow profile mgr tid 1733659 con_features 4611087853746454523) v3
   -20> 2019-08-22 10:10:25.686906 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578928 0x560f45b06400 forward(log(1 entries from seq 100496 at 2019-08-22 10:10:17.182894) v1 caps allow * tid 1733660 con_features 4611087853746454523) v3
   -19> 2019-08-22 10:10:25.687021 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578929 0x560f45b04000 forward(mgrbeacon mgr.cc-ceph-osd12-fra1(2ae305d1-6742-4b1c-af69-825a3bae8b53,2517870358, -, 0) v6 caps allow profile mgr tid 1733661 con_features 4611087853746454523) v3
   -18> 2019-08-22 10:10:25.687144 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578930 0x560f45b86c00 forward(mgrbeacon mgr.cc-ceph-osd13-fra1(2ae305d1-6742-4b1c-af69-825a3bae8b53,2517349297, -, 0) v6 caps allow profile mgr tid 1733662 con_features 4611087853746454523) v3
   -17> 2019-08-22 10:10:25.687220 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578931 0x560f45b89300 forward(mgrbeacon mgr.cc-ceph-osd11-fra1(2ae305d1-6742-4b1c-af69-825a3bae8b53,2517363527, 10.93.0.111:6800/1219802, 1) v6 caps allow profile mgr tid 1733663 con_features 4611087853746454523) v3
   -16> 2019-08-22 10:10:25.687371 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578932 0x560f45b72c00 forward(monmgrreport(0 checks) v1 caps allow profile mgr tid 1733664 con_features 4611087853746454523) v3
   -15> 2019-08-22 10:10:25.687442 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578933 0x560f45b72f00 forward(mgrbeacon mgr.cc-ceph-osd12-fra1(2ae305d1-6742-4b1c-af69-825a3bae8b53,2517870358, -, 0) v6 caps allow profile mgr tid 1733665 con_features 4611087853746454523) v3
   -14> 2019-08-22 10:10:25.687543 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578934 0x560f45b73b00 forward(mdsbeacon(2515805112/cc-ceph-osd13-fra1 up:standby seq 301448 v18) v7 caps allow profile mds tid 1733666 con_features 575862587620376571) v3
   -13> 2019-08-22 10:10:25.687760 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578935 0x560f45b72600 forward(mgrbeacon mgr.cc-ceph-osd13-fra1(2ae305d1-6742-4b1c-af69-825a3bae8b53,2517349297, -, 0) v6 caps allow profile mgr tid 1733667 con_features 4611087853746454523) v3
   -12> 2019-08-22 10:10:25.687855 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578936 0x560f45d46600 forward(mgrbeacon mgr.cc-ceph-osd11-fra1(2ae305d1-6742-4b1c-af69-825a3bae8b53,2517363527, 10.93.0.111:6800/1219802, 1) v6 caps allow profile mgr tid 1733668 con_features 4611087853746454523) v3
   -11> 2019-08-22 10:10:25.687984 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578937 0x560f45d46f00 forward(monmgrreport(0 checks) v1 caps allow profile mgr tid 1733669 con_features 4611087853746454523) v3
   -10> 2019-08-22 10:10:25.688071 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578938 0x560f45d47500 forward(mgrbeacon mgr.cc-ceph-osd12-fra1(2ae305d1-6742-4b1c-af69-825a3bae8b53,2517870358, -, 0) v6 caps allow profile mgr tid 1733670 con_features 4611087853746454523) v3
    -9> 2019-08-22 10:10:25.688140 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578939 0x560f45d47b00 forward(mgrbeacon mgr.cc-ceph-osd13-fra1(2ae305d1-6742-4b1c-af69-825a3bae8b53,2517349297, -, 0) v6 caps allow profile mgr tid 1733671 con_features 4611087853746454523) v3
    -8> 2019-08-22 10:10:25.688228 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578940 0x560f45d48700 forward(mgrbeacon mgr.cc-ceph-osd11-fra1(2ae305d1-6742-4b1c-af69-825a3bae8b53,2517363527, 10.93.0.111:6800/1219802, 1) v6 caps allow profile mgr tid 1733672 con_features 4611087853746454523) v3
    -7> 2019-08-22 10:10:25.688354 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578941 0x560f45d49000 forward(monmgrreport(0 checks) v1 caps allow profile mgr tid 1733673 con_features 4611087853746454523) v3
    -6> 2019-08-22 10:10:25.688462 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578942 0x560f45d49300 forward(log(5 entries from seq 299620 at 2019-08-22 10:10:15.077964) v1 caps allow profile mgr tid 1733674 con_features 4611087853746454523) v3
    -5> 2019-08-22 10:10:25.688596 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578943 0x560f45d49900 forward(mgrbeacon mgr.cc-ceph-osd12-fra1(2ae305d1-6742-4b1c-af69-825a3bae8b53,2517870358, -, 0) v6 caps allow profile mgr tid 1733675 con_features 4611087853746454523) v3
    -4> 2019-08-22 10:10:25.688666 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578944 0x560f45bb2600 forward(mdsbeacon(2515805112/cc-ceph-osd13-fra1 up:standby seq 301449 v18) v7 caps allow profile mds tid 1733676 con_features 575862587620376571) v3
    -3> 2019-08-22 10:10:25.688761 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578945 0x560f45bb3e00 forward(mgrbeacon mgr.cc-ceph-osd13-fra1(2ae305d1-6742-4b1c-af69-825a3bae8b53,2517349297, -, 0) v6 caps allow profile mgr tid 1733677 con_features 4611087853746454523) v3
    -2> 2019-08-22 10:10:25.688837 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578946 0x560f45bb5000 forward(mgrbeacon mgr.cc-ceph-osd11-fra1(2ae305d1-6742-4b1c-af69-825a3bae8b53,2517363527, 10.93.0.111:6800/1219802, 1) v6 caps allow profile mgr tid 1733678 con_features 4611087853746454523) v3
    -1> 2019-08-22 10:10:25.688952 7fc99685c700  5 -- 10.93.0.112:6789/0 >> 10.93.0.113:6789/0 conn(0x560f45939800 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1205585 cs=1 l=0). rx mon.2 seq 1634578947 0x560f4507a700 forward(monmgrreport(0 checks) v1 caps allow profile mgr tid 1733679 con_features 4611087853746454523) v3
     0> 2019-08-22 10:10:25.695552 7fc992a39700 -1 *** Caught signal (Aborted) **
 in thread 7fc992a39700 thread_name:ms_dispatch

 ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
 1: (()+0x9e6334) [0x560f3b8f0334]
 2: (()+0x11390) [0x7fc99ebc1390]
 3: (gsignal()+0x38) [0x7fc99d107428]
 4: (abort()+0x16a) [0x7fc99d10902a]
 5: (tcmalloc::Log(tcmalloc::LogMode, char const*, int, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem, tcmalloc::LogItem)+0x22e) [0x7fc99e17e5ce]
 6: (()+0x1375f) [0x7fc99e17275f]
 7: (operator delete[](void*)+0x1fd) [0x7fc99e19566d]
 8: (CrushWrapper::update_choose_args(CephContext*)+0x2a6) [0x560f3b849cb6]
 9: (CrushWrapper::remove_root(int)+0x1ac) [0x560f3b85154c]
 10: (CrushWrapper::remove_root(int)+0x6d) [0x560f3b85140d]
 11: (CrushWrapper::trim_roots_with_class()+0x1b7) [0x560f3b851b37]
 12: (CrushWrapper::rebuild_roots_with_classes()+0xbd) [0x560f3b86291d]
 13: (CrushWrapper::insert_item(CephContext*, int, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)+0x5b0) [0x560f3b864630]
 14: (CrushWrapper::create_or_move_item(CephContext*, int, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)+0x25a) [0x560f3b86574a]
 15: (OSDMonitor::prepare_command_impl(boost::intrusive_ptr<MonOpRequest>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > >&)+0x47c4) [0x560f3b4af654]
 16: (OSDMonitor::prepare_command(boost::intrusive_ptr<MonOpRequest>)+0x2e6) [0x560f3b4d18e6]
 17: (OSDMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x283) [0x560f3b4d1d23]
 18: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x9b4) [0x560f3b44fb84]
 19: (PaxosService::C_RetryMessage::_finish(int)+0x5e) [0x560f3b35bbde]
 20: (Context::complete(int)+0x9) [0x560f3b304c39]
 21: (void finish_contexts<Context>(CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0x20b) [0x560f3b30ef1b]
 22: (Paxos::finish_round()+0x2a6) [0x560f3b4467e6]
 23: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0xf9d) [0x560f3b447bdd]
 24: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x263) [0x560f3b448533]
 25: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xefe) [0x560f3b2fa61e]
 26: (Monitor::_ms_dispatch(Message*)+0x6db) [0x560f3b2fb1ab]
 27: (Monitor::ms_dispatch(Message*)+0x23) [0x560f3b32c2c3]
 28: (DispatchQueue::entry()+0xf4a) [0x560f3b89705a]
 29: (DispatchQueue::DispatchThread::entry()+0xd) [0x560f3b62adcd]
 30: (()+0x76ba) [0x7fc99ebb76ba]
 31: (clone()+0x6d) [0x7fc99d1d941d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mon.cc-ceph-osd12-fra1.log
--- end dump of recent events ---
Actions #11

Updated by Florian Haas over 4 years ago

With thanks to Paul Emmerich in https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/QGY75UVQEAT2SUHHKZC2K5ONC5C2DDCV/?sort=date, we've been able to track this down further.

The culprit appears to be the choose_args map that is present in both our own cluster's crushmap, and the one that Iain included in the original bug report.

If we use crushtool, offline, on a crushmap previously exported from the cluster where we ran into the bug, we get (on 12.2.12):

$ crushtool -i crushmap --update-item 59 0.01949 osd.59 --loc host nonexistent -o crushmap-update-nonexistent-to-nonexistent
*** Caught signal (Segmentation fault) **
 in thread 7f5caadd8500 thread_name:crushtool
 ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
 1: (()+0x1eea4) [0x562c391b4ea4]
 2: (()+0x11390) [0x7f5ca1c79390]
 3: (CrushWrapper::update_choose_args(CephContext*)+0x471) [0x7f5ca2565a31]
 4: (CrushWrapper::remove_root(int)+0x1ac) [0x7f5ca256d0fc]
 5: (CrushWrapper::remove_root(int)+0x6d) [0x7f5ca256cfbd]
 6: (CrushWrapper::trim_roots_with_class()+0x1b7) [0x7f5ca256d6e7]
 7: (CrushWrapper::rebuild_roots_with_classes()+0xbd) [0x7f5ca257e4cd]
 8: (CrushWrapper::insert_item(CephContext*, int, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)+0x5b0) [0x7f5ca25801e0]
 9: (CrushWrapper::update_item(CephContext*, int, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)+0x3b5) [0x7f5ca25818b5]
 10: (main()+0xfdd) [0x562c391a1b1d]
 11: (__libc_start_main()+0xf0) [0x7f5ca1122830]
 12: (_start()+0x29) [0x562c391a79e9]
2019-08-26 09:42:50.903779 7f5caadd8500 -1 *** Caught signal (Segmentation fault) **
 in thread 7f5caadd8500 thread_name:crushtool

 ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
 1: (()+0x1eea4) [0x562c391b4ea4]
 2: (()+0x11390) [0x7f5ca1c79390]
 3: (CrushWrapper::update_choose_args(CephContext*)+0x471) [0x7f5ca2565a31]
 4: (CrushWrapper::remove_root(int)+0x1ac) [0x7f5ca256d0fc]
 5: (CrushWrapper::remove_root(int)+0x6d) [0x7f5ca256cfbd]
 6: (CrushWrapper::trim_roots_with_class()+0x1b7) [0x7f5ca256d6e7]
 7: (CrushWrapper::rebuild_roots_with_classes()+0xbd) [0x7f5ca257e4cd]
 8: (CrushWrapper::insert_item(CephContext*, int, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)+0x5b0) [0x7f5ca25801e0]
 9: (CrushWrapper::update_item(CephContext*, int, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)+0x3b5) [0x7f5ca25818b5]
 10: (main()+0xfdd) [0x562c391a1b1d]
 11: (__libc_start_main()+0xf0) [0x7f5ca1122830]
 12: (_start()+0x29) [0x562c391a79e9]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
   -19> 2019-08-26 09:42:50.896107 7f5caadd8500  5 asok(0x562c3a422bd0) register_command perfcounters_dump hook 0x562c3a4a0500
   -18> 2019-08-26 09:42:50.896177 7f5caadd8500  5 asok(0x562c3a422bd0) register_command 1 hook 0x562c3a4a0500
   -17> 2019-08-26 09:42:50.896185 7f5caadd8500  5 asok(0x562c3a422bd0) register_command perf dump hook 0x562c3a4a0500
   -16> 2019-08-26 09:42:50.896188 7f5caadd8500  5 asok(0x562c3a422bd0) register_command perfcounters_schema hook 0x562c3a4a0500
   -15> 2019-08-26 09:42:50.896190 7f5caadd8500  5 asok(0x562c3a422bd0) register_command perf histogram dump hook 0x562c3a4a0500
   -14> 2019-08-26 09:42:50.896193 7f5caadd8500  5 asok(0x562c3a422bd0) register_command 2 hook 0x562c3a4a0500
   -13> 2019-08-26 09:42:50.896195 7f5caadd8500  5 asok(0x562c3a422bd0) register_command perf schema hook 0x562c3a4a0500
   -12> 2019-08-26 09:42:50.896200 7f5caadd8500  5 asok(0x562c3a422bd0) register_command perf histogram schema hook 0x562c3a4a0500
   -11> 2019-08-26 09:42:50.896207 7f5caadd8500  5 asok(0x562c3a422bd0) register_command perf reset hook 0x562c3a4a0500
   -10> 2019-08-26 09:42:50.896209 7f5caadd8500  5 asok(0x562c3a422bd0) register_command config show hook 0x562c3a4a0500
    -9> 2019-08-26 09:42:50.896211 7f5caadd8500  5 asok(0x562c3a422bd0) register_command config help hook 0x562c3a4a0500
    -8> 2019-08-26 09:42:50.896224 7f5caadd8500  5 asok(0x562c3a422bd0) register_command config set hook 0x562c3a4a0500
    -7> 2019-08-26 09:42:50.896226 7f5caadd8500  5 asok(0x562c3a422bd0) register_command config get hook 0x562c3a4a0500
    -6> 2019-08-26 09:42:50.896228 7f5caadd8500  5 asok(0x562c3a422bd0) register_command config diff hook 0x562c3a4a0500
    -5> 2019-08-26 09:42:50.896236 7f5caadd8500  5 asok(0x562c3a422bd0) register_command config diff get hook 0x562c3a4a0500
    -4> 2019-08-26 09:42:50.896239 7f5caadd8500  5 asok(0x562c3a422bd0) register_command log flush hook 0x562c3a4a0500
    -3> 2019-08-26 09:42:50.896241 7f5caadd8500  5 asok(0x562c3a422bd0) register_command log dump hook 0x562c3a4a0500
    -2> 2019-08-26 09:42:50.896242 7f5caadd8500  5 asok(0x562c3a422bd0) register_command log reopen hook 0x562c3a4a0500
    -1> 2019-08-26 09:42:50.896286 7f5caadd8500  5 asok(0x562c3a422bd0) register_command dump_mempools hook 0x562c3a4a3a68
     0> 2019-08-26 09:42:50.903779 7f5caadd8500 -1 *** Caught signal (Segmentation fault) **
 in thread 7f5caadd8500 thread_name:crushtool

 ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
 1: (()+0x1eea4) [0x562c391b4ea4]
 2: (()+0x11390) [0x7f5ca1c79390]
 3: (CrushWrapper::update_choose_args(CephContext*)+0x471) [0x7f5ca2565a31]
 4: (CrushWrapper::remove_root(int)+0x1ac) [0x7f5ca256d0fc]
 5: (CrushWrapper::remove_root(int)+0x6d) [0x7f5ca256cfbd]
 6: (CrushWrapper::trim_roots_with_class()+0x1b7) [0x7f5ca256d6e7]
 7: (CrushWrapper::rebuild_roots_with_classes()+0xbd) [0x7f5ca257e4cd]
 8: (CrushWrapper::insert_item(CephContext*, int, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)+0x5b0) [0x7f5ca25801e0]
 9: (CrushWrapper::update_item(CephContext*, int, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)+0x3b5) [0x7f5ca25818b5]
 10: (main()+0xfdd) [0x562c391a1b1d]
 11: (__libc_start_main()+0xf0) [0x7f5ca1122830]
 12: (_start()+0x29) [0x562c391a79e9]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
  max_recent       500
  max_new         1000
  log_file 
--- end dump of recent events ---
Segmentation fault (core dumped)

The same is true for add-item. Adding the OSD to a host that already has a bucket in the crushmap is working fine.

If we then decompile the crushmap, remove the choose_args map, and recompile the crushmap, the exact same command causes no segfault, but instead creates a new host bucket named nonexistent, and a new device osd.59 in that host bucket.

On a Nautilus1 (14.2.2) cluster though, crushtool acts differently, and does not crash even with the original choose_args map left untouched:

  • we see a new OSD in the crushmap,
  • we see a new host in the crushmap,
  • we see a weight_set for the new buckets in the choose_args map.

So I think the issue might have been fixed in one of these commits:

$ git log --grep choose_args --oneline --no-merges nautilus ^mimic
ad4a9d00a6 crush/CrushWrapper: make update_choose_args less chatty
0f6a36e1e4 crushtool: make --reweight re-sum choose_args weight-sets too
a75ffcd17c crush: update choose_args on bucket removal
564ef28a40 crush: update choose_args on bucket removal, resize, or position mismatch

Perhaps one of those would still need be needed as a backport, in order to make this work for pre-Nautilus users?

[1] We skipped mimic because we already know that this problem is present there, because that's the release this bug report was originally filed for.

Actions #12

Updated by Florian Haas over 4 years ago

Should add one more thing: the only clusters bitten by this issue would be those that, at any time, ran the balancer module in the default crush-compat mode.

If you ran the balancer in upmap mode from the get-go, then the choose_args map would never have been added to your crushmap. However, once crush-compat mode does add the map, there is apparently no way to subsequently remove it — other than decompiling, editing, compiling, and reinjecting the crushmap.

Actions #13

Updated by Neha Ojha over 3 years ago

  • Status changed from New to Rejected

This only seems to be a problem with luminous and mimic, which are EOL now.

Actions #14

Updated by Neha Ojha over 3 years ago

  • Status changed from Rejected to Resolved
Actions #15

Updated by Norman Shen about 3 years ago

Florian Haas wrote:

With thanks to Paul Emmerich in https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/QGY75UVQEAT2SUHHKZC2K5ONC5C2DDCV/?sort=date, we've been able to track this down further.

The culprit appears to be the choose_args map that is present in both our own cluster's crushmap, and the one that Iain included in the original bug report.

If we use crushtool, offline, on a crushmap previously exported from the cluster where we ran into the bug, we get (on 12.2.12):

[...]

The same is true for add-item. Adding the OSD to a host that already has a bucket in the crushmap is working fine.

If we then decompile the crushmap, remove the choose_args map, and recompile the crushmap, the exact same command causes no segfault, but instead creates a new host bucket named nonexistent, and a new device osd.59 in that host bucket.

On a Nautilus1 (14.2.2) cluster though, crushtool acts differently, and does not crash even with the original choose_args map left untouched:

  • we see a new OSD in the crushmap,
  • we see a new host in the crushmap,
  • we see a weight_set for the new buckets in the choose_args map.

So I think the issue might have been fixed in one of these commits:

[...]

Perhaps one of those would still need be needed as a backport, in order to make this work for pre-Nautilus users?

[1] We skipped mimic because we already know that this problem is present there, because that's the release this bug report was originally filed for.

Hi, we seem to also hit by this issue. but updating crushmap manually does not seems to reproduce the error log reported here. could osd id be randomly chosen?

Actions #16

Updated by Norman Shen about 3 years ago

Florian Haas wrote:

With thanks to Paul Emmerich in https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/QGY75UVQEAT2SUHHKZC2K5ONC5C2DDCV/?sort=date, we've been able to track this down further.

The culprit appears to be the choose_args map that is present in both our own cluster's crushmap, and the one that Iain included in the original bug report.

If we use crushtool, offline, on a crushmap previously exported from the cluster where we ran into the bug, we get (on 12.2.12):

[...]

The same is true for add-item. Adding the OSD to a host that already has a bucket in the crushmap is working fine.

If we then decompile the crushmap, remove the choose_args map, and recompile the crushmap, the exact same command causes no segfault, but instead creates a new host bucket named nonexistent, and a new device osd.59 in that host bucket.

On a Nautilus1 (14.2.2) cluster though, crushtool acts differently, and does not crash even with the original choose_args map left untouched:

  • we see a new OSD in the crushmap,
  • we see a new host in the crushmap,
  • we see a weight_set for the new buckets in the choose_args map.

So I think the issue might have been fixed in one of these commits:

[...]

Perhaps one of those would still need be needed as a backport, in order to make this work for pre-Nautilus users?

[1] We skipped mimic because we already know that this problem is present there, because that's the release this bug report was originally filed for.

Please find https://tracker.ceph.com/issues/40029 and https://github.com/ceph/ceph/commit/ccda488815eeca608a58092d0d917c8a68a7b93c be useful.. I think it could explain crushtool's failure.

this change has been merged to 12.2.13..

Actions

Also available in: Atom PDF