Project

General

Profile

Bug #41891

global osd crash in DynamicPerfStats::add_to_reports

Added by Marcin Gibula over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):

d771a7bad3f0f90ecacd2538e284e04138c655ad59404f60f0be90263acff9ad

Crash signature (v2):

Description

Hi,
during routine host maintenance, I've encountered massive osd crash across entire cluster. The sequence of events was:

Ceph is newest nautilus (14.2.3) recently upgraded from luminous.

1. Restart OSDs on one of the hosts (systemctl restar ceph-osd.target)
2. These OSDs marked themself down
3. Multiple OSDs on different hosts crashed 1-2s later

Backtrace of crashing OSDs:
0> 2019-09-17 12:05:07.449 7f1f13555700 -1 ** Caught signal (Aborted) *
in thread 7f1f13555700 thread_name:safe_timer

ceph version 14.2.3 (0f776cf838a1ae3130b2b73dc26be9c95c6ccc39) nautilus (stable)
1: (()+0x11390) [0x7f1f2d608390]
2: (gsignal()+0x38) [0x7f1f2cb33428]
3: (abort()+0x16a) [0x7f1f2cb3502a]
4: (_gnu_cxx::_verbose_terminate_handler()+0x135) [0x1654965]
5: (_cxxabiv1::_terminate(void ()())+0x6) [0x16489b6]
6: /usr/bin/ceph-osd() [0x1648a01]
7: /usr/bin/ceph-osd() [0x1654bb4]
8: (std::__throw_out_of_range(char const
)+0x3f) [0x16602cf]
9: (DynamicPerfStats::add_to_reports(std::map<OSDPerfMetricQuery, std::set<OSDPerfMetricLimit, std::less<OSDPerfMetricLimit>, std::allocator<OSDPerfMetricLimit> >, std::less<OSDPerfMetricQuery>, std::allocator<std::pair<OSDPerfMetricQuery const, std::set<OSDPerfMetricLimit, std::less<OSDPerfMetricLimit>, std::allocator<OSDPerfMetricLimit> > > > > const&, std::map<OSDPerfMetricQuery, OSDPerfMetricReport, std::less<OSDPerfMetricQuery>, std::allocator<std::pair<OSDPerfMetricQuery const, OSDPerfMetricReport> > >)+0x54d) [0x94becd]
10: (OSD::get_perf_reports(std::map<OSDPerfMetricQuery, OSDPerfMetricReport, std::less<OSDPerfMetricQuery>, std::allocator<std::pair<OSDPerfMetricQuery const, OSDPerfMetricReport> > >
)+0xb38) [0x8f19c8]
11: (MgrClient::_send_report()+0x39a) [0x114051a]
12: (MgrClient::_send_stats()+0x22) [0x1140942]
13: (FunctionContext::finish(int)+0x2c) [0x929abc]
14: (Context::complete(int)+0x9) [0x916599]
15: (SafeTimer::timer_thread()+0x190) [0xf21eb0]
16: (SafeTimerThread::entry()+0xd) [0xf2377d]
17: (()+0x76ba) [0x7f1f2d5fe6ba]
18: (clone()+0x6d) [0x7f1f2cc0541d]

Cluster status:

2019-09-17 12:05:01.216878 mon.hv-ceph-01 [INF] osd.200 marked itself down
2019-09-17 12:05:01.230932 mon.hv-ceph-01 [INF] osd.217 marked itself down
2019-09-17 12:05:01.234446 mon.hv-ceph-01 [INF] osd.198 marked itself down
2019-09-17 12:05:01.234785 mon.hv-ceph-01 [INF] osd.204 marked itself down
2019-09-17 12:05:01.235130 mon.hv-ceph-01 [INF] osd.202 marked itself down
2019-09-17 12:05:01.235447 mon.hv-ceph-01 [INF] osd.203 marked itself down
2019-09-17 12:05:01.235793 mon.hv-ceph-01 [INF] osd.195 marked itself down
2019-09-17 12:05:01.236136 mon.hv-ceph-01 [INF] osd.201 marked itself down
2019-09-17 12:05:01.236342 mon.hv-ceph-01 [INF] osd.197 marked itself down
2019-09-17 12:05:01.239945 mon.hv-ceph-01 [INF] osd.199 marked itself down
2019-09-17 12:05:01.242013 mon.hv-ceph-01 [INF] osd.196 marked itself down
2019-09-17 12:05:01.243770 mon.hv-ceph-01 [INF] osd.194 marked itself down
2019-09-17 12:05:01.866272 mon.hv-ceph-01 [WRN] Health check failed: 12 osds down (OSD_DOWN)
2019-09-17 12:05:01.866297 mon.hv-ceph-01 [WRN] Health check failed: 2 hosts (12 osds) down (OSD_HOST_DOWN) <== one host in reality, two different crush trees
2019-09-17 12:05:02.893107 mon.hv-ceph-01 [WRN] Health check failed: Reduced data availability: 3 pgs inactive, 69 pgs peering (PG_AVAILABILITY)
2019-09-17 12:05:03.372306 mon.hv-ceph-01 [INF] osd.134 failed (root=hdd1-root,host=ceph-hdd-08-hdd1) (connection refused reported by osd.95)
2019-09-17 12:05:03.396364 mon.hv-ceph-01 [INF] osd.166 failed (root=hdd1-root,host=ceph-hdd-05-hdd1) (connection refused reported by osd.212)
2019-09-17 12:05:03.396432 mon.hv-ceph-01 [INF] osd.140 failed (root=hdd1-root,host=ceph-hdd-09-hdd1) (connection refused reported by osd.188)
2019-09-17 12:05:03.397809 mon.hv-ceph-01 [INF] osd.177 failed (root=hdd1-root,host=ceph-hdd-08-hdd1) (connection refused reported by osd.244)
2019-09-17 12:05:03.451098 mon.hv-ceph-01 [INF] osd.80 failed (root=hdd1-root,host=ceph-hdd-06-hdd1) (connection refused reported by osd.184)
2019-09-17 12:05:03.474105 mon.hv-ceph-01 [INF] osd.40 failed (root=hdd1-root,host=ceph-hdd-04-hdd1) (connection refused reported by osd.99)
2019-09-17 12:05:03.516318 mon.hv-ceph-01 [INF] osd.153 failed (root=hdd1-root,host=ceph-hdd-09-hdd1) (connection refused reported by osd.210)
2019-09-17 12:05:03.531219 mon.hv-ceph-01 [INF] osd.154 failed (root=hdd1-root,host=ceph-hdd-09-hdd1) (connection refused reported by osd.239)
2019-09-17 12:05:03.536304 mon.hv-ceph-01 [INF] osd.77 failed (root=hdd1-root,host=ceph-hdd-06-hdd1) (connection refused reported by osd.240)
2019-09-17 12:05:03.545708 mon.hv-ceph-01 [INF] osd.74 failed (root=hdd1-root,host=ceph-hdd-05-hdd1) (connection refused reported by osd.246)
2019-09-17 12:05:03.558471 mon.hv-ceph-01 [INF] osd.130 failed (root=hdd1-root,host=ceph-hdd-08-hdd1) (connection refused reported by osd.191)
2019-09-17 12:05:03.589367 mon.hv-ceph-01 [INF] osd.50 failed (root=hdd1-root,host=ceph-hdd-04-hdd1) (connection refused reported by osd.213)
2019-09-17 12:05:03.598871 mon.hv-ceph-01 [INF] osd.133 failed (root=hdd1-root,host=ceph-hdd-08-hdd1) (connection refused reported by osd.240)
2019-09-17 12:05:03.601311 mon.hv-ceph-01 [INF] osd.155 failed (root=hdd1-root,host=ceph-hdd-09-hdd1) (connection refused reported by osd.72)
2019-09-17 12:05:03.725224 mon.hv-ceph-01 [INF] osd.148 failed (root=hdd1-root,host=ceph-hdd-09-hdd1) (connection refused reported by osd.47)

After OSD came back up, the cluster seems stable.


Related issues

Copied to RADOS - Backport #42095: nautilus: global osd crash in DynamicPerfStats::add_to_reports Resolved

History

#1 Updated by Marcin Gibula over 4 years ago

As crash seems to be related to stats reporting - don't know if it is related, but it was soon after eliminating "Legacy BlueStore stats reporting detected" error (by running ceph-bluestore repair thing).

#2 Updated by Mykola Golub over 4 years ago

Marcin, are you using `rbd perf image iotop|iostat` commands? Or may be prometheus mgr module with rbd per image stats enabled?

Also, it would be nice to see the exact location in the code where the exception was thrown. Do you have a chance to install ceph-osd debug symbols on one of the osd nodes, run `gdb /usr/bin/ceph-osd` and in the inside the gdb issue:

  l *DynamicPerfStats::add_to_reports+0x54d

and provide the output.

Don't you observe the crashes any more?

#3 Updated by Marcin Gibula over 4 years ago

Yes, I use "rbd perf image iotop/iostat" (one of the reasons for upgrade:-) ). Not exporting per image data with prometheus yet, though I'm using it.

(gdb) l DynamicPerfStats::add_to_reports+0x54d
0x94becd is in DynamicPerfStats::add_to_reports(std::map<OSDPerfMetricQuery, std::set<OSDPerfMetricLimit, std::less<OSDPerfMetricLimit>, std::allocator<OSDPerfMetricLimit> >, std::less<OSDPerfMetricQuery>, std::allocator<std::pair<OSDPerfMetricQuery const, std::set<OSDPerfMetricLimit, std::less<OSDPerfMetricLimit>, std::allocator<OSDPerfMetricLimit> > > > > const&, std::map<OSDPerfMetricQuery, OSDPerfMetricReport, std::less<OSDPerfMetricQuery>, std::allocator<std::pair<OSDPerfMetricQuery const, OSDPerfMetricReport> > >
) (/usr/include/c++/7/bits/stl_map.h:542).
537 in /usr/include/c++/7/bits/stl_map.h

I don't have a copy of this header on server, but I've checked on other system, and its std::map<>::at() .

Since cluster was stable after this crash, I proceeded with maintenance (I was migrating configs from ceph.conf into mons and was restarting OSD hosts one by one) and had couple more exactly same crashes across the cluster. But it was not deterministic, most hosts restarts went without any issues.

After all restarts were completed, cluster is stable and healthy.

#4 Updated by Mykola Golub over 4 years ago

  • Status changed from New to In Progress
  • Assignee set to Mykola Golub

#5 Updated by Mykola Golub over 4 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 30454

#6 Updated by Mykola Golub over 4 years ago

Marcin, I believe I know the cause and I am now discussing the fix [1]. A workaround could be not to use "rbd perf image iotop/iostat" when you expecting osds go down/up (i.e. many pg state changes).

[1] https://github.com/ceph/ceph/pull/30454

#7 Updated by Mykola Golub over 4 years ago

  • Backport set to nautilus

#8 Updated by Marcin Gibula over 4 years ago

I don't believe this command was running at that time, however "rbd_support" mgr module was active. Could this be the reason? Is it safe to unload it until fixed version is released?

#9 Updated by Jason Dillaman over 4 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)

#10 Updated by Marcin Gibula over 4 years ago

Answering myself - seems that rbd_support cannot be disabled anyway

  1. ceph mgr module disable rbd_support
    Error EINVAL: module 'rbd_support' cannot be disabled (always-on)

#11 Updated by Kefu Chai over 4 years ago

  • Status changed from Fix Under Review to Pending Backport

#12 Updated by Mykola Golub over 4 years ago

  • Copied to Backport #42095: nautilus: global osd crash in DynamicPerfStats::add_to_reports added

#14 Updated by Sage Weil over 4 years ago

  • Crash signature (v1) updated (diff)

#15 Updated by Sage Weil over 4 years ago

  • Crash signature (v1) updated (diff)

#16 Updated by Sage Weil over 4 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF