Project

General

Profile

Actions

Bug #47593

closed

when start osd>=3 cirmson cluster got crush when osd.3 startup

Added by chunmei liu over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

when start osd=3 ( MDS=0 MGR=1 OSD=3 MON=1 ../src/vstart.sh -n --without-dashboard --bluestore -X --crimson), got the following error on each osd, Crimson cluster can’t startup.

OSD.1:

INFO 2020-09-22 17:28:10,792 [shard 0] osd - osd.1: now active
INFO 2020-09-22 17:28:10,796 [shard 0] ms - [osd.1(hb_back) v2:172.25.110.24:6809/68942 >> osd.2 v2:172.25.110.24:6812/69472@58453] established: gs=0, pgs=2, cs=0, client_cookie=1148998288105928784, server_cookie=0, in_seq=0, out_seq=0, out_q=0
INFO 2020-09-22 17:28:10,797 [shard 0] ms - [osd.1(hb_front) v2:172.25.110.24:6808/68942 >> osd.2 v2:172.25.110.24:6813/69472@59104] established: gs=0, pgs=2, cs=0, client_cookie=11246246344932318894, server_cookie=0, in_seq=0, out_seq=0, out_q=0
INFO 2020-09-22 17:28:10,801 [shard 0] osd - Heartbeat::Peer: osd.2 connected (send=false)
INFO 2020-09-22 17:28:10,801 [shard 0] osd - Heartbeat::Peer: osd.2 added
INFO 2020-09-22 17:28:10,824 [shard 0] osd - on_activate_complete: requesting recovery
INFO 2020-09-22 17:28:10,827 [shard 0] osd - Exiting state: Started/Primary/Active/Activating, entered at 1600820890.7864268, 0.003305271 spent on 5 events
INFO 2020-09-22 17:28:10,827 [shard 0] osd - Entering state: Started/Primary/Active/WaitLocalRecoveryReserved
INFO 2020-09-22 17:28:10,829 [shard 0] osd - Exiting state: Started/Primary/Active/WaitLocalRecoveryReserved, entered at 1600820890.82751, 0.000819413 spent on 1 events
INFO 2020-09-22 17:28:10,830 [shard 0] osd - Entering state: Started/Primary/Active/WaitRemoteRecoveryReserved
INFO 2020-09-22 17:28:10,934 [shard 0] osd - Exiting state: Started/Primary/Active/WaitRemoteRecoveryReserved, entered at 1600820890.8299105, 0.000980272 spent on 4 events
INFO 2020-09-22 17:28:10,935 [shard 0] osd - Entering state: Started/Primary/Active/Recovering
INFO 2020-09-22 17:28:10,935 [shard 0] osd - start_primary_recovery_ops recovering 0 in pg pg_epoch 15 pg[1.0( v 9'1 lc 0'0 (0'0,9'1] local-lis/les=14/15 n=0 ec=2/2 lis/c=14/0 les/c/f=15/0/0 sis=14) [1,0,2] r=0 lpr=14 pi=[2,14)/1 crt=9'1 mlcod 0'0 active+recovering+degraded , missing missing(1 may_include_deletes = 1)
INFO 2020-09-22 17:28:10,935 [shard 0] osd - start_primary_recovery_ops 1:0c89496c:::INTEL_SSDSC2KG960G8_PHYG9050006G960CGN:head item.need 9'1 (missing) (missing head)
crimson-osd: /home/chunmei/ceph/src/crimson/osd/replicated_recovery_backend.cc:29: virtual seastar::future<> ReplicatedRecoveryBackend::recover_object(const hobject_t&, eversion_t): Assertion `added' failed.
Aborting on shard 0.
Backtrace:
/usr/lib/x86_64-linux-gnu/libasan.so.5+0x000000000006bb2f
0x000000001800ef24
0x0000000017fe7861
0x0000000017e9eb70
0x0000000017e9ece3
0x0000000017f00e2f
0x0000000017f389d6
0x0000000017f38a71
/lib/x86_64-linux-gnu/libpthread.so.0+0x000000000001289f
/lib/x86_64-linux-gnu/libc.so.6+0x000000000003ef46
/lib/x86_64-linux-gnu/libc.so.6+0x00000000000408b0
/lib/x86_64-linux-gnu/libc.so.6+0x0000000000030429
/lib/x86_64-linux-gnu/libc.so.6+0x00000000000304a1
0x0000000014216b62
0x000000001415e041
0x000000001415932e
0x0000000014155e38
0x0000000014123b9e
0x0000000014120e62
0x0000000014124d6b
0x0000000014125846
0x000000001412192d
0x00000000141804a7
0x0000000014151389
0x0000000013c126cd
0x00000000147ed894
0x0000000014adb8d8
0x0000000014ad7066
0x0000000014ad32db
0x0000000014ace8ea
0x0000000014acaca4
0x0000000014ac5d79
0x0000000014ac10c5
0x0000000014abc81b
0x0000000014ab7e6a
0x0000000014ab318e
0x0000000014aad49c
0x0000000014aa6e83
0x0000000014a9916a
0x0000000014a88513
0x0000000014a6ce38
0x00000000135ad0c6
0x0000000013cc167e
0x0000000013c7b582
0x0000000013c7bb9e
0x0000000013c3c695
0x0000000013c0c3d0
0x0000000013b898eb
0x0000000013b89c40
0x00000000140d51ff
0x00000000140e2314
0x00000000140df491
0x00000000140df550
0x00000000140e25ec
0x00000000136ed28c
0x00000000136e7d3c
0x00000000136dd1fc
0x00000000136cf16f
0x00000000136beb4e
0x00000000136cf264
0x00000000136bea3a
0x00000000136fc104
0x0000000017ee51af
0x0000000017ee97ff
0x0000000017eeed27
0x0000000017d8a3f6
0x0000000013707771
/lib/x86_64-linux-gnu/libc.so.6+0x0000000000021b96
0x0000000013508b49
Aborted (core dumped)

OSD.0 and OSD.1 After OSD.1 Crush, the other two crush on a deadly signal. The stack loop in format.h, formatter.cc core.h etc. I can’t get which crimson code call this format function.

INFO 2020-09-22 17:28:32,115 [shard 0] ms - [osd.0(hb_front) v2:172.25.110.24:6805/66695 >> osd.1 v2:172.25.110.24:6808/68942] execute_wait(): going to CONNECTING
INFO 2020-09-22 17:28:32,116 [shard 0] ms - [osd.0(hb_front) v2:172.25.110.24:6805/66695 >> osd.1 v2:172.25.110.24:6808/68942] execute_connecting(): fault at CONNECTING, going to WAIT -- std::system_error (error system:111, Connection refused)
WARN 2020-09-22 17:28:32,116 [shard 0] ms - [osd.0(hb_front) v2:172.25.110.24:6805/66695 >> osd.1 v2:172.25.110.24:6808/68942] waiting 3.2 seconds ...
INFO 2020-09-22 17:28:32,117 [shard 0] ms - [osd.0(hb_back) v2:172.25.110.24:6804/66695 >> osd.1 v2:172.25.110.24:6809/68942] execute_wait(): going to CONNECTING
INFO 2020-09-22 17:28:32,118 [shard 0] ms - [osd.0(hb_back) v2:172.25.110.24:6804/66695 >> osd.1 v2:172.25.110.24:6809/68942] execute_connecting(): fault at CONNECTING, going to WAIT -- std::system_error (error system:111, Connection refused)
WARN 2020-09-22 17:28:32,119 [shard 0] ms - [osd.0(hb_back) v2:172.25.110.24:6804/66695 >> osd.1 v2:172.25.110.24:6809/68942] waiting 3.2 seconds ...
AddressSanitizer:DEADLYSIGNAL =================================================================
66695ERROR: AddressSanitizer: stack-overflow on address 0x7ffca8cf2818 (pc 0x7f94b8f8ef79 bp 0x7ffca8cf30c0 sp 0x7ffca8cf2820 T0)
#0 0x7f94b8f8ef78 (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xd4f78)
#1 0x7f94b67841f7 in _cxxabiv1::_vmi_class_type_info::__do_dyncast(long, _cxxabiv1::_class_type_info::__sub_kind, _cxxabiv1::_class_type_info const*, void const*, _cxxabiv1::_class_type_info const*, void const*, _cxxabiv1::_class_type_info::__dyncast_result&) const (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0xa71f7)
#2 0x7f94b6781143 in __dynamic_cast (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0xa4143)
#3 0x7f94b67f7a7d in bool std::has_facet<std::ctype<char> >(std::locale const&) (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x11aa7d)
#4 0x7f94b67ea183 in std::basic_ios<char, std::char_traits<char> >::_M_cache_locale(std::locale const&) (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x10d183)
#5 0x7f94b67ea5ff in std::basic_ios<char, std::char_traits<char> >::init(std::basic_streambuf<char, std::char_traits<char> >) (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x10d5ff)
#6 0x7f94b6806756 in std::basic_ostream<char, std::char_traits<char> >::basic_ostream(std::basic_streambuf<char, std::char_traits<char> >
) (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x129756)
#7 0x55c203d7a5b8 in void fmt::v6::detail::format_value<char, std::chrono::time_point<ceph::coarse_real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > >(fmt::v6::detail::buffer<char>&, std::chrono::time_point<ceph::coarse_real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const&, fmt::v6::detail::locale_ref) (/home/chunmei/ceph/build/bin/crimson-osd+0x136f45b8)
#8 0x55c203d78aea in std::back_insert_iterator<fmt::v6::detail::buffer<char> > fmt::v6::detail::fallback_formatter<std::chrono::time_point<ceph::coarse_real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, char, void>::format<std::back_insert_iterator<fmt::v6::detail::buffer<char> > >(std::chrono::time_point<ceph::coarse_real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const&, fmt::v6::basic_format_context<std::back_insert_iterator<fmt::v6::detail::buffer<char> >, char>&) (/home/chunmei/ceph/build/bin/crimson-osd+0x136f2aea)
#9 0x55c203d768e6 in void fmt::v6::detail::value<fmt::v6::basic_format_context<std::back_insert_iterator<fmt::v6::detail::buffer<char> >, char> >::format_custom_arg<std::chrono::time_point<ceph::coarse_real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, fmt::v6::detail::fallback_formatter<std::chrono::time_point<ceph::coarse_real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, char, void> >(void const*, fmt::v6::basic_format_parse_context<char, fmt::v6::detail::error_handler>&, fmt::v6::basic_format_context<std::back_insert_iterator<fmt::v6::detail::buffer<char> >, char>&) (/home/chunmei/ceph/build/bin/crimson-osd+0x136f08e6)

Actions #1

Updated by Kefu Chai over 3 years ago

crimson-osd: ../src/crimson/osd/replicated_recovery_backend.cc:29: virtual seastar::future<> ReplicatedRecoveryBackend::recover_object(const hobject_t&, eversion_t): Assertion `added' failed.
Aborted

reproduced locally.

Actions #2

Updated by Kefu Chai over 3 years ago

  • Status changed from New to Fix Under Review
  • Assignee changed from Xuehan Xu to Kefu Chai
  • Pull request ID set to 37323
Actions #3

Updated by Kefu Chai over 3 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF