Bug #62470
openRook: OSD Crash Looping / Caught signal (Aborted) / thread_name:tp_osd_tp
0%
Description
After a discussion on Rook Github:[[https://github.com/rook/rook/discussions/12713#discussioncomment-6730118]], they suggested I open a Ceph issue.
Two crash looping OSDs (on different nodes).
- one OSD is doing a rapid crash loop, can only stay up for a few seconds of recovery and crashes
- the other is doing a slow crash loop, stays online for many minutes to an hour. When I find it crashed, has same message, but seems to be able to restart on its own without manual intervention.
Host OS: Ubuntu 22.04.03
Kernel: 5.15.0-78-generic
Nodes: 6
Kubernetes: v1.26.7+k3s1
Ceph Version: 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
Each node has at least 64GiB of RAM, each OSD has RAM request/limit set to 4Gi.
# ceph osd status
ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 k3s01 32.2G 667G 0 0 0 0 exists
1 k3s02 265G 434G 3 21.5k 0 0 exists,up
2 k3s03 258G 441G 0 99.1k 1 0 exists,up
3 k3s04 181G 518G 0 73.5k 2 90 exists,up
4 k3s05 157G 542G 0 2457 1 16 exists,up
5 k3s06 166G 764G 1 5734 0 0 exists,up
6 k3s06 217G 713G 1 11.1k 1 0 exists,up
7 k3s04 170G 529G 0 7372 1 0 exists,up
8 k3s05 178G 521G 1 12.0k 1 0 exists,up
Crashing OSD logs ended with:
debug -17> 2023-08-13T22:09:44.349+0000 7f0ccf8e8700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2023-08-13T22:09:14.353598+0000)
debug -16> 2023-08-13T22:09:44.361+0000 7f0cc0f13700 5 osd.0 pg_epoch: 42252 pg[1.1d( v 42249'12091630 (40626'12088880,42249'12091630] lb 1:b9a58b2d:::rbd_data.2c851fe481b84d.0000000000002f6b:head local-lis/les=41707/41708 n=726 ec=247/5 lis/c=42251/40618 les/c/f=42252/40619/0 sis=42251 pruub=12.915059090s) [1,5,3,0]/[1,5,3] r=-1 lpr=42251 pi=[40618,42251)/1 luod=0'0 crt=42249'12091630 mlcod 0'0 active+remapped pruub 17.178600311s@ mbc={}] exit Started/ReplicaActive/RepRecovering 0.235278 1 0.000069
debug -15> 2023-08-13T22:09:44.361+0000 7f0cc0f13700 5 osd.0 pg_epoch: 42252 pg[1.1d( v 42249'12091630 (40626'12088880,42249'12091630] lb 1:b9a58b2d:::rbd_data.2c851fe481b84d.0000000000002f6b:head local-lis/les=41707/41708 n=726 ec=247/5 lis/c=42251/40618 les/c/f=42252/40619/0 sis=42251 pruub=12.915059090s) [1,5,3,0]/[1,5,3] r=-1 lpr=42251 pi=[40618,42251)/1 luod=0'0 crt=42249'12091630 mlcod 0'0 active+remapped pruub 17.178600311s@ mbc={}] enter Started/ReplicaActive/RepNotRecovering
debug -14> 2023-08-13T22:09:44.361+0000 7f0cbd70c700 5 osd.0 pg_epoch: 42252 pg[1.7c( v 42249'6608333 (40626'6605409,42249'6608333] lb 1:3e449c1a:::rbd_data.9513d469c8a286.0000000000000e85:head local-lis/les=42091/42092 n=113 ec=37674/5 lis/c=42250/42247 les/c/f=42251/42248/0 sis=42250) [8,6,0,3]/[8,6,3,2] r=-1 lpr=42250 pi=[41932,42250)/1 luod=0'0 crt=42249'6608333 lcod 0'0 mlcod 0'0 active+remapped mbc={}] exit Started/ReplicaActive/RepWaitBackfillReserved 1.184208 4 0.000062
debug -13> 2023-08-13T22:09:44.361+0000 7f0cbd70c700 5 osd.0 pg_epoch: 42252 pg[1.7c( v 42249'6608333 (40626'6605409,42249'6608333] lb 1:3e449c1a:::rbd_data.9513d469c8a286.0000000000000e85:head local-lis/les=42091/42092 n=113 ec=37674/5 lis/c=42250/42247 les/c/f=42251/42248/0 sis=42250) [8,6,0,3]/[8,6,3,2] r=-1 lpr=42250 pi=[41932,42250)/1 luod=0'0 crt=42249'6608333 lcod 0'0 mlcod 0'0 active+remapped mbc={}] enter Started/ReplicaActive/RepRecovering
debug -12> 2023-08-13T22:09:44.361+0000 7f0cbcf0b700 5 osd.0 pg_epoch: 42252 pg[1.1d( v 42249'12091630 (40626'12088880,42249'12091630] lb 1:b9a58b2d:::rbd_data.2c851fe481b84d.0000000000002f6b:head local-lis/les=41707/41708 n=726 ec=247/5 lis/c=42251/40618 les/c/f=42252/40619/0 sis=42251 pruub=12.915059090s) [1,5,3,0]/[1,5,3] r=-1 lpr=42251 pi=[40618,42251)/1 luod=0'0 crt=42249'12091630 mlcod 0'0 active+remapped pruub 17.178600311s@ mbc={}] exit Started/ReplicaActive/RepNotRecovering 0.000240 1 0.000086
debug -11> 2023-08-13T22:09:44.361+0000 7f0cbcf0b700 5 osd.0 pg_epoch: 42252 pg[1.1d( v 42249'12091630 (40626'12088880,42249'12091630] lb 1:b9a58b2d:::rbd_data.2c851fe481b84d.0000000000002f6b:head local-lis/les=41707/41708 n=726 ec=247/5 lis/c=42251/40618 les/c/f=42252/40619/0 sis=42251 pruub=12.915059090s) [1,5,3,0]/[1,5,3] r=-1 lpr=42251 pi=[40618,42251)/1 luod=0'0 crt=42249'12091630 mlcod 0'0 active+remapped pruub 17.178600311s@ mbc={}] enter Started/ReplicaActive/RepWaitBackfillReserved
debug -10> 2023-08-13T22:09:44.361+0000 7f0cc0f13700 5 osd.0 pg_epoch: 42252 pg[1.1d( v 42249'12091630 (40626'12088880,42249'12091630] lb 1:b9a58b2d:::rbd_data.2c851fe481b84d.0000000000002f6b:head local-lis/les=41707/41708 n=726 ec=247/5 lis/c=42251/40618 les/c/f=42252/40619/0 sis=42251 pruub=12.915059090s) [1,5,3,0]/[1,5,3] r=-1 lpr=42251 pi=[40618,42251)/1 luod=0'0 crt=42249'12091630 mlcod 0'0 active+remapped pruub 17.178600311s@ mbc={}] exit Started/ReplicaActive/RepWaitBackfillReserved 0.000055 1 0.000178
debug -9> 2023-08-13T22:09:44.361+0000 7f0cc0f13700 5 osd.0 pg_epoch: 42252 pg[1.1d( v 42249'12091630 (40626'12088880,42249'12091630] lb 1:b9a58b2d:::rbd_data.2c851fe481b84d.0000000000002f6b:head local-lis/les=41707/41708 n=726 ec=247/5 lis/c=42251/40618 les/c/f=42252/40619/0 sis=42251 pruub=12.915059090s) [1,5,3,0]/[1,5,3] r=-1 lpr=42251 pi=[40618,42251)/1 luod=0'0 crt=42249'12091630 mlcod 0'0 active+remapped pruub 17.178600311s@ mbc={}] enter Started/ReplicaActive/RepRecovering
debug -8> 2023-08-13T22:09:44.361+0000 7f0cbdf0d700 5 osd.0 pg_epoch: 42252 pg[1.63( v 42249'9387151 (40626'9384104,42249'9387151] lb 1:c6277df9:::rbd_data.8582ba8bb2fb07.0000000000002113:14e0 local-lis/les=42091/42092 n=77 ec=37674/5 lis/c=42250/42247 les/c/f=42251/42248/0 sis=42250) [8,5,0,7]/[8,5,7,2] r=-1 lpr=42250 pi=[41932,42250)/1 luod=0'0 crt=42249'9387151 lcod 0'0 mlcod 0'0 active+remapped mbc={}] exit Started/ReplicaActive/RepRecovering 0.044200 2 0.000085
debug -7> 2023-08-13T22:09:44.361+0000 7f0cbdf0d700 5 osd.0 pg_epoch: 42252 pg[1.63( v 42249'9387151 (40626'9384104,42249'9387151] lb 1:c6277df9:::rbd_data.8582ba8bb2fb07.0000000000002113:14e0 local-lis/les=42091/42092 n=77 ec=37674/5 lis/c=42250/42247 les/c/f=42251/42248/0 sis=42250) [8,5,0,7]/[8,5,7,2] r=-1 lpr=42250 pi=[41932,42250)/1 luod=0'0 crt=42249'9387151 lcod 0'0 mlcod 0'0 active+remapped mbc={}] enter Started/ReplicaActive/RepNotRecovering
debug -6> 2023-08-13T22:09:44.361+0000 7f0cc0712700 5 osd.0 pg_epoch: 42252 pg[1.76( v 42249'8152346 (40612'8149796,42249'8152346] lb 1:6e002bb3:::rbd_data.8582ba8bb2fb07.00000000000017a3:head local-lis/les=42219/42220 n=7 ec=37674/5 lis/c=42250/42247 les/c/f=42251/42248/0 sis=42250) [0,8,6,3]/[8,6,3,2] r=-1 lpr=42250 pi=[41932,42250)/1 luod=0'0 crt=42249'8152346 lcod 0'0 mlcod 0'0 active+remapped mbc={}] exit Started/ReplicaActive/RepNotRecovering 1.096523 6 0.003226
debug -5> 2023-08-13T22:09:44.361+0000 7f0cc0712700 5 osd.0 pg_epoch: 42252 pg[1.76( v 42249'8152346 (40612'8149796,42249'8152346] lb 1:6e002bb3:::rbd_data.8582ba8bb2fb07.00000000000017a3:head local-lis/les=42219/42220 n=7 ec=37674/5 lis/c=42250/42247 les/c/f=42251/42248/0 sis=42250) [0,8,6,3]/[8,6,3,2] r=-1 lpr=42250 pi=[41932,42250)/1 luod=0'0 crt=42249'8152346 lcod 0'0 mlcod 0'0 active+remapped mbc={}] enter Started/ReplicaActive/RepWaitBackfillReserved
debug -4> 2023-08-13T22:09:44.437+0000 7f0cbcf0b700 -1 *** Caught signal (Aborted) **
in thread 7f0cbcf0b700 thread_name:tp_osd_tp
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
1: /lib64/libpthread.so.0(+0x12cf0) [0x7f0ce435ecf0]
2: gsignal()
3: abort()
4: /lib64/libstdc++.so.6(+0x9009b) [0x7f0ce397409b]
5: /lib64/libstdc++.so.6(+0x9653c) [0x7f0ce397a53c]
6: /lib64/libstdc++.so.6(+0x96597) [0x7f0ce397a597]
7: /lib64/libstdc++.so.6(+0x967f8) [0x7f0ce397a7f8]
8: ceph-osd(+0x5b012d) [0x5639c95f512d]
9: (ReplicatedBackend::submit_push_data(ObjectRecoveryInfo const&, bool, bool, bool, bool, interval_set<unsigned long, std::map>&, interval_set<unsigned long, std::map> const&, ceph::buffer::v15_2_0::list, ceph::buffer::v15_2_0::list, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > const&, ceph::os::Transaction*)+0x906) [0x5639c9b07e56]
10: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x307) [0x5639c9b09617]
11: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x2b1) [0x5639c9b09b61]
12: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x260) [0x5639c9b0bd80]
13: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x56) [0x5639c98f9386]
14: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x512) [0x5639c988bbb2]
15: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1c0) [0x5639c96df760]
16: (ceph::osd::scheduler::PGRecoveryMsg::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x5639c99e0729]
17: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x115f) [0x5639c97218af]
18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x435) [0x5639c9e55b75]
19: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5639c9e58294]
20: /lib64/libpthread.so.0(+0x81ca) [0x7f0ce43541ca]
21: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
debug -3> 2023-08-13T22:09:44.537+0000 7f0cbff11700 5 osd.0 pg_epoch: 42252 pg[1.27( v 42249'6054220 (40613'6048197,42249'6054220] local-lis/les=42237/42238 n=917 ec=37674/5 lis/c=42251/42237 les/c/f=42252/42238/0 sis=42251) [4,7,0,6]/[4,7,6] r=-1 lpr=42251 pi=[42237,42251)/1 luod=0'0 crt=42249'6054220 mlcod 0'0 active+remapped mbc={}] exit Started/ReplicaActive/RepRecovering 0.441339 1 0.000038
debug -2> 2023-08-13T22:09:44.537+0000 7f0cbff11700 5 osd.0 pg_epoch: 42252 pg[1.27( v 42249'6054220 (40613'6048197,42249'6054220] local-lis/les=42237/42238 n=917 ec=37674/5 lis/c=42251/42237 les/c/f=42252/42238/0 sis=42251) [4,7,0,6]/[4,7,6] r=-1 lpr=42251 pi=[42237,42251)/1 luod=0'0 crt=42249'6054220 mlcod 0'0 active+remapped mbc={}] enter Started/ReplicaActive/RepNotRecovering
debug -1> 2023-08-13T22:09:44.537+0000 7f0cc3718700 5 osd.0 pg_epoch: 42252 pg[6.8( v 40609'46149 (38853'42914,40609'46149] lb 6:1106ac51:::1000001050c.00000000:116 local-lis/les=42091/42092 n=90 ec=251/19 lis/c=42250/42247 les/c/f=42251/42248/0 sis=42250) [0,8,7]/[8,7,5] r=-1 lpr=42250 pi=[41932,42250)/1 luod=0'0 crt=40609'46149 lcod 0'0 mlcod 0'0 active+remapped mbc={}] exit Started/ReplicaActive/RepWaitBackfillReserved 1.360010 4 0.000159
debug 0> 2023-08-13T22:09:44.537+0000 7f0cc3718700 5 osd.0 pg_epoch: 42252 pg[6.8( v 40609'46149 (38853'42914,40609'46149] lb 6:1106ac51:::1000001050c.00000000:116 local-lis/les=42091/42092 n=90 ec=251/19 lis/c=42250/42247 les/c/f=42251/42248/0 sis=42250) [0,8,7]/[8,7,5] r=-1 lpr=42250 pi=[41932,42250)/1 luod=0'0 crt=40609'46149 lcod 0'0 mlcod 0'0 active+remapped mbc={}] enter Started/ReplicaActive/RepRecovering
Crash dump:
# ceph crash info 2023-08-13T21:42:48.801495Z_149fbd5a-bf67-49be-9382-2ce345594b39
{
"archived": "2023-08-13 21:58:05.518639",
"backtrace": [
"/lib64/libpthread.so.0(+0x12cf0) [0x7f30ecfc8cf0]",
"gsignal()",
"abort()",
"/lib64/libstdc++.so.6(+0x9009b) [0x7f30ec5de09b]",
"/lib64/libstdc++.so.6(+0x9653c) [0x7f30ec5e453c]",
"/lib64/libstdc++.so.6(+0x96597) [0x7f30ec5e4597]",
"/lib64/libstdc++.so.6(+0x967f8) [0x7f30ec5e47f8]",
"ceph-osd(+0x5b012d) [0x56416665712d]",
"(ReplicatedBackend::submit_push_data(ObjectRecoveryInfo const&, bool, bool, bool, bool, interval_set<unsigned long, std::map>&, interval_set<unsigned long, std::map> const&, ceph::buffer::v15_2_0::list, ceph::buffer::v15_2_0::list, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > const&, ceph::os::Transaction*)+0x906) [0x564166b69e56]",
"(ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ceph::os::Transaction*, bool)+0x307) [0x564166b6b617]",
"(ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x2b1) [0x564166b6bb61]",
"(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x260) [0x564166b6dd80]",
"(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x56) [0x56416695b386]",
"(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x512) [0x5641668edbb2]",
"(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1c0) [0x564166741760]",
"(ceph::osd::scheduler::PGRecoveryMsg::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x564166a42729]",
"(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x115f) [0x5641667838af]",
"(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x435) [0x564166eb7b75]",
"(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x564166eba294]",
"/lib64/libpthread.so.0(+0x81ca) [0x7f30ecfbe1ca]",
"clone()"
],
"ceph_version": "17.2.6",
"crash_id": "2023-08-13T21:42:48.801495Z_149fbd5a-bf67-49be-9382-2ce345594b39",
"entity_name": "osd.0",
"os_id": "centos",
"os_name": "CentOS Stream",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-osd",
"stack_sig": "b32adeb2b1180482f7ddc835e6eae3ed628c7afc5f9daefd9ceff57362258dbc",
"timestamp": "2023-08-13T21:42:48.801495Z",
"utsname_hostname": "k3s01",
"utsname_machine": "x86_64",
"utsname_release": "5.15.0-78-generic",
"utsname_sysname": "Linux",
"utsname_version": "#85-Ubuntu SMP Fri Jul 7 15:25:09 UTC 2023"
Watching the logs, it appeared Ceph was making a little progress with each crash. I left Ceph in a crashloop overnight and by morning it seemed to recover:
# ceph osd status
ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 k3s01 184G 515G 1 25.5k 0 0 exists,up
1 k3s02 185G 514G 0 0 0 0 exists,up
2 k3s03 179G 520G 1 12.0k 1 819 exists,up
3 k3s04 162G 537G 0 4915 3 106 exists,up
4 k3s05 169G 530G 0 0 0 0 exists,up
5 k3s06 192G 738G 0 39.1k 0 0 exists,up
6 k3s06 205G 725G 1 17.5k 1 0 exists,up
7 k3s04 165G 534G 1 78.3k 1 0 exists,up
8 k3s05 167G 532G 0 109k 0 0 exists,up
I saw a reference to a damaged PG, but no scrub information was found:
# ceph health detail
HEALTH_ERR 2 scrub errors; Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 2 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
pg 1.5a is active+clean+inconsistent, acting [3,5,1,2]
# rados list-inconsistent-obj 1.5a --format=json-pretty
No scrub information available for pg 1.5a
error 2: (2) No such file or directory
I did a repair on it away:
# ceph pg repair 1.5a
instructing pg 1.5a on osd.3 to repair
# ceph -s
cluster:
id: cb82340a-2eaf-4597-b83e-cc0e62a9d019
health: HEALTH_OK
services:
mon: 3 daemons, quorum o,p,s (age 6d)
mgr: b(active, since 3d), standbys: a
mds: 1/1 daemons up, 1 hot standby
osd: 9 osds: 9 up (since 17m), 9 in (since 17m)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 289 pgs
objects: 151.55k objects, 434 GiB
usage: 1.6 TiB used, 5.0 TiB / 6.6 TiB avail
pgs: 289 active+clean
io:
client: 1.2 KiB/s rd, 687 KiB/s wr, 1 op/s rd, 14 op/s wr
I have no idea what caused 100+ crashes or what the crash message is reporting. Not clear why this was able to resolve on its own.
Updated by Laura Flores 5 months ago
- Project changed from Ceph to RADOS
- Category deleted (
OSD)
Updated by Igor Fedotov 5 months ago
It might be helpful to see the full (or at least last 20K lines of) OSD log preceding the crash.
Updated by Radoslaw Zarzynski 5 months ago
- Status changed from New to Need More Info