Project

General

Profile

Actions

Bug #54589

open

OSD crash during node boot

Added by Moritz Roehrich about 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

What I did:
- spin up a cluster
- write some data to the file system
- shut down one node
- write some more data to the filesystem
- wait for the OSDs on the affected node to be marked `down` and `out`
- write even more data to the filesystem
- boot up that node again

What happened:
- One out of the four OSDs started up fine
- The other three out of the four OSDs needed to be started manually with a `ceph orch start osd.#`
- Out of the three OSDs, two started up fine, not problems. One crashed repeatedly.
- After a long time, when the cluster was already healthy again, the last OSD which had been crashing before started up ok too.

Cluster configuration:
5 nodes, 4 disks each, three storage pools, one filesystem

Traces in the log (e.g.):
```
[...]
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 1 osd.18 1004 failed to load OSD map for epoch 765, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 766, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 767, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 768, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 769, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 770, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 771, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 772, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 773, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 774, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: /home/abuild/rpmbuild/BUILD/ceph-16.2.7-596-g7d574789716/src/osd/OSD.cc: In function 'void OSDShard::register_and_wake_split_child(PG*)' thread 7f050558f700 time 2022-03-15T11:34:53.622051+0000
Mar 15 12:34:53 node1 conmon4209: /home/abuild/rpmbuild/BUILD/ceph-16.2.7-596-g7d574789716/src/osd/OSD.cc: 10722: FAILED ceph_assert(p != pg_slots.end())
Mar 15 12:34:53 node1 conmon4209: ceph version 16.2.7-596-g7d574789716 (7d574789716b837713efea9ff29454afaaacf48a) pacific (stable)
Mar 15 12:34:53 node1 conmon4209: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14b) [0x560ddab853aa]
Mar 15 12:34:53 node1 conmon4209: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x560ddab85585]
Mar 15 12:34:53 node1 conmon4209: 3: (OSDShard::register_and_wake_split_child(PG*)+0x7e6) [0x560ddac5e6a6]
Mar 15 12:34:53 node1 conmon4209: 4: (OSD::_finish_splits(std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >&)+0x110) [0x560ddac5e830]
Mar 15 12:34:53 node1 conmon4209: 5: (Context::complete(int)+0x9) [0x560ddac6ace9]
Mar 15 12:34:53 node1 conmon4209: 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xd74) [0x560ddac4ced4]
Mar 15 12:34:53 node1 conmon4209: 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x560ddb2ad54c]
Mar 15 12:34:53 node1 conmon4209: 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x560ddb2b0a10]
Mar 15 12:34:53 node1 conmon4209: 9: /lib64/libpthread.so.0(+0xa6ea) [0x7f05265046ea]
Mar 15 12:34:53 node1 conmon4209: 10: clone()
Mar 15 12:34:53 node1 conmon4209: /home/abuild/
Mar 15 12:34:53 node1 conmon4209: rpmbuild/BUILD/ceph-16.2.7-596-g7d574789716/src/osd/OSD.cc: In function 'void OSDShard::register_and_wake_split_child(PG*)' thread 7f050558f700 time 2022-03-15T11:34:53.622051+0000
Mar 15 12:34:53 node1 conmon4209: /home/abuild/rpmbuild/BUILD/ceph-16.2.7-596-g7d574789716/src/osd/OSD.cc: 10722: FAILED ceph_assert(p != pg_slots.end())
Mar 15 12:34:53 node1 conmon4209: ceph version 16.2.7-596-g7d574789716 (7d574789716b837713efea9ff29454afaaacf48a) pacific (stable)
Mar 15 12:34:53 node1 conmon4209: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14b) [0x560ddab853aa]
Mar 15 12:34:53 node1 conmon4209: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x560ddab85585]
Mar 15 12:34:53 node1 conmon4209: 3: (OSDShard::register_and_wake_split_child(PG*)+0x7e6) [0x560ddac5e6a6]
Mar 15 12:34:53 node1 conmon4209: 4: (OSD::_finish_splits(std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >&)+0x110) [0x560ddac5e830]
Mar 15 12:34:53 node1 conmon4209: 5: (Context::complete(int)+0x9) [0x560ddac6ace9]
Mar 15 12:34:53 node1 conmon4209: 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xd74) [0x560ddac4ced4]
Mar 15 12:34:53 node1 conmon4209: 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x560ddb2ad54c]
Mar 15 12:34:53 node1 conmon4209: 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x560ddb2b0a10]
Mar 15 12:34:53 node1 conmon4209: 9: /lib64/libpthread.so.0(+0xa6ea) [0x7f05265046ea]
Mar 15 12:34:53 node1 conmon4209: 10: clone()
Mar 15 12:34:53 node1 conmon4209: ** Caught signal (Aborted) *
Mar 15 12:34:53 node1 conmon4209: in thread 7f0505d90700 thread_name:tp_osd_tp
Mar 15 12:34:53 node1 conmon4209:
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 775, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 776, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 777, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 778, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 779, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 780, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 781, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 782, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 783, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 784, got 0 bytes
[...]
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 822, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 823, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: ceph version 16.2.7-596-g7d574789716 (7d574789716b837713efea9ff29454afaaacf48a) pacific (stable)
Mar 15 12:34:53 node1 conmon4209: 1: /lib64/libpthread.so.0(+0x168c0) [0x7f05265108c0]
Mar 15 12:34:53 node1 conmon4209: 2: gsignal()
Mar 15 12:34:53 node1 conmon4209: 3: abort()
Mar 15 12:34:53 node1 conmon4209: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19c) [0x560ddab853fb]
Mar 15 12:34:53 node1 conmon4209: 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x560ddab85585]
Mar 15 12:34:53 node1 conmon4209: 6: (OSDShard::register_and_wake_split_child(PG*)+0x7e6) [0x560ddac5e6a6]
Mar 15 12:34:53 node1 conmon4209: 7: (OSD::_finish_splits(std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >&)+0x110) [0x560ddac5e830]
Mar 15 12:34:53 node1 conmon4209: 8: (Context::complete(int)+0x9) [0x560ddac6ace9]
Mar 15 12:34:53 node1 conmon4209: 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xd74) [0x560ddac4ced4]
Mar 15 12:34:53 node1 conmon4209: 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x560ddb2ad54c]
Mar 15 12:34:53 node1 conmon4209: 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x560ddb2b0a10]
Mar 15 12:34:53 node1 conmon4209: 12: /lib64/libpthread.so.0(+0xa6ea) [0x7f05265046ea]
Mar 15 12:34:53 node1 conmon4209: 13: clone()
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 824, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.233+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 825, got 0 bytes
[...]
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.613+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 842, got 0 bytes
Mar 15 12:34:53 node1 conmon4209: debug 2022-03-15T11:34:53.613+0000 7f050458d700 -1 osd.18 1004 failed to load OSD map for epoch 843, got 0 bytes
Mar 15 12:35:02 node1 podman6410: 2022-03-15 12:35:02.769198422 +0100 CET m=+0.026565274 container died 96c979d4051ae6376557448542e7ad9d6c0a5108dcd00b7b80b0903b5c633176 (image=registry.suse.de/devel/storage/7.0/pacific/containers/ses/7.1/ceph/ceph@sha256:c1141ebf3f59e7833ab9bd4562f562715c5336ef12271c3e6ec35f19f487cf2d, name=ceph-87cba79e-a37f-1
1ec-92b3-525400d8d43e-osd-18)
Mar 15 12:35:02 node1 podman6410: 2022-03-15 12:35:02.846461212 +0100 CET m=+0.103828054 container remove 96c979d4051ae6376557448542e7ad9d6c0a5108dcd00b7b80b0903b5c633176 (image=registry.suse.de/devel/storage/7.0/pacific/containers/ses/7.1/ceph/ceph@sha256:c1141ebf3f59e7833ab9bd4562f562715c5336ef12271c3e6ec35f19f487cf2d, name=ceph-87cba79e-a37f
-11ec-92b3-525400d8d43e-osd-18, org.openbuildservice.disturl=obs://build.suse.de/Devel:Storage:7.0:Pacific/containers/5f45aadcb4178f70ad913abfb504ad12-ceph-image, com.suse.sle.base.reference=registry.suse.com/suse/sle15:15.3.17.11.3, com.suse.rook.created=2022-03-11T10:03:33.430040372Z, com.suse.lifecycle-url=https://www.suse.com/lifecycle, com.s
use.ses.disturl=obs://build.suse.de/Devel:Storage:7.0:Pacific/containers/5f45aadcb4178f70ad913abfb504ad12-ceph-image, org.opencontainers.image.source=https://sources.suse.com/SUSE:SLE-15-SP3:Update:CR/sles15-image/f5bf6d4e56940c21f38c6a4b358f6653/, com.suse.ses.version=7.1, com.suse.eula=sle-bci, com.suse.sle.base.created=2022-03-11T06:26:59.7756
46395Z, com.suse.sle.base.vendor=SUSE LLC, com.suse.sle.base.image-type=sle-bci, com.suse.sle.base.eula=sle-bci, com.suse.sle.base.title=SUSE Linux Enterprise Server 15 SP3 Base Container Image, com.suse.ceph.version=16.2.7.596, com.suse.sle.base.version=15.3.17.11.3, ceph=True, com.suse.ses.url=https://www.suse.com/solutions/software-defined-sto
rage/, com.suse.release-stage=released, com.suse.ses.description=Ceph container image, io.ceph.version=16.2.7.596, com.suse.image-type=sle-bci, com.suse.ses.title=SUSE Enterprise Storage 7.1, com.suse.sle.base.lifecycle-url=https://www.suse.com/lifecycle, org.opencontainers.image.version=16.2.7.596.11.33, org.opencontainers.image.created=2022-03

11T10:03:33.430040372Z, com.suse.sle.base.url=https://www.suse.com/products/server/, org.opencontainers.image.url=https://registry.suse.com/ses/7.1/ceph/ceph:16.2.7.596.11.33, com.suse.sle.base.release-stage=released, org.opensuse.reference=registry.suse.com/ses/7.1/ceph/ceph:16.2.7.596.11.33, com.suse.sle.base.description=Image for containers ba
sed on SUSE Linux Enterprise Server 15 SP3., com.suse.sle.base.source=https://sources.suse.com/SUSE:SLE-15-SP3:Update:CR/sles15-image/f5bf6d4e56940c21f38c6a4b358f6653/, org.opencontainers.image.vendor=SUSE LLC, com.suse.ceph.url=https://ceph.com/, com.suse.ses.created=2022-03-11T10:03:33.430040372Z, com.suse.sle.base.disturl=obs://build.suse.de/S
USE:SLE-15-SP3:Update:CR/images/f5bf6d4e56940c21f38c6a4b358f6653-sles15-image, com.suse.ses.reference=registry.suse.com/ses/7.1/ceph/ceph:16.2.7.596.11.33, com.suse.ses.vendor=SUSE LLC, org.opencontainers.image.description=Ceph container image, org.opencontainers.image.title=SUSE Enterprise Storage 7.1)
Mar 15 12:35:02 node1 systemd1: : Main process exited, code=exited, status=134/n/a
```

Full log: https://gist.githubusercontent.com/m-ildefons/3c884af8d9eee83c3ec02bf1d8c391a0/raw/a71e6cc9d818ed544a1eadd019055dab8404d460/gistfile1.txt


Related issues 1 (1 open0 closed)

Related to RADOS - Bug #56770: crash: void OSDShard::register_and_wake_split_child(PG*): assert(p != pg_slots.end())New

Actions
Actions #1

Updated by Laura Flores over 1 year ago

  • Related to Bug #56770: crash: void OSDShard::register_and_wake_split_child(PG*): assert(p != pg_slots.end()) added
Actions

Also available in: Atom PDF