Actions
Bug #46366
openOctopus: Recovery and backfilling causes OSDs to crash after upgrading from nautilus to octopus
% Done:
0%
Description
A customer has upgraded the cluster from nautilus to octopus after experiencing issues with osds not being able to connect to each other, clients/mons/mgrs. The connectivity issues was related to the msgrV2 and require_osd_release setting not being set to nautilus. After fixing this the OSDs were restarted and all placement groups became active again.
After unsetting the norecover and nobackfill flag some OSDs started crashing every few minutes. The OSD log, even with high debug settings, don't seem to reveal anything, it just stops logging mid log line.
In the systemd journal there is the following message:
Jul 05 13:41:50 st0.r23.spod1.rtm0.transip.io ceph-osd[92605]: *** Caught signal (Segmentation fault) **
Jul 05 13:41:50 st0.r23.spod1.rtm0.transip.io ceph-osd[92605]: in thread 557dc6fb3510 thread_name:tp_osd_tp
Jul 05 13:41:50 st0.r23.spod1.rtm0.transip.io ceph-osd[92605]: src/tcmalloc.cc:283] Attempt to free invalid pointer 0x363bbb77000
Jul 05 13:41:50 st0.r23.spod1.rtm0.transip.io ceph-osd[92605]: *** Caught signal (Aborted) **
Jul 05 13:41:50 st0.r23.spod1.rtm0.transip.io ceph-osd[92605]: in thread 557dc6fb3510 thread_name:tp_osd_tp
Jul 05 13:41:50 st0.r23.spod1.rtm0.transip.io ceph-osd[92605]: src/tcmalloc.cc:283] Attempt to free invalid pointer 0x363bbb77000
snippet of log from time around crash.
2020-07-05T06:31:33.547+0200 7f8860296700 -1 osd.127 1496224 heartbeat_check: no reply from 10.200.19.17:6836 osd.111 since back 2020-07-05T06:28:30.776006+0200 front 2020-07-05T06:28:30.775261+0200 (oldest deadline 2020-07-05T06:28:53.0
73588+0200)
2020-07-05T06:31:33.547+0200 7f8860296700 -1 osd.127 1496224 heartbeat_check: no reply from 10.200.19.37:6901 osd.146 since back 2020-07-05T06:31:01.434299+0200 front 2020-07-05T06:31:01.434534+0200 (oldest deadline 2020-07-05T06:31:27.2
33589+0200)
2020-07-05T06:31:33.547+0200 7f8860296700 -1 osd.127 1496224 heartbeat_check: no reply from 10.200.19.38:6929 osd.180 since back 2020-07-05T06:28:18.971489+0200 front 2020-07-05T06:28:18.971597+0200 (oldest deadline 2020-07-05T06:28:50.7
71298+0200)
2020-07-05T06:31:33.547+0200 7f8860296700 -1 osd.127 1496224 heartbeat_check: no reply from 10.200.19.38:6891 osd.189 since back 2020-07-05T06:28:18.971678+0200 front 2020-07-05T06:28:18.971894+0200 (oldest deadline 2020-07-05T06:28:44.8
69635+0200)
2020-07-05T06:31:33.547+0200 7f8860296700 -1 osd.127 1496224 heartbeat_check: no reply from 10.200.19.48:6836 osd.229 since back 2020-07-05T06:31:07.237691+0200 front 2020-07-05T06:31:07.237226+0200 (oldest deadline 2020-07-05T06:31:30.7
34951+0200)
2020-07-05T06:35:04.026+0200 7ff24a7e8d80 0 set uid:gid to 64045:64045 (ceph:ceph)
2020-07-05T06:35:04.026+0200 7ff24a7e8d80 0 ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable), process ceph-osd, pid 1667604
2020-07-05T06:35:04.026+0200 7ff24a7e8d80 0 pidfile_write: ignore empty --pid-file
2020-07-05T06:35:04.026+0200 7ff24a7e8d80 1 bdev create path /var/lib/ceph/osd/ceph-127/block type kernel
2020-07-05T06:35:04.026+0200 7ff24a7e8d80 1 bdev(0x55f03b8f6380 /var/lib/ceph/osd/ceph-127/block) open path /var/lib/ceph/osd/ceph-127/block
2020-07-05T06:35:04.030+0200 7ff24a7e8d80 1 bdev(0x55f03b8f6380 /var/lib/ceph/osd/ceph-127/block) open size 12000134430720 (0xae9ffc00000, 11 TiB) block_size 4096 (4 KiB) rotational discard not supported
2020-07-05T06:35:04.030+0200 7ff24a7e8d80 1 bluestore(/var/lib/ceph/osd/ceph-127) _set_cache_sizes cache_size 1073741824 meta 0.4 kv 0.4 data 0.2
2020-07-05T06:35:04.030+0200 7ff24a7e8d80 1 bdev create path /var/lib/ceph/osd/ceph-127/block.db type kernel
2020-07-05T06:35:04.030+0200 7ff24a7e8d80 1 bdev(0x55f03b8f6a80 /var/lib/ceph/osd/ceph-127/block.db) open path /var/lib/ceph/osd/ceph-127/block.db
2020-07-05T06:35:04.030+0200 7ff24a7e8d80 1 bdev(0x55f03b8f6a80 /var/lib/ceph/osd/ceph-127/block.db) open size 128849018880 (0x1e00000000, 120 GiB) block_size 4096 (4 KiB) non-rotational discard supported
2020-07-05T06:35:04.030+0200 7ff24a7e8d80 1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-127/block.db size 120 GiB
2020-07-05T06:35:04.030+0200 7ff24a7e8d80 1 bdev create path /var/lib/ceph/osd/ceph-127/block type kernel
2020-07-05T06:35:04.030+0200 7ff24a7e8d80 1 bdev(0x55f03b8f6e00 /var/lib/ceph/osd/ceph-127/block) open path /var/lib/ceph/osd/ceph-127/block
2020-07-05T06:35:04.030+0200 7ff24a7e8d80 1 bdev(0x55f03b8f6e00 /var/lib/ceph/osd/ceph-127/block) open size 12000134430720 (0xae9ffc00000, 11 TiB) block_size 4096 (4 KiB) rotational discard not supported
2020-07-05T06:35:04.030+0200 7ff24a7e8d80 1 bluefs add_block_device bdev 2 path /var/lib/ceph/osd/ceph-127/block size 11 TiB
2020-07-05T06:35:04.030+0200 7ff24a7e8d80 1 bdev create path /var/lib/ceph/osd/ceph-127/block.wal type kernel
2020-07-05T06:35:04.030+0200 7ff24a7e8d80 1 bdev(0x55f03b8f7180 /var/lib/ceph/osd/ceph-127/block.wal) open path /var/lib/ceph/osd/ceph-127/block.wal
2020-07-05T06:35:04.030+0200 7ff24a7e8d80 1 bdev(0x55f03b8f7180 /var/lib/ceph/osd/ceph-127/block.wal) open size 2147483648 (0x80000000, 2 GiB) block_size 4096 (4 KiB) non-rotational discard supported
2020-07-05T06:35:04.030+0200 7ff24a7e8d80 1 bluefs add_block_device bdev 0 path /var/lib/ceph/osd/ceph-127/block.wal size 2 GiB
2020-07-05T06:35:04.030+0200 7ff24a7e8d80 1 bdev(0x55f03b8f7180 /var/lib/ceph/osd/ceph-127/block.wal) close
A gdb backtrace is attached that reveals some more info.
Files
Actions