Project

General

Profile

Actions

Bug #46366

open

Octopus: Recovery and backfilling causes OSDs to crash after upgrading from nautilus to octopus

Added by Wout van Heeswijk almost 4 years ago. Updated over 2 years ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
octopus
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

A customer has upgraded the cluster from nautilus to octopus after experiencing issues with osds not being able to connect to each other, clients/mons/mgrs. The connectivity issues was related to the msgrV2 and require_osd_release setting not being set to nautilus. After fixing this the OSDs were restarted and all placement groups became active again.

After unsetting the norecover and nobackfill flag some OSDs started crashing every few minutes. The OSD log, even with high debug settings, don't seem to reveal anything, it just stops logging mid log line.

In the systemd journal there is the following message:

Jul 05 13:41:50 st0.r23.spod1.rtm0.transip.io ceph-osd[92605]: *** Caught signal (Segmentation fault) **
Jul 05 13:41:50 st0.r23.spod1.rtm0.transip.io ceph-osd[92605]:  in thread 557dc6fb3510 thread_name:tp_osd_tp
Jul 05 13:41:50 st0.r23.spod1.rtm0.transip.io ceph-osd[92605]: src/tcmalloc.cc:283] Attempt to free invalid pointer 0x363bbb77000
Jul 05 13:41:50 st0.r23.spod1.rtm0.transip.io ceph-osd[92605]: *** Caught signal (Aborted) **
Jul 05 13:41:50 st0.r23.spod1.rtm0.transip.io ceph-osd[92605]:  in thread 557dc6fb3510 thread_name:tp_osd_tp
Jul 05 13:41:50 st0.r23.spod1.rtm0.transip.io ceph-osd[92605]: src/tcmalloc.cc:283] Attempt to free invalid pointer 0x363bbb77000

snippet of log from time around crash.

2020-07-05T06:31:33.547+0200 7f8860296700 -1 osd.127 1496224 heartbeat_check: no reply from 10.200.19.17:6836 osd.111 since back 2020-07-05T06:28:30.776006+0200 front 2020-07-05T06:28:30.775261+0200 (oldest deadline 2020-07-05T06:28:53.0
73588+0200)
2020-07-05T06:31:33.547+0200 7f8860296700 -1 osd.127 1496224 heartbeat_check: no reply from 10.200.19.37:6901 osd.146 since back 2020-07-05T06:31:01.434299+0200 front 2020-07-05T06:31:01.434534+0200 (oldest deadline 2020-07-05T06:31:27.2
33589+0200)
2020-07-05T06:31:33.547+0200 7f8860296700 -1 osd.127 1496224 heartbeat_check: no reply from 10.200.19.38:6929 osd.180 since back 2020-07-05T06:28:18.971489+0200 front 2020-07-05T06:28:18.971597+0200 (oldest deadline 2020-07-05T06:28:50.7
71298+0200)
2020-07-05T06:31:33.547+0200 7f8860296700 -1 osd.127 1496224 heartbeat_check: no reply from 10.200.19.38:6891 osd.189 since back 2020-07-05T06:28:18.971678+0200 front 2020-07-05T06:28:18.971894+0200 (oldest deadline 2020-07-05T06:28:44.8
69635+0200)
2020-07-05T06:31:33.547+0200 7f8860296700 -1 osd.127 1496224 heartbeat_check: no reply from 10.200.19.48:6836 osd.229 since back 2020-07-05T06:31:07.237691+0200 front 2020-07-05T06:31:07.237226+0200 (oldest deadline 2020-07-05T06:31:30.7
34951+0200)
2020-07-05T06:35:04.026+0200 7ff24a7e8d80  0 set uid:gid to 64045:64045 (ceph:ceph)
2020-07-05T06:35:04.026+0200 7ff24a7e8d80  0 ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable), process ceph-osd, pid 1667604
2020-07-05T06:35:04.026+0200 7ff24a7e8d80  0 pidfile_write: ignore empty --pid-file
2020-07-05T06:35:04.026+0200 7ff24a7e8d80  1 bdev create path /var/lib/ceph/osd/ceph-127/block type kernel
2020-07-05T06:35:04.026+0200 7ff24a7e8d80  1 bdev(0x55f03b8f6380 /var/lib/ceph/osd/ceph-127/block) open path /var/lib/ceph/osd/ceph-127/block
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bdev(0x55f03b8f6380 /var/lib/ceph/osd/ceph-127/block) open size 12000134430720 (0xae9ffc00000, 11 TiB) block_size 4096 (4 KiB) rotational discard not supported
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bluestore(/var/lib/ceph/osd/ceph-127) _set_cache_sizes cache_size 1073741824 meta 0.4 kv 0.4 data 0.2
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bdev create path /var/lib/ceph/osd/ceph-127/block.db type kernel
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bdev(0x55f03b8f6a80 /var/lib/ceph/osd/ceph-127/block.db) open path /var/lib/ceph/osd/ceph-127/block.db
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bdev(0x55f03b8f6a80 /var/lib/ceph/osd/ceph-127/block.db) open size 128849018880 (0x1e00000000, 120 GiB) block_size 4096 (4 KiB) non-rotational discard supported
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-127/block.db size 120 GiB
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bdev create path /var/lib/ceph/osd/ceph-127/block type kernel
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bdev(0x55f03b8f6e00 /var/lib/ceph/osd/ceph-127/block) open path /var/lib/ceph/osd/ceph-127/block
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bdev(0x55f03b8f6e00 /var/lib/ceph/osd/ceph-127/block) open size 12000134430720 (0xae9ffc00000, 11 TiB) block_size 4096 (4 KiB) rotational discard not supported
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bluefs add_block_device bdev 2 path /var/lib/ceph/osd/ceph-127/block size 11 TiB
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bdev create path /var/lib/ceph/osd/ceph-127/block.wal type kernel
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bdev(0x55f03b8f7180 /var/lib/ceph/osd/ceph-127/block.wal) open path /var/lib/ceph/osd/ceph-127/block.wal
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bdev(0x55f03b8f7180 /var/lib/ceph/osd/ceph-127/block.wal) open size 2147483648 (0x80000000, 2 GiB) block_size 4096 (4 KiB) non-rotational discard supported
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bluefs add_block_device bdev 0 path /var/lib/ceph/osd/ceph-127/block.wal size 2 GiB
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bdev(0x55f03b8f7180 /var/lib/ceph/osd/ceph-127/block.wal) close

A gdb backtrace is attached that reveals some more info.


Files

backtrace.txt (4.7 KB) backtrace.txt Wout van Heeswijk, 07/05/2020 12:26 PM
stacktrace.gz (6.04 KB) stacktrace.gz Wout van Heeswijk, 07/06/2020 05:36 PM
aio.patch (798 Bytes) aio.patch Wout van Heeswijk, 07/06/2020 05:36 PM
Actions

Also available in: Atom PDF