Project

General

Profile

Actions

Bug #46366

open

Octopus: Recovery and backfilling causes OSDs to crash after upgrading from nautilus to octopus

Added by Wout van Heeswijk almost 4 years ago. Updated over 2 years ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
octopus
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

A customer has upgraded the cluster from nautilus to octopus after experiencing issues with osds not being able to connect to each other, clients/mons/mgrs. The connectivity issues was related to the msgrV2 and require_osd_release setting not being set to nautilus. After fixing this the OSDs were restarted and all placement groups became active again.

After unsetting the norecover and nobackfill flag some OSDs started crashing every few minutes. The OSD log, even with high debug settings, don't seem to reveal anything, it just stops logging mid log line.

In the systemd journal there is the following message:

Jul 05 13:41:50 st0.r23.spod1.rtm0.transip.io ceph-osd[92605]: *** Caught signal (Segmentation fault) **
Jul 05 13:41:50 st0.r23.spod1.rtm0.transip.io ceph-osd[92605]:  in thread 557dc6fb3510 thread_name:tp_osd_tp
Jul 05 13:41:50 st0.r23.spod1.rtm0.transip.io ceph-osd[92605]: src/tcmalloc.cc:283] Attempt to free invalid pointer 0x363bbb77000
Jul 05 13:41:50 st0.r23.spod1.rtm0.transip.io ceph-osd[92605]: *** Caught signal (Aborted) **
Jul 05 13:41:50 st0.r23.spod1.rtm0.transip.io ceph-osd[92605]:  in thread 557dc6fb3510 thread_name:tp_osd_tp
Jul 05 13:41:50 st0.r23.spod1.rtm0.transip.io ceph-osd[92605]: src/tcmalloc.cc:283] Attempt to free invalid pointer 0x363bbb77000

snippet of log from time around crash.

2020-07-05T06:31:33.547+0200 7f8860296700 -1 osd.127 1496224 heartbeat_check: no reply from 10.200.19.17:6836 osd.111 since back 2020-07-05T06:28:30.776006+0200 front 2020-07-05T06:28:30.775261+0200 (oldest deadline 2020-07-05T06:28:53.0
73588+0200)
2020-07-05T06:31:33.547+0200 7f8860296700 -1 osd.127 1496224 heartbeat_check: no reply from 10.200.19.37:6901 osd.146 since back 2020-07-05T06:31:01.434299+0200 front 2020-07-05T06:31:01.434534+0200 (oldest deadline 2020-07-05T06:31:27.2
33589+0200)
2020-07-05T06:31:33.547+0200 7f8860296700 -1 osd.127 1496224 heartbeat_check: no reply from 10.200.19.38:6929 osd.180 since back 2020-07-05T06:28:18.971489+0200 front 2020-07-05T06:28:18.971597+0200 (oldest deadline 2020-07-05T06:28:50.7
71298+0200)
2020-07-05T06:31:33.547+0200 7f8860296700 -1 osd.127 1496224 heartbeat_check: no reply from 10.200.19.38:6891 osd.189 since back 2020-07-05T06:28:18.971678+0200 front 2020-07-05T06:28:18.971894+0200 (oldest deadline 2020-07-05T06:28:44.8
69635+0200)
2020-07-05T06:31:33.547+0200 7f8860296700 -1 osd.127 1496224 heartbeat_check: no reply from 10.200.19.48:6836 osd.229 since back 2020-07-05T06:31:07.237691+0200 front 2020-07-05T06:31:07.237226+0200 (oldest deadline 2020-07-05T06:31:30.7
34951+0200)
2020-07-05T06:35:04.026+0200 7ff24a7e8d80  0 set uid:gid to 64045:64045 (ceph:ceph)
2020-07-05T06:35:04.026+0200 7ff24a7e8d80  0 ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable), process ceph-osd, pid 1667604
2020-07-05T06:35:04.026+0200 7ff24a7e8d80  0 pidfile_write: ignore empty --pid-file
2020-07-05T06:35:04.026+0200 7ff24a7e8d80  1 bdev create path /var/lib/ceph/osd/ceph-127/block type kernel
2020-07-05T06:35:04.026+0200 7ff24a7e8d80  1 bdev(0x55f03b8f6380 /var/lib/ceph/osd/ceph-127/block) open path /var/lib/ceph/osd/ceph-127/block
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bdev(0x55f03b8f6380 /var/lib/ceph/osd/ceph-127/block) open size 12000134430720 (0xae9ffc00000, 11 TiB) block_size 4096 (4 KiB) rotational discard not supported
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bluestore(/var/lib/ceph/osd/ceph-127) _set_cache_sizes cache_size 1073741824 meta 0.4 kv 0.4 data 0.2
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bdev create path /var/lib/ceph/osd/ceph-127/block.db type kernel
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bdev(0x55f03b8f6a80 /var/lib/ceph/osd/ceph-127/block.db) open path /var/lib/ceph/osd/ceph-127/block.db
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bdev(0x55f03b8f6a80 /var/lib/ceph/osd/ceph-127/block.db) open size 128849018880 (0x1e00000000, 120 GiB) block_size 4096 (4 KiB) non-rotational discard supported
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-127/block.db size 120 GiB
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bdev create path /var/lib/ceph/osd/ceph-127/block type kernel
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bdev(0x55f03b8f6e00 /var/lib/ceph/osd/ceph-127/block) open path /var/lib/ceph/osd/ceph-127/block
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bdev(0x55f03b8f6e00 /var/lib/ceph/osd/ceph-127/block) open size 12000134430720 (0xae9ffc00000, 11 TiB) block_size 4096 (4 KiB) rotational discard not supported
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bluefs add_block_device bdev 2 path /var/lib/ceph/osd/ceph-127/block size 11 TiB
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bdev create path /var/lib/ceph/osd/ceph-127/block.wal type kernel
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bdev(0x55f03b8f7180 /var/lib/ceph/osd/ceph-127/block.wal) open path /var/lib/ceph/osd/ceph-127/block.wal
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bdev(0x55f03b8f7180 /var/lib/ceph/osd/ceph-127/block.wal) open size 2147483648 (0x80000000, 2 GiB) block_size 4096 (4 KiB) non-rotational discard supported
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bluefs add_block_device bdev 0 path /var/lib/ceph/osd/ceph-127/block.wal size 2 GiB
2020-07-05T06:35:04.030+0200 7ff24a7e8d80  1 bdev(0x55f03b8f7180 /var/lib/ceph/osd/ceph-127/block.wal) close

A gdb backtrace is attached that reveals some more info.


Files

backtrace.txt (4.7 KB) backtrace.txt Wout van Heeswijk, 07/05/2020 12:26 PM
stacktrace.gz (6.04 KB) stacktrace.gz Wout van Heeswijk, 07/06/2020 05:36 PM
aio.patch (798 Bytes) aio.patch Wout van Heeswijk, 07/06/2020 05:36 PM
Actions #1

Updated by Kefu Chai almost 4 years ago

hi Wout,

what distro are you using? where did the ceph packages come from? could you install debug debugsymbls packages of ceph-osd and librados2 for a more readable backtrace?

Actions #2

Updated by Wout van Heeswijk almost 4 years ago

Hi Kefu,

Distributor ID: Ubuntu
Description: Ubuntu 18.04.4 LTS
Release: 18.04
Codename: bionic
Source package: deb http://download.ceph.com/debian-octopus bionic main

Package: ceph-common
Status: install ok installed
Priority: optional
Section: admin
Installed-Size: 73640
Maintainer: Ceph Maintainers <>
Architecture: amd64
Source: ceph
Version: 15.2.4-1bionic

ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)

uname -a current kernel: 4.15.0-99-generic
uname -a backported kernel: 5.4.0-40-generic

I'm installing the debugsymbls packages as i'm typing this.

Actions #3

Updated by Wout van Heeswijk almost 4 years ago

We've also tested with the docker container distribution. It crashes with the same segfault:

  • Caught signal (Segmentation fault) **
    in thread 55918cd11a10 thread_name:tp_osd_tp

Using the cli below without the FSID.

/bin/docker run --rm --net=host --privileged --group-add=disk --name
ceph-[FSID-PREFIX]-osd.[OSD-ID] -e
CONTAINER_IMAGE=docker.io/ceph/ceph:v15 -e NODE_NAME=[HOSTNAME] -v
/var/run/ceph/[FSID-PREFIX]:/var/run/ceph:z -v
/var/log/ceph/[FSID-PREFIX]:/var/log/ceph:z -v
/var/lib/ceph/[FSID-PREFIX]/crash:/var/lib/ceph/crash:z -v
/var/lib/ceph/[FSID-PREFIX]/osd.[OSD-ID]:/var/lib/ceph/osd/ceph-3:z -v
/var/lib/ceph/[FSID-PREFIX]/osd.[OSD-ID]/config:/etc/ceph/ceph.conf:z -v
/dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v
/run/lock/lvm:/run/lock/lvm --entrypoint /usr/bin/ceph-osd
docker.io/ceph/ceph:v15 -n osd.[OSD-ID] -f --setuser ceph --setgroup
ceph --default-log-to-file=false --default-log-to-stderr=true
--default-log-stderr-prefix=debug

Updated by Wout van Heeswijk almost 4 years ago

Wout van Heeswijk wrote:

After reverting https://github.com/ceph/ceph/commit/74be30c8b7c the crashing ceph-osd's seem to be stable.

Related:
https://github.com/ceph/ceph/pull/30061
https://github.com/ceph/ceph/pull/29588

The above patch was found based on a backtrace that was executed with the attached patch applied.

Actions #6

Updated by xie xingguo almost 4 years ago

Wout van Heeswijk wrote:

Wout van Heeswijk wrote:

After reverting https://github.com/ceph/ceph/commit/74be30c8b7c the crashing ceph-osd's seem to be stable.

Related:
https://github.com/ceph/ceph/pull/30061
https://github.com/ceph/ceph/pull/29588

The above patch was found based on a backtrace that was executed with the attached patch applied.

Hi, Wout
I am wondering how could it be possible for an object to become so fragmented:(
Did you modify the default size of rados object? What kind of app (rbd, rgw, cephfs) are you using?

Actions #7

Updated by Kefu Chai almost 4 years ago

  • Status changed from New to Fix Under Review
  • Assignee set to Kefu Chai
  • Backport set to octopus
  • Pull request ID set to 35952

hi, Wout,

thanks for your investigation and patch. i revised it to use boost::container::small_vector. see https://github.com/ceph/ceph/pull/35952

Actions #8

Updated by Wout van Heeswijk almost 4 years ago

Kefu Chai wrote:

hi, Wout,

thanks for your investigation and patch. i revised it to use boost::container::small_vector. see https://github.com/ceph/ceph/pull/35952

No problem, we had a good bughunt session with our customer! Thank you for this quick fix Kefu!

Currently some jenkins tests are failing, but it is not related to the patch. It is because of space within the build environment. Other tests are also failing because of this.

Actions #9

Updated by Wout van Heeswijk almost 4 years ago

xie xingguo wrote:

Wout van Heeswijk wrote:

Wout van Heeswijk wrote:

After reverting https://github.com/ceph/ceph/commit/74be30c8b7c the crashing ceph-osd's seem to be stable.

Related:
https://github.com/ceph/ceph/pull/30061
https://github.com/ceph/ceph/pull/29588

The above patch was found based on a backtrace that was executed with the attached patch applied.

Hi, Wout
I am wondering how could it be possible for an object to become so fragmented:(
Did you modify the default size of rados object? What kind of app (rbd, rgw, cephfs) are you using?

I've created logs with high debugging settings yesterday. I've asked they can be shared via another route with links here.

Actions #10

Updated by Wout van Heeswijk almost 4 years ago

xie xingguo wrote:

Wout van Heeswijk wrote:

Wout van Heeswijk wrote:

After reverting https://github.com/ceph/ceph/commit/74be30c8b7c the crashing ceph-osd's seem to be stable.

Related:
https://github.com/ceph/ceph/pull/30061
https://github.com/ceph/ceph/pull/29588

The above patch was found based on a backtrace that was executed with the attached patch applied.

Hi, Wout
I am wondering how could it be possible for an object to become so fragmented:(
Did you modify the default size of rados object? What kind of app (rbd, rgw, cephfs) are you using?

Here are the rather large log files:
https://robing-disk.stackstorage.com/s/eEM0xaCuF6MBdkqg
https://robing-disk.stackstorage.com/s/KfFPQpAfrOgiCLF9

Actions #11

Updated by Wout van Heeswijk over 3 years ago

Kefu Chai wrote:

hi, Wout,

thanks for your investigation and patch. i revised it to use boost::container::small_vector. see https://github.com/ceph/ceph/pull/35952

Hi Kefu,

Do you have any thoughts on this ticket?

Actions #12

Updated by Ruben Kerkhof over 2 years ago

We have finally been able to reproduce this, and it is still an issue in current master (17.0.0-9218-g0f7791fa).
- $ sudo rbd create 42on --size 1MB
- $ sudo rbd bench --io-type write --io-size 1 --io-threads 1 --io-total 100000 --io-pattern rand 42on
- $ sudo rbd info 42on | grep block_name_prefix
block_name_prefix: rbd_data.25a5bdaf5203b
- $ sudo ceph osd map rbd rbd_data.25a5bdaf5203b.0000000000000000
osdmap e858 pool 'rbd' (2) object 'rbd_data.25a5bdaf5203b.0000000000000000' -> pg 2.595bd128 (2.8) -> up ([0,2,1], p0) acting ([0,2,1], p0)

Now trigger recovery by marking the primary out:
$ sudo ceph osd out 0
marked out osd.0.

Observe that both replicas crash:
$ sudo ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-1 0.09448 - 86 GiB 712 MiB 155 MiB 147 KiB 558 MiB 85 GiB 0.81 1.00 - root default
-7 0.03149 - 32 GiB 100 MiB 54 MiB 50 KiB 46 MiB 32 GiB 0.30 0.37 - host osd0
2 ssd 0.01050 1.00000 11 GiB 32 MiB 8.2 MiB 13 KiB 24 MiB 11 GiB 0.29 0.36 3 up osd.2
5 ssd 0.01050 1.00000 11 GiB 42 MiB 31 MiB 18 KiB 11 MiB 11 GiB 0.38 0.47 12 up osd.5
8 ssd 0.01050 1.00000 11 GiB 26 MiB 15 MiB 19 KiB 11 MiB 11 GiB 0.24 0.30 9 up osd.8
-3 0.03149 - 32 GiB 525 MiB 54 MiB 63 KiB 471 MiB 32 GiB 1.59 1.97 - host osd1
1 ssd 0.01050 1.00000 11 GiB 484 MiB 39 MiB 16 KiB 445 MiB 10 GiB 4.41 5.44 0 down osd.1
3 ssd 0.01050 1.00000 11 GiB 19 MiB 7.8 MiB 15 KiB 11 MiB 11 GiB 0.17 0.21 13 up osd.3
6 ssd 0.01050 1.00000 11 GiB 22 MiB 7.3 MiB 32 KiB 15 MiB 11 GiB 0.20 0.25 10 up osd.6
-5 0.03149 - 21 GiB 87 MiB 46 MiB 34 KiB 41 MiB 21 GiB 0.40 0.49 - host osd2
0 ssd 0.01050 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down osd.0
4 ssd 0.01050 1.00000 11 GiB 57 MiB 38 MiB 12 KiB 19 MiB 11 GiB 0.52 0.64 14 up osd.4
7 ssd 0.01050 1.00000 11 GiB 30 MiB 7.8 MiB 22 KiB 22 MiB 11 GiB 0.27 0.33 19 up osd.7
TOTAL 86 GiB 712 MiB 155 MiB 152 KiB 558 MiB 85 GiB 0.81

With the following backtrace:
ceph version 17.0.0-9218-g0f7791fa (0f7791fa24a07d543fb5a2de11c15c3b4c01dccd) quincy (dev)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7fbf0a1733c0]
2: gsignal()
3: abort()
4: /lib/x86_64-linux-gnu/libc.so.6(+0x25729) [0x7fbf09bd6729]
5: /lib/x86_64-linux-gnu/libc.so.6(+0x36f36) [0x7fbf09be7f36]
6: (KernelDevice::aio_submit(IOContext*)+0x924) [0x56312dcf7334]
7: (BlueStore::_do_readv(BlueStore::Collection*, boost::intrusive_ptr<BlueStore::Onode>, interval_set<unsigned long, std::map> const&, ceph::buffer::v15_2_0::l
ist&, unsigned int, unsigned long)+0xcf1) [0x56312d718fa1]
8: (BlueStore::readv(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, interval_set<unsigned long, std::map>&, ceph::buffer::v15_2_0::list
&, unsigned int)+0x502) [0x56312d719dd2]
9: (ReplicatedBackend::build_push_op(ObjectRecoveryInfo const&, ObjectRecoveryProgress const&, ObjectRecoveryProgress*, PushOp*, object_stat_sum_t*, bool)+0x3d
1) [0x56312d531b91]
10: (ReplicatedBackend::prep_push(std::shared_ptr<ObjectContext>, hobject_t const&, pg_shard_t, eversion_t, interval_set<unsigned long, std::map>&, std::map<ho
bject_t, interval_set<unsigned long, std::map>, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, interval_set<unsigned long, std::map> > > >&, Pu
shOp*, bool, ObcLockManager&&)+0x73d) [0x56312d5340cd]
11: (ReplicatedBackend::prep_push_to_replica(std::shared_ptr<ObjectContext>, hobject_t const&, pg_shard_t, PushOp*, bool)+0x1be) [0x56312d5348ee]
12: (ReplicatedBackend::start_pushes(hobject_t const&, std::shared_ptr<ObjectContext>, ReplicatedBackend::RPGHandle*)+0x30e) [0x56312d5381ce]
13: (ReplicatedBackend::recover_object(hobject_t const&, eversion_t, std::shared_ptr<ObjectContext>, std::shared_ptr<ObjectContext>, PGBackend::RecoveryHandle*
)+0x177) [0x56312d5387c7]
14: (PrimaryLogPG::prep_backfill_object_push(hobject_t, eversion_t, std::shared_ptr<ObjectContext>, std::vector<pg_shard_t, std::allocator<pg_shard_t> >, PGBac
kend::RecoveryHandle*)+0x1c5) [0x56312d25d115]
15: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x6374) [0x56312d2b5104]
16: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0xe1c) [0x56312d2b6ecc]
17: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x25d) [0x56312d0fc8ed]
18: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1d) [0x56312d413bad]
19: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x8aa) [0x56312d1295da]
20: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x403) [0x56312d8a19e3]
21: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x56312d8a4804]
22: /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7fbf0a167609]
23: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

dumping the onode on the primary shows it has 86873 extents:
root@osd2:~# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0/ '["2.8",{"oid":"rbd_data.25a5bdaf5203b.0000000000000000","key":"","snapid":-2,"hash":1499189544,"max":0,"pool":2,"namespace":"","max":0}]' dump | jq '.onode.extents | length'
86873

The assert will be prevented by https://github.com/ceph/ceph/pull/44065
The question is why this workload creates so many small extents, shouldn't this be limited by min_alloc_size?

Actions

Also available in: Atom PDF