Project

General

Profile

Actions

Bug #56503

open

Deleting large (~850gb) objects causes OSD to crash

Added by Marcin Gibula almost 2 years ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After deleting large S3 object - around 850GB in size, OSDs in our cluster started becaming laggy, unresponsive and eventually suiciding.
We've managed to reproduce this by following steps:

1. Upload large object
2. Delete it
3. Run radosgw-admin gc list (or gc process)

After that, OSDs that are storing .rgw.gc pool become unstable and eventually crash.

ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980) [0x7f8daa3d0980]
2: pthread_kill()
3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, std::chrono::time_point<ceph::coarse_mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >)+0x472) [0x55943ed4cb02]
4: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >)+0x72) [0x55943ed4d2a2]
5: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x498) [0x55943ed72748]
6: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55943ed75c20]
7: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f8daa3c56db]
8: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

When running with debug_osd=10, there are lots and lots of messages like this:

2022-07-08T11:21:05.647+0200 7f57b18a5700 10 osd.0 pg_epoch: 782299 pg[14.2( v 782289'638107506 (778682'638104180,782289'638107506] local-lis/les=782298/782299 n=928 ec=36/36 lis/c=782298/782286 les/c/f=782299/782287/0 sis=782298) [0,160] r=0 lpr=782298 pi=[782286,782298)/3 crt=782289'638107506 lcod 782274'638107504 mlcod 0'0 active+undersized+degraded mbc={}] read got 1024 / 1024 bytes from obj 14:4ee803b6:::gc.21:head
2022-07-08T11:21:05.651+0200 7f57b18a5700 10 osd.0 pg_epoch: 782299 pg[14.2( v 782289'638107506 (778682'638104180,782289'638107506] local-lis/les=782298/782299 n=928 ec=36/36 lis/c=782298/782286 les/c/f=782299/782287/0 sis=782298) [0,160] r=0 lpr=782298 pi=[782286,782298)/3 crt=782289'638107506 lcod 782274'638107504 mlcod 0'0 active+undersized+degraded mbc={}] do_osd_op 14:4ee803b6:::gc.21:head [sync_read 80662864~1024]
2022-07-08T11:21:05.651+0200 7f57b18a5700 10 osd.0 pg_epoch: 782299 pg[14.2( v 782289'638107506 (778682'638104180,782289'638107506] local-lis/les=782298/782299 n=928 ec=36/36 lis/c=782298/782286 les/c/f=782299/782287/0 sis=782298) [0,160] r=0 lpr=782298 pi=[782286,782298)/3 crt=782289'638107506 lcod 782274'638107504 mlcod 0'0 active+undersized+degraded mbc={}] do_osd_op sync_read 80662864~1024
2022-07-08T11:21:05.651+0200 7f57b18a5700 10 osd.0 pg_epoch: 782299 pg[14.2( v 782289'638107506 (778682'638104180,782289'638107506] local-lis/les=782298/782299 n=928 ec=36/36 lis/c=782298/782286 les/c/f=782299/782287/0 sis=782298) [0,160] r=0 lpr=782298 pi=[782286,782298)/3 crt=782289'638107506 lcod 782274'638107504 mlcod 0'0 active+undersized+degraded mbc={}] read got 1024 / 1024 bytes from obj 14:4ee803b6:::gc.21:head
2022-07-08T11:21:05.651+0200 7f57d08e3700 10 osd.0 782301 internal heartbeat not healthy, dropping ping request
2022-07-08T11:21:05.651+0200 7f57d00e2700 10 osd.0 782301 internal heartbeat not healthy, dropping ping request

Actions

Also available in: Atom PDF