Bug #56503: Deleting large (~850gb) objects causes OSD to crash - bluestore - Ceph

Actions

Copy link

Bug #56503

open

Deleting large (~850gb) objects causes OSD to crash

Added by Marcin Gibula almost 2 years ago. Updated over 1 year ago.

Status:

New

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v16.2.9

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

After deleting large S3 object - around 850GB in size, OSDs in our cluster started becaming laggy, unresponsive and eventually suiciding.
We've managed to reproduce this by following steps:

1. Upload large object
2. Delete it
3. Run radosgw-admin gc list (or gc process)

After that, OSDs that are storing .rgw.gc pool become unstable and eventually crash.

ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)
 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980) [0x7f8daa3d0980]
 2: pthread_kill()
 3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, std::chrono::time_point&lt;ceph::coarse_mono_clock, std::chrono::duration&lt;unsigned long, std::ratio&lt;1l, 1000000000l&gt; > >)+0x472) [0x55943ed4cb02]
 4: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, std::chrono::duration&lt;unsigned long, std::ratio&lt;1l, 1000000000l&gt; >, std::chrono::duration&lt;unsigned long, std::ratio&lt;1l, 1000000000l&gt; >)+0x72) [0x55943ed4d2a2]
 5: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x498) [0x55943ed72748]
 6: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55943ed75c20]
 7: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f8daa3c56db]
 8: clone()
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

When running with debug_osd=10, there are lots and lots of messages like this:

2022-07-08T11:21:05.647+0200 7f57b18a5700 10 osd.0 pg_epoch: 782299 pg[14.2( v 782289'638107506 (778682'638104180,782289'638107506] local-lis/les=782298/782299 n=928 ec=36/36 lis/c=782298/782286 les/c/f=782299/782287/0 sis=782298) [0,160] r=0 lpr=782298 pi=[782286,782298)/3 crt=782289'638107506 lcod 782274'638107504 mlcod 0'0 active+undersized+degraded mbc={}] read got 1024 / 1024 bytes from obj 14:4ee803b6:::gc.21:head
2022-07-08T11:21:05.651+0200 7f57b18a5700 10 osd.0 pg_epoch: 782299 pg[14.2( v 782289'638107506 (778682'638104180,782289'638107506] local-lis/les=782298/782299 n=928 ec=36/36 lis/c=782298/782286 les/c/f=782299/782287/0 sis=782298) [0,160] r=0 lpr=782298 pi=[782286,782298)/3 crt=782289'638107506 lcod 782274'638107504 mlcod 0'0 active+undersized+degraded mbc={}] do_osd_op 14:4ee803b6:::gc.21:head [sync_read 80662864~1024]
2022-07-08T11:21:05.651+0200 7f57b18a5700 10 osd.0 pg_epoch: 782299 pg[14.2( v 782289'638107506 (778682'638104180,782289'638107506] local-lis/les=782298/782299 n=928 ec=36/36 lis/c=782298/782286 les/c/f=782299/782287/0 sis=782298) [0,160] r=0 lpr=782298 pi=[782286,782298)/3 crt=782289'638107506 lcod 782274'638107504 mlcod 0'0 active+undersized+degraded mbc={}] do_osd_op sync_read 80662864~1024
2022-07-08T11:21:05.651+0200 7f57b18a5700 10 osd.0 pg_epoch: 782299 pg[14.2( v 782289'638107506 (778682'638104180,782289'638107506] local-lis/les=782298/782299 n=928 ec=36/36 lis/c=782298/782286 les/c/f=782299/782287/0 sis=782298) [0,160] r=0 lpr=782298 pi=[782286,782298)/3 crt=782289'638107506 lcod 782274'638107504 mlcod 0'0 active+undersized+degraded mbc={}] read got 1024 / 1024 bytes from obj 14:4ee803b6:::gc.21:head
2022-07-08T11:21:05.651+0200 7f57d08e3700 10 osd.0 782301 internal heartbeat not healthy, dropping ping request
2022-07-08T11:21:05.651+0200 7f57d00e2700 10 osd.0 782301 internal heartbeat not healthy, dropping ping request

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » bluestore

Custom queries

Bug #56503

Deleting large (~850gb) objects causes OSD to crash

Updated by Marcin Gibula almost 2 years ago

Updated by Casey Bodley almost 2 years ago

Updated by Radoslaw Zarzynski almost 2 years ago

Updated by Marcin Gibula almost 2 years ago

Updated by Igor Fedotov almost 2 years ago

Updated by Marcin Gibula almost 2 years ago

Updated by Radoslaw Zarzynski almost 2 years ago

Updated by Radoslaw Zarzynski almost 2 years ago

Updated by Marcin Gibula over 1 year ago