Bug #49892: rgw_orphan_list.sh causing a crash in the OSD - rgw - Ceph

Actions

Copy link

Bug #49892

closed

rgw_orphan_list.sh causing a crash in the OSD

Added by Ali Maredia about 3 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

High

Assignee:

Target version:

Ceph - v17.0.0

% Done:

Source:

Tags:

Backport:

pacific

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v17.0.0

ceph-qa-suite:

Pull request ID:

40553

Crash signature (v1):

Crash signature (v2):

Description

Running the workunit rgw_orphan_list.sh is causing OSDs to crash.

Link to log:
http://qa-proxy.ceph.com/teuthology/amaredia-2021-03-18_16:35:47-rgw:tools-master-distro-basic-smithi/5977705/teuthology.log

Notable lines from log:
2021-03-18T16:58:58.319 INFO:tasks.workunit.client.0.smithi083.stderr:########################################
2021-03-18T16:58:58.319 INFO:tasks.workunit.client.0.smithi083.stderr:# DO ORPHAN LIST
2021-03-18T16:58:58.320 INFO:tasks.workunit.client.0.smithi083.stderr:
2021-03-18T16:58:58.320 INFO:tasks.workunit.client.0.smithi083.stderr:pool="default.rgw.buckets.data"
2021-03-18T16:58:58.320 INFO:tasks.workunit.client.0.smithi083.stderr:+ pool=default.rgw.buckets.data
2021-03-18T16:58:58.320 INFO:tasks.workunit.client.0.smithi083.stderr:
2021-03-18T16:58:58.320 INFO:tasks.workunit.client.0.smithi083.stderr:rgw-orphan-list $pool
2021-03-18T16:58:58.320 INFO:tasks.workunit.client.0.smithi083.stderr:+ rgw-orphan-list default.rgw.buckets.data
2021-03-18T16:58:58.321 INFO:tasks.workunit.client.0.smithi083.stdout:Pool is "default.rgw.buckets.data".
2021-03-18T16:58:58.321 INFO:tasks.workunit.client.0.smithi083.stdout:Note: output files produced will be tagged with the current timestamp -- 20210318165858.
2021-03-18T16:58:58.321 INFO:tasks.workunit.client.0.smithi083.stdout:running 'rados ls' at Thu Mar 18 16:58:58 UTC 2021
2021-03-18T16:58:58.455 INFO:tasks.workunit.client.0.smithi083.stdout:running 'radosgw-admin bucket radoslist' at Thu Mar 18 16:58:58 UTC 2021
2021-03-18T17:18:28.045 INFO:tasks.ceph.mon.a.smithi083.stderr:2021-03-18T17:18:26.100+0000 7f3e97865700 -1 mon.a@0(leader) e1 get_health_metrics reporting 3 slow ops, oldest is log(1 entries from seq 701 at 2021-03-18T17:17:33.275375+0000)
2021-03-18T17:18:42.412 INFO:tasks.ceph.osd.1.smithi083.stderr:2021-03-18T17:18:40.066+0000 7f9d36cd3700 -1 osd.1 32 heartbeat_check: no reply from 172.21.15.83:6804 osd.0 since back 2021-03-18T17:18:39.271090+0000 front 2021-03-18T17:18:16.950626+0000 (oldest deadline 2021-03-18T17:18:38.436615+0000)
2021-03-18T17:18:42.742 INFO:tasks.ceph.osd.1.smithi083.stderr:2021-03-18T17:18:40.066+0000 7f9d36cd3700 -1 osd.1 32 heartbeat_check: no reply from 172.21.15.83:6812 osd.2 since back 2021-03-18T17:18:39.494709+0000 front 2021-03-18T17:18:16.947020+0000 (oldest deadline 2021-03-18T17:18:38.436615+0000)
2021-03-18T17:18:43.928 INFO:tasks.ceph.osd.2.smithi083.stderr:2021-03-18T17:18:43.194+0000 7fed8e7c7700 -1 osd.2 32 heartbeat_check: no reply from 172.21.15.83:6804 osd.0 since back 2021-03-18T17:18:16.954098+0000 front 2021-03-18T17:18:16.953128+0000 (oldest deadline 2021-03-18T17:18:40.726503+0000)
2021-03-18T17:18:43.929 INFO:tasks.ceph.osd.2.smithi083.stderr:2021-03-18T17:18:43.441+0000 7fed8e7c7700 -1 osd.2 32 heartbeat_check: no reply from 172.21.15.83:6820 osd.1 since back 2021-03-18T17:18:22.060493+0000 front 2021-03-18T17:18:16.954306+0000 (oldest deadline 2021-03-18T17:18:40.726503+0000)
2021-03-18T17:18:53.793 INFO:tasks.ceph.osd.2.smithi083.stderr:2021-03-18T17:18:49.691+0000 7fed8e7c7700 -1 osd.2 32 heartbeat_check: no reply from 172.21.15.83:6804 osd.0 since back 2021-03-18T17:18:47.759699+0000 front 2021-03-18T17:18:47.753397+0000 (oldest deadline 2021-03-18T17:18:43.579652+0000)
2021-03-18T17:18:53.820 INFO:tasks.ceph.osd.2.smithi083.stderr:2021-03-18T17:18:51.379+0000 7fed8e7c7700 -1 osd.2 32 heartbeat_check: no reply from 172.21.15.83:6820 osd.1 since back 2021-03-18T17:18:22.060493+0000 front 2021-03-18T17:18:16.954306+0000 (oldest deadline 2021-03-18T17:18:40.726503+0000)
2021-03-18T17:20:15.246 INFO:tasks.ceph.osd.1.smithi083.stderr:2021-03-18T17:20:10.795+0000 7f9d36cd3700 -1 osd.1 32 heartbeat_check: no reply from 172.21.15.83:6804 osd.0 since back 2021-03-18T17:18:39.271090+0000 front 2021-03-18T17:18:16.950626+0000 (oldest deadline 2021-03-18T17:18:38.436615+0000)
2021-03-18T17:20:15.247 INFO:tasks.ceph.osd.1.smithi083.stderr:2021-03-18T17:20:12.656+0000 7f9d36cd3700 -1 osd.1 32 heartbeat_check: no reply from 172.21.15.83:6812 osd.2 since back 2021-03-18T17:19:28.160527+0000 front 2021-03-18T17:18:16.947020+0000 (oldest deadline 2021-03-18T17:18:38.436615+0000)
2021-03-18T17:20:21.221 INFO:tasks.ceph.osd.0.smithi083.stderr:2021-03-18T17:20:19.136+0000 7fd64b08a700 -1 osd.0 32 heartbeat_check: no reply from 172.21.15.83:6820 osd.1 since back 2021-03-18T17:18:16.958205+0000 front 2021-03-18T17:18:16.950401+0000 (oldest deadline 2021-03-18T17:18:40.726505+0000)
2021-03-18T17:20:25.693 INFO:tasks.ceph.osd.0.smithi083.stderr:2021-03-18T17:20:19.555+0000 7fd64b08a700 -1 osd.0 32 heartbeat_check: no reply from 172.21.15.83:6812 osd.2 since back 2021-03-18T17:18:22.060493+0000 front 2021-03-18T17:18:16.954646+0000 (oldest deadline 2021-03-18T17:18:40.726505+0000)
2021-03-18T17:20:55.474 INFO:tasks.ceph.osd.2.smithi083.stderr:2021-03-18T17:20:51.735+0000 7fed8e7c7700 -1 osd.2 32 heartbeat_check: no reply from 172.21.15.83:6804 osd.0 since back 2021-03-18T17:19:05.977959+0000 front 2021-03-18T17:18:47.753397+0000 (oldest deadline 2021-03-18T17:18:49.741126+0000)
2021-03-18T17:20:55.474 INFO:tasks.ceph.osd.2.smithi083.stderr:2021-03-18T17:20:52.423+0000 7fed8e7c7700 -1 osd.2 32 heartbeat_check: no reply from 172.21.15.83:6820 osd.1 since back 2021-03-18T17:18:22.060493+0000 front 2021-03-18T17:18:16.954306+0000 (oldest deadline 2021-03-18T17:18:40.726503+0000)
2021-03-18T17:21:14.684 INFO:tasks.ceph.osd.1.smithi083.stderr:2021-03-18T17:21:10.962+0000 7f9d36cd3700 -1 osd.1 32 get_health_metrics reporting 13 slow ops, oldest is osd_op(client.4242.0:622254 7.0 7:05bf5b68:::notify.1:head [watch ping cookie 93843840136160] snapc 0=[] ondisk+write+known_if_redirected e32)
2021-03-18T17:23:00.484 INFO:tasks.ceph.osd.2.smithi083.stderr:2021-03-18T17:22:57.756+0000 7fed8e7c7700 -1 osd.2 32 get_health_metrics reporting 11 slow ops, oldest is osd_op(client.4202.0:21877 7.15 7:a93a5511:::notify.2:head [watch ping cookie 93977688257024] snapc 0=[] ondisk+write+known_if_redirected e32)
2021-03-18T17:23:35.601 INFO:tasks.ceph.osd.0.smithi083.stderr:2021-03-18T17:23:31.082+0000 7fd64b08a700 -1 osd.0 32 get_health_metrics reporting 39 slow ops, oldest is osd_op(client.4242.0:622220 7.3 7:c609908c:::notify.5:head [watch ping cookie 93843840145968] snapc 0=[] ondisk+write+known_if_redirected e32)
2021-03-18T17:25:27.082 INFO:tasks.ceph.osd.0.smithi083.stderr:2021-03-18T17:25:21.683+0000 7fd64b08a700 -1 osd.0 32 heartbeat_check: no reply from 172.21.15.83:6820 osd.1 since back 2021-03-18T17:18:16.958205+0000 front 2021-03-18T17:18:16.950401+0000 (oldest deadline 2021-03-18T17:18:40.726505+0000)
2021-03-18T17:25:30.770 INFO:tasks.ceph.osd.0.smithi083.stderr:2021-03-18T17:25:25.814+0000 7fd64b08a700 -1 osd.0 32 heartbeat_check: no reply from 172.21.15.83:6812 osd.2 since back 2021-03-18T17:18:22.060493+0000 front 2021-03-18T17:18:16.954646+0000 (oldest deadline 2021-03-18T17:18:40.726505+0000)
2021-03-18T17:26:22.824 INFO:tasks.ceph.osd.2.smithi083.stderr:*** Caught signal (Aborted)
2021-03-18T17:26:22.824 INFO:tasks.ceph.osd.2.smithi083.stderr: in thread 7fed74701700 thread_name:tp_osd_tp
2021-03-18T17:26:38.297 INFO:tasks.ceph.osd.1.smithi083.stderr:* Caught signal (Aborted) *
2021-03-18T17:26:38.297 INFO:tasks.ceph.osd.1.smithi083.stderr: in thread 7f9d19406700 thread_name:tp_osd_tp
2021-03-18T17:27:43.451 INFO:tasks.ceph.osd.2.smithi083.stderr: ceph version 17.0.0-2189-g25bc7023 (25bc7023f0c8949e8cbf9fb35124022f6d4f3fb3) quincy (dev)
2021-03-18T17:27:43.451 INFO:tasks.ceph.osd.2.smithi083.stderr: 1: /lib64/libpthread.so.0(+0x12b20) [0x7fed97e2db20]
2021-03-18T17:27:43.451 INFO:tasks.ceph.osd.2.smithi083.stderr: 2: pthread_kill()
2021-03-18T17:27:43.451 INFO:tasks.ceph.osd.2.smithi083.stderr: 3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const, char const*, std::chrono::time_point<ceph::coarse_mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >)+0x48c) [0x556e49a1a54c]
2021-03-18T17:27:43.451 INFO:tasks.ceph.osd.2.smithi083.stderr: 4: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >)+0x23e) [0x556e49a1a93e]
2021-03-18T17:27:43.452 INFO:tasks.ceph.osd.2.smithi083.stderr: 5: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b0) [0x556e49a3aba0]
2021-03-18T17:27:43.452 INFO:tasks.ceph.osd.2.smithi083.stderr: 6: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x556e49a3d854]
2021-03-18T17:27:43.452 INFO:tasks.ceph.osd.2.smithi083.stderr: 7: (Thread::_entry_func(void*)+0xd) [0x556e49a2c6bd]
2021-03-18T17:27:43.452 INFO:tasks.ceph.osd.2.smithi083.stderr: 8: /lib64/libpthread.so.0(+0x814a) [0x7fed97e2314a]
2021-03-18T17:27:43.452 INFO:tasks.ceph.osd.2.smithi083.stderr: 9: clone()

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by J. Eric Ivancich about 3 years ago

It's interesting that `radosgw-admin bucket radoslist` runs at 2021-03-18T16:58:58.455 and then 20 minutes later (2021-03-18T17:18:28.045) the osd heart beats go missing.

Actions

Copy link

Updated by J. Eric Ivancich about 3 years ago

Project changed from RADOS to rgw
Status changed from New to Triaged

As a result of bisecting, it appears that this PR is causing this failure: https://github.com/ceph/ceph/pull/39399

There may be a related OSD issue.

Moving this back to RGW. Given that this could be indicative of a deeper issue, perhaps this should be raised to URGENT.

Actions

Copy link

Updated by J. Eric Ivancich about 3 years ago

Status changed from Triaged to Fix Under Review

An issue with rgw-orphan-list was tracked to PR https://github.com/ceph/ceph/pull/39399. Digging further, a large bucket would not terminate listing and would instead loop over the same objects repeatedly.

That issue was addressed by: https://github.com/ceph/ceph/pull/40553 . And that may address the teuthology testing issue described herein.

We'll see what happens with the teuthology run and go from there.

Actions

Copy link

Updated by J. Eric Ivancich about 3 years ago

Target version set to v17.0.0
Backport set to none

Actions

Copy link

Updated by J. Eric Ivancich about 3 years ago

Pull request ID set to 40553

Actions

Copy link

Updated by J. Eric Ivancich about 3 years ago

Status changed from Fix Under Review to Resolved

Actions

Copy link

Updated by Daniel Gryniewicz over 2 years ago

Backport changed from none to pacific

Actions

Copy link

Updated by Konstantin Shalygin over 2 years ago

Status changed from Resolved to Pending Backport

Actions

Copy link

Updated by Konstantin Shalygin over 2 years ago

Copied to Backport #51751: pacific: rgw_orphan_list.sh causing a crash in the OSD added

Actions

Copy link

#10

Updated by Loïc Dachary over 2 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #49892

rgw_orphan_list.sh causing a crash in the OSD

Updated by J. Eric Ivancich about 3 years ago

Updated by J. Eric Ivancich about 3 years ago

Updated by J. Eric Ivancich about 3 years ago

Updated by J. Eric Ivancich about 3 years ago

Updated by J. Eric Ivancich about 3 years ago

Updated by J. Eric Ivancich about 3 years ago

Updated by Daniel Gryniewicz over 2 years ago

Updated by Konstantin Shalygin over 2 years ago

Updated by Konstantin Shalygin over 2 years ago

Updated by Loïc Dachary over 2 years ago