Project

General

Profile

Bug #49892

rgw_orphan_list.sh causing a crash in the OSD

Added by Ali Maredia 9 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Running the workunit rgw_orphan_list.sh is causing OSDs to crash.

Link to log:
http://qa-proxy.ceph.com/teuthology/amaredia-2021-03-18_16:35:47-rgw:tools-master-distro-basic-smithi/5977705/teuthology.log

Notable lines from log:
2021-03-18T16:58:58.319 INFO:tasks.workunit.client.0.smithi083.stderr:########################################
2021-03-18T16:58:58.319 INFO:tasks.workunit.client.0.smithi083.stderr:# DO ORPHAN LIST
2021-03-18T16:58:58.320 INFO:tasks.workunit.client.0.smithi083.stderr:
2021-03-18T16:58:58.320 INFO:tasks.workunit.client.0.smithi083.stderr:pool="default.rgw.buckets.data"
2021-03-18T16:58:58.320 INFO:tasks.workunit.client.0.smithi083.stderr:+ pool=default.rgw.buckets.data
2021-03-18T16:58:58.320 INFO:tasks.workunit.client.0.smithi083.stderr:
2021-03-18T16:58:58.320 INFO:tasks.workunit.client.0.smithi083.stderr:rgw-orphan-list $pool
2021-03-18T16:58:58.320 INFO:tasks.workunit.client.0.smithi083.stderr:+ rgw-orphan-list default.rgw.buckets.data
2021-03-18T16:58:58.321 INFO:tasks.workunit.client.0.smithi083.stdout:Pool is "default.rgw.buckets.data".
2021-03-18T16:58:58.321 INFO:tasks.workunit.client.0.smithi083.stdout:Note: output files produced will be tagged with the current timestamp -- 20210318165858.
2021-03-18T16:58:58.321 INFO:tasks.workunit.client.0.smithi083.stdout:running 'rados ls' at Thu Mar 18 16:58:58 UTC 2021
2021-03-18T16:58:58.455 INFO:tasks.workunit.client.0.smithi083.stdout:running 'radosgw-admin bucket radoslist' at Thu Mar 18 16:58:58 UTC 2021
2021-03-18T17:18:28.045 INFO:tasks.ceph.mon.a.smithi083.stderr:2021-03-18T17:18:26.100+0000 7f3e97865700 -1 mon.a@0(leader) e1 get_health_metrics reporting 3 slow ops, oldest is log(1 entries from seq 701 at 2021-03-18T17:17:33.275375+0000)
2021-03-18T17:18:42.412 INFO:tasks.ceph.osd.1.smithi083.stderr:2021-03-18T17:18:40.066+0000 7f9d36cd3700 -1 osd.1 32 heartbeat_check: no reply from 172.21.15.83:6804 osd.0 since back 2021-03-18T17:18:39.271090+0000 front 2021-03-18T17:18:16.950626+0000 (oldest deadline 2021-03-18T17:18:38.436615+0000)
2021-03-18T17:18:42.742 INFO:tasks.ceph.osd.1.smithi083.stderr:2021-03-18T17:18:40.066+0000 7f9d36cd3700 -1 osd.1 32 heartbeat_check: no reply from 172.21.15.83:6812 osd.2 since back 2021-03-18T17:18:39.494709+0000 front 2021-03-18T17:18:16.947020+0000 (oldest deadline 2021-03-18T17:18:38.436615+0000)
2021-03-18T17:18:43.928 INFO:tasks.ceph.osd.2.smithi083.stderr:2021-03-18T17:18:43.194+0000 7fed8e7c7700 -1 osd.2 32 heartbeat_check: no reply from 172.21.15.83:6804 osd.0 since back 2021-03-18T17:18:16.954098+0000 front 2021-03-18T17:18:16.953128+0000 (oldest deadline 2021-03-18T17:18:40.726503+0000)
2021-03-18T17:18:43.929 INFO:tasks.ceph.osd.2.smithi083.stderr:2021-03-18T17:18:43.441+0000 7fed8e7c7700 -1 osd.2 32 heartbeat_check: no reply from 172.21.15.83:6820 osd.1 since back 2021-03-18T17:18:22.060493+0000 front 2021-03-18T17:18:16.954306+0000 (oldest deadline 2021-03-18T17:18:40.726503+0000)
2021-03-18T17:18:53.793 INFO:tasks.ceph.osd.2.smithi083.stderr:2021-03-18T17:18:49.691+0000 7fed8e7c7700 -1 osd.2 32 heartbeat_check: no reply from 172.21.15.83:6804 osd.0 since back 2021-03-18T17:18:47.759699+0000 front 2021-03-18T17:18:47.753397+0000 (oldest deadline 2021-03-18T17:18:43.579652+0000)
2021-03-18T17:18:53.820 INFO:tasks.ceph.osd.2.smithi083.stderr:2021-03-18T17:18:51.379+0000 7fed8e7c7700 -1 osd.2 32 heartbeat_check: no reply from 172.21.15.83:6820 osd.1 since back 2021-03-18T17:18:22.060493+0000 front 2021-03-18T17:18:16.954306+0000 (oldest deadline 2021-03-18T17:18:40.726503+0000)
2021-03-18T17:20:15.246 INFO:tasks.ceph.osd.1.smithi083.stderr:2021-03-18T17:20:10.795+0000 7f9d36cd3700 -1 osd.1 32 heartbeat_check: no reply from 172.21.15.83:6804 osd.0 since back 2021-03-18T17:18:39.271090+0000 front 2021-03-18T17:18:16.950626+0000 (oldest deadline 2021-03-18T17:18:38.436615+0000)
2021-03-18T17:20:15.247 INFO:tasks.ceph.osd.1.smithi083.stderr:2021-03-18T17:20:12.656+0000 7f9d36cd3700 -1 osd.1 32 heartbeat_check: no reply from 172.21.15.83:6812 osd.2 since back 2021-03-18T17:19:28.160527+0000 front 2021-03-18T17:18:16.947020+0000 (oldest deadline 2021-03-18T17:18:38.436615+0000)
2021-03-18T17:20:21.221 INFO:tasks.ceph.osd.0.smithi083.stderr:2021-03-18T17:20:19.136+0000 7fd64b08a700 -1 osd.0 32 heartbeat_check: no reply from 172.21.15.83:6820 osd.1 since back 2021-03-18T17:18:16.958205+0000 front 2021-03-18T17:18:16.950401+0000 (oldest deadline 2021-03-18T17:18:40.726505+0000)
2021-03-18T17:20:25.693 INFO:tasks.ceph.osd.0.smithi083.stderr:2021-03-18T17:20:19.555+0000 7fd64b08a700 -1 osd.0 32 heartbeat_check: no reply from 172.21.15.83:6812 osd.2 since back 2021-03-18T17:18:22.060493+0000 front 2021-03-18T17:18:16.954646+0000 (oldest deadline 2021-03-18T17:18:40.726505+0000)
2021-03-18T17:20:55.474 INFO:tasks.ceph.osd.2.smithi083.stderr:2021-03-18T17:20:51.735+0000 7fed8e7c7700 -1 osd.2 32 heartbeat_check: no reply from 172.21.15.83:6804 osd.0 since back 2021-03-18T17:19:05.977959+0000 front 2021-03-18T17:18:47.753397+0000 (oldest deadline 2021-03-18T17:18:49.741126+0000)
2021-03-18T17:20:55.474 INFO:tasks.ceph.osd.2.smithi083.stderr:2021-03-18T17:20:52.423+0000 7fed8e7c7700 -1 osd.2 32 heartbeat_check: no reply from 172.21.15.83:6820 osd.1 since back 2021-03-18T17:18:22.060493+0000 front 2021-03-18T17:18:16.954306+0000 (oldest deadline 2021-03-18T17:18:40.726503+0000)
2021-03-18T17:21:14.684 INFO:tasks.ceph.osd.1.smithi083.stderr:2021-03-18T17:21:10.962+0000 7f9d36cd3700 -1 osd.1 32 get_health_metrics reporting 13 slow ops, oldest is osd_op(client.4242.0:622254 7.0 7:05bf5b68:::notify.1:head [watch ping cookie 93843840136160] snapc 0=[] ondisk+write+known_if_redirected e32)
2021-03-18T17:23:00.484 INFO:tasks.ceph.osd.2.smithi083.stderr:2021-03-18T17:22:57.756+0000 7fed8e7c7700 -1 osd.2 32 get_health_metrics reporting 11 slow ops, oldest is osd_op(client.4202.0:21877 7.15 7:a93a5511:::notify.2:head [watch ping cookie 93977688257024] snapc 0=[] ondisk+write+known_if_redirected e32)
2021-03-18T17:23:35.601 INFO:tasks.ceph.osd.0.smithi083.stderr:2021-03-18T17:23:31.082+0000 7fd64b08a700 -1 osd.0 32 get_health_metrics reporting 39 slow ops, oldest is osd_op(client.4242.0:622220 7.3 7:c609908c:::notify.5:head [watch ping cookie 93843840145968] snapc 0=[] ondisk+write+known_if_redirected e32)
2021-03-18T17:25:27.082 INFO:tasks.ceph.osd.0.smithi083.stderr:2021-03-18T17:25:21.683+0000 7fd64b08a700 -1 osd.0 32 heartbeat_check: no reply from 172.21.15.83:6820 osd.1 since back 2021-03-18T17:18:16.958205+0000 front 2021-03-18T17:18:16.950401+0000 (oldest deadline 2021-03-18T17:18:40.726505+0000)
2021-03-18T17:25:30.770 INFO:tasks.ceph.osd.0.smithi083.stderr:2021-03-18T17:25:25.814+0000 7fd64b08a700 -1 osd.0 32 heartbeat_check: no reply from 172.21.15.83:6812 osd.2 since back 2021-03-18T17:18:22.060493+0000 front 2021-03-18T17:18:16.954646+0000 (oldest deadline 2021-03-18T17:18:40.726505+0000)
2021-03-18T17:26:22.824 INFO:tasks.ceph.osd.2.smithi083.stderr:*** Caught signal (Aborted)
2021-03-18T17:26:22.824 INFO:tasks.ceph.osd.2.smithi083.stderr: in thread 7fed74701700 thread_name:tp_osd_tp
2021-03-18T17:26:38.297 INFO:tasks.ceph.osd.1.smithi083.stderr:
* Caught signal (Aborted) *
2021-03-18T17:26:38.297 INFO:tasks.ceph.osd.1.smithi083.stderr: in thread 7f9d19406700 thread_name:tp_osd_tp
2021-03-18T17:27:43.451 INFO:tasks.ceph.osd.2.smithi083.stderr: ceph version 17.0.0-2189-g25bc7023 (25bc7023f0c8949e8cbf9fb35124022f6d4f3fb3) quincy (dev)
2021-03-18T17:27:43.451 INFO:tasks.ceph.osd.2.smithi083.stderr: 1: /lib64/libpthread.so.0(+0x12b20) [0x7fed97e2db20]
2021-03-18T17:27:43.451 INFO:tasks.ceph.osd.2.smithi083.stderr: 2: pthread_kill()
2021-03-18T17:27:43.451 INFO:tasks.ceph.osd.2.smithi083.stderr: 3: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const
, char const*, std::chrono::time_point<ceph::coarse_mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >)+0x48c) [0x556e49a1a54c]
2021-03-18T17:27:43.451 INFO:tasks.ceph.osd.2.smithi083.stderr: 4: (ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >)+0x23e) [0x556e49a1a93e]
2021-03-18T17:27:43.452 INFO:tasks.ceph.osd.2.smithi083.stderr: 5: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b0) [0x556e49a3aba0]
2021-03-18T17:27:43.452 INFO:tasks.ceph.osd.2.smithi083.stderr: 6: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x556e49a3d854]
2021-03-18T17:27:43.452 INFO:tasks.ceph.osd.2.smithi083.stderr: 7: (Thread::_entry_func(void*)+0xd) [0x556e49a2c6bd]
2021-03-18T17:27:43.452 INFO:tasks.ceph.osd.2.smithi083.stderr: 8: /lib64/libpthread.so.0(+0x814a) [0x7fed97e2314a]
2021-03-18T17:27:43.452 INFO:tasks.ceph.osd.2.smithi083.stderr: 9: clone()


Related issues

Copied to rgw - Backport #51751: pacific: rgw_orphan_list.sh causing a crash in the OSD Resolved

History

#1 Updated by J. Eric Ivancich 9 months ago

It's interesting that `radosgw-admin bucket radoslist` runs at 2021-03-18T16:58:58.455 and then 20 minutes later (2021-03-18T17:18:28.045) the osd heart beats go missing.

#2 Updated by J. Eric Ivancich 8 months ago

  • Project changed from RADOS to rgw
  • Status changed from New to Triaged

As a result of bisecting, it appears that this PR is causing this failure: https://github.com/ceph/ceph/pull/39399

There may be a related OSD issue.

Moving this back to RGW. Given that this could be indicative of a deeper issue, perhaps this should be raised to URGENT.

#3 Updated by J. Eric Ivancich 8 months ago

  • Status changed from Triaged to Fix Under Review

An issue with rgw-orphan-list was tracked to PR https://github.com/ceph/ceph/pull/39399. Digging further, a large bucket would not terminate listing and would instead loop over the same objects repeatedly.

That issue was addressed by: https://github.com/ceph/ceph/pull/40553 . And that may address the teuthology testing issue described herein.

We'll see what happens with the teuthology run and go from there.

#4 Updated by J. Eric Ivancich 8 months ago

  • Target version set to v17.0.0
  • Backport set to none

#5 Updated by J. Eric Ivancich 8 months ago

  • Pull request ID set to 40553

#6 Updated by J. Eric Ivancich 8 months ago

  • Status changed from Fix Under Review to Resolved

#7 Updated by Daniel Gryniewicz 4 months ago

  • Backport changed from none to pacific

#8 Updated by Konstantin Shalygin 4 months ago

  • Status changed from Resolved to Pending Backport

#9 Updated by Konstantin Shalygin 4 months ago

  • Copied to Backport #51751: pacific: rgw_orphan_list.sh causing a crash in the OSD added

#10 Updated by Loïc Dachary 3 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF