Project

General

Profile

Actions

Bug #44427

open

osd: stuck during shutdown

Added by Sage Weil about 4 years ago. Updated about 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2020-03-04T12:10:25.576+0000 1480b700 -1 osd.2 174 *** Got signal Terminated ***
...
2020-03-04T12:10:26.413+0000 1480b700  0 osd.2 174 shutdown
...
2020-03-04T12:10:26.516+0000 1480b700  5 PG::get 0x1e850fc0 tag intptr 2 -> 3
2020-03-04T12:10:26.516+0000 1480b700 10 osd.2 pg_epoch: 174 pg[17.17( empty local-lis/les=165/166 n=0 ec=165/165 lis/c=165/165 les/c/f=166/166/0 sis=165) [2,7] r=0 lpr=165 DELETING crt=0'0 mlcod 0'0 active+clean] on_shutdown
2020-03-04T12:10:26.516+0000 1480b700 20 osd.2 pg_epoch: 174 pg[17.17( empty local-lis/les=165/166 n=0 ec=165/165 lis/c=165/165 les/c/f=166/166/0 sis=165) [2,7] r=0 lpr=165 DELETING crt=0'0 mlcod 0'0 active+clean] prepare_stats_for_publish reporting purged_snaps []
2020-03-04T12:10:26.516+0000 1480b700 15 osd.2 pg_epoch: 174 pg[17.17( empty local-lis/les=165/166 n=0 ec=165/165 lis/c=165/165 les/c/f=166/166/0 sis=165) [2,7] r=0 lpr=165 DELETING crt=0'0 mlcod 0'0 active+clean] publish_stats_to_osd 173:15
2020-03-04T12:10:26.517+0000 1480b700 10 osd.2 pg_epoch: 174 pg[17.17( empty local-lis/les=165/166 n=0 ec=165/165 lis/c=165/165 les/c/f=166/166/0 sis=165) [2,7] r=0 lpr=165 DELETING crt=0'0 mlcod 0'0 active+clean] cancel_copy_ops
2020-03-04T12:10:26.517+0000 1480b700 10 osd.2 pg_epoch: 174 pg[17.17( empty local-lis/les=165/166 n=0 ec=165/165 lis/c=165/165 les/c/f=166/166/0 sis=165) [2,7] r=0 lpr=165 DELETING crt=0'0 mlcod 0'0 active+clean] cancel_flush_ops
2020-03-04T12:10:26.517+0000 1480b700 10 osd.2 pg_epoch: 174 pg[17.17( empty local-lis/les=165/166 n=0 ec=165/165 lis/c=165/165 les/c/f=166/166/0 sis=165) [2,7] r=0 lpr=165 DELETING crt=0'0 mlcod 0'0 active+clean] cancel_proxy_ops
2020-03-04T12:10:26.517+0000 1480b700 10 osd.2 pg_epoch: 174 pg[17.17( empty local-lis/les=165/166 n=0 ec=165/165 lis/c=165/165 les/c/f=166/166/0 sis=165) [2,7] r=0 lpr=165 DELETING crt=0'0 mlcod 0'0 active+clean] cancel_manifest_ops
2020-03-04T12:10:26.517+0000 1480b700 10 osd.2 pg_epoch: 174 pg[17.17( empty local-lis/les=165/166 n=0 ec=165/165 lis/c=165/165 les/c/f=166/166/0 sis=165) [2,7] r=0 lpr=165 DELETING crt=0'0 mlcod 0'0 active+clean] clear_backoffs 
2020-03-04T12:10:26.517+0000 1480b700 20 osd.2 pg_epoch: 174 pg[17.17( empty local-lis/les=165/166 n=0 ec=165/165 lis/c=165/165 les/c/f=166/166/0 sis=165) [2,7] r=0 lpr=165 DELETING crt=0'0 mlcod 0'0 active+clean] exit NotTrimming
2020-03-04T12:10:26.517+0000 1480b700 20 osd.2 pg_epoch: 174 pg[17.17( empty local-lis/les=165/166 n=0 ec=165/165 lis/c=165/165 les/c/f=166/166/0 sis=165) [2,7] r=0 lpr=165 DELETING crt=0'0 mlcod 0'0 active+clean] enter NotTrimming
2020-03-04T12:10:26.517+0000 1480b700 10 osd.2 pg_epoch: 174 pg[17.17( empty local-lis/les=165/166 n=0 ec=165/165 lis/c=165/165 les/c/f=166/166/0 sis=165) [2,7] r=0 lpr=165 DELETING crt=0'0 mlcod 0'0 active+clean] on_change
2020-03-04T12:10:26.518+0000 1480b700 10 osd.2 pg_epoch: 174 pg[17.17( empty local-lis/les=165/166 n=0 ec=165/165 lis/c=165/165 les/c/f=166/166/0 sis=165) [2,7] r=0 lpr=165 DELETING crt=0'0 mlcod 0'0 active+clean] clear_async_reads
2020-03-04T12:10:26.518+0000 1480b700 10 osd.2 pg_epoch: 174 pg[17.17( empty local-lis/les=165/166 n=0 ec=165/165 lis/c=165/165 les/c/f=166/166/0 sis=165) [2,7] r=0 lpr=165 DELETING crt=0'0 mlcod 0'0 active+clean] release_backoffs [17:e8000000::::0,17:f0000000::::head)
2020-03-04T12:10:26.518+0000 1480b700 20 osd.2 pg_epoch: 174 pg[17.17( empty local-lis/les=165/166 n=0 ec=165/165 lis/c=165/165 les/c/f=166/166/0 sis=165) [2,7] r=0 lpr=165 DELETING crt=0'0 mlcod 0'0 active+clean] agent_stop
2020-03-04T12:10:26.518+0000 1480b700 10 osd.2 pg_epoch: 174 pg[17.17( empty local-lis/les=165/166 n=0 ec=165/165 lis/c=165/165 les/c/f=166/166/0 sis=165) [2,7] r=0 lpr=165 DELETING crt=0'0 mlcod 0'0 active+clean] cancel_recovery
2020-03-04T12:10:26.518+0000 1480b700 10 osd.2 pg_epoch: 174 pg[17.17( empty local-lis/les=165/166 n=0 ec=165/165 lis/c=165/165 les/c/f=166/166/0 sis=165) [2,7] r=0 lpr=165 DELETING crt=0'0 mlcod 0'0 active+clean] clear_recovery_state
2020-03-04T12:10:26.518+0000 1480b700 10 osd.2 174 clear_ready_to_merge 17.17
2020-03-04T12:10:26.518+0000 1480b700  5 PG::put 0x1e850fc0 tag intptr 3 -> 2
2020-03-04T12:10:26.518+0000 1480b700  5 PG::put 0x15b67b20 tag intptr 2 -> 1
2020-03-04T12:10:26.519+0000 1480b700  5 PG::put 0x15954000 tag intptr 2 -> 1
2020-03-04T12:10:26.519+0000 1480b700  5 PG::put 0x20bc0ec0 tag intptr 2 -> 1
2020-03-04T12:10:26.519+0000 1480b700  5 PG::put 0x21126190 tag intptr 2 -> 1
2020-03-04T12:10:26.519+0000 1480b700  5 PG::put 0x15941f90 tag intptr 2 -> 1
2020-03-04T12:10:26.519+0000 1480b700  5 PG::put 0x20fc5a70 tag intptr 2 -> 1
2020-03-04T12:10:26.519+0000 1480b700  5 PG::put 0x2111f9d0 tag intptr 2 -> 1
2020-03-04T12:10:26.519+0000 1480b700  5 PG::put 0x265a3180 tag intptr 2 -> 1
2020-03-04T12:10:26.519+0000 1480b700  5 PG::put 0x266e7f70 tag intptr 2 -> 1
2020-03-04T12:10:26.519+0000 1480b700  5 PG::put 0x1e4e3af0 tag intptr 2 -> 1
2020-03-04T12:10:26.519+0000 1480b700  5 PG::put 0x20bffcf0 tag intptr 2 -> 1
2020-03-04T12:10:26.519+0000 1480b700  5 PG::put 0x1e850fc0 tag intptr 2 -> 1

and then that thread is stuck. probably here?
  // drain op queue again (in case PGs requeued something)
  op_shardedwq.drain();

it's under valgrind, so gdb attach doesn't really work.

/a/sage-2020-03-03_19:20:19-rados-wip-sage2-testing-2020-03-03-0959-distro-basic-smithi/4821329

Actions #1

Updated by Sage Weil about 4 years ago

/a/sage-2020-03-09_14:07:51-rados-wip-sage4-testing-2020-03-09-0634-distro-basic-smithi/4841228

Actions

Also available in: Atom PDF