Project

General

Profile

Bug #23195

Read operations segfaulting multiple OSDs

Added by Paul Emmerich about 1 year ago. Updated 11 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Correctness/Safety
Target version:
-
Start date:
03/02/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:

Description

I'm seeing some OSDs crashing at the same time with (mostly) the same error message related to a reading an erasure coded object.

The crashes occur on different OSDs and the request in the log refers to different PGs, so not a problem specific with some PG or OSD.
However, I believe all crashing PGs were already in a degraded state before the crash.

Here's a log filtered to the relevant parts.

  -426> 2018-03-01 21:08:48.597788 7f580446d700 20 osd.121 op_wq(2) _process empty q, waiting
  -409> 2018-03-01 21:08:48.598627 7f580446d700 20 osd.121 op_wq(2) _process 42.12s0 item PGQueueable(0x561e9fe1ec40 prio 127 cost 0 e15989) queued
  -408> 2018-03-01 21:08:48.598636 7f580446d700 20 osd.121 op_wq(2) _process 42.12s0 item PGQueueable(0x561e9fe1ec40 prio 127 cost 0 e15989) pg 0x561f7ff27000
  -407> 2018-03-01 21:08:48.598640 7f580446d700 10 osd.121 15989 dequeue_op 0x561e9fe1ec40 prio 127 cost 0 latency 0.000586 MOSDECSubOpReadReply(42.12s0 15989/15984 ECSubReadReply(tid=646841, attrs_read=0)) v2 pg pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling]
  -406> 2018-03-01 21:08:48.598673 7f580446d700 20 osd.121 15989 share_map osd.114 172.30.5.34:6800/1299971 15989
  -405> 2018-03-01 21:08:48.598679 7f580446d700 20 osd.121 15989 should_share_map osd.114 172.30.5.34:6800/1299971 15989
  -404> 2018-03-01 21:08:48.598688 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] _handle_message: MOSDECSubOpReadReply(42.12s0 15989/15984 ECSubReadReply(tid=646841, attrs_read=0)) v2
  -403> 2018-03-01 21:08:48.598719 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply: reply ECSubReadReply(tid=646841, attrs_read=0)
  -402> 2018-03-01 21:08:48.598750 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=2
  -401> 2018-03-01 21:08:48.598776 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=9
  -400> 2018-03-01 21:08:48.598803 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=3
  -399> 2018-03-01 21:08:48.598829 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=10
  -398> 2018-03-01 21:08:48.598856 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=4
  -397> 2018-03-01 21:08:48.598883 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=1
  -396> 2018-03-01 21:08:48.598924 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=5
  -395> 2018-03-01 21:08:48.598950 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=0
  -394> 2018-03-01 21:08:48.598976 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=7
  -393> 2018-03-01 21:08:48.599002 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=11
  -392> 2018-03-01 21:08:48.599028 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=6
  -391> 2018-03-01 21:08:48.599066 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=5
  -390> 2018-03-01 21:08:48.599089 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply minimum_to_decode failed
  -389> 2018-03-01 21:08:48.599117 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] send_all_remaining_reads have/error shards=5
  -388> 2018-03-01 21:08:48.599142 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] get_all_avail_shards: checking acting 114(2)
  -387> 2018-03-01 21:08:48.599172 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] get_all_avail_shards: checking acting 116(9)
  -386> 2018-03-01 21:08:48.599199 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] get_all_avail_shards: checking acting 117(10)
  -385> 2018-03-01 21:08:48.599226 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] get_all_avail_shards: checking acting 118(4)
  -384> 2018-03-01 21:08:48.599253 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] get_all_avail_shards: checking acting 119(1)
  -383> 2018-03-01 21:08:48.599281 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] get_all_avail_shards: checking acting 120(5)
  -382> 2018-03-01 21:08:48.599310 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] get_all_avail_shards: checking acting 121(0)
  -381> 2018-03-01 21:08:48.599336 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] get_all_avail_shards: checking acting 122(7)
  -380> 2018-03-01 21:08:48.599364 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] get_all_avail_shards: checking acting 124(11)
  -379> 2018-03-01 21:08:48.599391 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] get_all_avail_shards: checking acting 128(6)
  -378> 2018-03-01 21:08:48.599419 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] get_all_avail_shards: checking backfill 117(3)
  -377> 2018-03-01 21:08:48.599466 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] send_all_remaining_reads Read remaining shards 114(2),116(9),117(3),117(10),118(4),119(1),121(0),122(7),124(11),128(6)
     0> 2018-03-01 21:08:48.606605 7f580446d700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f580446d700 thread_name:tp_osd_tp

 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (()+0xa74234) [0x561e29afb234]
 2: (()+0x11390) [0x7f582200a390]
 3: (std::__cxx11::list<boost::tuples::tuple<unsigned long, unsigned long, unsigned int, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>, std::allocator<boost::tuples::tuple<unsigned long, unsigned long, unsigned int, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type> > >::list(std::__cxx11::list<boost::tuples::tuple<unsigned long, unsigned long, unsigned int, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>, std::allocator<boost::tuples::tuple<unsigned long, unsigned long, unsigned int, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type> > > const&)+0x44) [0x561e2987dab4]
 4: (ECBackend::send_all_remaining_reads(hobject_t const&, ECBackend::ReadOp&)+0x308) [0x561e29865378]
 5: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*, ZTracer::Trace const&)+0x11bb) [0x561e29866aeb]
 6: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x1c4) [0x561e298791b4]
 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x561e2975bca0]
 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x543) [0x561e296c09d3]
 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a9) [0x561e2953a3b9]
 10: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x561e297dd047]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x130e) [0x561e295629ae]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) [0x561e29b43664]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x561e29b466a0]
 14: (()+0x76ba) [0x7f58220006ba]
 15: (clone()+0x6d) [0x7f582107741d]

(Yes, that EC profile has a size of 12 and yes, that's probably too much here).

This happenes every 10-30 Minutes on around 4 different OSDs. We have tried to enable fast_read on the ec pool and that seems to cause fewer OSDs to crash, but that might be a coincidence.

I've also observed a second crash on the primary OSD of a PG, it seems to be quite rare. I've only seen it twice while the others occur all the time.

  -82> 2018-03-01 20:32:00.457043 7fe3a7993700 20 osd.132 op_wq(2) _process empty q, waiting
   -60> 2018-03-01 20:32:00.457288 7fe3a7993700 20 osd.132 op_wq(2) _process 42.42s0 item PGQueueable(0x561aedd29640 prio 127 cost 0 e15964) queued
   -59> 2018-03-01 20:32:00.457294 7fe3a7993700 20 osd.132 op_wq(2) _process 42.42s0 item PGQueueable(0x561aedd29640 prio 127 cost 0 e15964) pg 0x561927d22000
   -57> 2018-03-01 20:32:00.457298 7fe3a7993700 10 osd.132 15964 dequeue_op 0x561aedd29640 prio 127 cost 0 latency 0.000824 MOSDECSubOpReadReply(42.42s0 15964/15923 ECSubReadReply(tid=399089, attrs_read=0)) v2 pg pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped]
   -56> 2018-03-01 20:32:00.457319 7fe3a7993700 20 osd.132 15964 share_map osd.124 172.30.5.38:6802/720160 15964
   -54> 2018-03-01 20:32:00.457341 7fe3a7993700 20 osd.132 15964 should_share_map osd.124 172.30.5.38:6802/720160 15964
   -51> 2018-03-01 20:32:00.457356 7fe3a7993700 10 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] _handle_message: MOSDECSubOpReadReply(42.42s0 15964/15923 ECSubReadReply(tid=399089, attrs_read=0)) v2
   -50> 2018-03-01 20:32:00.457384 7fe3a7993700 10 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply: reply ECSubReadReply(tid=399089, attrs_read=0)
   -48> 2018-03-01 20:32:00.457413 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=6
   -46> 2018-03-01 20:32:00.457433 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=9
   -44> 2018-03-01 20:32:00.457448 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=8
   -42> 2018-03-01 20:32:00.457467 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=10
   -41> 2018-03-01 20:32:00.457483 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=1
   -39> 2018-03-01 20:32:00.457503 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=5
   -37> 2018-03-01 20:32:00.457518 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=11
   -36> 2018-03-01 20:32:00.457535 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=4
   -34> 2018-03-01 20:32:00.457558 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=7
   -32> 2018-03-01 20:32:00.457573 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=2
   -31> 2018-03-01 20:32:00.457589 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=3
   -29> 2018-03-01 20:32:00.457604 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=0
   -26> 2018-03-01 20:32:00.457645 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=6
   -25> 2018-03-01 20:32:00.457661 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=9
   -24> 2018-03-01 20:32:00.457676 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=10
   -23> 2018-03-01 20:32:00.457694 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=1
   -22> 2018-03-01 20:32:00.457712 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=5
   -20> 2018-03-01 20:32:00.457731 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=4
   -19> 2018-03-01 20:32:00.457757 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=7
   -18> 2018-03-01 20:32:00.457775 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=3
   -16> 2018-03-01 20:32:00.457793 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=0
   -12> 2018-03-01 20:32:00.457823 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply Complete: ReadOp(tid=399089, to_read={42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head=read_request_t(to_read=[0,8404992,0], need=112(6),113(9),114(10),115(1),120(5),122(11),124(4),127(7),130(2),130(3),132(0), want_attrs=0)}, complete={42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head=read_result_t(r=0, errors={}, noattrs, returned=(0, 8404992, [112(6),466944, 113(9),466944, 114(8),466944, 114(10),466944, 115(1),466944, 120(5),466944, 122(11),466944, 124(4),466944, 127(7),466944, 130(2),466944, 130(3),466944, 132(0),466944])),42:43a9a4a1:::rbd_data.41.277b572ae8944a.0000000000006f6a:head=read_result_t(r=0, errors={}, noattrs, returned=(0, 8404992, [112(6),466944, 113(9),466944, 114(10),466944, 115(1),466944, 120(5),466944, 124(4),466944, 127(7),466944, 130(3),466944, 132(0),466944]))}, priority=3, obj_to_source={42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head=112(6),113(9),114(8),114(10),115(1),120(5),122(11),124(4),127(7),130(2),130(3),132(0),42:43a9a4a1:::rbd_data.41.277b572ae8944a.0000000000006f6a:head=112(6),113(9),114(10),115(1),120(5),124(4),127(7),130(3),132(0)}, source_to_obj={112(6)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,42:43a9a4a1:::rbd_data.41.277b572ae8944a.0000000000006f6a:head,113(9)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,42:43a9a4a1:::rbd_data.41.277b572ae8944a.0000000000006f6a:head,114(8)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,114(10)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,42:43a9a4a1:::rbd_data.41.277b572ae8944a.0000000000006f6a:head,115(1)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,42:43a9a4a1:::rbd_data.41.277b572ae8944a.0000000000006f6a:head,120(5)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,42:43a9a4a1:::rbd_data.41.277b572ae8944a.0000000000006f6a:head,122(11)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,124(4)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,42:43a9a4a1:::rbd_data.41.277b572ae8944a.0000000000006f6a:head,127(7)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,42:43a9a4a1:::rbd_data.41.277b572ae8944a.0000000000006f6a:head,130(2)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,130(3)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,42:43a9a4a1:::rbd_data.41.277b572ae8944a.0000000000006f6a:head,132(0)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,42:43a9a4a1:::rbd_data.41.277b572ae8944a.0000000000006f6a:head}, in_progress=)
     0> 2018-03-01 20:32:00.465086 7fe3a7993700 -1 /build/ceph-12.2.4/src/osd/ECBackend.cc: In function 'void ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)' thread 7fe3a7993700 time 2018-03-01 20:32:00.457921

 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x561910bea872]
 2: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x106) [0x56191090a586]
 3: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*, ZTracer::Trace const&)+0x113f) [0x561910912a6f]
 4: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x1c4) [0x5619109251b4]
 5: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x561910807ca0]
 6: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x543) [0x56191076c9d3]
 7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a9) [0x5619105e63b9]
 8: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x561910889047]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x130e) [0x56191060e9ae]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) [0x561910bef664]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x561910bf26a0]
 12: (()+0x76ba) [0x7fe3c0d1d6ba]
 13: (clone()+0x6d) [0x7fe3bfd9441d]

(Filtered for the relevant thread and trimmed to the crashing request)

Full logs are attached.

crash-read-size.txt.gz (301 KB) Paul Emmerich, 03/01/2018 11:43 PM

crash-read-segfault.txt.gz (227 KB) Paul Emmerich, 03/01/2018 11:43 PM


Related issues

Related to RADOS - Bug #21931: osd: src/osd/ECBackend.cc: 2164: FAILED assert((offset + length) <= (range.first.get_off() + range.first.get_len())) Resolved 10/25/2017
Duplicated by RADOS - Bug #21992: osd: src/common/interval_map.h: 161: FAILED assert(len > 0) Duplicate 10/31/2017
Copied to RADOS - Backport #23850: luminous: Read operations segfaulting multiple OSDs Resolved

History

#1 Updated by John Spray about 1 year ago

  • Project changed from Ceph to RADOS
  • Category changed from OSD to Correctness/Safety
  • Component(RADOS) OSD added

#2 Updated by Greg Farnum 12 months ago

  • Related to Bug #21931: osd: src/osd/ECBackend.cc: 2164: FAILED assert((offset + length) <= (range.first.get_off() + range.first.get_len())) added

#3 Updated by Josh Durgin 12 months ago

  • Status changed from New to In Progress
  • Assignee set to Josh Durgin
  • Priority changed from Normal to Urgent

#4 Updated by Josh Durgin 12 months ago

  • Status changed from In Progress to Need Review

https://github.com/ceph/ceph/pull/21273

I'm going to treat this issue as tracking the first crash, in send_all_remaining_reads. http://tracker.ceph.com/issues/21931 can track the 2nd one, which I haven't gotten to the bottom of yet.

#5 Updated by Josh Durgin 12 months ago

For anyone running into the send_all_remaining_reads() crash, a workaround is to use these osd settings:

osd recovery max single start = 1
osd recovery max active = 1

#6 Updated by David Zafman 12 months ago

Change the test-erasure-eio.sh test as following:

diff --git a/qa/standalone/erasure-code/test-erasure-eio.sh b/qa/standalone/erasure-code/test-erasure-eio.sh
index c0515a4..c7f257c 100755
--- a/qa/standalone/erasure-code/test-erasure-eio.sh
+++ b/qa/standalone/erasure-code/test-erasure-eio.sh
@@ -26,6 +26,7 @@ function run() {
     export CEPH_ARGS
     CEPH_ARGS+="--fsid=$(uuidgen) --auth-supported=none " 
     CEPH_ARGS+="--mon-host=$CEPH_MON " 
+    CEPH_ARGS+="--osd-recovery-max-single-start=5 " 

     local funcs=${@:-$(set | sed -n -e 's/^\(TEST_[0-9a-z_]*\) .*/\1/p')}
     for func in $funcs ; do

../qa/run-standalone.sh test-erasure-eio.sh 2>&1 | tee output.log

Produced this crash:

2018-04-06 19:40:58.808 7f2df20e1700 10 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] _handle_message: MOSDECSubOpReadReply(2.0s0 44/43 ECSubReadReply(tid=738, attrs_read=0)) v2
2018-04-06 19:40:58.808 7f2df20e1700 10 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] handle_sub_read_reply: reply ECSubReadReply(tid=738, attrs_read=0)
2018-04-06 19:40:58.808 7f2df20e1700 20 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] handle_sub_read_reply have shard=1
2018-04-06 19:40:58.808 7f2df20e1700 20 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] handle_sub_read_reply have shard=0
2018-04-06 19:40:58.808 7f2df20e1700 20 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] handle_sub_read_reply have shard=2
2018-04-06 19:40:58.808 7f2df20e1700 20 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] handle_sub_read_reply have shard=1
2018-04-06 19:40:58.808 7f2df20e1700 20 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] handle_sub_read_reply have shard=0
2018-04-06 19:40:58.808 7f2df20e1700 20 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] handle_sub_read_reply have shard=2
2018-04-06 19:40:58.808 7f2df20e1700 20 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] handle_sub_read_reply have shard=3
2018-04-06 19:40:58.808 7f2df20e1700 20 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] handle_sub_read_reply have shard=2
2018-04-06 19:40:58.808 7f2df20e1700 20 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] handle_sub_read_reply minimum_to_decode failed
2018-04-06 19:40:58.808 7f2df20e1700 10 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] send_all_remaining_reads have/error shards=0,1,2,3
2018-04-06 19:40:58.808 7f2df20e1700 10 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] get_all_avail_shards: checking acting 0(3)
2018-04-06 19:40:58.808 7f2df20e1700 10 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] get_all_avail_shards: checking acting 1(1)
2018-04-06 19:40:58.808 7f2df20e1700 10 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] get_all_avail_shards: checking acting 3(0)
2018-04-06 19:40:58.808 7f2df20e1700 10 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] get_all_avail_shards: checking acting 4(2)
2018-04-06 19:40:58.808 7f2df20e1700 10 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] get_all_avail_shards: checking backfill 2(4)
2018-04-06 19:40:58.808 7f2df20e1700 20 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] handle_sub_read_reply Complete: ReadOp(tid=738, to_read={2:e8366861:::obj250:head=read_request_t(to_read=[0,8392704,0], need={0(3)=[0,1]}, want_attrs=0)}, complete={2:e5124c98:::obj133:head=read_result_t(r=0, errors={}, noattrs, returned=(0, 8392704, [1(1),4096, 3(0),4096, 4(2),4096])),2:e6e28dcd:::obj87:head=read_result_t(r=0, errors={}, noattrs, returned=(0, 8392704, [1(1),4096, 3(0),4096, 4(2),4096])),2:e8366861:::obj250:head=read_result_t(r=-5, errors={1(1)=-5,3(0)=-5}, noattrs, returned=(0, 8392704, [0(3),4096, 4(2),4096]))}, priority=3, obj_to_source={2:e5124c98:::obj133:head=1(1),3(0),4(2),2:e6e28dcd:::obj87:head=1(1),3(0),4(2),2:e8366861:::obj250:head=0(3),1(1),3(0),4(2)}, source_to_obj={0(3)=2:e8366861:::obj250:head,1(1)=2:e5124c98:::obj133:head,2:e6e28dcd:::obj87:head,2:e8366861:::obj250:head,3(0)=2:e5124c98:::obj133:head,2:e6e28dcd:::obj87:head,2:e8366861:::obj250:head,4(2)=2:e5124c98:::obj133:head,2:e6e28dcd:::obj87:head,2:e8366861:::obj250:head}, in_progress=)
2018-04-06 19:40:58.816 7f2df20e1700 -1 /home/dzafman/ceph/src/osd/ECBackend.cc: In function 'void ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)' thread 7f2df20e1700 time 2018-04-06 19:40:58.812432
/home/dzafman/ceph/src/osd/ECBackend.cc: 1244: FAILED assert(rop.to_read.size() == rop.complete.size())

 ceph version 13.0.1-3375-g5801b5a (5801b5a2731504293b756fcd34ecf08998732586) mimic (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f2e11a20f02]
 2: (()+0x28d0d7) [0x7f2e11a210d7]
 3: (()+0x6aa44e) [0x564aa046244e]
 4: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*, ZTracer::Trace const&)+0xe6e) [0x564aa046a74e]
 5: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x4ca) [0x564aa0478f0a]
 6: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x97) [0x564aa0361777]
 7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x6c3) [0x564aa0310fe3]
 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x336) [0x564aa0174166]
 9: (PGOpItem::run(OSD*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x564aa03e5c32]
 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1092) [0x564aa017c122]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x452) [0x7f2e11a25e52]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f2e11a27e50]
 13: (()+0x76ba) [0x7f2e105216ba]
 14: (clone()+0x6d) [0x7f2e0f7ab82d]

#7 Updated by Sage Weil 11 months ago

  • Status changed from Need Review to Resolved

#8 Updated by Josh Durgin 11 months ago

  • Status changed from Resolved to Pending Backport
  • Backport set to luminous

#9 Updated by Nathan Cutler 11 months ago

  • Copied to Backport #23850: luminous: Read operations segfaulting multiple OSDs added

#10 Updated by Sage Weil 11 months ago

  • Duplicated by Bug #21992: osd: src/common/interval_map.h: 161: FAILED assert(len > 0) added

#11 Updated by Nathan Cutler 11 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF