Bug #23195
Read operations segfaulting multiple OSDs
0%
Description
I'm seeing some OSDs crashing at the same time with (mostly) the same error message related to a reading an erasure coded object.
The crashes occur on different OSDs and the request in the log refers to different PGs, so not a problem specific with some PG or OSD.
However, I believe all crashing PGs were already in a degraded state before the crash.
Here's a log filtered to the relevant parts.
-426> 2018-03-01 21:08:48.597788 7f580446d700 20 osd.121 op_wq(2) _process empty q, waiting -409> 2018-03-01 21:08:48.598627 7f580446d700 20 osd.121 op_wq(2) _process 42.12s0 item PGQueueable(0x561e9fe1ec40 prio 127 cost 0 e15989) queued -408> 2018-03-01 21:08:48.598636 7f580446d700 20 osd.121 op_wq(2) _process 42.12s0 item PGQueueable(0x561e9fe1ec40 prio 127 cost 0 e15989) pg 0x561f7ff27000 -407> 2018-03-01 21:08:48.598640 7f580446d700 10 osd.121 15989 dequeue_op 0x561e9fe1ec40 prio 127 cost 0 latency 0.000586 MOSDECSubOpReadReply(42.12s0 15989/15984 ECSubReadReply(tid=646841, attrs_read=0)) v2 pg pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] -406> 2018-03-01 21:08:48.598673 7f580446d700 20 osd.121 15989 share_map osd.114 172.30.5.34:6800/1299971 15989 -405> 2018-03-01 21:08:48.598679 7f580446d700 20 osd.121 15989 should_share_map osd.114 172.30.5.34:6800/1299971 15989 -404> 2018-03-01 21:08:48.598688 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] _handle_message: MOSDECSubOpReadReply(42.12s0 15989/15984 ECSubReadReply(tid=646841, attrs_read=0)) v2 -403> 2018-03-01 21:08:48.598719 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply: reply ECSubReadReply(tid=646841, attrs_read=0) -402> 2018-03-01 21:08:48.598750 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=2 -401> 2018-03-01 21:08:48.598776 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=9 -400> 2018-03-01 21:08:48.598803 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=3 -399> 2018-03-01 21:08:48.598829 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=10 -398> 2018-03-01 21:08:48.598856 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=4 -397> 2018-03-01 21:08:48.598883 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=1 -396> 2018-03-01 21:08:48.598924 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=5 -395> 2018-03-01 21:08:48.598950 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=0 -394> 2018-03-01 21:08:48.598976 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=7 -393> 2018-03-01 21:08:48.599002 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=11 -392> 2018-03-01 21:08:48.599028 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=6 -391> 2018-03-01 21:08:48.599066 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply have shard=5 -390> 2018-03-01 21:08:48.599089 7f580446d700 20 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] handle_sub_read_reply minimum_to_decode failed -389> 2018-03-01 21:08:48.599117 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] send_all_remaining_reads have/error shards=5 -388> 2018-03-01 21:08:48.599142 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] get_all_avail_shards: checking acting 114(2) -387> 2018-03-01 21:08:48.599172 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] get_all_avail_shards: checking acting 116(9) -386> 2018-03-01 21:08:48.599199 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] get_all_avail_shards: checking acting 117(10) -385> 2018-03-01 21:08:48.599226 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] get_all_avail_shards: checking acting 118(4) -384> 2018-03-01 21:08:48.599253 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] get_all_avail_shards: checking acting 119(1) -383> 2018-03-01 21:08:48.599281 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] get_all_avail_shards: checking acting 120(5) -382> 2018-03-01 21:08:48.599310 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] get_all_avail_shards: checking acting 121(0) -381> 2018-03-01 21:08:48.599336 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] get_all_avail_shards: checking acting 122(7) -380> 2018-03-01 21:08:48.599364 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] get_all_avail_shards: checking acting 124(11) -379> 2018-03-01 21:08:48.599391 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] get_all_avail_shards: checking acting 128(6) -378> 2018-03-01 21:08:48.599419 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] get_all_avail_shards: checking backfill 117(3) -377> 2018-03-01 21:08:48.599466 7f580446d700 10 osd.121 pg_epoch: 15989 pg[42.12s0( v 15989'33233784 (15948'33223684,15989'33233784] local-lis/les=15984/15985 n=15239 ec=1815/1815 lis/c 15984/14087 les/c/f 15985/14100/0 15984/15984/15915) [121,119,114,117,118,135,128,122,2147483647,116,120,124]/[121,119,114,2147483647,118,120,128,122,2147483647,116,117,124]p121(0) r=0 lpr=15984 pi=[14087,15984)/15 rops=2 bft=117(3),120(10),135(5) crt=15989'33233784 lcod 15989'33233782 mlcod 15989'33233782 active+undersized+degraded+remapped+backfilling] send_all_remaining_reads Read remaining shards 114(2),116(9),117(3),117(10),118(4),119(1),121(0),122(7),124(11),128(6) 0> 2018-03-01 21:08:48.606605 7f580446d700 -1 *** Caught signal (Segmentation fault) ** in thread 7f580446d700 thread_name:tp_osd_tp ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable) 1: (()+0xa74234) [0x561e29afb234] 2: (()+0x11390) [0x7f582200a390] 3: (std::__cxx11::list<boost::tuples::tuple<unsigned long, unsigned long, unsigned int, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>, std::allocator<boost::tuples::tuple<unsigned long, unsigned long, unsigned int, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type> > >::list(std::__cxx11::list<boost::tuples::tuple<unsigned long, unsigned long, unsigned int, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>, std::allocator<boost::tuples::tuple<unsigned long, unsigned long, unsigned int, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type> > > const&)+0x44) [0x561e2987dab4] 4: (ECBackend::send_all_remaining_reads(hobject_t const&, ECBackend::ReadOp&)+0x308) [0x561e29865378] 5: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*, ZTracer::Trace const&)+0x11bb) [0x561e29866aeb] 6: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x1c4) [0x561e298791b4] 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x561e2975bca0] 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x543) [0x561e296c09d3] 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a9) [0x561e2953a3b9] 10: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x561e297dd047] 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x130e) [0x561e295629ae] 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) [0x561e29b43664] 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x561e29b466a0] 14: (()+0x76ba) [0x7f58220006ba] 15: (clone()+0x6d) [0x7f582107741d]
(Yes, that EC profile has a size of 12 and yes, that's probably too much here).
This happenes every 10-30 Minutes on around 4 different OSDs. We have tried to enable fast_read on the ec pool and that seems to cause fewer OSDs to crash, but that might be a coincidence.
I've also observed a second crash on the primary OSD of a PG, it seems to be quite rare. I've only seen it twice while the others occur all the time.
-82> 2018-03-01 20:32:00.457043 7fe3a7993700 20 osd.132 op_wq(2) _process empty q, waiting -60> 2018-03-01 20:32:00.457288 7fe3a7993700 20 osd.132 op_wq(2) _process 42.42s0 item PGQueueable(0x561aedd29640 prio 127 cost 0 e15964) queued -59> 2018-03-01 20:32:00.457294 7fe3a7993700 20 osd.132 op_wq(2) _process 42.42s0 item PGQueueable(0x561aedd29640 prio 127 cost 0 e15964) pg 0x561927d22000 -57> 2018-03-01 20:32:00.457298 7fe3a7993700 10 osd.132 15964 dequeue_op 0x561aedd29640 prio 127 cost 0 latency 0.000824 MOSDECSubOpReadReply(42.42s0 15964/15923 ECSubReadReply(tid=399089, attrs_read=0)) v2 pg pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] -56> 2018-03-01 20:32:00.457319 7fe3a7993700 20 osd.132 15964 share_map osd.124 172.30.5.38:6802/720160 15964 -54> 2018-03-01 20:32:00.457341 7fe3a7993700 20 osd.132 15964 should_share_map osd.124 172.30.5.38:6802/720160 15964 -51> 2018-03-01 20:32:00.457356 7fe3a7993700 10 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] _handle_message: MOSDECSubOpReadReply(42.42s0 15964/15923 ECSubReadReply(tid=399089, attrs_read=0)) v2 -50> 2018-03-01 20:32:00.457384 7fe3a7993700 10 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply: reply ECSubReadReply(tid=399089, attrs_read=0) -48> 2018-03-01 20:32:00.457413 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=6 -46> 2018-03-01 20:32:00.457433 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=9 -44> 2018-03-01 20:32:00.457448 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=8 -42> 2018-03-01 20:32:00.457467 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=10 -41> 2018-03-01 20:32:00.457483 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=1 -39> 2018-03-01 20:32:00.457503 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=5 -37> 2018-03-01 20:32:00.457518 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=11 -36> 2018-03-01 20:32:00.457535 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=4 -34> 2018-03-01 20:32:00.457558 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=7 -32> 2018-03-01 20:32:00.457573 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=2 -31> 2018-03-01 20:32:00.457589 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=3 -29> 2018-03-01 20:32:00.457604 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=0 -26> 2018-03-01 20:32:00.457645 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=6 -25> 2018-03-01 20:32:00.457661 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=9 -24> 2018-03-01 20:32:00.457676 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=10 -23> 2018-03-01 20:32:00.457694 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=1 -22> 2018-03-01 20:32:00.457712 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=5 -20> 2018-03-01 20:32:00.457731 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=4 -19> 2018-03-01 20:32:00.457757 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=7 -18> 2018-03-01 20:32:00.457775 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=3 -16> 2018-03-01 20:32:00.457793 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply have shard=0 -12> 2018-03-01 20:32:00.457823 7fe3a7993700 20 osd.132 pg_epoch: 15964 pg[42.42s0( v 15954'36316063 (15851'36253620,15954'36316063] local-lis/les=15923/15924 n=15533 ec=1815/1815 lis/c 15923/14259 les/c/f 15924/14260/0 15922/15923/15902) [132,115,130,117,124,120,112,127,114,113,135,122]/[132,115,2147483647,130,124,120,112,127,129,113,114,122]p132(0) r=0 lpr=15923 pi=[14259,15923)/17 rops=2 bft=114(8),117(3),130(2),135(10) crt=15954'36316063 lcod 15954'36316062 mlcod 0'0 active+recovering+undersized+degraded+remapped] handle_sub_read_reply Complete: ReadOp(tid=399089, to_read={42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head=read_request_t(to_read=[0,8404992,0], need=112(6),113(9),114(10),115(1),120(5),122(11),124(4),127(7),130(2),130(3),132(0), want_attrs=0)}, complete={42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head=read_result_t(r=0, errors={}, noattrs, returned=(0, 8404992, [112(6),466944, 113(9),466944, 114(8),466944, 114(10),466944, 115(1),466944, 120(5),466944, 122(11),466944, 124(4),466944, 127(7),466944, 130(2),466944, 130(3),466944, 132(0),466944])),42:43a9a4a1:::rbd_data.41.277b572ae8944a.0000000000006f6a:head=read_result_t(r=0, errors={}, noattrs, returned=(0, 8404992, [112(6),466944, 113(9),466944, 114(10),466944, 115(1),466944, 120(5),466944, 124(4),466944, 127(7),466944, 130(3),466944, 132(0),466944]))}, priority=3, obj_to_source={42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head=112(6),113(9),114(8),114(10),115(1),120(5),122(11),124(4),127(7),130(2),130(3),132(0),42:43a9a4a1:::rbd_data.41.277b572ae8944a.0000000000006f6a:head=112(6),113(9),114(10),115(1),120(5),124(4),127(7),130(3),132(0)}, source_to_obj={112(6)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,42:43a9a4a1:::rbd_data.41.277b572ae8944a.0000000000006f6a:head,113(9)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,42:43a9a4a1:::rbd_data.41.277b572ae8944a.0000000000006f6a:head,114(8)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,114(10)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,42:43a9a4a1:::rbd_data.41.277b572ae8944a.0000000000006f6a:head,115(1)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,42:43a9a4a1:::rbd_data.41.277b572ae8944a.0000000000006f6a:head,120(5)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,42:43a9a4a1:::rbd_data.41.277b572ae8944a.0000000000006f6a:head,122(11)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,124(4)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,42:43a9a4a1:::rbd_data.41.277b572ae8944a.0000000000006f6a:head,127(7)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,42:43a9a4a1:::rbd_data.41.277b572ae8944a.0000000000006f6a:head,130(2)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,130(3)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,42:43a9a4a1:::rbd_data.41.277b572ae8944a.0000000000006f6a:head,132(0)=42:42b87b9e:::rbd_data.41.263ecc2ae8944a.00000000000006ed:head,42:43a9a4a1:::rbd_data.41.277b572ae8944a.0000000000006f6a:head}, in_progress=) 0> 2018-03-01 20:32:00.465086 7fe3a7993700 -1 /build/ceph-12.2.4/src/osd/ECBackend.cc: In function 'void ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)' thread 7fe3a7993700 time 2018-03-01 20:32:00.457921 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x561910bea872] 2: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x106) [0x56191090a586] 3: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*, ZTracer::Trace const&)+0x113f) [0x561910912a6f] 4: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x1c4) [0x5619109251b4] 5: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x561910807ca0] 6: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x543) [0x56191076c9d3] 7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a9) [0x5619105e63b9] 8: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x561910889047] 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x130e) [0x56191060e9ae] 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) [0x561910bef664] 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x561910bf26a0] 12: (()+0x76ba) [0x7fe3c0d1d6ba] 13: (clone()+0x6d) [0x7fe3bfd9441d]
(Filtered for the relevant thread and trimmed to the crashing request)
Full logs are attached.
Related issues
History
#1 Updated by John Spray about 6 years ago
- Project changed from Ceph to RADOS
- Category changed from OSD to Correctness/Safety
- Component(RADOS) OSD added
#2 Updated by Greg Farnum almost 6 years ago
- Related to Bug #21931: osd: src/osd/ECBackend.cc: 2164: FAILED assert((offset + length) <= (range.first.get_off() + range.first.get_len())) added
#3 Updated by Josh Durgin almost 6 years ago
- Status changed from New to In Progress
- Assignee set to Josh Durgin
- Priority changed from Normal to Urgent
#4 Updated by Josh Durgin almost 6 years ago
- Status changed from In Progress to Fix Under Review
https://github.com/ceph/ceph/pull/21273
I'm going to treat this issue as tracking the first crash, in send_all_remaining_reads. http://tracker.ceph.com/issues/21931 can track the 2nd one, which I haven't gotten to the bottom of yet.
#5 Updated by Josh Durgin almost 6 years ago
For anyone running into the send_all_remaining_reads() crash, a workaround is to use these osd settings:
osd recovery max single start = 1 osd recovery max active = 1
#6 Updated by David Zafman almost 6 years ago
Change the test-erasure-eio.sh test as following:
diff --git a/qa/standalone/erasure-code/test-erasure-eio.sh b/qa/standalone/erasure-code/test-erasure-eio.sh index c0515a4..c7f257c 100755 --- a/qa/standalone/erasure-code/test-erasure-eio.sh +++ b/qa/standalone/erasure-code/test-erasure-eio.sh @@ -26,6 +26,7 @@ function run() { export CEPH_ARGS CEPH_ARGS+="--fsid=$(uuidgen) --auth-supported=none " CEPH_ARGS+="--mon-host=$CEPH_MON " + CEPH_ARGS+="--osd-recovery-max-single-start=5 " local funcs=${@:-$(set | sed -n -e 's/^\(TEST_[0-9a-z_]*\) .*/\1/p')} for func in $funcs ; do
../qa/run-standalone.sh test-erasure-eio.sh 2>&1 | tee output.log
Produced this crash:
2018-04-06 19:40:58.808 7f2df20e1700 10 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] _handle_message: MOSDECSubOpReadReply(2.0s0 44/43 ECSubReadReply(tid=738, attrs_read=0)) v2 2018-04-06 19:40:58.808 7f2df20e1700 10 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] handle_sub_read_reply: reply ECSubReadReply(tid=738, attrs_read=0) 2018-04-06 19:40:58.808 7f2df20e1700 20 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] handle_sub_read_reply have shard=1 2018-04-06 19:40:58.808 7f2df20e1700 20 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] handle_sub_read_reply have shard=0 2018-04-06 19:40:58.808 7f2df20e1700 20 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] handle_sub_read_reply have shard=2 2018-04-06 19:40:58.808 7f2df20e1700 20 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] handle_sub_read_reply have shard=1 2018-04-06 19:40:58.808 7f2df20e1700 20 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] handle_sub_read_reply have shard=0 2018-04-06 19:40:58.808 7f2df20e1700 20 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] handle_sub_read_reply have shard=2 2018-04-06 19:40:58.808 7f2df20e1700 20 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] handle_sub_read_reply have shard=3 2018-04-06 19:40:58.808 7f2df20e1700 20 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] handle_sub_read_reply have shard=2 2018-04-06 19:40:58.808 7f2df20e1700 20 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] handle_sub_read_reply minimum_to_decode failed 2018-04-06 19:40:58.808 7f2df20e1700 10 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] send_all_remaining_reads have/error shards=0,1,2,3 2018-04-06 19:40:58.808 7f2df20e1700 10 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] get_all_avail_shards: checking acting 0(3) 2018-04-06 19:40:58.808 7f2df20e1700 10 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] get_all_avail_shards: checking acting 1(1) 2018-04-06 19:40:58.808 7f2df20e1700 10 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] get_all_avail_shards: checking acting 3(0) 2018-04-06 19:40:58.808 7f2df20e1700 10 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] get_all_avail_shards: checking acting 4(2) 2018-04-06 19:40:58.808 7f2df20e1700 10 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] get_all_avail_shards: checking backfill 2(4) 2018-04-06 19:40:58.808 7f2df20e1700 20 osd.3 pg_epoch: 44 pg[2.0s0( v 40'301 (40'200,40'301] local-lis/les=43/44 n=301 ec=34/34 lis/c 43/34 les/c/f 44/35/0 42/43/34) [3,1,4,0,2]/[3,1,4,0,2147483647]p3(0) backfill=[2(4)] r=0 lpr=43 pi=[34,43)/1 rops=3 bft=2(4) crt=40'301 lcod 40'300 mlcod 0'0 active+undersized+degraded+remapped+backfilling mbc={0={},1={},2={},3={},4={}}] handle_sub_read_reply Complete: ReadOp(tid=738, to_read={2:e8366861:::obj250:head=read_request_t(to_read=[0,8392704,0], need={0(3)=[0,1]}, want_attrs=0)}, complete={2:e5124c98:::obj133:head=read_result_t(r=0, errors={}, noattrs, returned=(0, 8392704, [1(1),4096, 3(0),4096, 4(2),4096])),2:e6e28dcd:::obj87:head=read_result_t(r=0, errors={}, noattrs, returned=(0, 8392704, [1(1),4096, 3(0),4096, 4(2),4096])),2:e8366861:::obj250:head=read_result_t(r=-5, errors={1(1)=-5,3(0)=-5}, noattrs, returned=(0, 8392704, [0(3),4096, 4(2),4096]))}, priority=3, obj_to_source={2:e5124c98:::obj133:head=1(1),3(0),4(2),2:e6e28dcd:::obj87:head=1(1),3(0),4(2),2:e8366861:::obj250:head=0(3),1(1),3(0),4(2)}, source_to_obj={0(3)=2:e8366861:::obj250:head,1(1)=2:e5124c98:::obj133:head,2:e6e28dcd:::obj87:head,2:e8366861:::obj250:head,3(0)=2:e5124c98:::obj133:head,2:e6e28dcd:::obj87:head,2:e8366861:::obj250:head,4(2)=2:e5124c98:::obj133:head,2:e6e28dcd:::obj87:head,2:e8366861:::obj250:head}, in_progress=) 2018-04-06 19:40:58.816 7f2df20e1700 -1 /home/dzafman/ceph/src/osd/ECBackend.cc: In function 'void ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)' thread 7f2df20e1700 time 2018-04-06 19:40:58.812432 /home/dzafman/ceph/src/osd/ECBackend.cc: 1244: FAILED assert(rop.to_read.size() == rop.complete.size()) ceph version 13.0.1-3375-g5801b5a (5801b5a2731504293b756fcd34ecf08998732586) mimic (dev) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f2e11a20f02] 2: (()+0x28d0d7) [0x7f2e11a210d7] 3: (()+0x6aa44e) [0x564aa046244e] 4: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*, ZTracer::Trace const&)+0xe6e) [0x564aa046a74e] 5: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x4ca) [0x564aa0478f0a] 6: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x97) [0x564aa0361777] 7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x6c3) [0x564aa0310fe3] 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x336) [0x564aa0174166] 9: (PGOpItem::run(OSD*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x564aa03e5c32] 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1092) [0x564aa017c122] 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x452) [0x7f2e11a25e52] 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f2e11a27e50] 13: (()+0x76ba) [0x7f2e105216ba] 14: (clone()+0x6d) [0x7f2e0f7ab82d]
#7 Updated by Sage Weil almost 6 years ago
- Status changed from Fix Under Review to Resolved
#8 Updated by Josh Durgin almost 6 years ago
- Status changed from Resolved to Pending Backport
- Backport set to luminous
#9 Updated by Nathan Cutler almost 6 years ago
- Copied to Backport #23850: luminous: Read operations segfaulting multiple OSDs added
#10 Updated by Sage Weil almost 6 years ago
- Duplicated by Bug #21992: osd: src/common/interval_map.h: 161: FAILED assert(len > 0) added
#11 Updated by Nathan Cutler almost 6 years ago
- Status changed from Pending Backport to Resolved