Project

General

Profile

Actions

Bug #22123

closed

osd: objecter sends out of sync with pg epochs for proxied ops

Added by Sage Weil over 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Tiering
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous, jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2017-11-14 00:37:04.496 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7810 3.0 3:0b332f33:::smithi1441403-241 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [sparse-read 0~2033680,omap-get-vals-by-keys,omap-get-keys,omap-get-vals,omap-get-header,getxattrs] snapc 0=[] RETRY=3 ondisk+retry+read+known_if_redirected e95) v8) prio 63 cost 32 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7812 3.0 3:0b332f33:::smithi1441403-241 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [read 0~2033680,checksum] snapc 0=[] RETRY=3 ondisk+retry+read+known_if_redirected e95) v8) prio 63 cost 4 e115)
2017-11-14 00:37:04.496 7f251983e700 10 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] cancel_proxy_ops 3:1288273c:::smithi1441403-126:head requeuing 1 requests
2017-11-14 00:37:04.496 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c7800
2017-11-14 00:37:04.496 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7827 3.0 3:1288273c:::smithi1441403-126:head [stat,delete] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 0 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7811 3.0 3:0b332f33:::smithi1441403-241 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [read 0~2033680,checksum] snapc 0=[] RETRY=3 ondisk+retry+read+known_if_redirected e95) v8) prio 63 cost 4 e115)
2017-11-14 00:37:04.496 7f251983e700 10 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] cancel_proxy_ops 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head requeuing 4 requests
2017-11-14 00:37:04.496 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c9d80
2017-11-14 00:37:04.496 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7826 3.0 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [read 0~1] snapc 0=[] RETRY=3 ondisk+retry+read+rwordered+known_if_redirected e95) v8) prio 63 cost 0 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7810 3.0 3:0b332f33:::smithi1441403-241 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [sparse-read 0~2033680,omap-get-vals-by-keys,omap-get-keys,omap-get-vals,omap-get-header,getxattrs] snapc 0=[] RETRY=3 ondisk+retry+read+known_if_redirected e95) v8) prio 63 cost 32 e115)
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c7b00
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7825 3.0 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [setxattr _header (356),truncate 3193660] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 363 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7827 3.0 3:1288273c:::smithi1441403-126:head [stat,delete] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 0 e115)
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c8b80
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7824 3.0 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [write 2037739~583136] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 583136 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7826 3.0 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [read 0~1] snapc 0=[] RETRY=3 ondisk+retry+read+rwordered+known_if_redirected e95) v8) prio 63 cost 0 e115)
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c9c00
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7823 3.0 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [write 793660~503513,stat] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 503513 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7825 3.0 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [setxattr _header (356),truncate 3193660] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 363 e115)
2017-11-14 00:37:04.497 7f251983e700 10 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] cancel_proxy_ops 3:27587c4d:::smithi1441403-404:head requeuing 4 requests
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c9480
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7809 3.0 3:27587c4d:::smithi1441403-404:head [read 0~1] snapc 0=[] RETRY=3 ondisk+retry+read+rwordered+known_if_redirected e95) v8) prio 63 cost 0 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7824 3.0 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [write 2037739~583136] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 583136 e115)
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c8580
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7808 3.0 3:27587c4d:::smithi1441403-404:head [setxattr _header (56),truncate 3160046] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 63 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7823 3.0 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [write 793660~503513,stat] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 503513 e115)
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea9a9aa80
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7807 3.0 3:27587c4d:::smithi1441403-404:head [write 2037128~695820] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 695820 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7809 3.0 3:27587c4d:::smithi1441403-404:head [read 0~1] snapc 0=[] RETRY=3 ondisk+retry+read+rwordered+known_if_redirected e95) v8) prio 63 cost 0 e115)
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea9a9a780
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7806 3.0 3:27587c4d:::smithi1441403-404:head [write 760046~753015,stat] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 753015 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7808 3.0 3:27587c4d:::smithi1441403-404:head [setxattr _header (56),truncate 3160046] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 63 e115)
2017-11-14 00:37:04.497 7f251983e700 10 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] cancel_proxy_ops 3:3d6e4248:::smithi1441403-422:head requeuing 5 requests
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c6c00
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7822 3.0 3:3d6e4248:::smithi1441403-422:head [read 0~1] snapc 0=[] RETRY=3 ondisk+retry+read+rwordered+known_if_redirected e95) v8) prio 63 cost 0 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7807 3.0 3:27587c4d:::smithi1441403-404:head [write 2037128~695820] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 695820 e115)
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c8e80
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7821 3.0 3:3d6e4248:::smithi1441403-422:head [setxattr _header (56),truncate 3176853] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 63 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7806 3.0 3:27587c4d:::smithi1441403-404:head [write 760046~753015,stat] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 753015 e115)
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c8a00
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7820 3.0 3:3d6e4248:::smithi1441403-422:head [write 2836307~340546] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 340546 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7822 3.0 3:3d6e4248:::smithi1441403-422:head [read 0~1] snapc 0=[] RETRY=3 ondisk+retry+read+rwordered+known_if_redirected e95) v8) prio 63 cost 0 e115)
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c7c80
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7819 3.0 3:3d6e4248:::smithi1441403-422:head [write 1979257~439478] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 439478 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7821 3.0 3:3d6e4248:::smithi1441403-422:head [setxattr _header (56),truncate 3176853] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 63 e115)
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c9600
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7818 3.0 3:3d6e4248:::smithi1441403-422:head [write 776853~428264,stat] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 428264 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7820 3.0 3:3d6e4248:::smithi1441403-422:head [write 2836307~340546] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 340546 e115)
...
2017-11-14 00:37:04.499 7f2514834700 20 osd.2 op_wq(0) _process OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7823 3.0 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [write 793660~503513,stat] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 503513 e115) queued

/a/sage-2017-11-13_22:25:20-rados-wip-sage-testing-2017-11-13-0823-distro-basic-smithi/1844914

Related issues 6 (1 open5 closed)

Related to RADOS - Bug #22743: "RadosModel.h: 854: FAILED assert(0)" in upgrade:hammer-x-jewel-distro-basic-smithiNew01/21/2018

Actions
Related to RADOS - Bug #23598: hammer->jewel: ceph_test_rados crashes during radosbench task in jewel rados upgrade testDuplicate04/09/2018

Actions
Has duplicate RADOS - Bug #22064: "RadosModel.h: 865: FAILED assert(0)" in rados-jewel-distro-basic-smithiDuplicateDavid Zafman11/07/2017

Actions
Has duplicate RADOS - Bug #22063: "RadosModel.h: 1703: FAILED assert(!version || comp->get_version64() == version)" inrados-jewel-distro-basic-smithi Duplicate11/07/2017

Actions
Copied to RADOS - Backport #23075: luminous: osd: objecter sends out of sync with pg epochs for proxied opsResolvedPrashant DActions
Copied to RADOS - Backport #23076: jewel: osd: objecter sends out of sync with pg epochs for proxied opsResolvedNathan CutlerActions
Actions #1

Updated by Sage Weil over 6 years ago

enqueue_front
client.4264.0:7826
client.4264.0:7825
client.4264.0:7824
client.4264.0:7823
enqueue_front shuffle
client.4264.0:7809 with client.4264.0:7824 (!!! this should be 7823 !!!)
client.4264.0:7808 with client.4264.0:7823
...

Actions #2

Updated by Sage Weil over 6 years ago

  • Subject changed from osd: misordered ops after enqueue_front shuffling to osd: objecter sends out of sync with pg epochs for proxied ops
  • Priority changed from Urgent to High

Okay, the shuffling is actually fine.

Real problem appears to be that objecter osdmap updates race with pg map updates, and the requeueing of proxied ops can race with teh objecter atcually sending those ops:

- start proxy op A (epoch E)
- start proxy op B (epoch E)
- start proxy op C (epoch E)
- objecter sends none of these because target is down in epoch E
- osdmap update to E+1
- pg cancels requeues A, B
- objecter updates to E+1
- objecter sends C
- pg cancels/requeues C

actually, it's not even that, it's harder:

- pg cancels A, B
- object receives backoff release, sends C
- pg cancels C

2017-11-14 00:37:44.817 7fbba2be1700 10 osd.7 pg_epoch: 134 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=116/117 n=51 ec=20/20 lis/c 116/116 les/c/f 117/117/0 134/134/116) [5,7]/[7,2] r=0 lpr=134 pi=[116,134)/1 crt=97'367 lcod 97'366 mlcod 0'0 remapped] cancel_proxy_write 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head
2017-11-14 00:37:44.817 7fbba2be1700 10 osd.7 pg_epoch: 134 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=116/117 n=51 ec=20/20 lis/c 116/116 les/c/f 117/117/0 134/134/116) [5,7]/[7,2] r=0 lpr=134 pi=[116,134)/1 crt=97'367 lcod 97'366 mlcod 0'0 remapped] cancel_proxy_write 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head
2017-11-14 00:37:44.817 7fbba53e6700  1 -- 172.21.15.1:0/14841 <== osd.3 172.21.15.144:6812/27580 2 ==== osd_backoff(2.0 unblock id 1 [2:00000000::::head,2:40000000::::head) e134) v1 ==== 115+0+0 (1476801449 0 0) 0x55bfc5eae780 con 0x55bfc52a6800
2017-11-14 00:37:44.817 7fbba2be1700 10 osd.7 pg_epoch: 134 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=116/117 n=51 ec=20/20 lis/c 116/116 les/c/f 117/117/0 134/134/116) [5,7]/[7,2] r=0 lpr=134 pi=[116,134)/1 crt=97'367 lcod 97'366 mlcod 0'0 remapped] cancel_proxy_write 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head
2017-11-14 00:37:44.818 7fbba53e6700  1 -- 172.21.15.1:0/14841 --> 172.21.15.144:6812/27580 -- osd_op(client.4264.0:7825 2.0 2:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [setxattr _header (356),truncate 3193660] snapc 0=[] RETRY=1 ondisk+retry+write+ignore_cache+ignore_overlay+known_if_redirected e134) v8 -- ?+0 0x55bfc8c0c680 con 0x55bfc52a6800

one fix might be to simply not cancel proxy ops since we can't cancel them all atomically.

another fix might be to generate a list of all ops to cancel and cancel them all atomically in one objecter call.

Actions #3

Updated by Sage Weil over 6 years ago

  • Project changed from Ceph to RADOS
  • Category set to Tiering
Actions #4

Updated by Greg Farnum over 6 years ago

We could also cancel them in reverse order; that won't break ordering and doesn't require new interfaces.

Actions #5

Updated by Sage Weil over 6 years ago

  • Status changed from In Progress to 12
Actions #6

Updated by David Zafman about 6 years ago

  • Related to Bug #22064: "RadosModel.h: 865: FAILED assert(0)" in rados-jewel-distro-basic-smithi added
Actions #7

Updated by David Zafman about 6 years ago

  • Related to deleted (Bug #22064: "RadosModel.h: 865: FAILED assert(0)" in rados-jewel-distro-basic-smithi)
Actions #8

Updated by David Zafman about 6 years ago

  • Has duplicate Bug #22064: "RadosModel.h: 865: FAILED assert(0)" in rados-jewel-distro-basic-smithi added
Actions #9

Updated by David Zafman about 6 years ago

  • Has duplicate Bug #22063: "RadosModel.h: 1703: FAILED assert(!version || comp->get_version64() == version)" inrados-jewel-distro-basic-smithi added
Actions #10

Updated by Nathan Cutler about 6 years ago

  • Related to Bug #22743: "RadosModel.h: 854: FAILED assert(0)" in upgrade:hammer-x-jewel-distro-basic-smithi added
Actions #11

Updated by Sage Weil about 6 years ago

  • Status changed from 12 to Fix Under Review

https://github.com/ceph/ceph/pull/20484

I opted for the marginally more complex solution of cancelling multiple ops atomically. The objecter change to do it is trivial.

Actions #12

Updated by Sage Weil about 6 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to luminous
Actions #13

Updated by Nathan Cutler about 6 years ago

  • Backport changed from luminous to luminous, jewel
Actions #14

Updated by Nathan Cutler about 6 years ago

  • Copied to Backport #23075: luminous: osd: objecter sends out of sync with pg epochs for proxied ops added
Actions #15

Updated by Nathan Cutler about 6 years ago

  • Copied to Backport #23076: jewel: osd: objecter sends out of sync with pg epochs for proxied ops added
Actions #16

Updated by Nathan Cutler about 6 years ago

  • Status changed from Pending Backport to Resolved
Actions #17

Updated by Nathan Cutler about 6 years ago

Despite the jewel backport of this fix being merged, this problem has reappeared in jewel 10.2.11 integration testing. See #23598

Actions #18

Updated by Nathan Cutler about 6 years ago

  • Related to Bug #23598: hammer->jewel: ceph_test_rados crashes during radosbench task in jewel rados upgrade test added
Actions

Also available in: Atom PDF