Project

General

Profile

Bug #22123

osd: objecter sends out of sync with pg epochs for proxied ops

Added by Sage Weil over 1 year ago. Updated 12 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Tiering
Target version:
-
Start date:
11/14/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous, jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:

Description

2017-11-14 00:37:04.496 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7810 3.0 3:0b332f33:::smithi1441403-241 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [sparse-read 0~2033680,omap-get-vals-by-keys,omap-get-keys,omap-get-vals,omap-get-header,getxattrs] snapc 0=[] RETRY=3 ondisk+retry+read+known_if_redirected e95) v8) prio 63 cost 32 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7812 3.0 3:0b332f33:::smithi1441403-241 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [read 0~2033680,checksum] snapc 0=[] RETRY=3 ondisk+retry+read+known_if_redirected e95) v8) prio 63 cost 4 e115)
2017-11-14 00:37:04.496 7f251983e700 10 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] cancel_proxy_ops 3:1288273c:::smithi1441403-126:head requeuing 1 requests
2017-11-14 00:37:04.496 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c7800
2017-11-14 00:37:04.496 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7827 3.0 3:1288273c:::smithi1441403-126:head [stat,delete] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 0 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7811 3.0 3:0b332f33:::smithi1441403-241 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [read 0~2033680,checksum] snapc 0=[] RETRY=3 ondisk+retry+read+known_if_redirected e95) v8) prio 63 cost 4 e115)
2017-11-14 00:37:04.496 7f251983e700 10 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] cancel_proxy_ops 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head requeuing 4 requests
2017-11-14 00:37:04.496 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c9d80
2017-11-14 00:37:04.496 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7826 3.0 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [read 0~1] snapc 0=[] RETRY=3 ondisk+retry+read+rwordered+known_if_redirected e95) v8) prio 63 cost 0 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7810 3.0 3:0b332f33:::smithi1441403-241 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [sparse-read 0~2033680,omap-get-vals-by-keys,omap-get-keys,omap-get-vals,omap-get-header,getxattrs] snapc 0=[] RETRY=3 ondisk+retry+read+known_if_redirected e95) v8) prio 63 cost 32 e115)
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c7b00
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7825 3.0 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [setxattr _header (356),truncate 3193660] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 363 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7827 3.0 3:1288273c:::smithi1441403-126:head [stat,delete] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 0 e115)
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c8b80
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7824 3.0 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [write 2037739~583136] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 583136 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7826 3.0 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [read 0~1] snapc 0=[] RETRY=3 ondisk+retry+read+rwordered+known_if_redirected e95) v8) prio 63 cost 0 e115)
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c9c00
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7823 3.0 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [write 793660~503513,stat] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 503513 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7825 3.0 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [setxattr _header (356),truncate 3193660] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 363 e115)
2017-11-14 00:37:04.497 7f251983e700 10 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] cancel_proxy_ops 3:27587c4d:::smithi1441403-404:head requeuing 4 requests
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c9480
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7809 3.0 3:27587c4d:::smithi1441403-404:head [read 0~1] snapc 0=[] RETRY=3 ondisk+retry+read+rwordered+known_if_redirected e95) v8) prio 63 cost 0 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7824 3.0 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [write 2037739~583136] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 583136 e115)
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c8580
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7808 3.0 3:27587c4d:::smithi1441403-404:head [setxattr _header (56),truncate 3160046] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 63 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7823 3.0 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [write 793660~503513,stat] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 503513 e115)
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea9a9aa80
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7807 3.0 3:27587c4d:::smithi1441403-404:head [write 2037128~695820] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 695820 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7809 3.0 3:27587c4d:::smithi1441403-404:head [read 0~1] snapc 0=[] RETRY=3 ondisk+retry+read+rwordered+known_if_redirected e95) v8) prio 63 cost 0 e115)
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea9a9a780
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7806 3.0 3:27587c4d:::smithi1441403-404:head [write 760046~753015,stat] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 753015 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7808 3.0 3:27587c4d:::smithi1441403-404:head [setxattr _header (56),truncate 3160046] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 63 e115)
2017-11-14 00:37:04.497 7f251983e700 10 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] cancel_proxy_ops 3:3d6e4248:::smithi1441403-422:head requeuing 5 requests
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c6c00
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7822 3.0 3:3d6e4248:::smithi1441403-422:head [read 0~1] snapc 0=[] RETRY=3 ondisk+retry+read+rwordered+known_if_redirected e95) v8) prio 63 cost 0 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7807 3.0 3:27587c4d:::smithi1441403-404:head [write 2037128~695820] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 695820 e115)
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c8e80
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7821 3.0 3:3d6e4248:::smithi1441403-422:head [setxattr _header (56),truncate 3176853] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 63 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7806 3.0 3:27587c4d:::smithi1441403-404:head [write 760046~753015,stat] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 753015 e115)
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c8a00
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7820 3.0 3:3d6e4248:::smithi1441403-422:head [write 2836307~340546] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 340546 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7822 3.0 3:3d6e4248:::smithi1441403-422:head [read 0~1] snapc 0=[] RETRY=3 ondisk+retry+read+rwordered+known_if_redirected e95) v8) prio 63 cost 0 e115)
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c7c80
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7819 3.0 3:3d6e4248:::smithi1441403-422:head [write 1979257~439478] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 439478 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7821 3.0 3:3d6e4248:::smithi1441403-422:head [setxattr _header (56),truncate 3176853] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 63 e115)
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 pg_epoch: 115 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=96/97 n=51 ec=20/20 lis/c 96/96 les/c/f 97/97/0 115/115/95) [7]/[2,7] r=0 lpr=115 pi=[96,115)/1 crt=97'367 mlcod 0'0 remapped] requeue_op 0x55fea90c9600
2017-11-14 00:37:04.497 7f251983e700 20 osd.2 op_wq(0) _enqueue_front OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7818 3.0 3:3d6e4248:::smithi1441403-422:head [write 776853~428264,stat] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 428264 e115) shuffled w/ OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7820 3.0 3:3d6e4248:::smithi1441403-422:head [write 2836307~340546] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 340546 e115)
...
2017-11-14 00:37:04.499 7f2514834700 20 osd.2 op_wq(0) _process OpQueueItem(3.0 PGOpItem(op=osd_op(client.4264.0:7823 3.0 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [write 793660~503513,stat] snapc 0=[] RETRY=3 ondisk+retry+write+known_if_redirected e95) v8) prio 63 cost 503513 e115) queued

/a/sage-2017-11-13_22:25:20-rados-wip-sage-testing-2017-11-13-0823-distro-basic-smithi/1844914

Related issues

Related to RADOS - Bug #22743: "RadosModel.h: 854: FAILED assert(0)" in upgrade:hammer-x-jewel-distro-basic-smithi New 01/21/2018
Related to RADOS - Bug #23598: hammer->jewel: ceph_test_rados crashes during radosbench task in jewel rados upgrade test Duplicate 04/09/2018
Duplicated by RADOS - Bug #22064: "RadosModel.h: 865: FAILED assert(0)" in rados-jewel-distro-basic-smithi Duplicate 11/07/2017
Duplicated by RADOS - Bug #22063: "RadosModel.h: 1703: FAILED assert(!version || comp->get_version64() == version)" inrados-jewel-distro-basic-smithi Duplicate 11/07/2017
Copied to RADOS - Backport #23075: luminous: osd: objecter sends out of sync with pg epochs for proxied ops Resolved
Copied to RADOS - Backport #23076: jewel: osd: objecter sends out of sync with pg epochs for proxied ops Resolved

History

#1 Updated by Sage Weil over 1 year ago

enqueue_front
client.4264.0:7826
client.4264.0:7825
client.4264.0:7824
client.4264.0:7823
enqueue_front shuffle
client.4264.0:7809 with client.4264.0:7824 (!!! this should be 7823 !!!)
client.4264.0:7808 with client.4264.0:7823
...

#2 Updated by Sage Weil over 1 year ago

  • Subject changed from osd: misordered ops after enqueue_front shuffling to osd: objecter sends out of sync with pg epochs for proxied ops
  • Priority changed from Urgent to High

Okay, the shuffling is actually fine.

Real problem appears to be that objecter osdmap updates race with pg map updates, and the requeueing of proxied ops can race with teh objecter atcually sending those ops:

- start proxy op A (epoch E)
- start proxy op B (epoch E)
- start proxy op C (epoch E)
- objecter sends none of these because target is down in epoch E
- osdmap update to E+1
- pg cancels requeues A, B
- objecter updates to E+1
- objecter sends C
- pg cancels/requeues C

actually, it's not even that, it's harder:

- pg cancels A, B
- object receives backoff release, sends C
- pg cancels C

2017-11-14 00:37:44.817 7fbba2be1700 10 osd.7 pg_epoch: 134 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=116/117 n=51 ec=20/20 lis/c 116/116 les/c/f 117/117/0 134/134/116) [5,7]/[7,2] r=0 lpr=134 pi=[116,134)/1 crt=97'367 lcod 97'366 mlcod 0'0 remapped] cancel_proxy_write 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head
2017-11-14 00:37:44.817 7fbba2be1700 10 osd.7 pg_epoch: 134 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=116/117 n=51 ec=20/20 lis/c 116/116 les/c/f 117/117/0 134/134/116) [5,7]/[7,2] r=0 lpr=134 pi=[116,134)/1 crt=97'367 lcod 97'366 mlcod 0'0 remapped] cancel_proxy_write 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head
2017-11-14 00:37:44.817 7fbba53e6700  1 -- 172.21.15.1:0/14841 <== osd.3 172.21.15.144:6812/27580 2 ==== osd_backoff(2.0 unblock id 1 [2:00000000::::head,2:40000000::::head) e134) v1 ==== 115+0+0 (1476801449 0 0) 0x55bfc5eae780 con 0x55bfc52a6800
2017-11-14 00:37:44.817 7fbba2be1700 10 osd.7 pg_epoch: 134 pg[3.0( v 97'367 (0'0,97'367] local-lis/les=116/117 n=51 ec=20/20 lis/c 116/116 les/c/f 117/117/0 134/134/116) [5,7]/[7,2] r=0 lpr=134 pi=[116,134)/1 crt=97'367 lcod 97'366 mlcod 0'0 remapped] cancel_proxy_write 3:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head
2017-11-14 00:37:44.818 7fbba53e6700  1 -- 172.21.15.1:0/14841 --> 172.21.15.144:6812/27580 -- osd_op(client.4264.0:7825 2.0 2:14993ae1:::smithi1441403-11 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:head [setxattr _header (356),truncate 3193660] snapc 0=[] RETRY=1 ondisk+retry+write+ignore_cache+ignore_overlay+known_if_redirected e134) v8 -- ?+0 0x55bfc8c0c680 con 0x55bfc52a6800

one fix might be to simply not cancel proxy ops since we can't cancel them all atomically.

another fix might be to generate a list of all ops to cancel and cancel them all atomically in one objecter call.

#3 Updated by Sage Weil over 1 year ago

  • Project changed from Ceph to RADOS
  • Category set to Tiering

#4 Updated by Greg Farnum over 1 year ago

We could also cancel them in reverse order; that won't break ordering and doesn't require new interfaces.

#5 Updated by Sage Weil over 1 year ago

  • Status changed from In Progress to Verified

#6 Updated by David Zafman about 1 year ago

  • Related to Bug #22064: "RadosModel.h: 865: FAILED assert(0)" in rados-jewel-distro-basic-smithi added

#7 Updated by David Zafman about 1 year ago

  • Related to deleted (Bug #22064: "RadosModel.h: 865: FAILED assert(0)" in rados-jewel-distro-basic-smithi)

#8 Updated by David Zafman about 1 year ago

  • Duplicated by Bug #22064: "RadosModel.h: 865: FAILED assert(0)" in rados-jewel-distro-basic-smithi added

#9 Updated by David Zafman about 1 year ago

  • Duplicated by Bug #22063: "RadosModel.h: 1703: FAILED assert(!version || comp->get_version64() == version)" inrados-jewel-distro-basic-smithi added

#10 Updated by Nathan Cutler about 1 year ago

  • Related to Bug #22743: "RadosModel.h: 854: FAILED assert(0)" in upgrade:hammer-x-jewel-distro-basic-smithi added

#11 Updated by Sage Weil about 1 year ago

  • Status changed from Verified to Need Review

https://github.com/ceph/ceph/pull/20484

I opted for the marginally more complex solution of cancelling multiple ops atomically. The objecter change to do it is trivial.

#12 Updated by Sage Weil about 1 year ago

  • Status changed from Need Review to Pending Backport
  • Backport set to luminous

#13 Updated by Nathan Cutler about 1 year ago

  • Backport changed from luminous to luminous, jewel

#14 Updated by Nathan Cutler about 1 year ago

  • Copied to Backport #23075: luminous: osd: objecter sends out of sync with pg epochs for proxied ops added

#15 Updated by Nathan Cutler about 1 year ago

  • Copied to Backport #23076: jewel: osd: objecter sends out of sync with pg epochs for proxied ops added

#16 Updated by Nathan Cutler 12 months ago

  • Status changed from Pending Backport to Resolved

#17 Updated by Nathan Cutler 12 months ago

Despite the jewel backport of this fix being merged, this problem has reappeared in jewel 10.2.11 integration testing. See #23598

#18 Updated by Nathan Cutler 12 months ago

  • Related to Bug #23598: hammer->jewel: ceph_test_rados crashes during radosbench task in jewel rados upgrade test added

Also available in: Atom PDF