Bug #22064
closed"RadosModel.h: 865: FAILED assert(0)" in rados-jewel-distro-basic-smithi
0%
Description
Run: http://pulpito.ceph.com/teuthology-2017-11-04_02:00:03-rados-jewel-distro-basic-smithi/
Jobs: '1809520', '1809675', '1809884', '1809567', '1809464'
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2017-11-04_02:00:03-rados-jewel-distro-basic-smithi/1809675/teuthology.log
2017-11-07T14:00:39.083 INFO:tasks.rados.rados.0.smithi026.stdout:3324: got expected ENOENT (src dne) 2017-11-07T14:00:39.085 INFO:tasks.rados.rados.0.smithi026.stdout:3316: finishing write tid 1 to smithi02615974-649 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo 2017-11-07T14:00:39.085 INFO:tasks.rados.rados.0.smithi026.stderr:Error: finished tid 1 when last_acked_tid was 6 2017-11-07T14:00:39.085 INFO:tasks.rados.rados.0.smithi026.stderr:./test/osd/RadosModel.h: In function 'virtual void WriteOp::_finish(TestOp::CallbackInfo*)' thread 7f3ed9ffb700 time 2017-11-07 14:00:39.084107 2017-11-07T14:00:39.085 INFO:tasks.rados.rados.0.smithi026.stderr:./test/osd/RadosModel.h: 865: FAILED assert(0) 2017-11-07T14:00:39.093 INFO:tasks.rados.rados.0.smithi026.stderr: ceph version 10.2.10-42-g917cad9 (917cad9fd037118c2993f7b56eb1305727d007b8) 2017-11-07T14:00:39.094 INFO:tasks.rados.rados.0.smithi026.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x560a2414c385] 2017-11-07T14:00:39.094 INFO:tasks.rados.rados.0.smithi026.stderr: 2: (WriteOp::_finish(TestOp::CallbackInfo*)+0x4ed) [0x560a24135add] 2017-11-07T14:00:39.094 INFO:tasks.rados.rados.0.smithi026.stderr: 3: (write_callback(void*, void*)+0x19) [0x560a241431a9] 2017-11-07T14:00:39.094 INFO:tasks.rados.rados.0.smithi026.stderr: 4: (()+0x9ef9d) [0x7f3ee7687f9d] 2017-11-07T14:00:39.094 INFO:tasks.rados.rados.0.smithi026.stderr: 5: (()+0x87349) [0x7f3ee7670349] 2017-11-07T14:00:39.095 INFO:tasks.rados.rados.0.smithi026.stderr: 6: (()+0x176016) [0x7f3ee775f016] 2017-11-07T14:00:39.095 INFO:tasks.rados.rados.0.smithi026.stderr: 7: (()+0x7e25) [0x7f3ee6a51e25] 2017-11-07T14:00:39.095 INFO:tasks.rados.rados.0.smithi026.stderr: 8: (clone()+0x6d) [0x7f3ee593934d] 2017-11-07T14:00:39.095 INFO:tasks.rados.rados.0.smithi026.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Greg Farnum over 6 years ago
- Project changed from Ceph to RADOS
- Category set to Correctness/Safety
- Status changed from New to 12
Updated by Yuri Weinstein over 6 years ago
Also in http://pulpito.ceph.com/teuthology-2017-12-09_02:00:03-rados-jewel-distro-basic-smithi/
Jobs: '1946659', '1946742', '1946550', '1946560', '1946733', '1946624'
Updated by Nathan Cutler over 6 years ago
- Priority changed from Urgent to Immediate
Raising priority because this is readily reproducible in baseline jewel. (By "baseline jewel" I mean jewel HEAD, currently at
This might be of interest, from /a/teuthology-2017-12-09_02:00:03-rados-jewel-distro-basic-smithi/1946610
2017-12-13T17:49:31.214 INFO:tasks.rados.rados.0.smithi097.stdout:update_object_version oid 372 v 329 (ObjNum 371 snap 0 seq_num 371) dirty exists 2017-12-13T17:49:31.214 INFO:tasks.rados.rados.0.smithi097.stderr:Error: oid 372 write returned error code -34 2017-12-13T17:49:31.214 INFO:tasks.rados.rados.0.smithi097.stderr:Error: oid 372 write returned error code -34 2017-12-13T17:49:31.214 INFO:tasks.rados.rados.0.smithi097.stderr:Error: racing read on 372 returned version 0 rather than version 329 2017-12-13T17:49:31.214 INFO:tasks.rados.rados.0.smithi097.stderr:./test/osd/RadosModel.h: In function 'virtual void WriteOp::_finish(TestOp::CallbackInfo*)' thread 7f4041ffb700 time 2017-12-13 17:49:31.378082 2017-12-13T17:49:31.215 INFO:tasks.rados.rados.0.smithi097.stderr:./test/osd/RadosModel.h: 892: FAILED assert(0 == "racing read got wrong version") 2017-12-13T17:49:31.215 INFO:tasks.rados.rados.0.smithi097.stderr: ceph version 10.2.10-50-g1557449 (1557449a53febd15f448ce5acd354a3f67e0c0ba) 2017-12-13T17:49:31.215 INFO:tasks.rados.rados.0.smithi097.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x5628f1ecff10] 2017-12-13T17:49:31.215 INFO:tasks.rados.rados.0.smithi097.stderr: 2: (WriteOp::_finish(TestOp::CallbackInfo*)+0x4c3) [0x5628f1eaee03] 2017-12-13T17:49:31.215 INFO:tasks.rados.rados.0.smithi097.stderr: 3: (write_callback(void*, void*)+0x19) [0x5628f1ec7779] 2017-12-13T17:49:31.216 INFO:tasks.rados.rados.0.smithi097.stderr: 4: (()+0x9711d) [0x7f404f0ad11d] 2017-12-13T17:49:31.216 INFO:tasks.rados.rados.0.smithi097.stderr: 5: (()+0x81d79) [0x7f404f097d79] 2017-12-13T17:49:31.216 INFO:tasks.rados.rados.0.smithi097.stderr: 6: (()+0x171ace) [0x7f404f187ace] 2017-12-13T17:49:31.216 INFO:tasks.rados.rados.0.smithi097.stderr: 7: (()+0x76ba) [0x7f404e9d66ba]
In other words, the "racing read got wrong version" assert is tripped instead of the "finished tid X when last_acked_tid_was Y" assert. (Both asserts are in the same if (waiting_on == 0)
conditional block.)
Here's a list of the affected tests:
[1946733] rados/thrash/{0-size-min-size-overrides/2-size-1-min-size.yaml 1-pg-log-overrides/short_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} hobj-sort.yaml msgr-failures/few.yaml msgr/simple.yaml objectstore/filestore-xfs.yaml rados.yaml thrashers/pggrow.yaml workloads/snaps-few-objects.yaml}
[1946550] rados/thrash-erasure-code/{clusters/{fixed-2.yaml openstack.yaml} msgr-failures/few.yaml objectstore/filestore-xfs.yaml rados.yaml thrashers/morepggrow.yaml workloads/ec-rados-plugin=jerasure-k=3-m=1.yaml}
[1946659] rados/thrash/{0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/short_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} hobj-sort.yaml msgr-failures/fastclose.yaml msgr/random.yaml objectstore/filestore-xfs.yaml rados.yaml thrashers/pggrow.yaml workloads/small-objects.yaml}
[1946560] rados/thrash-erasure-code-isa/{arch/x86_64.yaml clusters/{fixed-2.yaml openstack.yaml} msgr-failures/few.yaml objectstore/filestore-xfs.yaml rados.yaml supported/centos_latest.yaml thrashers/pggrow.yaml workloads/ec-rados-plugin=isa-k=2-m=1.yaml}
[1946610] rados/thrash/{0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} hobj-sort.yaml msgr-failures/osd-delay.yaml msgr/random.yaml objectstore/filestore-xfs.yaml rados.yaml thrashers/default.yaml workloads/cache.yaml}
[1946540] rados/thrash/{0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/short_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} hobj-sort.yaml msgr-failures/osd-delay.yaml msgr/random.yaml objectstore/filestore-xfs.yaml rados.yaml thrashers/mapgap.yaml workloads/cache.yaml}
[1946505] rados/upgrade/{hammer-x-singleton/{0-cluster/{openstack.yaml start.yaml} 1-hammer-install/hammer.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/{rbd-cls.yaml rbd-import-export.yaml readwrite.yaml snaps-few-objects.yaml} 6-next-mon/monb.yaml 7-workload/{radosbench.yaml rbd_api.yaml} 8-next-mon/monc.yaml 9-workload/{ec-rados-plugin=jerasure-k=3-m=1.yaml rbd-python.yaml rgw-swift.yaml snaps-many-objects.yaml test_cache-pool-snaps.yaml}} rados.yaml}
[1946624] rados/thrash-erasure-code-isa/{arch/x86_64.yaml clusters/{fixed-2.yaml openstack.yaml} msgr-failures/osd-delay.yaml objectstore/filestore-xfs.yaml rados.yaml supported/centos_latest.yaml thrashers/morepggrow.yaml workloads/ec-rados-plugin=isa-k=2-m=1.yaml}
[1946742] rados/thrash-erasure-code/{clusters/{fixed-2.yaml openstack.yaml} msgr-failures/osd-delay.yaml objectstore/filestore-xfs.yaml rados.yaml thrashers/fastread.yaml workloads/ec-rados-plugin=jerasure-k=2-m=1.yaml}
Updated by Nathan Cutler over 6 years ago
I'm going through the jewel rados nightlies now to figure out when it started. Here is a list of SHA1s where the bug is seen, going back in time:
1557449a53febd15f448ce5acd354a3f67e0c0ba (tip of jewel as of December 14, 2017) 917cad9fd037118c2993f7b56eb1305727d007b8 59900b4469c04fa530b018e0bfd0065dbde27872 3f0996289daaca32abaa042ab6b784bf8b58361a (integration test of https://github.com/ceph/ceph/pull/17815)
Following runs were free of this bug:
- http://pulpito.ceph.com/smithfarm-2017-09-12_17:11:54-rados-wip-jewel-backports-distro-basic-smithi/ (last integration round before 10.2.10 went to QE)
Updated by Nathan Cutler over 6 years ago
Pushed https://shaman.ceph.com/builds/ceph/wip-jewel-22064/ pointing to 900e8cfc0d3a8057c3528b5e1787560bb6c2f198 which is that last merge before PR#17815 went in
Also pushed https://shaman.ceph.com/builds/ceph/wip-jewel-22064-with-pr17815 pointing to 066716cd997c94b22f4925e50205561b74a607a8 - the PR#17815 merge commit, and verified via "git diff" that the diff between these two branches is identical to PR#17815.
Updated by Josh Durgin over 6 years ago
- Assignee set to David Zafman
So Nathan seems to have narrowed it down to https://github.com/ceph/ceph/pull/17815 - can you look at this when you're back David?
Updated by Nathan Cutler over 6 years ago
Actually, I may have seen an instance of the failure in a run that did not include 17815, so please don't take what I wrote as a "smoking gun".
Updated by David Zafman over 6 years ago
As 17815 has to do with when scrub is allowed to start, it wouldn't be related to this bug.
Updated by David Zafman over 6 years ago
Yuri Weinstein wrote:
This doesn't look the same this one. It has been filed as BUG #22063
2017-11-20T00:40:31.594 INFO:teuthology.orchestra.run.smithi037:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --cluster ceph osd primary-affinity 5 0'
2017-11-20T00:40:31.599 INFO:tasks.rados.rados.0.smithi037.stdout:2853: finishing copy_from to smithi03731985-32
2017-11-20T00:40:31.600 INFO:tasks.rados.rados.0.smithi037.stderr:./test/osd/RadosModel.h: In function 'virtual void CopyFromOp::_finish(TestOp::CallbackInfo*)' thread 7fab197fa700 time 2017-11-20 00:40:31.602028
2017-11-20T00:40:31.600 INFO:tasks.rados.rados.0.smithi037.stderr:./test/osd/RadosModel.h: 1703: FAILED assert(!version || comp->get_version64() == version)
2017-11-20T00:40:31.600 INFO:tasks.rados.rados.0.smithi037.stderr: ceph version 10.2.10-42-g917cad9 (917cad9fd037118c2993f7b56eb1305727d007b8)
2017-11-20T00:40:31.600 INFO:tasks.rados.rados.0.smithi037.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x55f5161cf385]
2017-11-20T00:40:31.600 INFO:tasks.rados.rados.0.smithi037.stderr: 2: (CopyFromOp::_finish(TestOp::CallbackInfo*)+0x4fb) [0x55f5161b2d5b]
2017-11-20T00:40:31.600 INFO:tasks.rados.rados.0.smithi037.stderr: 3: (write_callback(void*, void*)+0x19) [0x55f5161c61a9]
2017-11-20T00:40:31.601 INFO:tasks.rados.rados.0.smithi037.stderr: 4: (()+0x9ef9d) [0x7fab27006f9d]
2017-11-20T00:40:31.601 INFO:tasks.rados.rados.0.smithi037.stderr: 5: (()+0x87349) [0x7fab26fef349]
2017-11-20T00:40:31.601 INFO:tasks.rados.rados.0.smithi037.stderr: 6: (()+0x176016) [0x7fab270de016]
2017-11-20T00:40:31.601 INFO:tasks.rados.rados.0.smithi037.stderr: 7: (()+0x7e25) [0x7fab263d0e25]
2017-11-20T00:40:31.602 INFO:tasks.rados.rados.0.smithi037.stderr: 8: (clone()+0x6d) [0x7fab252b834d]
2017-11-20T00:40:31.602 INFO:tasks.rados.rados.0.smithi037.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by David Zafman over 6 years ago
- Related to Bug #22063: "RadosModel.h: 1703: FAILED assert(!version || comp->get_version64() == version)" inrados-jewel-distro-basic-smithi added
Updated by David Zafman about 6 years ago
- Related to Bug #22123: osd: objecter sends out of sync with pg epochs for proxied ops added
Updated by Josh Durgin about 6 years ago
Looking through the logs more with David, we found this sequence of events in 1946610:
1) osd.5 gets a write to object 372 on the cache tier, and proxies it
2) base tier and cache tier pgs are moved to osd.0 as primary
3) osd.5 resends the proxied write
4) osd.0 tries to promote object 372, reading version 347 from the base tier
5) osd.0's base tier processes the proxied write, making the object version 372
6) osd.0's cache tier continues the promotion, getting -ERANGE for its assert version == 347
So this looks like it's the same as #22123, but easier to trigger on jewel
Updated by David Zafman about 6 years ago
As Josh said it seems easier to trigger in Jewel. This is based on my attempt to reproduce in master.
All 50 master runs passed with a --filter where I tried to re-create an equivalent run.
Master: {0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/short_pg_log.yaml backoff/normal.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-balancer/off.yaml msgr-failures/osd-delay.yaml msgr/random.yaml objectstore/filestore-xfs.yaml rados.yaml rocksdb.yaml thrashers/default.yaml thrashosds-health.yaml workloads/cache.yaml}
Jewel: {0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} fs/xfs.yaml hobj-sort.yaml msgr-failures/osd-delay.yaml msgr/random.yaml rados.yaml thrashers/default.yaml workloads/cache.yaml}
Updated by David Zafman about 6 years ago
- Related to deleted (Bug #22123: osd: objecter sends out of sync with pg epochs for proxied ops)
Updated by David Zafman about 6 years ago
- Is duplicate of Bug #22123: osd: objecter sends out of sync with pg epochs for proxied ops added
Updated by Nathan Cutler about 6 years ago
- Related to Bug #22743: "RadosModel.h: 854: FAILED assert(0)" in upgrade:hammer-x-jewel-distro-basic-smithi added