Project

General

Profile

Bug #22064

"RadosModel.h: 865: FAILED assert(0)" in rados-jewel-distro-basic-smithi

Added by Yuri Weinstein over 1 year ago. Updated over 1 year ago.

Status:
Duplicate
Priority:
Immediate
Assignee:
Category:
Correctness/Safety
Target version:
-
Start date:
11/07/2017
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
Pull request ID:

Description

Run: http://pulpito.ceph.com/teuthology-2017-11-04_02:00:03-rados-jewel-distro-basic-smithi/
Jobs: '1809520', '1809675', '1809884', '1809567', '1809464'
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2017-11-04_02:00:03-rados-jewel-distro-basic-smithi/1809675/teuthology.log

2017-11-07T14:00:39.083 INFO:tasks.rados.rados.0.smithi026.stdout:3324:  got expected ENOENT (src dne)
2017-11-07T14:00:39.085 INFO:tasks.rados.rados.0.smithi026.stdout:3316:  finishing write tid 1 to smithi02615974-649 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
2017-11-07T14:00:39.085 INFO:tasks.rados.rados.0.smithi026.stderr:Error: finished tid 1 when last_acked_tid was 6
2017-11-07T14:00:39.085 INFO:tasks.rados.rados.0.smithi026.stderr:./test/osd/RadosModel.h: In function 'virtual void WriteOp::_finish(TestOp::CallbackInfo*)' thread 7f3ed9ffb700 time 2017-11-07 14:00:39.084107
2017-11-07T14:00:39.085 INFO:tasks.rados.rados.0.smithi026.stderr:./test/osd/RadosModel.h: 865: FAILED assert(0)
2017-11-07T14:00:39.093 INFO:tasks.rados.rados.0.smithi026.stderr: ceph version 10.2.10-42-g917cad9 (917cad9fd037118c2993f7b56eb1305727d007b8)
2017-11-07T14:00:39.094 INFO:tasks.rados.rados.0.smithi026.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x560a2414c385]
2017-11-07T14:00:39.094 INFO:tasks.rados.rados.0.smithi026.stderr: 2: (WriteOp::_finish(TestOp::CallbackInfo*)+0x4ed) [0x560a24135add]
2017-11-07T14:00:39.094 INFO:tasks.rados.rados.0.smithi026.stderr: 3: (write_callback(void*, void*)+0x19) [0x560a241431a9]
2017-11-07T14:00:39.094 INFO:tasks.rados.rados.0.smithi026.stderr: 4: (()+0x9ef9d) [0x7f3ee7687f9d]
2017-11-07T14:00:39.094 INFO:tasks.rados.rados.0.smithi026.stderr: 5: (()+0x87349) [0x7f3ee7670349]
2017-11-07T14:00:39.095 INFO:tasks.rados.rados.0.smithi026.stderr: 6: (()+0x176016) [0x7f3ee775f016]
2017-11-07T14:00:39.095 INFO:tasks.rados.rados.0.smithi026.stderr: 7: (()+0x7e25) [0x7f3ee6a51e25]
2017-11-07T14:00:39.095 INFO:tasks.rados.rados.0.smithi026.stderr: 8: (clone()+0x6d) [0x7f3ee593934d]
2017-11-07T14:00:39.095 INFO:tasks.rados.rados.0.smithi026.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Related issues

Related to RADOS - Bug #22063: "RadosModel.h: 1703: FAILED assert(!version || comp->get_version64() == version)" inrados-jewel-distro-basic-smithi Duplicate 11/07/2017
Related to RADOS - Bug #22743: "RadosModel.h: 854: FAILED assert(0)" in upgrade:hammer-x-jewel-distro-basic-smithi New 01/21/2018
Duplicates RADOS - Bug #22123: osd: objecter sends out of sync with pg epochs for proxied ops Resolved 11/14/2017

History

#2 Updated by Greg Farnum over 1 year ago

  • Project changed from Ceph to RADOS
  • Category set to Correctness/Safety
  • Status changed from New to Verified

#3 Updated by Yuri Weinstein over 1 year ago

Also in http://pulpito.ceph.com/teuthology-2017-12-09_02:00:03-rados-jewel-distro-basic-smithi/
Jobs: '1946659', '1946742', '1946550', '1946560', '1946733', '1946624'

#4 Updated by Nathan Cutler over 1 year ago

  • Priority changed from Urgent to Immediate

Raising priority because this is readily reproducible in baseline jewel. (By "baseline jewel" I mean jewel HEAD, currently at

This might be of interest, from /a/teuthology-2017-12-09_02:00:03-rados-jewel-distro-basic-smithi/1946610

2017-12-13T17:49:31.214 INFO:tasks.rados.rados.0.smithi097.stdout:update_object_version oid 372 v 329 (ObjNum 371 snap 0 seq_num 371) dirty exists
2017-12-13T17:49:31.214 INFO:tasks.rados.rados.0.smithi097.stderr:Error: oid 372 write returned error code -34
2017-12-13T17:49:31.214 INFO:tasks.rados.rados.0.smithi097.stderr:Error: oid 372 write returned error code -34
2017-12-13T17:49:31.214 INFO:tasks.rados.rados.0.smithi097.stderr:Error: racing read on 372 returned version 0 rather than version 329
2017-12-13T17:49:31.214 INFO:tasks.rados.rados.0.smithi097.stderr:./test/osd/RadosModel.h: In function 'virtual void WriteOp::_finish(TestOp::CallbackInfo*)' thread 7f4041ffb700 time 2017-12-13 17:49:31.378082
2017-12-13T17:49:31.215 INFO:tasks.rados.rados.0.smithi097.stderr:./test/osd/RadosModel.h: 892: FAILED assert(0 == "racing read got wrong version")
2017-12-13T17:49:31.215 INFO:tasks.rados.rados.0.smithi097.stderr: ceph version 10.2.10-50-g1557449 (1557449a53febd15f448ce5acd354a3f67e0c0ba)
2017-12-13T17:49:31.215 INFO:tasks.rados.rados.0.smithi097.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x5628f1ecff10]
2017-12-13T17:49:31.215 INFO:tasks.rados.rados.0.smithi097.stderr: 2: (WriteOp::_finish(TestOp::CallbackInfo*)+0x4c3) [0x5628f1eaee03]
2017-12-13T17:49:31.215 INFO:tasks.rados.rados.0.smithi097.stderr: 3: (write_callback(void*, void*)+0x19) [0x5628f1ec7779]
2017-12-13T17:49:31.216 INFO:tasks.rados.rados.0.smithi097.stderr: 4: (()+0x9711d) [0x7f404f0ad11d]
2017-12-13T17:49:31.216 INFO:tasks.rados.rados.0.smithi097.stderr: 5: (()+0x81d79) [0x7f404f097d79]
2017-12-13T17:49:31.216 INFO:tasks.rados.rados.0.smithi097.stderr: 6: (()+0x171ace) [0x7f404f187ace]
2017-12-13T17:49:31.216 INFO:tasks.rados.rados.0.smithi097.stderr: 7: (()+0x76ba) [0x7f404e9d66ba]

In other words, the "racing read got wrong version" assert is tripped instead of the "finished tid X when last_acked_tid_was Y" assert. (Both asserts are in the same if (waiting_on == 0) conditional block.)

Here's a list of the affected tests:

[1946733] rados/thrash/{0-size-min-size-overrides/2-size-1-min-size.yaml 1-pg-log-overrides/short_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} hobj-sort.yaml msgr-failures/few.yaml msgr/simple.yaml objectstore/filestore-xfs.yaml rados.yaml thrashers/pggrow.yaml workloads/snaps-few-objects.yaml}

[1946550] rados/thrash-erasure-code/{clusters/{fixed-2.yaml openstack.yaml} msgr-failures/few.yaml objectstore/filestore-xfs.yaml rados.yaml thrashers/morepggrow.yaml workloads/ec-rados-plugin=jerasure-k=3-m=1.yaml}

[1946659] rados/thrash/{0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/short_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} hobj-sort.yaml msgr-failures/fastclose.yaml msgr/random.yaml objectstore/filestore-xfs.yaml rados.yaml thrashers/pggrow.yaml workloads/small-objects.yaml}

[1946560] rados/thrash-erasure-code-isa/{arch/x86_64.yaml clusters/{fixed-2.yaml openstack.yaml} msgr-failures/few.yaml objectstore/filestore-xfs.yaml rados.yaml supported/centos_latest.yaml thrashers/pggrow.yaml workloads/ec-rados-plugin=isa-k=2-m=1.yaml}

[1946610] rados/thrash/{0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} hobj-sort.yaml msgr-failures/osd-delay.yaml msgr/random.yaml objectstore/filestore-xfs.yaml rados.yaml thrashers/default.yaml workloads/cache.yaml}

[1946540] rados/thrash/{0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/short_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} hobj-sort.yaml msgr-failures/osd-delay.yaml msgr/random.yaml objectstore/filestore-xfs.yaml rados.yaml thrashers/mapgap.yaml workloads/cache.yaml}

[1946505] rados/upgrade/{hammer-x-singleton/{0-cluster/{openstack.yaml start.yaml} 1-hammer-install/hammer.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/{rbd-cls.yaml rbd-import-export.yaml readwrite.yaml snaps-few-objects.yaml} 6-next-mon/monb.yaml 7-workload/{radosbench.yaml rbd_api.yaml} 8-next-mon/monc.yaml 9-workload/{ec-rados-plugin=jerasure-k=3-m=1.yaml rbd-python.yaml rgw-swift.yaml snaps-many-objects.yaml test_cache-pool-snaps.yaml}} rados.yaml}

[1946624] rados/thrash-erasure-code-isa/{arch/x86_64.yaml clusters/{fixed-2.yaml openstack.yaml} msgr-failures/osd-delay.yaml objectstore/filestore-xfs.yaml rados.yaml supported/centos_latest.yaml thrashers/morepggrow.yaml workloads/ec-rados-plugin=isa-k=2-m=1.yaml}

[1946742] rados/thrash-erasure-code/{clusters/{fixed-2.yaml openstack.yaml} msgr-failures/osd-delay.yaml objectstore/filestore-xfs.yaml rados.yaml thrashers/fastread.yaml workloads/ec-rados-plugin=jerasure-k=2-m=1.yaml}

#5 Updated by Nathan Cutler over 1 year ago

I'm going through the jewel rados nightlies now to figure out when it started. Here is a list of SHA1s where the bug is seen, going back in time:

1557449a53febd15f448ce5acd354a3f67e0c0ba (tip of jewel as of December 14, 2017)
917cad9fd037118c2993f7b56eb1305727d007b8
59900b4469c04fa530b018e0bfd0065dbde27872
3f0996289daaca32abaa042ab6b784bf8b58361a (integration test of https://github.com/ceph/ceph/pull/17815)

Following runs were free of this bug:

#6 Updated by Nathan Cutler over 1 year ago

Pushed https://shaman.ceph.com/builds/ceph/wip-jewel-22064/ pointing to 900e8cfc0d3a8057c3528b5e1787560bb6c2f198 which is that last merge before PR#17815 went in

Also pushed https://shaman.ceph.com/builds/ceph/wip-jewel-22064-with-pr17815 pointing to 066716cd997c94b22f4925e50205561b74a607a8 - the PR#17815 merge commit, and verified via "git diff" that the diff between these two branches is identical to PR#17815.

#7 Updated by Josh Durgin over 1 year ago

  • Assignee set to David Zafman

So Nathan seems to have narrowed it down to https://github.com/ceph/ceph/pull/17815 - can you look at this when you're back David?

#8 Updated by Nathan Cutler over 1 year ago

Actually, I may have seen an instance of the failure in a run that did not include 17815, so please don't take what I wrote as a "smoking gun".

#9 Updated by David Zafman over 1 year ago

As 17815 has to do with when scrub is allowed to start, it wouldn't be related to this bug.

#10 Updated by David Zafman over 1 year ago

Yuri Weinstein wrote:

Also in http://qa-proxy.ceph.com/teuthology/teuthology-2017-11-17_18:17:24-rados-jewel-distro-basic-smithi/1857527/teuthology.log

This doesn't look the same this one. It has been filed as BUG #22063

2017-11-20T00:40:31.594 INFO:teuthology.orchestra.run.smithi037:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --cluster ceph osd primary-affinity 5 0'
2017-11-20T00:40:31.599 INFO:tasks.rados.rados.0.smithi037.stdout:2853: finishing copy_from to smithi03731985-32
2017-11-20T00:40:31.600 INFO:tasks.rados.rados.0.smithi037.stderr:./test/osd/RadosModel.h: In function 'virtual void CopyFromOp::_finish(TestOp::CallbackInfo*)' thread 7fab197fa700 time 2017-11-20 00:40:31.602028
2017-11-20T00:40:31.600 INFO:tasks.rados.rados.0.smithi037.stderr:./test/osd/RadosModel.h: 1703: FAILED assert(!version || comp->get_version64() == version)
2017-11-20T00:40:31.600 INFO:tasks.rados.rados.0.smithi037.stderr: ceph version 10.2.10-42-g917cad9 (917cad9fd037118c2993f7b56eb1305727d007b8)
2017-11-20T00:40:31.600 INFO:tasks.rados.rados.0.smithi037.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x55f5161cf385]
2017-11-20T00:40:31.600 INFO:tasks.rados.rados.0.smithi037.stderr: 2: (CopyFromOp::_finish(TestOp::CallbackInfo*)+0x4fb) [0x55f5161b2d5b]
2017-11-20T00:40:31.600 INFO:tasks.rados.rados.0.smithi037.stderr: 3: (write_callback(void*, void*)+0x19) [0x55f5161c61a9]
2017-11-20T00:40:31.601 INFO:tasks.rados.rados.0.smithi037.stderr: 4: (()+0x9ef9d) [0x7fab27006f9d]
2017-11-20T00:40:31.601 INFO:tasks.rados.rados.0.smithi037.stderr: 5: (()+0x87349) [0x7fab26fef349]
2017-11-20T00:40:31.601 INFO:tasks.rados.rados.0.smithi037.stderr: 6: (()+0x176016) [0x7fab270de016]
2017-11-20T00:40:31.601 INFO:tasks.rados.rados.0.smithi037.stderr: 7: (()+0x7e25) [0x7fab263d0e25]
2017-11-20T00:40:31.602 INFO:tasks.rados.rados.0.smithi037.stderr: 8: (clone()+0x6d) [0x7fab252b834d]
2017-11-20T00:40:31.602 INFO:tasks.rados.rados.0.smithi037.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#11 Updated by David Zafman over 1 year ago

  • Related to Bug #22063: "RadosModel.h: 1703: FAILED assert(!version || comp->get_version64() == version)" inrados-jewel-distro-basic-smithi added

#12 Updated by David Zafman over 1 year ago

  • Related to Bug #22123: osd: objecter sends out of sync with pg epochs for proxied ops added

#13 Updated by Josh Durgin over 1 year ago

Looking through the logs more with David, we found this sequence of events in 1946610:

1) osd.5 gets a write to object 372 on the cache tier, and proxies it
2) base tier and cache tier pgs are moved to osd.0 as primary
3) osd.5 resends the proxied write
4) osd.0 tries to promote object 372, reading version 347 from the base tier
5) osd.0's base tier processes the proxied write, making the object version 372
6) osd.0's cache tier continues the promotion, getting -ERANGE for its assert version == 347

So this looks like it's the same as #22123, but easier to trigger on jewel

#14 Updated by David Zafman over 1 year ago

As Josh said it seems easier to trigger in Jewel. This is based on my attempt to reproduce in master.

All 50 master runs passed with a --filter where I tried to re-create an equivalent run.

Master: {0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/short_pg_log.yaml backoff/normal.yaml ceph.yaml clusters/{fixed-2.yaml openstack.yaml} d-balancer/off.yaml msgr-failures/osd-delay.yaml msgr/random.yaml objectstore/filestore-xfs.yaml rados.yaml rocksdb.yaml thrashers/default.yaml thrashosds-health.yaml workloads/cache.yaml}

Jewel: {0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} fs/xfs.yaml hobj-sort.yaml msgr-failures/osd-delay.yaml msgr/random.yaml rados.yaml thrashers/default.yaml workloads/cache.yaml}

http://pulpito.ceph.com/dzafman-2018-01-24_15:59:27-rados:thrash-wip-zafman-testing-distro-basic-smithi/

#15 Updated by David Zafman over 1 year ago

  • Related to deleted (Bug #22123: osd: objecter sends out of sync with pg epochs for proxied ops)

#16 Updated by David Zafman over 1 year ago

  • Duplicates Bug #22123: osd: objecter sends out of sync with pg epochs for proxied ops added

#17 Updated by David Zafman over 1 year ago

  • Status changed from Verified to Duplicate

#18 Updated by Nathan Cutler over 1 year ago

  • Related to Bug #22743: "RadosModel.h: 854: FAILED assert(0)" in upgrade:hammer-x-jewel-distro-basic-smithi added

Also available in: Atom PDF