Project

General

Profile

Actions

Bug #23598

closed

hammer->jewel: ceph_test_rados crashes during radosbench task in jewel rados upgrade test

Added by Nathan Cutler about 6 years ago. Updated almost 6 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Test description: rados/upgrade/{hammer-x-singleton/{0-cluster/{openstack.yaml start.yaml} 1-hammer-install/hammer.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/{rbd-cls.yaml rbd-import-export.yaml readwrite.yaml snaps-few-objects.yaml} 6-next-mon/monb.yaml 7-workload/{radosbench.yaml rbd_api.yaml} 8-next-mon/monc.yaml 9-workload/{ec-rados-plugin=jerasure-k=3-m=1.yaml rbd-python.yaml rgw-swift.yaml snaps-many-objects.yaml test_cache-pool-snaps.yaml}} rados.yaml}

Symptom: crash during radosbench

Log excerpt:

2018-04-08T20:51:22.888 INFO:teuthology.task.full_sequential:In full_sequential, running task radosbench...
2018-04-08T20:51:22.888 INFO:tasks.radosbench:Beginning radosbench...

After some time, but still within radosbench:

2018-04-08T21:06:37.005 INFO:tasks.rados.rados.0.smithi130.stderr:./test/osd/RadosModel.h: In function 'virtual void CopyFromOp::_finish(TestOp::CallbackInfo*)' thread 7f11527fc700 time 2018-04-08 21:06:37.005490
2018-04-08T21:06:37.006 INFO:tasks.rados.rados.0.smithi130.stderr:./test/osd/RadosModel.h: 1597: FAILED assert(!version || comp->get_version64() == version)
2018-04-08T21:06:37.006 INFO:tasks.rados.rados.0.smithi130.stderr: ceph version 0.94.10-85-ga8e54ce (a8e54cee69fc2fdc8df27f35ebe1b56444f43317)
2018-04-08T21:06:37.007 INFO:tasks.rados.rados.0.smithi130.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x4e2cb5]
2018-04-08T21:06:37.007 INFO:tasks.rados.rados.0.smithi130.stderr: 2: (CopyFromOp::_finish(TestOp::CallbackInfo*)+0x4bb) [0x4c940b]
2018-04-08T21:06:37.007 INFO:tasks.rados.rados.0.smithi130.stderr: 3: (write_callback(void*, void*)+0x19) [0x4d9e49]
2018-04-08T21:06:37.007 INFO:tasks.rados.rados.0.smithi130.stderr: 4: (()+0x99b4d) [0x7f115f8bcb4d]
2018-04-08T21:06:37.007 INFO:tasks.rados.rados.0.smithi130.stderr: 5: (()+0x73379) [0x7f115f896379]
2018-04-08T21:06:37.007 INFO:tasks.rados.rados.0.smithi130.stderr: 6: (()+0x13eb88) [0x7f115f961b88]
2018-04-08T21:06:37.007 INFO:tasks.rados.rados.0.smithi130.stderr: 7: (()+0x7e25) [0x7f115ec8be25]
2018-04-08T21:06:37.008 INFO:tasks.rados.rados.0.smithi130.stderr: 8: (clone()+0x6d) [0x7f115dd8c34d]
2018-04-08T21:06:37.008 INFO:tasks.rados.rados.0.smithi130.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2018-04-08T21:06:37.008 INFO:tasks.rados.rados.0.smithi130.stderr:terminate called after throwing an instance of 'ceph::FailedAssertion'

Alternatively (in some runs), the crash is:

2018-04-08T09:02:33.922 INFO:tasks.rados.rados.0.smithi016.stderr:Error: finished tid 1 when last_acked_tid was 6
2018-04-08T09:02:33.922 INFO:tasks.rados.rados.0.smithi016.stderr:./test/osd/RadosModel.h: In function 'virtual void WriteOp::_finish(TestOp::CallbackInfo*)' thread 7f953ffff700 time 2018-04-08 09:02:33.913642
2018-04-08T09:02:33.922 INFO:tasks.rados.rados.0.smithi016.stderr:./test/osd/RadosModel.h: 854: FAILED assert(0)
2018-04-08T09:02:33.922 INFO:tasks.rados.rados.0.smithi016.stderr: ceph version 0.94.10-85-ga8e54ce (a8e54cee69fc2fdc8df27f35ebe1b56444f43317)
2018-04-08T09:02:33.923 INFO:tasks.rados.rados.0.smithi016.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x4e2cb5]
2018-04-08T09:02:33.923 INFO:tasks.rados.rados.0.smithi016.stderr: 2: (WriteOp::_finish(TestOp::CallbackInfo*)+0x4a3) [0x4c9ce3]
2018-04-08T09:02:33.923 INFO:tasks.rados.rados.0.smithi016.stderr: 3: (write_callback(void*, void*)+0x19) [0x4d9e49]
2018-04-08T09:02:33.923 INFO:tasks.rados.rados.0.smithi016.stderr: 4: (()+0x99b4d) [0x7f9559a7ab4d]
2018-04-08T09:02:33.923 INFO:tasks.rados.rados.0.smithi016.stderr: 5: (()+0x73379) [0x7f9559a54379]
2018-04-08T09:02:33.923 INFO:tasks.rados.rados.0.smithi016.stderr: 6: (()+0x13eb88) [0x7f9559b1fb88]
2018-04-08T09:02:33.923 INFO:tasks.rados.rados.0.smithi016.stderr: 7: (()+0x7e25) [0x7f9558e49e25]
2018-04-08T09:02:33.924 INFO:tasks.rados.rados.0.smithi016.stderr: 8: (clone()+0x6d) [0x7f9557f4a34d]
2018-04-08T09:02:33.925 INFO:tasks.rados.rados.0.smithi016.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2018-04-08T09:02:33.925 INFO:tasks.rados.rados.0.smithi016.stderr:terminate called after throwing an instance of 'ceph::FailedAssertion'

Full log: http://qa-proxy.ceph.com/teuthology/smithfarm-2018-04-08_20:06:36-rados-wip-jewel-backports-distro-basic-smithi/2371658/teuthology.log

Reproducibility: HIGH (4 in 4 tries)


Related issues 4 (2 open2 closed)

Related to RADOS - Bug #22123: osd: objecter sends out of sync with pg epochs for proxied opsResolvedSage Weil11/14/2017

Actions
Related to RADOS - Bug #22063: "RadosModel.h: 1703: FAILED assert(!version || comp->get_version64() == version)" inrados-jewel-distro-basic-smithi Duplicate11/07/2017

Actions
Related to Ceph - Bug #23947: ceph_test_rados dumped core, Error: finished tid 1 when last_acked_tid was 6New

Actions
Is duplicate of RADOS - Bug #23290: "/test/osd/RadosModel.h: 854: FAILED assert(0)" in upgrade:hammer-x-jewel-distro-basic-smithiNew03/09/2018

Actions
Actions #1

Updated by Nathan Cutler about 6 years ago

Set priority to Urgent because this prevents us from getting a clean rados run in jewel 10.2.11 integration testing.

Actions #2

Updated by Nathan Cutler about 6 years ago

  • Description updated (diff)
Actions #3

Updated by Nathan Cutler about 6 years ago

  • Related to Bug #22123: osd: objecter sends out of sync with pg epochs for proxied ops added
Actions #4

Updated by Nathan Cutler about 6 years ago

This problem was not happening so reproducibly before the current integration run, so one of the following PRs might be implicated:

https://github.com/ceph/ceph/pull/21200 - jewel: osd/PrimaryLogPG: dump snap_trimq size
https://github.com/ceph/ceph/pull/21199 - jewel: osd: replica read can trigger cache promotion
https://github.com/ceph/ceph/pull/21197 - jewel: ceph_authtool: add mode option
https://github.com/ceph/ceph/pull/20381 - jewel: librados: Double free in rados_getxattrs_next
https://github.com/ceph/ceph/pull/18010 - jewel: core: enable rocksdb for filestore

Actions #5

Updated by Nathan Cutler about 6 years ago

  • Related to Bug #22063: "RadosModel.h: 1703: FAILED assert(!version || comp->get_version64() == version)" inrados-jewel-distro-basic-smithi added
Actions #6

Updated by Nathan Cutler about 6 years ago

  • Subject changed from FAILED assert(!version || comp->get_version64() == version) in jewel rados upgrade test to FAILED assert(!version || comp->get_version64() == version) in radosbench in jewel rados upgrade test
Actions #7

Updated by Nathan Cutler about 6 years ago

  • Description updated (diff)
Actions #8

Updated by Nathan Cutler about 6 years ago

rados bisect

Reproducer: --suite rados --filter="rados/upgrade/{hammer-x-singleton/{0-cluster/{openstack.yaml start.yaml} 1-hammer-install/hammer.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/{rbd-cls.yaml rbd-import-export.yaml readwrite.yaml snaps-few-objects.yaml} 6-next-mon/monb.yaml 7-workload/{radosbench.yaml rbd_api.yaml} 8-next-mon/monc.yaml 9-workload/{ec-rados-plugin=jerasure-k=3-m=1.yaml rbd-python.yaml rgw-swift.yaml snaps-many-objects.yaml test_cache-pool-snaps.yaml}} rados.yaml}" --num 5

Jewel baseline

wip-jewel-backports

Actions #9

Updated by Nathan Cutler about 6 years ago

Hm hm hm

Actions #10

Updated by Nathan Cutler about 6 years ago

  • Subject changed from FAILED assert(!version || comp->get_version64() == version) in radosbench in jewel rados upgrade test to ceph_test_rados crashes in radosbench task in jewel rados upgrade test
Actions #11

Updated by Nathan Cutler about 6 years ago

  • Subject changed from ceph_test_rados crashes in radosbench task in jewel rados upgrade test to ceph_test_rados crashes during radosbench task in jewel rados upgrade test
Actions #12

Updated by Nathan Cutler about 6 years ago

  • Description updated (diff)
Actions #13

Updated by Greg Farnum about 6 years ago

  • Project changed from Ceph to RADOS

This is a dupe of...something. We can track it down later.

For now, note that the crash is happening with Hammer clients during an upgrade to Jewel.

Actions #14

Updated by Sage Weil almost 6 years ago

  • Subject changed from ceph_test_rados crashes during radosbench task in jewel rados upgrade test to hammer->jewel: ceph_test_rados crashes during radosbench task in jewel rados upgrade test
Actions #15

Updated by Kefu Chai almost 6 years ago

  • Related to Bug #23947: ceph_test_rados dumped core, Error: finished tid 1 when last_acked_tid was 6 added
Actions #16

Updated by Kefu Chai almost 6 years ago

  • Related to Bug #23290: "/test/osd/RadosModel.h: 854: FAILED assert(0)" in upgrade:hammer-x-jewel-distro-basic-smithi added
Actions #17

Updated by Kefu Chai almost 6 years ago

  • Related to deleted (Bug #23290: "/test/osd/RadosModel.h: 854: FAILED assert(0)" in upgrade:hammer-x-jewel-distro-basic-smithi)
Actions #18

Updated by Kefu Chai almost 6 years ago

  • Is duplicate of Bug #23290: "/test/osd/RadosModel.h: 854: FAILED assert(0)" in upgrade:hammer-x-jewel-distro-basic-smithi added
Actions #19

Updated by Kefu Chai almost 6 years ago

  • Status changed from New to Duplicate

#23290 does not contain any of the PR mentioned above. so it's not a regression.

Actions

Also available in: Atom PDF