Bug #12983
closedhandle_sub_read: Bad hash for 116/b1271c9d:test-rados-api-burnupi34-7183-66/foo/head
0%
Description
2015-09-06T23:36:57.664 INFO:tasks.ceph.osd.4.plana34.stderr:2015-09-06 23:36:57.655690 7f6f4a868700 -1 log_channel(cluster) log [ERR] : handle_sub_read: Bad hash for 116/b1271c9d:test-rados-api-burnupi34-7183-66/foo/head digest 0x56d5a3a6 expected 0xcf9b62cb 2015-09-06T23:36:57.673 INFO:tasks.ceph.osd.0.burnupi34.stderr:2015-09-06 23:36:57.674749 7f1539cb7700 -1 log_channel(cluster) log [ERR] : handle_sub_read: Bad hash for 116/b1271c9d:test-rados-api-burnupi34-7183-66/foo/head dige st 0x56d5a3a6 expected 0xcf9b62cb 2015-09-06T23:36:57.675 INFO:tasks.ceph.osd.1.burnupi34.stderr:2015-09-06 23:36:57.675847 7f9b6ca8f700 -1 osd.1 pg_epoch: 834 pg[123.5( empty local-les=828 n=0 ec=827 les/c 828/828 827/827/827) [1,2] r=0 lpr=827 crt=0'0 mlcod 0' 0 active+clean NIBBLEWISE] finish_promote unexpected promote error (5) Input/output error 2015-09-06T23:36:57.675 INFO:tasks.workunit.client.0.burnupi34.stdout:test/librados/tier.cc:3117: Failure 2015-09-06T23:36:57.675 INFO:tasks.workunit.client.0.burnupi34.stdout:Value of: ioctx.read("foo", bl, 1, 0) 2015-09-06T23:36:57.675 INFO:tasks.workunit.client.0.burnupi34.stdout: Actual: -5 2015-09-06T23:36:57.675 INFO:tasks.workunit.client.0.burnupi34.stdout:Expected: 1 2015-09-06T23:37:01.035 INFO:tasks.workunit.client.0.burnupi34.stdout:[ FAILED ] LibRadosTwoPoolsECPP.EvictSnap (17607 ms)
/a/sage-2015-09-06_20:14:37-rados-wip-sage-testing---basic-multi/1045602
Updated by Loïc Dachary over 8 years ago
This job run is mostly to convince myself that it runs ok on infernalis. It can be killed if it is known to fail.
filter='rados/thrash/{hobj-sort.yaml 0-size-min-size-overrides/2-size-1-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml clusters/fixed-2.yaml fs/ext4.yaml msgr/async.yaml msgr-failures/fastclose.yaml thrashers/default.yaml workloads/rados_api_tests.yaml}' ./virtualenv/bin/teuthology-suite --priority 101 --suite rados --filter="$filter" --suite-branch infernalis --distro ubuntu --email loic@dachary.org --ceph infernalis --machine-type plana,burnupi,mira
teuthology-openstack --throttle 5 --verbose --simultaneous-jobs 50 --key-name myself --key-filename ~/Downloads/myself --priority 101 --suite rados --filter="$filter" --suite-branch infernalis --distro ubuntu --email loic@dachary.org --ceph infernalis
Updated by Loïc Dachary over 8 years ago
Difference between the wip-sage-testing and infernalis
$ git --no-pager log --stat --oneline --graph ceph/infernalis..2d7e4803038ccac959659ebada301d78e77bb0b1 * 2d7e480 Merge remote-tracking branch 'gh/master' into wip-sage-testing * e01c90c Merge branch 'wip-12861' of https://github.com/liewegas/ceph into wip-sage-testing |\ | * 64962aa qa/workunits/rados/test_alloc_hint.sh: sudo to list files | qa/workunits/rados/test_alloc_hint.sh | 2 +- | 1 file changed, 1 insertion(+), 1 deletion(-) * b63e308 Merge branch 'tmap-update' of git://github.com/wonzhq/ceph into wip-sage-testing |\ | * f5bf2c5 osd: force promote for ops which ec base pool can't handle | src/osd/OSD.cc | 58 +++++++++++++++++++++++++++++++++++++++++----------------- | 1 file changed, 41 insertions(+), 17 deletions(-) * 430ce9d Merge branch 'wip-12912' of git://github.com/yuyuyu101/ceph into wip-sage-testing |\ | * 9cc1055 AsyncConnection: Close connection when unregistered connection met WAIT | src/msg/async/AsyncConnection.cc | 4 ++++ | 1 file changed, 4 insertions(+) * 80636ba Merge branch 'wip-scrub-promote' of git://github.com/XinzeChi/ceph into wip-sage-testing * f77949f bug fix: osd: requeue_scrub when kick_object_context_blocked src/osd/ReplicatedPG.cc | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-)
Updated by Sage Weil over 8 years ago
/a/sage-2015-09-07_18:07:23-rados-wip-sage-testing---basic-multi/1046245
Updated by David Zafman over 8 years ago
/a/sage-2015-09-07_18:07:23-rados-wip-sage-testing---basic-multi/1046245
2015-09-07 22:57:03.865461 7f163e0e4700 -1 log_channel(cluster) log [ERR] : handle_sub_read: Bad hash for 116/eb77ba0a:test-rados-api-mira033-22974-66/bam/head digest 0x56d5a3a6 expected 0xcf9b62cb
2015-09-07 22:57:03.863327 7fd18d474700 -1 log_channel(cluster) log [ERR] : handle_sub_read: Bad hash for 116/eb77ba0a:test-rados-api-mira033-22974-66/bam/head digest 0x56d5a3a6 expected 0xcf9b62cb
This logging shows EIO from a 2 out of 3 shards. errors={2(2)=-5,3(0)=-5} due to bad hashes.
It is interesting that the expected digest is the same for both shards and that the computed digest of both shards are the same.
Updated by Sage Weil over 8 years ago
/a/sage-2015-09-15_18:56:52-rados-wip-sage-testing---basic-multi/1058556
Updated by David Zafman over 8 years ago
Compared to a simple run of just the single test case there is an extra setattr. In addition we can see the extra sequencer op count 14530.1.8.
2015-09-07 22:57:00.390817 7fd194be7700 15 filestore(/var/lib/ceph/osd/ceph-3) setattrs 116.2s0_head/0:116/eb77ba0a:test-rados-api-mira033-22974-66/bam/head 2015-09-07 22:57:00.390882 7fd194be7700 10 filestore(/var/lib/ceph/osd/ceph-3) setattrs 116.2s0_head/0:116/eb77ba0a:test-rados-api-mira033-22974-66/bam/head = 0 2015-09-07 22:57:00.390922 7fd194be7700 20 filestore(/var/lib/ceph/osd/ceph-3) fgetattrs 164 getting 'hinfo_key' 2015-09-07 22:57:00.390937 7fd194be7700 15 filestore(/var/lib/ceph/osd/ceph-3) setattrs 116.2s0_head/0:116/eb77ba0a:test-rados-api-mira033-22974-66/bam/head 2015-09-07 22:57:00.391213 7fd194be7700 20 filestore dbobjectmap: seq is 1779 2015-09-07 22:57:00.391308 7fd194be7700 20 filestore set_map_header: setting 1778 oid 0:116/eb77ba0a:test-rados-api-mira033-22974-66/bam/head parent seq 0 2015-09-07 22:57:00.391341 7fd194be7700 10 filestore oid: 0:116/eb77ba0a:test-rados-api-mira033-22974-66/bam/head not skipping op, *spos 14530.1.6 2015-09-07 22:57:00.391344 7fd194be7700 10 filestore > header.spos 0.0.0 2015-09-07 22:57:00.391418 7fd194be7700 10 filestore(/var/lib/ceph/osd/ceph-3) setattrs 116.2s0_head/0:116/eb77ba0a:test-rados-api-mira033-22974-66/bam/head = 0 2015-09-07 22:57:00.391462 7fd194be7700 20 filestore(/var/lib/ceph/osd/ceph-3) fgetattrs 164 getting '_' 2015-09-07 22:57:00.391477 7fd194be7700 20 filestore(/var/lib/ceph/osd/ceph-3) fgetattrs 164 getting 'hinfo_key' 2015-09-07 22:57:00.391490 7fd194be7700 15 filestore(/var/lib/ceph/osd/ceph-3) setattrs 116.2s0_head/0:116/eb77ba0a:test-rados-api-mira033-22974-66/bam/head 2015-09-07 22:57:00.391572 7fd194be7700 10 filestore oid: 0:116/eb77ba0a:test-rados-api-mira033-22974-66/bam/head not skipping op, *spos 14530.1.7 2015-09-07 22:57:00.391576 7fd194be7700 10 filestore > header.spos 0.0.0 2015-09-07 22:57:00.391609 7fd194be7700 10 filestore(/var/lib/ceph/osd/ceph-3) setattrs 116.2s0_head/0:116/eb77ba0a:test-rados-api-mira033-22974-66/bam/head = 0
Extra setattr:
2015-09-07 22:57:00.391648 7fd194be7700 20 filestore(/var/lib/ceph/osd/ceph-3) fgetattrs 164 getting '_' 2015-09-07 22:57:00.391664 7fd194be7700 20 filestore(/var/lib/ceph/osd/ceph-3) fgetattrs 164 getting 'hinfo_key' 2015-09-07 22:57:00.391676 7fd194be7700 15 filestore(/var/lib/ceph/osd/ceph-3) setattrs 116.2s0_head/0:116/eb77ba0a:test-rados-api-mira033-22974-66/bam/head 2015-09-07 22:57:00.391687 7fd194be7700 10 filestore oid: 0:116/eb77ba0a:test-rados-api-mira033-22974-66/bam/head not skipping op, *spos 14530.1.8 2015-09-07 22:57:00.391689 7fd194be7700 10 filestore > header.spos 0.0.0 2015-09-07 22:57:00.391723 7fd194be7700 10 filestore(/var/lib/ceph/osd/ceph-3) setattrs 116.2s0_head/0:116/eb77ba0a:test-rados-api-mira033-22974-66/bam/head = 0
2015-09-07 22:57:00.391730 7fd194be7700 10 journal op_apply_finish 14530 open_ops 1 -> 0, max_applied_seq 14529 -> 14530 2015-09-07 22:57:00.391732 7fd194be7700 10 filestore(/var/lib/ceph/osd/ceph-3) _do_op 0x55d2a64d20f0 seq 14530 r = 0, finisher 0x55d2a5fd23a0 0 2015-09-07 22:57:00.391735 7fd194be7700 10 filestore(/var/lib/ceph/osd/ceph-3) _finish_op 0x55d2a64d20f0 seq 14530 osr(116.2s0 0x55d2a4ff3120)/0x55d2a4ff3120 lat 0.386367
Updated by Sage Weil over 8 years ago
The 'Bad hash' message triggered twice on my wip-sage-testing run that included the wip-12983 branch:
/a/sage-2015-09-24_20:43:22-rados-wip-sage-testing---basic-multi/1067946
/a/sage-2015-09-24_20:43:22-rados-wip-sage-testing---basic-multi/1068002
Updated by Samuel Just over 8 years ago
Harumph, /a/sage-2015-09-24_20:43:22-rados-wip-sage-testing---basic-multi/1068002 has no recovery. This should be fairly straightforward, looking...
Updated by Samuel Just over 8 years ago
2015-09-24 22:48:12.489539 7ff05c984700 1 -- 10.214.132.108:6808/25735 <== client.4345 10.214.132.108:0/2866085996 11 ==== osd_op(client.4345.0:26 test-rados-api-mira034-28047-42/foo [read 0~128,read 0~128] 76.81ece4b2 ack+read+known_if_redirected e328) v6 ==== 258+0+0 (87309738 0 0) 0x7ff08a0bedc0 con 0x7ff08a3e3080
I think sage's failures are a bug in the first patch in david's branch, I'll make a new version of that commit.
Updated by Sage Weil over 8 years ago
/var/lib/teuthworker/archive/sage-2015-09-28_08:14:59-rados-wip-sage-testing---basic-multi/1073896
another build that includes whatever is referenced by the PR