Project

General

Profile

Actions

Bug #18139

closed

LibRadosList.EnumerateObjects failure

Added by Sage Weil over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2016-12-03T20:44:59.112 INFO:tasks.workunit.client.0.smithi074.stdout:                 api_list: [ RUN      ] LibRadosList.EnumerateObjects
2016-12-03T20:44:59.112 INFO:tasks.workunit.client.0.smithi074.stdout:                 api_list: /srv/autobuild-ceph/gitbuilder.git/build/out~/ceph-11.0.2-2202-g0ba4bc1/src/test/librados/list.cc:718: Failure
2016-12-03T20:44:59.113 INFO:tasks.workunit.client.0.smithi074.stdout:                 api_list:       Expected: n_objects
2016-12-03T20:44:59.113 INFO:tasks.workunit.client.0.smithi074.stdout:                 api_list:       Which is: 16
2016-12-03T20:44:59.113 INFO:tasks.workunit.client.0.smithi074.stdout:                 api_list: To be equal to: saw_obj.size()
2016-12-03T20:44:59.113 INFO:tasks.workunit.client.0.smithi074.stdout:                 api_list:       Which is: 17
2016-12-03T20:44:59.113 INFO:tasks.workunit.client.0.smithi074.stdout:                 api_list: [  FAILED  ] LibRadosList.EnumerateObjects (20621 ms)
/a/sage-2016-12-03_19:34:03-rados-wip-sage-testing---basic-smithi/599201 description: rados/thrash/{0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml
clusters/{fixed-2.yaml openstack.yaml} fs/btrfs.yaml hobj-sort.yaml msgr-failures/few.yaml
msgr/async.yaml objectstore/filestore.yaml rados.yaml rocksdb.yaml thrashers/mapgap.yaml
workloads/rados_api_tests.yaml}
Actions #1

Updated by Samuel Just over 7 years ago

  • Assignee set to David Zafman
Actions #2

Updated by David Zafman over 7 years ago

This test verifies 3 things. First, as objects are returned it makes sure there are no duplicates AND inserts the name into an std::set. Second, after processing the elements it makes sure that nothing is missing. Finally, it verifies that there aren't any extra elements. This last check is what failed. Given that the saw_obj is a std::set with 17 elements and the test only created 16 objects, there is 1 extra unique string. We know that all 16 expected elements are present based on the first 2 checks.

Need to examine further how this could happen. The obvious pg split seems like it could get the same element more that one once, but that didn't happen. Where could an extraneous object name come from?

Actions #3

Updated by David Zafman over 7 years ago

Still need to investigate rollforward which I think caused an object "foo" to be created then deleted, but not before it was found by the filestore servicing the pgnls from the test. 4 other curiosities:

The writes to "foo" are 4 bytes (current test writes 128 bytes). Where did that come from?
The snapset is changed, so was this a clone?
The test that is running in a recent sub-test wrote an 18'1 version of "foo" just before this, but this is 18'9.
Why a 19 seconds delay between the 17 writes and the pgnls?


TEST WROTE 16 objects just before this point

2016-12-03 20:44:19.895607 7faf70127700 10 osd.5 pg_epoch: 18 pg[14.6( v 18'8 (0'0,18'8] local-les=18 n=2 ec=17 les/c/f 18/18/0 17/17/17) [5,2,1] r=0 lpr=17 crt=18'8 lcod 18'7 mlcod 18'7 active+clean] do_op 14:602f83fe:::foo:head [write 0~4] ov 0'0 av 18'9 snapc 0=[] snapset 0=[]:[]
2016-12-03 20:44:20.895064 7faf70127700 10 osd.5 pg_epoch: 19 pg[14.6( v 18'9 (0'0,18'9] local-les=18 n=3 ec=17 les/c/f 18/18/0 17/17/17) [5,2,1] r=0 lpr=17 crt=18'9 lcod 18'8 mlcod 18'8 active+clean] do_op 14:602f83fe:::foo:head [write 0~4] ov 18'9 av 19'10 snapc 0=[] snapset 0=[]:[]+head

2016-12-03 20:44:39.685193 7faf6d922700 20 filestore(/var/lib/ceph/osd/ceph-5) objects: [#14:60000000::::head#,#14:602f83fe:::foo:head#,#14:62a1935d:::14:head#,#14:6cac518f:::0:head#]

2016-12-03 20:44:39.707414 7faf6d922700 10 osd.5 pg_epoch: 37 pg[14.6( v 19'10 (0'0,19'10] local-les=18 n=3 ec=17 les/c/f 18/18/0 17/17/17) [5,2,1] r=0 lpr=17 crt=19'10 lcod 18'9 mlcod 18'9 active+clean] do_op 14:602f83fe:::foo:head [delete] ov 19'10 av 37'11 snapc 0=[] snapset 0=[]:[]+head
Actions #4

Updated by David Zafman over 7 years ago

ceph_test_rados_api_misc with pid 15487 wrote 4 bytes to "foo" to pool 14 which belongs to the other test!

Actions #5

Updated by David Zafman over 7 years ago

Caused by new PoolCreationRace test in ceph_test_rados_api_misc added by c0fb48331213e6934d19b7dbeeb83afeed5d7f7a

It creates a pool then uses that pool id + 1 to create an ioctx. If another concurrently running test has added a pool it could have this pool id.

We could either disable this test or find another way to do it.

Actions #6

Updated by David Zafman over 7 years ago

  • Status changed from 12 to 17
Actions #7

Updated by David Zafman over 7 years ago

  • Status changed from 17 to Resolved
Actions

Also available in: Atom PDF