Project

General

Profile

Actions

Bug #15425

closed

client PGLS loop and LibRadosWatchNotifyPPTests failure

Added by Samuel Just about 8 years ago. Updated about 8 years ago.

Status:
Rejected
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2016-04-06T22:02:17.420 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: test/librados/watch_notify.cc:611: Failure
2016-04-06T22:02:17.421 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: Expected: (ioctx.watch_check(handle)) > (0), actual: 0 vs 0
2016-04-06T22:02:17.421 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: [ FAILED ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.AioNotify/0, where GetParam() = "" (9 ms)
2016-04-06T22:02:17.421 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: [ RUN ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.AioNotify/1
2016-04-06T22:02:17.421 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: handle_notify cookie 140375336919424 notify_id 253403070472 notifier_gid 4206
2016-04-06T22:02:17.421 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: [ OK ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.AioNotify/1 (16 ms)
2016-04-06T22:02:17.421 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: [ RUN ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify2Timeout/0
2016-04-06T22:02:17.422 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: trying...
2016-04-06T22:02:17.422 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: handle_notify cookie 140375336917024 notify_id 253403070465 notifier_gid 4206
2016-04-06T22:02:17.422 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: timed out
2016-04-06T22:02:17.422 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: flushing
2016-04-06T22:02:17.422 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: flushed
2016-04-06T22:02:17.423 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: [ OK ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify2Timeout/0 (3012 ms)
2016-04-06T22:02:17.423 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: [ RUN ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify2Timeout/1
2016-04-06T22:02:17.423 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: test/librados/watch_notify.cc:751: Failure
2016-04-06T22:02:17.423 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: Expected: (ioctx.watch_check(handle)) > (0), actual: 0 vs 0
2016-04-06T22:02:17.423 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: [ FAILED ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify2Timeout/1, where GetParam() = "cache" (12 ms)
2016-04-06T22:02:17.424 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: [----------] 12 tests from LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP (8753 ms total)
2016-04-06T22:02:17.425 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify:
2016-04-06T22:02:17.425 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: [----------] Global test environment tear-down
2016-04-06T22:02:17.425 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: [==========] 24 tests from 4 test cases ran. (70825 ms total)
2016-04-06T22:02:17.425 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: [ PASSED ] 22 tests.
2016-04-06T22:02:17.425 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: [ FAILED ] 2 tests, listed below:
2016-04-06T22:02:17.426 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: [ FAILED ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.AioNotify/0, where GetParam() = ""
2016-04-06T22:02:17.426 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: [ FAILED ] LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify2Timeout/1, where GetParam() = "cache"
2016-04-06T22:02:17.427 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify:
2016-04-06T22:02:17.427 INFO:tasks.workunit.client.0.smithi001.stdout: api_watch_notify: 2 FAILED TESTS

wip-sam-testing run, but the patches should not have affected this code.

sjust@teuthology:/a/samuelj-2016-04-06_16:46:34-rados-wip-sam-testing-distro-basic-smithi/112699/remote$

Actions #1

Updated by Samuel Just about 8 years ago

wip-sam-testing-07-04-2016

Actions #2

Updated by Samuel Just about 8 years ago

Probably relevant,

Probably relevant, ceph-osd.1.log.gz is 7.2g in size and is full of:

2016-04-07 05:01:29.620509 7fb52c9bf700 1 -- 172.21.15.1:6808/16253 --> 172.21.15.1:0/786444710 -- osd_op_reply(76833 [pgls start_epoch 10] v10'1 uv1 ondisk = 0) v7 -- ?+0 0x7fb555d98840 con 0x7fb555c15800
2016-04-07 05:01:29.620515 7fb52c9bf700 10 osd.1 25 dequeue_op 0x7fb556301a00 finish
2016-04-07 05:01:29.620602 7fb522e8d700 1 -- 172.21.15.1:6808/16253 <== client.4162 172.21.15.1:0/786444710 26 ==== osd_op(client.4162.0:76834 24.0 (undecoded) ack+read+known_if_redirected e25) v7 ==== 186+0+39 (300062115 0 3195274061) 0x7fb554952680 con 0x7fb555c15800
2016-04-07 05:01:29.620612 7fb522e8d700 20 osd.1 25 should_share_map client.4162 172.21.15.1:0/786444710 25
2016-04-07 05:01:29.620616 7fb522e8d700 15 osd.1 25 enqueue_op 0x7fb556301800 prio 63 cost 39 latency 0.000029 osd_op(client.4162.0:76834 24.0 (undecoded) ack+read+known_if_redirected e25) v7
2016-04-07 05:01:29.620629 7fb52f1c4700 10 osd.1 25 dequeue_op 0x7fb556301800 prio 63 cost 39 latency 0.000042 osd_op(client.4162.0:76834 24.0 (undecoded) ack+read+known_if_redirected e25) v7 pg pg[24.0( v 10'1 (0'0,10'1] local-les=10 n=1 ec=9 les/c/f 10/10/0 9/9/9) [1,2] r=0 lpr=9 crt=0'0 lcod 0'0 mlcod 0'0 active+clean]
2016-04-07 05:01:29.620641 7fb52f1c4700 10 osd.1 pg_epoch: 25 pg[24.0( v 10'1 (0'0,10'1] local-les=10 n=1 ec=9 les/c/f 10/10/0 9/9/9) [1,2] r=0 lpr=9 crt=0'0 lcod 0'0 mlcod 0'0 active+clean] handle_message: 0x7fb556301800
2016-04-07 05:01:29.620647 7fb52f1c4700 10 osd.1 pg_epoch: 25 pg[24.0( v 10'1 (0'0,10'1] local-les=10 n=1 ec=9 les/c/f 10/10/0 9/9/9) [1,2] r=0 lpr=9 crt=0'0 lcod 0'0 mlcod 0'0 active+clean] do_pg_op osd_op(client.4162.0:76834 24.0 [pgls start_epoch 10] snapc 0=[] ack+read+known_if_redirected e25) v7

Actions #3

Updated by Samuel Just about 8 years ago

sjust@teuthology:/a/samuelj-2016-04-06_16:46:34-rados-wip-sam-testing-distro-basic-smithi/112440/remote$

This one did not have the api test failures, but it did have the pgls loop. I'm guessing that the pgls loop is what caused the above failure and is the actual bug:

2016-04-07 01:24:38.829880 7fef848aa700 10 osd.1 pg_epoch: 146 pg[28.1( v 10'1 (0'0,10'1] local-les=10 n=1 ec=9 les/c/f 10/10/0 9/9/9) [1,0] r=0 lpr=9 crt=0'0 lcod 0'0 mlcod 0'0 active+clean] pgls pg=28.1 count 1024
2016-04-07 01:24:38.829894 7fef848aa700 20 filestore(/var/lib/ceph/osd/ceph-1) collection_list pool is 28 shard is 255 pgid 28.1
2016-04-07 01:24:38.829896 7fef848aa700 10 filestore(/var/lib/ceph/osd/ceph-1) collection_list first checking temp pool
2016-04-07 01:24:38.829898 7fef848aa700 20 filestore(/var/lib/ceph/osd/ceph-1) collection_list pool is 30 shard is 255 pgid 28.1
2016-04-07 01:24:38.829901 7fef848aa700 20 _collection_list_partial start:GHMIN end:GHMAX-1024 ls.size 0
2016-04-07 01:24:38.829939 7fef848aa700 20 filestore(/var/lib/ceph/osd/ceph-1) objects: 0x7fef848a74c0
2016-04-07 01:24:38.829945 7fef848aa700 10 filestore(/var/lib/ceph/osd/ceph-1) collection_list fall through to non-temp collection, start #-1:00000000::::0#
2016-04-07 01:24:38.829950 7fef848aa700 20 _collection_list_partial start:#-1:00000000::::0# end:GHMAX-1024 ls.size 0
2016-04-07 01:24:38.829991 7fef848aa700 10 osd.1 pg_epoch: 146 pg[28.1( v 10'1 (0'0,10'1] local-les=10 n=1 ec=9 les/c/f 10/10/0 9/9/9) [1,0] r=0 lpr=9 crt=0'0 lcod 0'0 mlcod 0'0 active+clean] pgls result=0 outdata.length()=44
2016-04-07 01:24:38.830008 7fef848aa700 1 -
172.21.15.60:6804/25150 --> 172.21.15.60:0/4155044961 -- osd_op_reply(289335 [pgls start_epoch 10] v10'1 uv1 ondisk = 0) v7 -- ?+0 0x7fefb28af700 con 0x7fefb1f15180
2016-04-07 01:24:38.830024 7fef848aa700 10 osd.1 146 dequeue_op 0x7fefb2066d00 finish
2016-04-07 01:24:38.830324 7fef76b52700 1 -- 172.21.15.60:6804/25150 <== client.4174 172.21.15.60:0/4155044961 151 ==== osd_op(client.4174.0:289336 28.1 (undecoded) ack+read+known_if_redirected e146) v7 ==== 186+0+39 (3843260203 0 3195274
061) 0x7fefb28ad8c0 con 0x7fefb1f15180
2016-04-07 01:24:38.830357 7fef76b52700 20 osd.1 146 should_share_map client.4174 172.21.15.60:0/4155044961 146
2016-04-07 01:24:38.830363 7fef76b52700 15 osd.1 146 enqueue_op 0x7fefb1c1ff00 prio 63 cost 39 latency 0.000069 osd_op(client.4174.0:289336 28.1 (undecoded) ack+read+known_if_redirected e146) v7
2016-04-07 01:24:38.830409 7fef820a5700 10 osd.1 146 dequeue_op 0x7fefb1c1ff00 prio 63 cost 39 latency 0.000115 osd_op(client.4174.0:289336 28.1 (undecoded) ack+read+known_if_redirected e146) v7 pg pg[28.1( v 10'1 (0'0,10'1] local-les=10 n
=1 ec=9 les/c/f 10/10/0 9/9/9) [1,0] r=0 lpr=9 crt=0'0 lcod 0'0 mlcod 0'0 active+clean]
2016-04-07 01:24:38.830439 7fef820a5700 10 osd.1 pg_epoch: 146 pg[28.1( v 10'1 (0'0,10'1] local-les=10 n=1 ec=9 les/c/f 10/10/0 9/9/9) [1,0] r=0 lpr=9 crt=0'0 lcod 0'0 mlcod 0'0 active+clean] handle_message: 0x7fefb1c1ff00
2016-04-07 01:24:38.830456 7fef820a5700 10 osd.1 pg_epoch: 146 pg[28.1( v 10'1 (0'0,10'1] local-les=10 n=1 ec=9 les/c/f 10/10/0 9/9/9) [1,0] r=0 lpr=9 crt=0'0 lcod 0'0 mlcod 0'0 active+clean] do_pg_op osd_op(client.4174.0:289336 28.1 [pgl
s start_epoch 10] snapc 0=[] ack+read+known_if_redirected e146) v7
2016-04-07 01:24:38.830478 7fef820a5700 10 osd.1 pg_epoch: 146 pg[28.1( v 10'1 (0'0,10'1] local-les=10 n=1 ec=9 les/c/f 10/10/0 9/9/9) [1,0] r=0 lpr=9 crt=0'0 lcod 0'0 mlcod 0'0 active+clean] pgls pg=28.1 count 1024
2016-04-07 01:24:38.830491 7fef820a5700 20 filestore(/var/lib/ceph/osd/ceph-1) collection_list pool is 28 shard is 255 pgid 28.1
2016-04-07 01:24:38.830495 7fef820a5700 10 filestore(/var/lib/ceph/osd/ceph-1) collection_list first checking temp pool
2016-04-07 01:24:38.830496 7fef820a5700 20 filestore(/var/lib/ceph/osd/ceph-1) collection_list pool is -30 shard is 255 pgid 28.1
2016-04-07 01:24:38.830499 7fef820a5700 20 _collection_list_partial start:GHMIN end:GHMAX-1024 ls.size 0

Actions #4

Updated by Samuel Just about 8 years ago

  • Subject changed from LibRadosWatchNotifyPPTests failure to client PGLS loop and LibRadosWatchNotifyPPTests failure
Actions #5

Updated by Samuel Just about 8 years ago

f16187f68d7f494fd5ed243e163a8ae3cb111cd3 ?

Actions #6

Updated by Samuel Just about 8 years ago

  • Priority changed from Urgent to Immediate
Actions #7

Updated by Sage Weil about 8 years ago

  • Status changed from New to In Progress
  • Assignee set to Sage Weil
  • Source changed from other to Q/A
Actions #8

Updated by Sage Weil about 8 years ago

  • Status changed from In Progress to 12

Actually, I don't think this is a problem with the pgls changes. It looks like a problem in HashIndex or LFNIndex:

a normal collection_list looks like this:

2016-04-07 01:22:04.885616 7fef828a6700 10 osd.1 pg_epoch: 10 pg[28.0( empty local-les=10 n=0 ec=9 les/c/f 10/10/0 9/9/9) [1,5] r=0 lpr=9 crt=0'0 mlcod 0'0 active+clean]  pgls pg=28.0 count 1024
2016-04-07 01:22:04.885626 7fef828a6700 20 filestore(/var/lib/ceph/osd/ceph-1) collection_list pool is 28 shard is 255 pgid 28.0
2016-04-07 01:22:04.885627 7fef828a6700 10 filestore(/var/lib/ceph/osd/ceph-1) collection_list first checking temp pool
2016-04-07 01:22:04.885629 7fef828a6700 20 filestore(/var/lib/ceph/osd/ceph-1) collection_list pool is -30 shard is 255 pgid 28.0
2016-04-07 01:22:04.885631 7fef828a6700 20 _collection_list_partial start:GHMIN end:GHMAX-1024 ls.size 0
2016-04-07 01:22:04.885656 7fef828a6700 20 filestore(/var/lib/ceph/osd/ceph-1) objects: 0x7fef828a34c0
2016-04-07 01:22:04.885660 7fef828a6700 10 filestore(/var/lib/ceph/osd/ceph-1) collection_list fall through to non-temp collection, start #-1:00000000::::0#
2016-04-07 01:22:04.885662 7fef828a6700 20 _collection_list_partial start:#-1:00000000::::0# end:GHMAX-1024 ls.size 0
2016-04-07 01:22:04.885683 7fef828a6700 20 list_by_hash_bitwise prefix 00000000
2016-04-07 01:22:04.885687 7fef828a6700 20 list_by_hash_bitwise prefix 00000000 ob #28:00000000::::head#
2016-04-07 01:22:04.885689 7fef828a6700 20 filestore(/var/lib/ceph/osd/ceph-1) objects: 0x7fef828a34c0
2016-04-07 01:22:04.885693 7fef828a6700 10 osd.1 pg_epoch: 10 pg[28.0( empty local-les=10 n=0 ec=9 les/c/f 10/10/0 9/9/9) [1,5] r=0 lpr=9 crt=0'0 mlcod 0'0 active+clean]  pgls result=1 outdata.length()=44

but the one that gets stuck on 28.1 does not:

2016-04-07 01:22:04.886035 7fef848aa700 10 osd.1 pg_epoch: 10 pg[28.1( v 10'1 (0'0,10'1] local-les=10 n=1 ec=9 les/c/f 10/10/0 9/9/9) [1,0] r=0 lpr=9 crt=0'0 lcod 0'0 mlcod 0'0 active+clean]  pgls pg=28.1 count 1024
2016-04-07 01:22:04.886040 7fef848aa700 20 filestore(/var/lib/ceph/osd/ceph-1) collection_list pool is 28 shard is 255 pgid 28.1
2016-04-07 01:22:04.886042 7fef848aa700 10 filestore(/var/lib/ceph/osd/ceph-1) collection_list first checking temp pool
2016-04-07 01:22:04.886043 7fef848aa700 20 filestore(/var/lib/ceph/osd/ceph-1) collection_list pool is -30 shard is 255 pgid 28.1
2016-04-07 01:22:04.886045 7fef848aa700 20 _collection_list_partial start:GHMIN end:GHMAX-1024 ls.size 0
2016-04-07 01:22:04.886065 7fef848aa700 20 filestore(/var/lib/ceph/osd/ceph-1) objects: 0x7fef848a74c0
2016-04-07 01:22:04.886067 7fef848aa700 10 filestore(/var/lib/ceph/osd/ceph-1) collection_list fall through to non-temp collection, start #-1:00000000::::0#
2016-04-07 01:22:04.886069 7fef848aa700 20 _collection_list_partial start:#-1:00000000::::0# end:GHMAX-1024 ls.size 0
...there should be a call to list_by_hash_bitwise here...
2016-04-07 01:22:04.886093 7fef848aa700 10 osd.1 pg_epoch: 10 pg[28.1( v 10'1 (0'0,10'1] local-les=10 n=1 ec=9 les/c/f 10/10/0 9/9/9) [1,0] r=0 lpr=9 crt=0'0 lcod 0'0 mlcod 0'0 active+clean]  pgls result=0 outdata.length()=44

collection_list definitely errored out because there is no "objects: ..." line. that could happen in several places, including

https://github.com/ceph/ceph/blob/master/src/os/filestore/HashIndex.cc#L838
https://github.com/ceph/ceph/blob/master/src/os/filestore/LFNIndex.cc#L400
https://github.com/ceph/ceph/blob/master/src/os/filestore/LFNIndex.cc#L425 (lfn_translate fails?)
https://github.com/ceph/ceph/blob/master/src/os/filestore/HashIndex.cc#L852

the lfn_translate call gets my vote?

Actions #9

Updated by Sage Weil about 8 years ago

https://github.com/liewegas/ceph/commit/wip-15425

might be helpful for finding where the error is.

also, i notice that int lfn_translate() is returning a bool lfn_parse_object_name_* in several places. though the error checks are for -1, so it's probably okay in master.

Actions #10

Updated by Samuel Just about 8 years ago

  • Status changed from 12 to Rejected

Bug in integration branch

Actions

Also available in: Atom PDF