Bug #23578
large-omap-object-warnings test fails
0%
Description
2018-04-06T16:26:01.808 INFO:tasks.workunit.client.0.smithi141.stdout: 2018-04-06T16:26:01.809 INFO:tasks.workunit.client.0.smithi141.stdout:Cluster ID: 90d70377-127d-4a15-a009-2e4cc84f615d 2018-04-06T16:26:01.809 INFO:tasks.workunit.client.0.smithi141.stdout:Error, got invalid output: 2018-04-06T16:26:01.809 INFO:tasks.workunit.client.0.smithi141.stdout:HEALTH_WARN application not enabled on 1 pool(s) 2018-04-06T16:26:01.809 INFO:tasks.workunit.client.0.smithi141.stdout:POOL_APP_NOT_ENABLED application not enabled on 1 pool(s) 2018-04-06T16:26:01.810 INFO:tasks.workunit.client.0.smithi141.stdout: application not enabled on pool 'large-omap-test-pool' 2018-04-06T16:26:01.810 INFO:tasks.workunit.client.0.smithi141.stdout: use 'ceph osd pool application enable <pool-name> <app-name>', where <app-name> is 'cephfs', 'rbd', 'rgw', or freeform for custom applications. 2018-04-06T16:26:01.810 INFO:tasks.workunit.client.0.smithi141.stdout: 2018-04-06T16:26:01.810 INFO:tasks.workunit.client.0.smithi141.stderr:Traceback (most recent call last): 2018-04-06T16:26:01.810 INFO:tasks.workunit.client.0.smithi141.stderr: File "/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test_large_omap_detection.py", line 130, in <module> 2018-04-06T16:26:01.810 INFO:tasks.workunit.client.0.smithi141.stderr: main() 2018-04-06T16:26:01.810 INFO:tasks.workunit.client.0.smithi141.stderr: File "/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test_large_omap_detection.py", line 125, in main 2018-04-06T16:26:01.810 INFO:tasks.workunit.client.0.smithi141.stderr: check_health_output() 2018-04-06T16:26:01.811 INFO:tasks.workunit.client.0.smithi141.stderr: File "/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test_large_omap_detection.py", line 120, in check_health_output 2018-04-06T16:26:01.811 INFO:tasks.workunit.client.0.smithi141.stderr: raise Exception
/a/sage-2018-04-06_13:59:16-rados-wip-sage2-testing-2018-04-06-0721-distro-basic-smithi/2362967
I've seen this a few other times in the last couple weeks
Related issues
History
#1 Updated by Brad Hubbard almost 6 years ago
- Category set to Tests
- Assignee set to Brad Hubbard
- Source set to Development
We instruct the OSDs to scrub at around 16:15.
2018-04-06T16:04:15.971 INFO:tasks.workunit.client.0.smithi141.stderr:instructed osd(s) 0 to deep-scrub 2018-04-06T16:04:16.222 INFO:tasks.workunit.client.0.smithi141.stderr:instructed osd(s) 1 to deep-scrub
Twenty two minutes later, at 16:26:01, we time out.
2018-04-06T16:26:01.809 INFO:tasks.workunit.client.0.smithi141.stdout:Error, got invalid output:
But we only start scrubbing the relevant pg on OSD at 16:26:10.
2018-04-06 16:26:10.549 7efd8b459700 20 osd.0 pg_epoch: 13 pg[2.2( v 12'2 (0'0,12'2] local-lis/les=11/12 n=1 ec=11/11 lis/c 11/11 les/c/f 12/12/0 11/11/11) [0,1] r=0 lpr=11 crt=12'2 lcod 12'1 mlcod 12'1 active+clean+scrubbing+deep] deep-scrub 2:539eec76:::large-omap-test-object1:head 2:539eec76:::large-omap-test-object1:head(12'2 client.4146.0:3 dirty|omap|data_digest s 11 uv 2 dd 9b6a63e4 alloc_hint [0 0 0]) 2018-04-06 16:26:10.549 7efd8b459700 20 osd.0 pg_epoch: 13 pg[2.2( v 12'2 (0'0,12'2] local-lis/les=11/12 n=1 ec=11/11 lis/c 11/11 les/c/f 12/12/0 11/11/11) [0,1] r=0 lpr=11 crt=12'2 lcod 12'1 mlcod 12'1 active+clean+scrubbing+deep] scrub_snapshot_metadata deep-scrub new head 2:539eec76:::large-omap-test-object1:head 2018-04-06 16:26:10.549 7efd8b459700 10 osd.0 pg_epoch: 13 pg[2.2( v 13'3 (0'0,13'3] local-lis/les=11/12 n=1 ec=11/11 lis/c 11/11 les/c/f 12/12/0 11/11/11) [0,1] r=0 lpr=11 luod=12'2 crt=13'3 lcod 12'1 mlcod 12'1 active+clean+scrubbing+deep] scrub_snapshot_metadata (deep-scrub) finish 2018-04-06 16:26:10.553 7efd87451700 10 osd.0 pg_epoch: 13 pg[2.2( v 13'3 (0'0,13'3] local-lis/les=11/12 n=1 ec=11/11 lis/c 11/11 les/c/f 12/12/0 11/11/11) [0,1] r=0 lpr=11 crt=13'3 lcod 12'2 mlcod 12'2 active+clean+scrubbing+deep] deep-scrub got 1/1 objects, 0/0 clones, 1/1 dirty, 1/1 omap, 0/0 pinned, 0/0 hit_set_archive, 11/11 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. 2018-04-06 16:26:10.553 7efd87451700 0 log_channel(cluster) log [DBG] : 2.2 deep-scrub ok
We start even later on osd.1.
2018-04-06 16:26:21.477 7fc28f5aa700 0 log_channel(cluster) log [DBG] : 2.4 deep-scrub starts 2018-04-06 16:26:21.549 7fc28f5aa700 20 osd.1 pg_epoch: 13 pg[2.4( v 12'2 (0'0,12'2] local-lis/les=11/12 n=1 ec=11/11 lis/c 11/11 les/c/f 12/12/0 11/11/11) [1,0] r=0 lpr=11 crt=12'2 lcod 12'1 mlcod 12'1 active+clean+scrubbing+deep] deep-scrub 2:29c01cbb:::large-omap-test-object2:head 2:29c01cbb:::large-omap-test-object2:head(12'2 client.4146.0:5 dirty|omap|data_digest s 17 uv 2 dd f0fa5029 alloc_hint [0 0 0]) 2018-04-06 16:26:21.549 7fc28f5aa700 20 osd.1 pg_epoch: 13 pg[2.4( v 12'2 (0'0,12'2] local-lis/les=11/12 n=1 ec=11/11 lis/c 11/11 les/c/f 12/12/0 11/11/11) [1,0] r=0 lpr=11 crt=12'2 lcod 12'1 mlcod 12'1 active+clean+scrubbing+deep] scrub_snapshot_metadata deep-scrub new head 2:29c01cbb:::large-omap-test-object2:head 2018-04-06 16:26:21.549 7fc28f5aa700 10 osd.1 pg_epoch: 13 pg[2.4( v 13'3 (0'0,13'3] local-lis/les=11/12 n=1 ec=11/11 lis/c 11/11 les/c/f 12/12/0 11/11/11) [1,0] r=0 lpr=11 luod=12'2 crt=13'3 lcod 12'1 mlcod 12'1 active+clean+scrubbing+deep] scrub_snapshot_metadata (deep-scrub) finish 2018-04-06 16:26:21.549 7fc2935b2700 10 osd.1 pg_epoch: 13 pg[2.4( v 13'3 (0'0,13'3] local-lis/les=11/12 n=1 ec=11/11 lis/c 11/11 les/c/f 12/12/0 11/11/11) [1,0] r=0 lpr=11 crt=13'3 lcod 12'2 mlcod 12'2 active+clean+scrubbing+deep] deep-scrub got 1/1 objects, 0/0 clones, 1/1 dirty, 1/1 omap, 0/0 pinned, 0/0 hit_set_archive, 17/17 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. 2018-04-06 16:26:21.549 7fc2935b2700 0 log_channel(cluster) log [DBG] : 2.4 deep-scrub ok
On osd.0 between "2018-04-06 16:04:20.179 7efd8b459700 0 log_channel(cluster) log [DBG] : 1.2 deep-scrub starts" and "2018-04-06 16:26:06.493 7efd88453700 0 log_channel(cluster) log [DBG] : 2.0 deep-scrub starts" we lost 846 consecutive coin flips in scrub_random_backoff. On osd.1 between "2018-04-06 16:04:45.207 7fc291daf700 0 log_channel(cluster) log [DBG] : 1.7 deep-scrub starts" and "2018-04-06 16:26:06.473 7fc28fdab700 0 log_channel(cluster) log [DBG] : 1.3 deep-scrub starts" we lost 827 coin flips in a row.
I'll create a PR to add 'osd_scrub_backoff_ratio=0' to see if we can avoid this sort of anomaly.
#2 Updated by Brad Hubbard almost 6 years ago
- Status changed from 12 to In Progress
- Backport set to luminous
#3 Updated by Brad Hubbard almost 6 years ago
- Status changed from In Progress to Pending Backport
Just a note that my analysis above was incorrect and this was not due to the lost coin flips but due to a pg map update race in the manager since the handling of the osd scrub command was moved from the monitor to the manager. That issue is currently being discussed but the PR does resolve this particular issue since the test completes much faster, avoiding the timeout.
#4 Updated by Nathan Cutler almost 6 years ago
- Copied to Backport #23633: luminous: large-omap-object-warnings test fails added
#5 Updated by Brad Hubbard almost 6 years ago
- Status changed from Pending Backport to Resolved