Project

General

Profile

Bug #23578

large-omap-object-warnings test fails

Added by Sage Weil almost 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Tests
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2018-04-06T16:26:01.808 INFO:tasks.workunit.client.0.smithi141.stdout:
2018-04-06T16:26:01.809 INFO:tasks.workunit.client.0.smithi141.stdout:Cluster ID: 90d70377-127d-4a15-a009-2e4cc84f615d
2018-04-06T16:26:01.809 INFO:tasks.workunit.client.0.smithi141.stdout:Error, got invalid output:
2018-04-06T16:26:01.809 INFO:tasks.workunit.client.0.smithi141.stdout:HEALTH_WARN application not enabled on 1 pool(s)
2018-04-06T16:26:01.809 INFO:tasks.workunit.client.0.smithi141.stdout:POOL_APP_NOT_ENABLED application not enabled on 1 pool(s)
2018-04-06T16:26:01.810 INFO:tasks.workunit.client.0.smithi141.stdout:    application not enabled on pool 'large-omap-test-pool'
2018-04-06T16:26:01.810 INFO:tasks.workunit.client.0.smithi141.stdout:    use 'ceph osd pool application enable <pool-name> <app-name>', where <app-name> is 'cephfs', 'rbd', 'rgw', or freeform for custom applications.
2018-04-06T16:26:01.810 INFO:tasks.workunit.client.0.smithi141.stdout:
2018-04-06T16:26:01.810 INFO:tasks.workunit.client.0.smithi141.stderr:Traceback (most recent call last):
2018-04-06T16:26:01.810 INFO:tasks.workunit.client.0.smithi141.stderr:  File "/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test_large_omap_detection.py", line 130, in <module>
2018-04-06T16:26:01.810 INFO:tasks.workunit.client.0.smithi141.stderr:    main()
2018-04-06T16:26:01.810 INFO:tasks.workunit.client.0.smithi141.stderr:  File "/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test_large_omap_detection.py", line 125, in main
2018-04-06T16:26:01.810 INFO:tasks.workunit.client.0.smithi141.stderr:    check_health_output()
2018-04-06T16:26:01.811 INFO:tasks.workunit.client.0.smithi141.stderr:  File "/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test_large_omap_detection.py", line 120, in check_health_output
2018-04-06T16:26:01.811 INFO:tasks.workunit.client.0.smithi141.stderr:    raise Exception

/a/sage-2018-04-06_13:59:16-rados-wip-sage2-testing-2018-04-06-0721-distro-basic-smithi/2362967

I've seen this a few other times in the last couple weeks


Related issues

Copied to RADOS - Backport #23633: luminous: large-omap-object-warnings test fails Rejected

History

#1 Updated by Brad Hubbard almost 6 years ago

  • Category set to Tests
  • Assignee set to Brad Hubbard
  • Source set to Development

We instruct the OSDs to scrub at around 16:15.

2018-04-06T16:04:15.971 INFO:tasks.workunit.client.0.smithi141.stderr:instructed osd(s) 0 to deep-scrub                                                                                                                                      
2018-04-06T16:04:16.222 INFO:tasks.workunit.client.0.smithi141.stderr:instructed osd(s) 1 to deep-scrub

Twenty two minutes later, at 16:26:01, we time out.

2018-04-06T16:26:01.809 INFO:tasks.workunit.client.0.smithi141.stdout:Error, got invalid output:

But we only start scrubbing the relevant pg on OSD at 16:26:10.

2018-04-06 16:26:10.549 7efd8b459700 20 osd.0 pg_epoch: 13 pg[2.2( v 12'2 (0'0,12'2] local-lis/les=11/12 n=1 ec=11/11 lis/c 11/11 les/c/f 12/12/0 11/11/11) [0,1] r=0 lpr=11 crt=12'2 lcod 12'1 mlcod 12'1 active+clean+scrubbing+deep] deep-scrub  2:539eec76:::large-omap-test-object1:head 2:539eec76:::large-omap-test-object1:head(12'2 client.4146.0:3 dirty|omap|data_digest s 11 uv 2 dd 9b6a63e4 alloc_hint [0 0 0])                                                             
2018-04-06 16:26:10.549 7efd8b459700 20 osd.0 pg_epoch: 13 pg[2.2( v 12'2 (0'0,12'2] local-lis/les=11/12 n=1 ec=11/11 lis/c 11/11 les/c/f 12/12/0 11/11/11) [0,1] r=0 lpr=11 crt=12'2 lcod 12'1 mlcod 12'1 active+clean+scrubbing+deep] scrub_snapshot_metadata deep-scrub new head  2:539eec76:::large-omap-test-object1:head                                     
2018-04-06 16:26:10.549 7efd8b459700 10 osd.0 pg_epoch: 13 pg[2.2( v 13'3 (0'0,13'3] local-lis/les=11/12 n=1 ec=11/11 lis/c 11/11 les/c/f 12/12/0 11/11/11) [0,1] r=0 lpr=11 luod=12'2 crt=13'3 lcod 12'1 mlcod 12'1 active+clean+scrubbing+deep] scrub_snapshot_metadata (deep-scrub) finish           
2018-04-06 16:26:10.553 7efd87451700 10 osd.0 pg_epoch: 13 pg[2.2( v 13'3 (0'0,13'3] local-lis/les=11/12 n=1 ec=11/11 lis/c 11/11 les/c/f 12/12/0 11/11/11) [0,1] r=0 lpr=11 crt=13'3 lcod 12'2 mlcod 12'2 active+clean+scrubbing+deep] deep-scrub got 1/1 objects, 0/0 clones, 1/1 dirty, 1/1 omap, 0/0 pinned, 0/0 hit_set_archive, 11/11 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.                                                                                       
2018-04-06 16:26:10.553 7efd87451700  0 log_channel(cluster) log [DBG] : 2.2 deep-scrub ok

We start even later on osd.1.

2018-04-06 16:26:21.477 7fc28f5aa700  0 log_channel(cluster) log [DBG] : 2.4 deep-scrub starts
2018-04-06 16:26:21.549 7fc28f5aa700 20 osd.1 pg_epoch: 13 pg[2.4( v 12'2 (0'0,12'2] local-lis/les=11/12 n=1 ec=11/11 lis/c 11/11 les/c/f 12/12/0 11/11/11) [1,0] r=0 lpr=11 crt=12'2 lcod 12'1 mlcod 12'1 active+clean+scrubbing+deep] deep-scrub  2:29c01cbb:::large-omap-test-object2:head 2:29c01cbb:::large-omap-test-object2:head(12'2 client.4146.0:5 dirty|omap|data_digest s 17 uv 2 dd f0fa5029 alloc_hint [0 0 0])
2018-04-06 16:26:21.549 7fc28f5aa700 20 osd.1 pg_epoch: 13 pg[2.4( v 12'2 (0'0,12'2] local-lis/les=11/12 n=1 ec=11/11 lis/c 11/11 les/c/f 12/12/0 11/11/11) [1,0] r=0 lpr=11 crt=12'2 lcod 12'1 mlcod 12'1 active+clean+scrubbing+deep] scrub_snapshot_metadata deep-scrub new head  2:29c01cbb:::large-omap-test-object2:head
2018-04-06 16:26:21.549 7fc28f5aa700 10 osd.1 pg_epoch: 13 pg[2.4( v 13'3 (0'0,13'3] local-lis/les=11/12 n=1 ec=11/11 lis/c 11/11 les/c/f 12/12/0 11/11/11) [1,0] r=0 lpr=11 luod=12'2 crt=13'3 lcod 12'1 mlcod 12'1 active+clean+scrubbing+deep] scrub_snapshot_metadata (deep-scrub) finish
2018-04-06 16:26:21.549 7fc2935b2700 10 osd.1 pg_epoch: 13 pg[2.4( v 13'3 (0'0,13'3] local-lis/les=11/12 n=1 ec=11/11 lis/c 11/11 les/c/f 12/12/0 11/11/11) [1,0] r=0 lpr=11 crt=13'3 lcod 12'2 mlcod 12'2 active+clean+scrubbing+deep] deep-scrub got 1/1 objects, 0/0 clones, 1/1 dirty, 1/1 omap, 0/0 pinned, 0/0 hit_set_archive, 17/17 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.
2018-04-06 16:26:21.549 7fc2935b2700  0 log_channel(cluster) log [DBG] : 2.4 deep-scrub ok

On osd.0 between "2018-04-06 16:04:20.179 7efd8b459700 0 log_channel(cluster) log [DBG] : 1.2 deep-scrub starts" and "2018-04-06 16:26:06.493 7efd88453700 0 log_channel(cluster) log [DBG] : 2.0 deep-scrub starts" we lost 846 consecutive coin flips in scrub_random_backoff. On osd.1 between "2018-04-06 16:04:45.207 7fc291daf700 0 log_channel(cluster) log [DBG] : 1.7 deep-scrub starts" and "2018-04-06 16:26:06.473 7fc28fdab700 0 log_channel(cluster) log [DBG] : 1.3 deep-scrub starts" we lost 827 coin flips in a row.

I'll create a PR to add 'osd_scrub_backoff_ratio=0' to see if we can avoid this sort of anomaly.

#2 Updated by Brad Hubbard almost 6 years ago

  • Status changed from 12 to In Progress
  • Backport set to luminous

#3 Updated by Brad Hubbard almost 6 years ago

  • Status changed from In Progress to Pending Backport

Just a note that my analysis above was incorrect and this was not due to the lost coin flips but due to a pg map update race in the manager since the handling of the osd scrub command was moved from the monitor to the manager. That issue is currently being discussed but the PR does resolve this particular issue since the test completes much faster, avoiding the timeout.

#4 Updated by Nathan Cutler almost 6 years ago

  • Copied to Backport #23633: luminous: large-omap-object-warnings test fails added

#5 Updated by Brad Hubbard almost 6 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF