Project

General

Profile

Bug #49781

unittest_mempool.check_shard_select failed

Added by Josh Durgin 6 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus, octopus, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This test is probabilistic. Recording to see whether we find it failing more frequently.

From https://jenkins.ceph.com/job/ceph-pull-requests/71240/consoleFull :

[ RUN      ] mempool.check_shard_select
/home/jenkins-build/build/workspace/ceph-pull-requests/src/test/test_mempool.cc:411: Failure
Expected: (VarX) < (200), actual: 361.5 vs 200
[  FAILED  ] mempool.check_shard_select (7 ms)

Related issues

Related to RADOS - Backport #49055: nautilus: pick_a_shard() always select shard 0 Resolved
Related to Ceph - Fix #49896: qa: verify the benefits of mempool cacheline optimization Resolved
Copied to RADOS - Backport #49991: nautilus: unittest_mempool.check_shard_select failed Resolved
Copied to RADOS - Backport #49992: pacific: unittest_mempool.check_shard_select failed Resolved
Copied to RADOS - Backport #49993: octopus: unittest_mempool.check_shard_select failed Resolved

History

#1 Updated by Loïc Dachary 6 months ago

It happened 5 days ago at https://github.com/ceph/ceph/pull/39883#issuecomment-791944956 and is related to https://github.com/ceph/ceph/pull/39651 that introduced the failing test in nautilus 12 days ago.

#3 Updated by singuliere _ 6 months ago

  • Subject changed from unittest_mempool.check_shard_select failed to nautilus: unittest_mempool.check_shard_select failed

#4 Updated by singuliere _ 6 months ago

  • Related to Backport #49055: nautilus: pick_a_shard() always select shard 0 added

#5 Updated by Kefu Chai 6 months ago

master also

[ RUN      ] mempool.check_shard_select
../src/test/test_mempool.cc:430: Failure
Expected: (VarX) < (200), actual: 217.5 vs 200
[  FAILED  ] mempool.check_shard_select (10 ms)

https://github.com/ceph/ceph/pull/40060

#6 Updated by Kefu Chai 6 months ago

  • Subject changed from nautilus: unittest_mempool.check_shard_select failed to unittest_mempool.check_shard_select failed

#7 Updated by Kefu Chai 6 months ago

  • Priority changed from Normal to High

#8 Updated by Loïc Dachary 6 months ago

  • Status changed from New to In Progress
  • Assignee set to Loïc Dachary

#9 Updated by Loïc Dachary 6 months ago

Using pthread_self for sharding is not great because the value it returns is opaque. Shifting it 3 bits turned out to not do any sharding because the first 8 bits were always the same. A fix merged in master earlier this year gave better results by shifting 12 bits instead of 3. It also introduced a test that verifies the variance is not too high.

The test fails randomly and it could probably be fixed by increasing the number of samples (from 100 to 10000 for instance). But the root of the problem is that there is no way to predict the variance: it depends on an opaque value. The better fix would be to use something different from pthread_self.

#10 Updated by Loïc Dachary 6 months ago

The test condition should not be too strict because there really is no way to predict the result. It is however good to have some sort of failure in case it gets really bad. Like what happened when shifting only 3 bits and there was no sharding at all. This patch should achieve the desired result:

diff --git a/src/include/mempool.h b/src/include/mempool.h
index c03aa175cf..fe84f3b8f0 100644
--- a/src/include/mempool.h
+++ b/src/include/mempool.h
@@ -259,7 +259,7 @@ public:
     // Dirt cheap, see:
     //   https://fossies.org/dox/glibc-2.32/pthread__self_8c_source.html
     size_t me = (size_t)pthread_self();
-    size_t i = (me >> 12) & ((1 << num_shard_bits) - 1);
+    size_t i = (me >> CEPH_PAGE_SHIFT) & ((1 << num_shard_bits) - 1);
     return i;
   }

diff --git a/src/test/test_mempool.cc b/src/test/test_mempool.cc
index 99b87d11f2..acddfbcd77 100644
--- a/src/test/test_mempool.cc
+++ b/src/test/test_mempool.cc
@@ -404,7 +404,7 @@ TEST(mempool, btree_map_test)

 TEST(mempool, check_shard_select)
 {
-  const size_t samples = 100;
+  const size_t samples = mempool::num_shards * 100;
   std::atomic_int shards[mempool::num_shards] = {0};
   std::vector<std::thread> workers;
   for (size_t i = 0; i < samples; i++) {
@@ -419,15 +419,17 @@ TEST(mempool, check_shard_select)
   }
   workers.clear();

-  double EX = (double)samples / (double)mempool::num_shards;
-  double VarX = 0;
+  size_t missed = 0;
   for (size_t i = 0; i < mempool::num_shards; i++) {
-    VarX += (EX - shards[i]) * (EX - shards[i]);
+    if (shards[i] == 0) {
+      missed++;
+    }
   }
-  //random gives VarX below 200
-  //when half slots are 0, we get ~300
-  //when all samples go into one slot, we get ~9000
-  EXPECT_LT(VarX, 200);
+
+  // If more than half of the shards did not get anything,
+  // the distribution is bad enough to deserve a failure.
+  EXPECT_LT(missed, mempool::num_shards / 2);
 }

#11 Updated by singuliere _ 6 months ago

  • Pull request ID set to 40167

#12 Updated by Loïc Dachary 6 months ago

Discussion on the mailing list about mempool optimization

#13 Updated by Loïc Dachary 6 months ago

  • Related to Fix #49896: qa: verify the benefits of mempool cacheline optimization added

#15 Updated by Kefu Chai 6 months ago

  • Status changed from In Progress to Pending Backport
  • Backport set to nautilus, octopus, pacific

#16 Updated by Backport Bot 6 months ago

  • Copied to Backport #49991: nautilus: unittest_mempool.check_shard_select failed added

#17 Updated by Backport Bot 6 months ago

  • Copied to Backport #49992: pacific: unittest_mempool.check_shard_select failed added

#18 Updated by Backport Bot 6 months ago

  • Copied to Backport #49993: octopus: unittest_mempool.check_shard_select failed added

#19 Updated by Loïc Dachary 3 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF