Project

General

Profile

Actions

Bug #40519

closed

rbd_mirror/ImageSyncThrottler.cc: 61: FAILED ceph_assert(m_queue.empty())

Added by Mykola Golub almost 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous,mimic,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

A crash was observed in one of teuthology runs [1]:

2019-06-23T07:57:02.429 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi039.stderr:/build/ceph-15.0.0-2068-g742df4f/src/tools/rbd_mirror/ImageSyncThrottler.cc: In function 'void rbd::mirror::ImageSyncThrottler<ImageCtxT>::start_op(const string&, Context*) [with ImageCtxT = librbd::ImageCtx; std::string = std::__cxx11::basic_string<char>]' thread 7f5c1aad7700 time 2019-06-23T07:57:02.429701+0000
2019-06-23T07:57:02.429 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi039.stderr:/build/ceph-15.0.0-2068-g742df4f/src/tools/rbd_mirror/ImageSyncThrottler.cc: 61: FAILED ceph_assert(m_queue.empty())
2019-06-23T07:57:02.430 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi039.stderr: ceph version 15.0.0-2068-g742df4f (742df4fd43b2d170a24e2826fd033e137f79b855) octopus (dev)
2019-06-23T07:57:02.430 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi039.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x7f5c2e728037]
2019-06-23T07:57:02.430 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi039.stderr: 2: (()+0x2d3221) [0x7f5c2e728221]
2019-06-23T07:57:02.431 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi039.stderr: 3: (rbd::mirror::ImageSyncThrottler<librbd::ImageCtx>::start_op(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Context*)+0xa71) [0x55c7eb7b13c1]
2019-06-23T07:57:02.431 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi039.stderr: 4: (rbd::mirror::InstanceWatcher<librbd::ImageCtx>::handle_sync_request(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Context*)+0x4ba) [0x55c7eb75b5ea]
2019-06-23T07:57:02.431 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi039.stderr: 5: (rbd::mirror::InstanceWatcher<librbd::ImageCtx>::handle_payload(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rbd::mirror::instance_watcher::SyncRequestPayload const&, librbd::Watcher::C_NotifyAck*)+0x65) [0x55c7eb75c9b5]
2019-06-23T07:57:02.431 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi039.stderr: 6: (rbd::mirror::InstanceWatcher<librbd::ImageCtx>::handle_notify(unsigned long, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&)+0x475) [0x55c7eb761d85]
2019-06-23T07:57:02.431 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi039.stderr: 7: (librbd::Watcher::WatchCtx::handle_notify(unsigned long, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&)+0x5e) [0x55c7eb8c41be]
2019-06-23T07:57:02.431 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi039.stderr: 8: (()+0xaac31) [0x7f5c376f6c31]
2019-06-23T07:57:02.432 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi039.stderr: 9: (()+0xc355c) [0x7f5c3770f55c]
2019-06-23T07:57:02.432 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi039.stderr: 10: (()+0x66119) [0x7f5c376b2119]
2019-06-23T07:57:02.432 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi039.stderr: 11: (Finisher::finisher_thread_entry()+0x19d) [0x7f5c2e7bdefd]
2019-06-23T07:57:02.432 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi039.stderr: 12: (()+0x76db) [0x7f5c2e23d6db]
2019-06-23T07:57:02.432 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi039.stderr: 13: (clone()+0x3f) [0x7f5c2db5a88f]

I believe the cause is incorrect duplicate handling in `ImageSyncThrottler::start_op`. We already check here if there is a request with the same id in in-flight ops (started), but does not check if there is such a request in the queue (waiting).

[1] http://qa-proxy.ceph.com/teuthology/trociny-2019-06-21_06:39:59-rbd-wip-mgolub-testing-distro-basic-smithi/4055998/teuthology.log


Related issues 3 (0 open3 closed)

Copied to rbd - Backport #40592: luminous: rbd_mirror/ImageSyncThrottler.cc: 61: FAILED ceph_assert(m_queue.empty())ResolvedMykola GolubActions
Copied to rbd - Backport #40593: mimic: rbd_mirror/ImageSyncThrottler.cc: 61: FAILED ceph_assert(m_queue.empty())ResolvedMykola GolubActions
Copied to rbd - Backport #40594: nautilus: rbd_mirror/ImageSyncThrottler.cc: 61: FAILED ceph_assert(m_queue.empty())ResolvedMykola GolubActions
Actions #1

Updated by Mykola Golub almost 5 years ago

The test case that reproduces the assertion failure:

TEST_F(TestMockImageSyncThrottler, Duplicate) {
  MockImageSyncThrottler throttler(g_ceph_context);
  throttler.set_max_concurrent_syncs(2);

  C_SaferCond on_start1;
  throttler.start_op("id1", &on_start1); // req (1) in flight (in_flight.size() == 1)
  C_SaferCond on_start2;
  throttler.start_op("id2", &on_start2); // req (2) in flight (in_flight.size() == 2)

  C_SaferCond on_start3;
  throttler.start_op("id3", &on_start3); // req (3) queued (m_queue.size() == 1)
  C_SaferCond on_start4;
  throttler.start_op("id3", &on_start4); // req (4) (dup!) queued (m_queue.size() == 2)
  C_SaferCond on_start5;
  throttler.start_op("id4", &on_start5); // req (5) queued (m_queue.size() == 3)

  throttler.finish_op("id1"); // req (3) in flight (in_flight.size() == 2, m_queue.size() == 2)
  ASSERT_EQ(0, on_start1.wait());

  throttler.finish_op("id2"); // req (4) in flight (in_flight.size() == 1 due to dublicate!, m_queue.size() == 1)
  ASSERT_EQ(0, on_start1.wait());

  C_SaferCond on_start6;
  throttler.start_op("id6", &on_start6); // FAILED ceph_assert(m_queue.empty())

  throttler.finish_op("id3");
  ASSERT_EQ(0, on_start3.wait());
  ASSERT_EQ(-ECANCELED, on_start4.wait());

  throttler.finish_op("id4");
  ASSERT_EQ(0, on_start5.wait());
  throttler.finish_op("id5");
  ASSERT_EQ(0, on_start6.wait());
}

And in the logs one can see many duplicates at that moment (e.g. 10c5c813f2d7), which I believe are due to the trasher:

trociny@teuthology:/a/trociny-2019-06-21_06:39:59-rbd-wip-mgolub-testing-distro-basic-smithi/4055998/remote/smithi039/log$ zgrep '^20.*handle_sync_request: instance_id=4651, sync_id=' cluster2-client.mirror.2.38780.log.gz 
2019-06-23T07:56:49.284+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c9ddb5e0a
2019-06-23T07:56:49.288+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=1107334ad7fb
2019-06-23T07:56:49.288+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c9b4dfcd59
2019-06-23T07:56:49.288+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10be37fff151
2019-06-23T07:56:49.288+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=122bc575f727
2019-06-23T07:56:49.288+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c5c813f2d7
2019-06-23T07:56:49.292+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=110746577dca
2019-06-23T07:56:49.292+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c517ed9396
2019-06-23T07:56:49.292+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c939a2aa83
2019-06-23T07:56:49.292+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c916a6fb2
2019-06-23T07:56:49.292+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10be13accbcd
2019-06-23T07:56:52.024+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c5c813f2d7
2019-06-23T07:56:52.024+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=110746577dca
2019-06-23T07:56:52.024+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c517ed9396
2019-06-23T07:56:52.024+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c939a2aa83
2019-06-23T07:56:52.024+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c916a6fb2
2019-06-23T07:56:52.024+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10be13accbcd
2019-06-23T07:56:52.024+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c5c813f2d7
2019-06-23T07:56:52.024+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=110746577dca
2019-06-23T07:56:52.024+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c517ed9396
2019-06-23T07:56:52.028+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c939a2aa83
2019-06-23T07:56:52.028+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c916a6fb2
2019-06-23T07:56:52.028+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c5c813f2d7
2019-06-23T07:56:52.028+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=110746577dca
2019-06-23T07:56:52.028+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c517ed9396
2019-06-23T07:56:52.028+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c939a2aa83
2019-06-23T07:56:52.028+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c916a6fb2
2019-06-23T07:56:52.028+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c5c813f2d7
2019-06-23T07:56:52.028+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=110746577dca
2019-06-23T07:56:52.028+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c517ed9396
2019-06-23T07:56:52.028+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c939a2aa83
2019-06-23T07:56:52.028+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c916a6fb2
2019-06-23T07:56:52.340+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10be13accbcd
2019-06-23T07:56:52.340+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10be13accbcd
2019-06-23T07:56:52.948+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=110746577dca
2019-06-23T07:56:52.960+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10be13accbcd
2019-06-23T07:56:52.968+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c916a6fb2
2019-06-23T07:56:52.992+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c517ed9396
2019-06-23T07:56:52.992+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c5c813f2d7
2019-06-23T07:56:53.028+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=122bc575f727
2019-06-23T07:56:53.036+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10be13accbcd
2019-06-23T07:56:53.036+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c939a2aa83
2019-06-23T07:56:53.452+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c9b4dfcd59
2019-06-23T07:56:53.464+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=1107334ad7fb
2019-06-23T07:56:53.484+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c9ddb5e0a
2019-06-23T07:56:53.520+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10be37fff151
2019-06-23T07:56:56.596+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c916a6fb2
2019-06-23T07:56:56.596+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c517ed9396
2019-06-23T07:56:56.596+0000 7f5c1aad7700 10 rbd::mirror::InstanceWatcher: 0x55c7ed562000 handle_sync_request: instance_id=4651, sync_id=10c5c813f2d7

Actions #2

Updated by Mykola Golub almost 5 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 28730
Actions #3

Updated by Jason Dillaman almost 5 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to luminous,mimic,nautilus
Actions #4

Updated by Nathan Cutler almost 5 years ago

  • Copied to Backport #40592: luminous: rbd_mirror/ImageSyncThrottler.cc: 61: FAILED ceph_assert(m_queue.empty()) added
Actions #5

Updated by Nathan Cutler almost 5 years ago

  • Copied to Backport #40593: mimic: rbd_mirror/ImageSyncThrottler.cc: 61: FAILED ceph_assert(m_queue.empty()) added
Actions #6

Updated by Nathan Cutler almost 5 years ago

  • Copied to Backport #40594: nautilus: rbd_mirror/ImageSyncThrottler.cc: 61: FAILED ceph_assert(m_queue.empty()) added
Actions #8

Updated by Nathan Cutler over 4 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved".

Actions

Also available in: Atom PDF