Project

General

Profile

Actions

Bug #38315

closed

journal: stalled io due to race when closing and advancing object set

Added by Mykola Golub about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I am observing this running DiffIterateTest/1.DiffIterateStress unittest in loop with journaling enabled:

RBD_FEATURES=109 ./bin/unittest_librbd --gtest_filter=DiffIterateTest/1.DiffIterateStress --gtest_repeat=1000

The test may hung and this is probably the cause of sporadic Jenkins failures due to rbd unit tests getting stuck.

Running with debug level shows that it hunged after close_and_advance_object_set, because one of closed objects (ObjectRecorder) did not notified the JournalRecorder about being closed. In the log below note the missing "object journal_data.36.53d5f4b8b8f.18 closed". Also there is a message:

2019-02-13 13:33:49.696 7f7db77fe700 20 ObjectRecorder: 0x7f7db8001fe0 send_appends_aio: journal_data.36.53d5f4b8b8f.19 pending buffers empty

Tracing this, the problem is a race between `ObjectRecorder::handle_append_flushed` and `ObjectRecorder::close`. At the end the `handle_append_flushed` sets `m_aio_scheduled`, releases the lock and calls `send_appends_aio`. It should do this only if `m_pending_buffers` is not empty, and usually it is ok even if it is empty, because the first thing the `send_appends_aio` does after reacquiring the lock is it checks for `m_pending_buffers` and just returns if it is empty. The problem is this small time window when the lock is released, `m_aio_scheduled` is set and `m_pending_buffers` is empty. If during this time interval `ObjectRecorder::close()` is called it will returned `false` due to `m_aio_scheduled` is set and the higher level will wait for "closed" notification, which will never comes.

The regression has been introduced in the master, commit 46159af76cda5e2f4a54c1aefdc6838dabd890b2 (journal: limit in-flight appends), after mimic, and has not been backported.

2019-02-13 13:33:49.696 7f7db6ffd700 20 JournalRecorder: 0x7f7d7c002510 close_and_advance_object_set: closing active object set 4
2019-02-13 13:33:49.696 7f7db77fe700 10 ObjectRecorder: 0x7f7db8001fe0 handle_append_flushed: journal_data.36.53d5f4b8b8f.19 tid=1021, r=0
2019-02-13 13:33:49.696 7f7db77fe700 20 ObjectRecorder: 0x7f7db8001fe0 handle_append_flushed: Future[tag_tid=1, entry_tid=20483, commit_tid=20484] marked safe
2019-02-13 13:33:49.696 7f7db6ffd700 10 JournalRecorder: 0x7f7d7c002510 close_object_set: closing object journal_data.36.53d5f4b8b8f.16
2019-02-13 13:33:49.696 7f7db6ffd700 20 ObjectRecorder: 0x7f7db80022b0 close: journal_data.36.53d5f4b8b8f.16
2019-02-13 13:33:49.696 7f7db6ffd700 10 JournalRecorder: 0x7f7d7c002510 close_object_set: closing object journal_data.36.53d5f4b8b8f.17
2019-02-13 13:33:49.696 7f7db6ffd700 20 ObjectRecorder: 0x7f7db800d3b0 close: journal_data.36.53d5f4b8b8f.17
2019-02-13 13:33:49.696 7f7db6ffd700 10 JournalRecorder: 0x7f7d7c002510 close_object_set: closing object journal_data.36.53d5f4b8b8f.18
2019-02-13 13:33:49.696 7f7db6ffd700 20 ObjectRecorder: 0x7f7db8002990 close: journal_data.36.53d5f4b8b8f.18
2019-02-13 13:33:49.696 7f7db6ffd700 10 JournalRecorder: 0x7f7d7c002510 close_object_set: closing object journal_data.36.53d5f4b8b8f.19
2019-02-13 13:33:49.696 7f7db6ffd700 20 ObjectRecorder: 0x7f7db8001fe0 close: journal_data.36.53d5f4b8b8f.19
2019-02-13 13:33:49.696 7f7db57fa700 10 ObjectRecorder: 0x7f7db8002990 send_appends_aio: journal_data.36.53d5f4b8b8f.18 flushing journal tid=1022
-> 2019-02-13 13:33:49.696 7f7db77fe700 20 ObjectRecorder: 0x7f7db8001fe0 send_appends_aio: journal_data.36.53d5f4b8b8f.19 pending buffers empty
2019-02-13 13:33:49.696 7f7db57fa700 20 ObjectRecorder: 0x7f7db8002990 send_appends_aio: flushing Future[tag_tid=1, entry_tid=20486, commit_tid=20487]
2019-02-13 13:33:49.696 7f7db6ffd700 20 JournalMetadata: 0x7f7db800c270 allocated commit tid: commit_tid=20488 [object_num=19, tag_tid=1, entry_tid=20487]
2019-02-13 13:33:49.696 7f7db77fe700 10 ObjectRecorder: 0x7f7db80022b0 handle_append_flushed: journal_data.36.53d5f4b8b8f.16 tid=1022, r=0
2019-02-13 13:33:49.696 7f7db77fe700 20 ObjectRecorder: 0x7f7db80022b0 handle_append_flushed: Future[tag_tid=1, entry_tid=20484, commit_tid=20485] marked safe
2019-02-13 13:33:49.696 7f7db77fe700 10 JournalRecorder: 0x7f7d7c002510 handle_closed: journal_data.36.53d5f4b8b8f.16
2019-02-13 13:33:49.696 7f7db6ffd700 20 JournalMetadata: 0x7f7db800c270 allocated commit tid: commit_tid=20489 [object_num=16, tag_tid=1, entry_tid=20488]
2019-02-13 13:33:49.696 7f7db77fe700 20 JournalRecorder: 0x7f7d7c002510 handle_closed: object journal_data.36.53d5f4b8b8f.16 closed
2019-02-13 13:33:49.696 7f7db77fe700 10 ObjectRecorder: 0x7f7db800d3b0 handle_append_flushed: journal_data.36.53d5f4b8b8f.17 tid=1022, r=0
2019-02-13 13:33:49.696 7f7db77fe700 20 ObjectRecorder: 0x7f7db800d3b0 handle_append_flushed: Future[tag_tid=1, entry_tid=20485, commit_tid=20486] marked safe
2019-02-13 13:33:49.696 7f7db77fe700 10 JournalRecorder: 0x7f7d7c002510 handle_closed: journal_data.36.53d5f4b8b8f.17
2019-02-13 13:33:49.696 7f7db77fe700 20 JournalRecorder: 0x7f7d7c002510 handle_closed: object journal_data.36.53d5f4b8b8f.17 closed
2019-02-13 13:33:49.696 7f7db6ffd700 20 JournalMetadata: 0x7f7db800c270 allocated commit tid: commit_tid=20490 [object_num=17, tag_tid=1, entry_tid=20489]
2019-02-13 13:33:49.696 7f7db6ffd700 20 JournalMetadata: 0x7f7db800c270 allocated commit tid: commit_tid=20491 [object_num=18, tag_tid=1, entry_tid=20490]
2019-02-13 13:33:49.696 7f7db6ffd700 20 JournalMetadata: 0x7f7db800c270 allocated commit tid: commit_tid=20492 [object_num=19, tag_tid=1, entry_tid=20491]
2019-02-13 13:33:49.696 7f7db77fe700 10 ObjectRecorder: 0x7f7db8002990 handle_append_flushed: journal_data.36.53d5f4b8b8f.18 tid=1022, r=0
2019-02-13 13:33:49.696 7f7db77fe700 20 ObjectRecorder: 0x7f7db8002990 handle_append_flushed: Future[tag_tid=1, entry_tid=20486, commit_tid=20487] marked safe
2019-02-13 13:33:49.696 7f7db77fe700 10 JournalRecorder: 0x7f7d7c002510 handle_closed: journal_data.36.53d5f4b8b8f.18
2019-02-13 13:33:49.696 7f7db77fe700 20 JournalRecorder: 0x7f7d7c002510 handle_closed: object journal_data.36.53d5f4b8b8f.18 closed

Actions #1

Updated by Mykola Golub about 5 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 26425
Actions #2

Updated by Mykola Golub about 5 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF