Project

General

Profile

Actions

Bug #51100

closed

object_recorder->is_closed() assert failure in JournalRecorder::open_object_set() in nautilus

Added by Ilya Dryomov almost 3 years ago. Updated about 2 years ago.

Status:
Won't Fix - EOL
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

http://qa-proxy.ceph.com/teuthology/yuriw-2021-06-03_17:02:49-rbd-wip-yuri7-testing-2021-06-03-0726-nautilus-distro-basic-smithi/6150012/teuthology.log

2021-06-04T10:05:56.894 INFO:tasks.qemu.client.0.smithi187.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.21-317-g829d9986/rpm/el7/BUILD/ceph-14.2.21-317-g829d9986/src/journal/JournalRecorder.cc: In function 'void journal::JournalRecorder::open_object_set()' thread 7f6864d3d700 time 2021-06-04 10:05:56.900984
2021-06-04T10:05:56.895 INFO:tasks.qemu.client.0.smithi187.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.21-317-g829d9986/rpm/el7/BUILD/ceph-14.2.21-317-g829d9986/src/journal/JournalRecorder.cc: 266: FAILED ceph_assert(object_recorder->is_closed())
2021-06-04T10:05:56.896 INFO:tasks.qemu.client.0.smithi187.stderr: ceph version 14.2.21-317-g829d9986445 (829d99864457b1c6cc2479d2b96f1a1ddf99f265) nautilus (stable)
2021-06-04T10:05:56.896 INFO:tasks.qemu.client.0.smithi187.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7f687dd283e7]
2021-06-04T10:05:56.896 INFO:tasks.qemu.client.0.smithi187.stderr: 2: (()+0x25d5af) [0x7f687dd285af]
2021-06-04T10:05:56.896 INFO:tasks.qemu.client.0.smithi187.stderr: 3: (()+0x37437d) [0x7f688bfce37d]
2021-06-04T10:05:56.896 INFO:tasks.qemu.client.0.smithi187.stderr: 4: (()+0x374f84) [0x7f688bfcef84]
2021-06-04T10:05:56.897 INFO:tasks.qemu.client.0.smithi187.stderr: 5: (()+0xabbc9) [0x7f688bd05bc9]
2021-06-04T10:05:56.897 INFO:tasks.qemu.client.0.smithi187.stderr: 6: (()+0x387868) [0x7f688bfe1868]
2021-06-04T10:05:56.897 INFO:tasks.qemu.client.0.smithi187.stderr: 7: (()+0x918b6) [0x7f688b9a08b6]
2021-06-04T10:05:56.897 INFO:tasks.qemu.client.0.smithi187.stderr: 8: (()+0x593b9) [0x7f688b9683b9]
2021-06-04T10:05:56.898 INFO:tasks.qemu.client.0.smithi187.stderr: 9: (Finisher::finisher_thread_entry()+0x16f) [0x7f687ddb345f]
2021-06-04T10:05:56.898 INFO:tasks.qemu.client.0.smithi187.stderr: 10: (()+0x7ea5) [0x7f688c751ea5]
2021-06-04T10:05:56.898 INFO:tasks.qemu.client.0.smithi187.stderr: 11: (clone()+0x6d) [0x7f6888fce9fd]
253 void JournalRecorder::open_object_set() {
254   ceph_assert(m_lock.is_locked());
255 
256   ldout(m_cct, 10) << "opening object set " << m_current_set << dendl;
257 
258   uint8_t splay_width = m_journal_metadata->get_splay_width();
259 
260   lock_object_recorders();
261   for (ObjectRecorderPtrs::iterator it = m_object_ptrs.begin();
262        it != m_object_ptrs.end(); ++it) {
263     ObjectRecorderPtr object_recorder = it->second;
264     uint64_t object_number = object_recorder->get_object_number();
265     if (object_number / splay_width != m_current_set) {
266       ceph_assert(object_recorder->is_closed());
267 
268       // ready to close object and open object in active set
269       create_next_object_recorder(object_recorder);
270     }
271   }
272   unlock_object_recorders();
273 }

Related issues 2 (0 open2 closed)

Related to rbd - Bug #42598: unittest_journal failure during make checkResolved

Actions
Related to rbd - Bug #47880: [journal] object recorder can race while lock is temporarily release for callbacksResolvedJason Dillaman

Actions
Actions #1

Updated by Ilya Dryomov almost 3 years ago

(gdb) bt
#0  0x00007f6888f063d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f6888f07ac8 in __GI_abort () at abort.c:90
#2  0x00007f687dd28436 in ceph::__ceph_assert_fail (assertion=<optimized out>, file=<optimized out>, line=<optimized out>, func=<optimized out>) at /usr/src/debug/ceph-14.2.21-317-g829d9986/src/common/assert.cc:73
#3  0x00007f687dd285af in ceph::__ceph_assert_fail (ctx=...) at /usr/src/debug/ceph-14.2.21-317-g829d9986/src/common/assert.cc:78
#4  0x00007f688bfce37d in journal::JournalRecorder::open_object_set() () at /usr/src/debug/ceph-14.2.21-317-g829d9986/src/journal/ObjectRecorder.h:65
#5  0x00007f688bfcef84 in journal::JournalRecorder::handle_advance_object_set(int) () at /usr/src/debug/ceph-14.2.21-317-g829d9986/src/journal/JournalRecorder.cc:244
#6  0x00007f688bd05bc9 in Context::complete (this=0x5573cbaa0d30, r=<optimized out>) at /usr/src/debug/ceph-14.2.21-317-g829d9986/src/include/Context.h:77
#7  0x00007f688bfe1868 in complete (r=<optimized out>, this=0x5573c5f5e840) at /usr/src/debug/ceph-14.2.21-317-g829d9986/src/include/Context.h:77
#8  journal::utils::rados_ctx_callback (c=<optimized out>, arg=0x5573c5f5e840) at /usr/src/debug/ceph-14.2.21-317-g829d9986/src/journal/Utils.cc:21
#9  0x00007f688b9a08b6 in librados::C_AioComplete::finish (this=0x5573c5048d50, r=<optimized out>) at /usr/src/debug/ceph-14.2.21-317-g829d9986/src/librados/AioCompletionImpl.h:172
#10 0x00007f688b9683b9 in Context::complete (this=0x5573c5048d50, r=<optimized out>) at /usr/src/debug/ceph-14.2.21-317-g829d9986/src/include/Context.h:77
#11 0x00007f687ddb345f in Finisher::finisher_thread_entry() () at /usr/src/debug/ceph-14.2.21-317-g829d9986/src/common/Finisher.cc:67
#12 0x00007f688c751ea5 in start_thread (arg=0x7f6864d3d700) at pthread_create.c:307
#13 0x00007f6888fce9fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
Actions #2

Updated by Ilya Dryomov almost 3 years ago

(gdb) p ((journal::ObjectRecorder *)0x5573c5cd63c0)->m_object_closed
$22 = true
(gdb) p ((journal::ObjectRecorder *)0x5573c5cd63c0)->m_in_flight_appends
Python Exception <class 'gdb.error'> There is no member or method named _M_value_field.: 
$23 = std::map with 18 elements
Actions #3

Updated by Mykola Golub almost 3 years ago

  • Status changed from New to Duplicate

The crash was observed in nautilus and I believe it is fixed in newer versions with this PR [1] as a non-related issue caught when testing that PR. See the comment [2] with the same backtrace. The PR [1] is still not backported to nautilus [3].

[1] https://github.com/ceph/ceph/pull/31392
[2] https://github.com/ceph/ceph/pull/31392#issuecomment-553297734
[3] https://tracker.ceph.com/issues/42953

Actions #4

Updated by Ilya Dryomov almost 3 years ago

  • Status changed from Duplicate to New

This assert is still being hit with https://github.com/ceph/ceph/pull/41748 included:

http://qa-proxy.ceph.com/teuthology/dis-2021-06-08_23:01:13-rbd-wip-dis-testing-nautilus-distro-basic-smithi/6161219/teuthology.log

src/journal/JournalRecorder.cc: In function 'void journal::JournalRecorder::open_object_set()' thread 7f278896e700 time 2021-06-08 23:28:33.884593
src/journal/JournalRecorder.cc: 267: FAILED ceph_assert(object_recorder->is_closed())

And also a new one:

http://qa-proxy.ceph.com/teuthology/dis-2021-06-08_23:01:13-rbd-wip-dis-testing-nautilus-distro-basic-smithi/6161222/teuthology.log

src/journal/ObjectRecorder.cc: In function 'void journal::ObjectRecorder::notify_handler_unlock(std::shared_ptr<Mutex>&, bool)' thread 7f7efb1cb700 time 2021-06-08 23:31:58.202888
src/journal/ObjectRecorder.cc: 367: FAILED ceph_assert(m_in_flight_callbacks)
Actions #5

Updated by Ilya Dryomov almost 3 years ago

It doesn't reproduce on master though, so it is definitely a missing backport. Perhaps https://github.com/ceph/ceph/pull/31392 was only part of the fix?

Actions #6

Updated by Ilya Dryomov almost 3 years ago

  • Related to Backport #42953: nautilus: unittest_journal failure during make check added
Actions #7

Updated by Ilya Dryomov almost 3 years ago

  • Related to deleted (Backport #42953: nautilus: unittest_journal failure during make check)
Actions #8

Updated by Ilya Dryomov almost 3 years ago

  • Related to Bug #42598: unittest_journal failure during make check added
Actions #9

Updated by Ilya Dryomov almost 3 years ago

  • Related to Bug #47880: [journal] object recorder can race while lock is temporarily release for callbacks added
Actions #10

Updated by Ilya Dryomov almost 3 years ago

  • Subject changed from object_recorder->is_closed() assert failure in JournalRecorder::open_object_set() to object_recorder->is_closed() assert failure in JournalRecorder::open_object_set() in nautilus
Actions #11

Updated by Ilya Dryomov about 2 years ago

  • Status changed from New to Won't Fix - EOL

nautilus is EOL

Actions

Also available in: Atom PDF