Bug #16773

FAILED assert(m_image_ctx.journal == nullptr)

Added by de lan over 6 years ago. Updated almost 6 years ago.

Jason Dillaman
Target version:
% Done:


3 - minor
Affected Versions:
Pull request ID:
Crash signature (v1):
Crash signature (v2):



when i test the ci suite:rbd:valgrind/{base/install.yaml clusters/{fixed-1.yaml openstack.yaml} fs/xfs.yaml validator/helgrind.yaml workloads/fsx.yaml}
the test was failed by the core dump.

the info list:

2016-07-22T10:53:46.272 write 0x7eeea5f thru 0x7ef2835 (0x3dd7 bytes)
2016-07-22T10:53:49.121 punch from 0x970509 to 0x97dfa4, (0xda9b bytes)
2016-07-22T10:57:52.942 trunc from 0x7ef2836 to 0x72b847
2016-07-22T10:59:43.932 trunc from 0x72b847 to 0x2be90e9
2016-07-22T10:59:44.893 In function 'void librbd::exclusive_lock::AcquireRequest<ImageCtxT>::apply() [with ImageCtxT = librbd::ImageCtx]' thread 18311700 time 2016-07-22 10:59:44.450770
2016-07-22T10:59:44.893 556: FAILED assert(m_image_ctx.journal == nullptr)
2016-07-22T10:59:44.954 ceph version v11.0.0-823-g80bec09 (80bec0939273dfc87322df06d531c9a65411af0c)
2016-07-22T10:59:44.955 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x55c5e4b]
2016-07-22T10:59:44.955 2: (librbd::exclusive_lock::AcquireRequest<librbd::ImageCtx>::apply()+0x246) [0x4fdd706]
2016-07-22T10:59:44.956 3: (librbd::exclusive_lock::AcquireRequest<librbd::ImageCtx>::send_open_journal()+0xaa) [0x4fdd7fa]
2016-07-22T10:59:44.956 4: (librbd::exclusive_lock::AcquireRequest<librbd::ImageCtx>::handle_open_object_map(int*)+0x65) [0x4fde925]
2016-07-22T10:59:44.957 5: (librbd::util::detail::C_StateCallbackAdapter<librbd::exclusive_lock::AcquireRequest<librbd::ImageCtx>, &librbd::exclusive_lock::AcquireRequest<librbd::ImageCtx>::handle_open_object_map, true>::complete(int)+0x1b) [0x4fdebcb]
2016-07-22T10:59:44.957 6: (void librbd::util::detail::rados_state_callback<librbd::object_map::RefreshRequest<librbd::ImageCtx>, &librbd::object_map::RefreshRequest<librbd::ImageCtx>::handle_load, true>(void*, void*)+0x31) [0x5004e71]
2016-07-22T10:59:44.957 7: (librados::C_AioComplete::finish(int)+0x1d) [0x554012d]
2016-07-22T10:59:44.958 8: (()+0x1bf2f9) [0x55242f9]
2016-07-22T10:59:44.958 9: (Finisher::finisher_thread_entry()+0x206) [0x55c50f6]
2016-07-22T10:59:44.959 10: (()+0x9fa7) [0x4c30fa7]
2016-07-22T10:59:44.959 11: (()+0x8182) [0xe3b7182]
2016-07-22T10:59:44.960 12: (clone()+0x6d) [0xeee747d]
2016-07-22T10:59:44.960 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-07-22T10:59:52.850 line 1: 4557 Killed adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/fsx_client.0.log --time-stamp=yes --tool=helgrind ceph_test_librbd_fsx -d -W -R -p 100 -P /home/ubuntu/cephtest/archive -r 1 -w 1 -t 1 -h 1 -l 134217728 -S 0 -N 1000 pool_client.0 image_client.0
2016-07-22T10:59:52.850 ERROR:teuthology.parallel:Exception in parallel execution
Traceback (most recent call last):
File "/home/foo2/src/teuthology_master/teuthology/", line 83, in exit
for result in self:
File "/home/foo2/src/teuthology_master/teuthology/", line 101, in next
File "/home/foo2/src/teuthology_master/teuthology/", line 19, in capture_traceback
return func(*args, **kwargs)
File "/home/foo2/src/ceph-qa-suite_master/tasks/", line 102, in _run_one_client
File "/home/foo2/src/teuthology_master/teuthology/orchestra/", line 198, in run
r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
File "/home/foo2/src/teuthology_master/teuthology/orchestra/", line 391, in run
File "/home/foo2/src/teuthology_master/teuthology/orchestra/", line 121, in wait
CommandFailedError: Command failed on plana128 with status 137: 'cd /home/ubuntu/cephtest && adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/fsx_client.0.log --time-stamp=yes --tool=helgrind ceph_test_librbd_fsx -d -W -R -p 100 -P /home/ubuntu/cephtest/archive -r 1 -w 1 -t 1 -h 1 -l 134217728 -S 0 -N 1000 pool_client.0 image_client.0'
2016-07-22T10:59:52.858 ERROR:teuthology.run_tasks:Saw exception from tasks.

log.txt View - test log (313 KB) de lan, 07/22/2016 06:54 AM

Related issues

Copied to rbd - Backport #17134: jewel: FAILED assert(m_image_ctx.journal == nullptr) Resolved


#1 Updated by Jason Dillaman over 6 years ago

  • Assignee deleted (Jason Dillaman)

#2 Updated by Jason Dillaman over 6 years ago

Repeated here:

Both cases seem to involve a shrink followed by a expand resize op (although could be a coincidence).

#3 Updated by Jason Dillaman over 6 years ago

If the rbd image header watch is lost ...

$302 = 0x7f69940b6a60 "librbd::ImageWatcher: 0x7f69940145b0 image watch failed: 140091431348880, (107)", <incomplete sequence \360>
$303 = 0x7f699403e530 "-- --> -- osd_op(client.4117.0:345173 2.a80d9177 rbd_header.10151b739747 [watch unwatch cookie 140091431348880] snapc 0=[] ondisk+write+known_if_redirec"...
$304 = 0x7f699d88d5c0 "-- <== osd.0 908 ==== osd_op_reply(345173 rbd_header.10151b739747 [watch unwatch cookie 140091431348880] v0'0 uv2869 ondisk = 0) v7 ==== 143+0+0 (308780"...

... it's possible for a refresh request to be triggered by an incoming API action and to race with the background release/acquire of the image lock ...

$462 = 0x7f699401b830 "librbd::AioImageRequestWQ: block_writes: 0xa5bed00, num=1\270\001\224i\177" 
$463 = 0x7f6998016d50 "-- --> -- osd_op(client.4117.0:345230 2.a87d6cb8 journal.10151b739747 [watch unwatch cookie 140091431418080] snapc 0=[] ondisk+write+known_if_redirected "...

... since both image::RefreshRequest and exclusive_lock::AcquireRequest can attempt to open the journal ...

#4 Updated by Jason Dillaman over 6 years ago

  • Status changed from New to In Progress
  • Assignee set to Jason Dillaman
  • Backport set to jewel

#5 Updated by Jason Dillaman over 6 years ago

  • Status changed from In Progress to Fix Under Review

#6 Updated by Mykola Golub over 6 years ago

  • Status changed from Fix Under Review to Pending Backport

#7 Updated by Loïc Dachary over 6 years ago

  • Copied to Backport #17134: jewel: FAILED assert(m_image_ctx.journal == nullptr) added

#8 Updated by Nathan Cutler almost 6 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF