Bug #16773: FAILED assert(m_image_ctx.journal == nullptr) - rbd - Ceph

Actions

Copy link

Bug #16773

closed

FAILED assert(m_image_ctx.journal == nullptr)

Added by de lan almost 8 years ago. Updated about 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Jason Dillaman

Target version:

% Done:

Source:

other

Tags:

Backport:

jewel

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

rbd

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi！

when i test the ci suite：rbd:valgrind/{base/install.yaml clusters/{fixed-1.yaml openstack.yaml} fs/xfs.yaml validator/helgrind.yaml workloads/fsx.yaml}
the test was failed by the core dump.

the info list：

2016-07-22T10:53:46.272 INFO:teuthology.orchestra.run.plana128.stdout:106 write 0x7eeea5f thru 0x7ef2835 (0x3dd7 bytes)
2016-07-22T10:53:49.121 INFO:teuthology.orchestra.run.plana128.stdout:108 punch from 0x970509 to 0x97dfa4, (0xda9b bytes)
2016-07-22T10:57:52.942 INFO:teuthology.orchestra.run.plana128.stdout:109 trunc from 0x7ef2836 to 0x72b847
2016-07-22T10:59:43.932 INFO:teuthology.orchestra.run.plana128.stdout:110 trunc from 0x72b847 to 0x2be90e9
2016-07-22T10:59:44.893 INFO:teuthology.orchestra.run.plana128.stderr:/srv/autobuild-ceph/gitbuilder.git/build/out~/ceph-11.0.0-823-g80bec09/src/librbd/exclusive_lock/AcquireRequest.cc: In function 'void librbd::exclusive_lock::AcquireRequest<ImageCtxT>::apply() [with ImageCtxT = librbd::ImageCtx]' thread 18311700 time 2016-07-22 10:59:44.450770
2016-07-22T10:59:44.893 INFO:teuthology.orchestra.run.plana128.stderr:/srv/autobuild-ceph/gitbuilder.git/build/out~/ceph-11.0.0-823-g80bec09/src/librbd/exclusive_lock/AcquireRequest.cc: 556: FAILED assert(m_image_ctx.journal == nullptr)
2016-07-22T10:59:44.954 INFO:teuthology.orchestra.run.plana128.stderr: ceph version v11.0.0-823-g80bec09 (80bec0939273dfc87322df06d531c9a65411af0c)
2016-07-22T10:59:44.955 INFO:teuthology.orchestra.run.plana128.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x55c5e4b]
2016-07-22T10:59:44.955 INFO:teuthology.orchestra.run.plana128.stderr: 2: (librbd::exclusive_lock::AcquireRequest<librbd::ImageCtx>::apply()+0x246) [0x4fdd706]
2016-07-22T10:59:44.956 INFO:teuthology.orchestra.run.plana128.stderr: 3: (librbd::exclusive_lock::AcquireRequest<librbd::ImageCtx>::send_open_journal()+0xaa) [0x4fdd7fa]
2016-07-22T10:59:44.956 INFO:teuthology.orchestra.run.plana128.stderr: 4: (librbd::exclusive_lock::AcquireRequest<librbd::ImageCtx>::handle_open_object_map(int*)+0x65) [0x4fde925]
2016-07-22T10:59:44.957 INFO:teuthology.orchestra.run.plana128.stderr: 5: (librbd::util::detail::C_StateCallbackAdapter<librbd::exclusive_lock::AcquireRequest<librbd::ImageCtx>, &librbd::exclusive_lock::AcquireRequest<librbd::ImageCtx>::handle_open_object_map, true>::complete(int)+0x1b) [0x4fdebcb]
2016-07-22T10:59:44.957 INFO:teuthology.orchestra.run.plana128.stderr: 6: (void librbd::util::detail::rados_state_callback<librbd::object_map::RefreshRequest<librbd::ImageCtx>, &librbd::object_map::RefreshRequest<librbd::ImageCtx>::handle_load, true>(void*, void*)+0x31) [0x5004e71]
2016-07-22T10:59:44.957 INFO:teuthology.orchestra.run.plana128.stderr: 7: (librados::C_AioComplete::finish(int)+0x1d) [0x554012d]
2016-07-22T10:59:44.958 INFO:teuthology.orchestra.run.plana128.stderr: 8: (()+0x1bf2f9) [0x55242f9]
2016-07-22T10:59:44.958 INFO:teuthology.orchestra.run.plana128.stderr: 9: (Finisher::finisher_thread_entry()+0x206) [0x55c50f6]
2016-07-22T10:59:44.959 INFO:teuthology.orchestra.run.plana128.stderr: 10: (()+0x9fa7) [0x4c30fa7]
2016-07-22T10:59:44.959 INFO:teuthology.orchestra.run.plana128.stderr: 11: (()+0x8182) [0xe3b7182]
2016-07-22T10:59:44.960 INFO:teuthology.orchestra.run.plana128.stderr: 12: (clone()+0x6d) [0xeee747d]
2016-07-22T10:59:44.960 INFO:teuthology.orchestra.run.plana128.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-07-22T10:59:52.850 INFO:teuthology.orchestra.run.plana128.stderr:bash: line 1: 4557 Killed adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/fsx_client.0.log --time-stamp=yes --tool=helgrind ceph_test_librbd_fsx -d -W -R -p 100 -P /home/ubuntu/cephtest/archive -r 1 -w 1 -t 1 -h 1 -l 134217728 -S 0 -N 1000 pool_client.0 image_client.0
2016-07-22T10:59:52.850 ERROR:teuthology.parallel:Exception in parallel execution
Traceback (most recent call last):
File "/home/foo2/src/teuthology_master/teuthology/parallel.py", line 83, in exit
for result in self:
File "/home/foo2/src/teuthology_master/teuthology/parallel.py", line 101, in next
resurrect_traceback(result)
File "/home/foo2/src/teuthology_master/teuthology/parallel.py", line 19, in capture_traceback
return func(*args, **kwargs)
File "/home/foo2/src/ceph-qa-suite_master/tasks/rbd_fsx.py", line 102, in _run_one_client
remote.run(args=args)
File "/home/foo2/src/teuthology_master/teuthology/orchestra/remote.py", line 198, in run
r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
File "/home/foo2/src/teuthology_master/teuthology/orchestra/run.py", line 391, in run
r.wait()
File "/home/foo2/src/teuthology_master/teuthology/orchestra/run.py", line 121, in wait
label=self.label)
CommandFailedError: Command failed on plana128 with status 137: 'cd /home/ubuntu/cephtest && adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage valgrind --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/fsx_client.0.log --time-stamp=yes --tool=helgrind ceph_test_librbd_fsx -d -W -R -p 100 -P /home/ubuntu/cephtest/archive -r 1 -w 1 -t 1 -h 1 -l 134217728 -S 0 -N 1000 pool_client.0 image_client.0'
2016-07-22T10:59:52.858 ERROR:teuthology.run_tasks:Saw exception from tasks.

Files

log.txt (313 KB) log.txt

test log

de lan, 07/22/2016 06:54 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Jason Dillaman over 7 years ago

Assignee deleted (~~Jason Dillaman~~)

Actions

Copy link

Updated by Jason Dillaman over 7 years ago

Repeated here: http://pulpito.ceph.com/jdillaman-2016-08-08_08:12:47-rbd-wip-jd-testing---basic-vps/354673/

Both cases seem to involve a shrink followed by a expand resize op (although could be a coincidence).

Actions

Copy link

Updated by Jason Dillaman over 7 years ago

If the rbd image header watch is lost ...

$302 = 0x7f69940b6a60 "librbd::ImageWatcher: 0x7f69940145b0 image watch failed: 140091431348880, (107)", <incomplete sequence \360>
$303 = 0x7f699403e530 "-- 172.21.2.17:0/4122897661 --> 172.21.2.17:6800/20170 -- osd_op(client.4117.0:345173 2.a80d9177 rbd_header.10151b739747 [watch unwatch cookie 140091431348880] snapc 0=[] ondisk+write+known_if_redirec"...
$304 = 0x7f699d88d5c0 "-- 172.21.2.17:0/4122897661 <== osd.0 172.21.2.17:6800/20170 908 ==== osd_op_reply(345173 rbd_header.10151b739747 [watch unwatch cookie 140091431348880] v0'0 uv2869 ondisk = 0) v7 ==== 143+0+0 (308780"...

... it's possible for a refresh request to be triggered by an incoming API action and to race with the background release/acquire of the image lock ...

$462 = 0x7f699401b830 "librbd::AioImageRequestWQ: block_writes: 0xa5bed00, num=1\270\001\224i\177" 
$463 = 0x7f6998016d50 "-- 172.21.2.17:0/4122897661 --> 172.21.2.17:6804/1604 -- osd_op(client.4117.0:345230 2.a87d6cb8 journal.10151b739747 [watch unwatch cookie 140091431418080] snapc 0=[] ondisk+write+known_if_redirected "...

... since both image::RefreshRequest and exclusive_lock::AcquireRequest can attempt to open the journal ...

Actions

Copy link