Bug #3600

rbd: assert in objectcacher destructor after flatten

Added by Josh Durgin about 8 years ago. Updated almost 8 years ago.

Target version:
% Done:


Affected Versions:
Pull request ID:
Crash signature:


From ubuntu@teuthology:/a/teuthology-2012-12-09_19:00:03-regression-master-testing-gcov/10977:

2012-12-09T22:20:17.777 write     0xef4cf97 thru  0xef503e3       (0x344d bytes)
2012-12-09T22:20:18.566 write     0x18d5c0e5 thru 0x18d5dc04      (0x1b20 bytes)
2012-12-09T22:20:18.613 read      0x3fb40df thru  0x3fb4fc2       (0xee4 bytes)
2012-12-09T22:20:18.616 punch     from 0x5855ff3 to 0x58653fb, (0xf408 bytes)
2012-12-09T22:20:19.382 flatten
2012-12-09T22:20:20.280 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise -- pg dump --format=json'
2012-12-09T22:20:20.335 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise -s'
2012-12-09T22:20:20.443 INFO:teuthology.task.thrashosds.ceph_manager:   health HEALTH_OK
   monmap e1: 1 mons at {a=}, election epoch 2, quorum 0 a
   osdmap e27: 6 osds: 6 up, 6 in
    pgmap v59: 80 pgs: 80 active+clean; 30382 KB data, 1195 MB used, 2609 GB / 2750 GB avail
   mdsmap e5: 1/1/1 up {0=a=up:active}
2012-12-09T22:20:20.443 INFO:teuthology.task.thrashosds.ceph_manager:recovered!
2012-12-09T22:20:20.443 INFO:teuthology.task.thrashosds.thrasher:choose_action: min_in 2 min_out 0 min_live 2 min_dead 0
2012-12-09T22:20:20.443 INFO:teuthology.task.thrashosds.thrasher:Removing osd 1, in_osds are: [2, 3, 4, 5, 1, 0]
2012-12-09T22:20:20.444 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise osd out 1'

2012-12-09T22:20:21.547 In function 'ObjectCacher::~ObjectCacher()' thread 7faabdf44780 time 2012-12-09 22:19:56.019607
2012-12-09T22:20:21.548 517: FAILED assert(!i->size())
2012-12-09T22:20:21.567 ceph version 0.55-180-gd4c6a22 (d4c6a22d798ff6de1bee31824144aa33bbf5e608)
2012-12-09T22:20:21.567 1: (ObjectCacher::~ObjectCacher()+0x26f) [0x7faabda838af]
2012-12-09T22:20:21.567 2: (librbd::ImageCtx::~ImageCtx()+0x42) [0x7faabda51432]
2012-12-09T22:20:21.568 3: (librbd::close_image(librbd::ImageCtx*)+0xbf) [0x7faabda66cef]
2012-12-09T22:20:21.568 4: (librbd::refresh_parent(librbd::ImageCtx*)+0xa1) [0x7faabda67c71]
2012-12-09T22:20:21.568 5: (librbd::ictx_refresh(librbd::ImageCtx*)+0xb8d) [0x7faabda5f46d]
2012-12-09T22:20:21.568 6: (librbd::ictx_check(librbd::ImageCtx*)+0x185) [0x7faabda61385]
2012-12-09T22:20:21.568 7: (librbd::info(librbd::ImageCtx*, rbd_image_info_t&, unsigned long)+0x121) [0x7faabda74b41]
2012-12-09T22:20:21.568 8: (rbd_stat()+0x11) [0x7faabda384c1]
2012-12-09T22:20:21.569 9: (check_size()+0x36) [0x404eb6]
2012-12-09T22:20:21.569 10: (test()+0x58d) [0x40707d]
2012-12-09T22:20:21.569 11: (main()+0xb33) [0x4039f3]
2012-12-09T22:20:21.569 12: (__libc_start_main()+0xed) [0x7faabc3ae76d]
2012-12-09T22:20:21.569 13: /tmp/cephtest/binary/usr/local/bin/test_librbd_fsx() [0x403ecd]

Related issues

Duplicates rbd - Bug #3664: osdc/ 517: FAILED assert(!i->size()) Resolved 03/11/2013


#1 Updated by Dan Mick about 8 years ago

Looking at the code, I don't really see a clean "stop caching" mechanism. While I look, hacked fsx to do only write/flatten/clone, and am running on a 3-machine cluster
and my local vstart cluster to try to reproduce the problem.

also the core file was corrupted again :(

It certainly seems possible that there could be I/Os in flight while the cache is
being destroyed, and that seems bad. Maybe I'm missing the mechanism that prevents

#2 Updated by Dan Mick about 8 years ago

Hm #2. Both runs are deadlocked on the mutex IoCtxImpl::selfmanaged_snap_create::mylock. Will look for owner next.

#3 Updated by Dan Mick about 8 years ago

The hang on selfmanaged_snap_create seems like a monitor issue; bouncing the monitor, or hanging out in gdb long enough to make the monitor reconnect, resolves the issue. I've sent mail to Joao about the problem in hopes that he can reproduce/investigate.

#4 Updated by Sage Weil about 8 years ago

  • Priority changed from Urgent to High

#5 Updated by Joao Eduardo Luis about 8 years ago

Tried to reproduce this behavior to no avail.

There are operations on the test that do hang for a long time, but after discussing that with Dan he told me that those operations do take a considerable amount of time anyway, so no luck there.

The logs weren't very useful in tracking down the problem either given the low debug levels.

#6 Updated by Tamilarasi muthamizhan about 8 years ago

recent log: ubuntu@teuthology:/a/teuthology-2012-12-27_19:00:03-regression-next-testing-basic/28713

#7 Updated by Dan Mick about 8 years ago

  • Assignee changed from Dan Mick to Josh Durgin

#8 Updated by Josh Durgin almost 8 years ago

  • Status changed from 12 to Duplicate

Also available in: Atom PDF