Project

General

Profile

Actions

Bug #1471

closed

osd: destroy_collection on non-empty dir

Added by Damien Churchill over 12 years ago. Updated over 12 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
OSD
Target version:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This was running 0.34 installed from packages.

Last 30 lines of log:


2011-08-31 15:11:06.005343 7fbb3f5fa760 filestore(/srv/osd7) collection_setattr /srv/osd7/current/0.2a5_head 'info' len 611 = 611
2011-08-31 15:11:06.005349 7fbb3f5fa760 filestore(/srv/osd7) truncate meta/pginfo_0.2a5/0 size 0
2011-08-31 15:11:06.005355 7fbb3f5fa760 filestore(/srv/osd7) lfn_get cid=meta oid=pginfo_0.2a5/0 pathname=/srv/osd7/current/meta/pginfo_0.2a5_0 lfn=pginfo_0.2a5_0 is_lfn=0
2011-08-31 15:11:06.005383 7fbb3f5fa760 filestore(/srv/osd7) truncate meta/pginfo_0.2a5/0 size 0 = 0
2011-08-31 15:11:06.005389 7fbb3f5fa760 filestore(/srv/osd7) write meta/pginfo_0.2a5/0 0~40
2011-08-31 15:11:06.005395 7fbb3f5fa760 filestore(/srv/osd7) lfn_get cid=meta oid=pginfo_0.2a5/0 pathname=/srv/osd7/current/meta/pginfo_0.2a5_0 lfn=pginfo_0.2a5_0 is_lfn=0
2011-08-31 15:11:06.005408 7fbb3f5fa760 filestore(/srv/osd7) queue_flusher ep 0 fd 34 0~40 qlen 20
2011-08-31 15:11:06.005418 7fbb3f5fa760 filestore(/srv/osd7) write meta/pginfo_0.2a5/0 0~40 = 40
2011-08-31 15:11:06.005424 7fbb3f5fa760 journal read_entry 704397312 : seq 3733800 33 bytes
2011-08-31 15:11:06.005429 7fbb3f5fa760 filestore(/srv/osd7) _do_transaction on 0x22ad000
2011-08-31 15:11:06.005434 7fbb3f5fa760 journal read_entry 704405504 : seq 3733801 49 bytes
2011-08-31 15:11:06.005439 7fbb3f5fa760 filestore(/srv/osd7) _do_transaction on 0x22ad000
2011-08-31 15:11:06.005443 7fbb3f5fa760 filestore(/srv/osd7) _destroy_collection /srv/osd7/current/0.2a5_f
2011-08-31 15:11:06.005458 7fbb3f5fa760 filestore(/srv/osd7) _destroy_collection /srv/osd7/current/0.2a5_f = -39
*** Caught signal (Aborted) **
 in thread 0x7fbb3f5fa760
 ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6)
 1: /usr/bin/cosd() [0x659052]
 2: (()+0xfc60) [0x7fbb3f1f2c60]
 3: (gsignal()+0x35) [0x7fbb3d977d05]
 4: (abort()+0x186) [0x7fbb3d97bab6]
 5: (__assert_fail()+0xf5) [0x7fbb3d9707c5]
 6: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x3f76) [0x6bc146]
 7: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x76) [0x6bc7a6]
 8: (JournalingObjectStore::journal_replay(unsigned long)+0x773) [0x6cce43]
 9: (FileStore::mount()+0x172a) [0x6b3e9a]
 10: (OSD::init()+0x18d) [0x550c6d]
 11: (main()+0x2b56) [0x4b3b06]
 12: (__libc_start_main()+0xff) [0x7fbb3d962eff]
 13: /usr/bin/cosd() [0x4b0b99]

Files

osd.7.log (61.5 KB) osd.7.log Damien Churchill, 08/31/2011 07:13 AM
osd.8.log (290 KB) osd.8.log Damien Churchill, 08/31/2011 07:22 AM
osd.7.log.1 (67.1 KB) osd.7.log.1 Damien Churchill, 08/31/2011 02:43 PM
Actions #1

Updated by Damien Churchill over 12 years ago

Actions #2

Updated by Sage Weil over 12 years ago

  • Category set to OSD
  • Status changed from New to 4
  • Assignee set to Sage Weil
  • Target version set to v0.35

The assert.h #includes were screwed up so there's no line number in the log and I can't tell which assert this was hitting. Can you run cosd in the foreground and reproduce (-f on command line)? Probably 'cosd -f -i 7' will do it.

THanks!

Actions #3

Updated by Damien Churchill over 12 years ago

2011-08-31 20:17:13.966717 7fd5f3840760 filestore(/srv/osd7) collection_setattr /srv/osd7/current/0.2a5_head 'info' len 611
2011-08-31 20:17:13.966740 7fd5f3840760 filestore(/srv/osd7) collection_setattr /srv/osd7/current/0.2a5_head 'info' len 611 = 611
2011-08-31 20:17:13.966745 7fd5f3840760 filestore(/srv/osd7) truncate meta/pginfo_0.2a5/0 size 0
2011-08-31 20:17:13.966750 7fd5f3840760 filestore(/srv/osd7) lfn_get cid=meta oid=pginfo_0.2a5/0 pathname=/srv/osd7/current/meta/pginfo_0.2a5_0 lfn=pginfo_0.2a5_0 is_lfn=0
2011-08-31 20:17:13.966777 7fd5f3840760 filestore(/srv/osd7) truncate meta/pginfo_0.2a5/0 size 0 = 0
2011-08-31 20:17:13.966783 7fd5f3840760 filestore(/srv/osd7) write meta/pginfo_0.2a5/0 0~40
2011-08-31 20:17:13.966787 7fd5f3840760 filestore(/srv/osd7) lfn_get cid=meta oid=pginfo_0.2a5/0 pathname=/srv/osd7/current/meta/pginfo_0.2a5_0 lfn=pginfo_0.2a5_0 is_lfn=0
2011-08-31 20:17:13.966799 7fd5f3840760 filestore(/srv/osd7) queue_flusher ep 0 fd 34 0~40 qlen 20
2011-08-31 20:17:13.966803 7fd5f3840760 filestore(/srv/osd7) write meta/pginfo_0.2a5/0 0~40 = 40
2011-08-31 20:17:13.966806 7fd5f3840760 journal journal_replay: r = 0, op now seq 3733799
2011-08-31 20:17:13.966812 7fd5f3840760 journal read_entry 704397312 : seq 3733800 33 bytes
2011-08-31 20:17:13.966815 7fd5f3840760 journal journal_replay: applying op seq 3733800 (op_seq 3733799)
2011-08-31 20:17:13.966818 7fd5f3840760 filestore(/srv/osd7) _do_transaction on 0x29d0000
2011-08-31 20:17:13.966821 7fd5f3840760 journal journal_replay: r = 0, op now seq 3733800
2011-08-31 20:17:13.966825 7fd5f3840760 journal read_entry 704405504 : seq 3733801 49 bytes
2011-08-31 20:17:13.966828 7fd5f3840760 journal journal_replay: applying op seq 3733801 (op_seq 3733800)
2011-08-31 20:17:13.966831 7fd5f3840760 filestore(/srv/osd7) _do_transaction on 0x29d0000
2011-08-31 20:17:13.966835 7fd5f3840760 filestore(/srv/osd7) _destroy_collection /srv/osd7/current/0.2a5_f
2011-08-31 20:17:13.966848 7fd5f3840760 filestore(/srv/osd7) _destroy_collection /srv/osd7/current/0.2a5_f = -39
os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&)', in thread '0x7fd5f3840760'
os/FileStore.cc: 2539: FAILED assert(0 == "ENOTEMPTY suggests garbage data in osd data dir")
 ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6)
 1: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x3f76) [0x6bc1b6]
 2: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x76) [0x6bc816]
 3: (JournalingObjectStore::journal_replay(unsigned long)+0x773) [0x6ccea3]
 4: (FileStore::mount()+0x172a) [0x6b3f0a]
 5: (OSD::init()+0x18d) [0x550c6d]
 6: (main()+0x2b56) [0x4b3b06]
 7: (__libc_start_main()+0xff) [0x7fd5f1ba8eff]
 8: /usr/bin/cosd() [0x4b0b99]
 ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6)
 1: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x3f76) [0x6bc1b6]
 2: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x76) [0x6bc816]
 3: (JournalingObjectStore::journal_replay(unsigned long)+0x773) [0x6ccea3]
 4: (FileStore::mount()+0x172a) [0x6b3f0a]
 5: (OSD::init()+0x18d) [0x550c6d]
 6: (main()+0x2b56) [0x4b3b06]
 7: (__libc_start_main()+0xff) [0x7fd5f1ba8eff]
 8: /usr/bin/cosd() [0x4b0b99]
*** Caught signal (Aborted) **
 in thread 0x7fd5f3840760
 ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6)
 1: /usr/bin/cosd() [0x583642]
 2: (()+0xfc60) [0x7fd5f3438c60]
 3: (gsignal()+0x35) [0x7fd5f1bbdd05]
 4: (abort()+0x186) [0x7fd5f1bc1ab6]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd5f2474d6d]
 6: (()+0xb9f16) [0x7fd5f2472f16]
 7: (()+0xb9f43) [0x7fd5f2472f43]
 8: (()+0xba03e) [0x7fd5f247303e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x396) [0x5a8876]
 10: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x3f76) [0x6bc1b6]
 11: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x76) [0x6bc816]
 12: (JournalingObjectStore::journal_replay(unsigned long)+0x773) [0x6ccea3]
 13: (FileStore::mount()+0x172a) [0x6b3f0a]
 14: (OSD::init()+0x18d) [0x550c6d]
 15: (main()+0x2b56) [0x4b3b06]
 16: (__libc_start_main()+0xff) [0x7fd5f1ba8eff]
Actions #5

Updated by Sage Weil over 12 years ago

  • Subject changed from osd crash to osd: destroy_collection on non-empty dir
  • Status changed from 4 to In Progress
Actions #6

Updated by Sage Weil over 12 years ago

  • Status changed from In Progress to Resolved

I think the root bug is fixed by f1cae577e3730a73dd5478785160745150095af5 and commit:e0776761a1d6866fda73fa5af217f5be4527d798.

Actions #7

Updated by Wido den Hollander over 12 years ago

I'm actually hitting the same bug with v0.37

It was time to upgrade my old (and good running!) 0.27 cluster to the latest version and I encountered this crash:

Oct 22 21:56:32 node03 osd.0[15433]: 7ff57621a720 filestore(/var/lib/ceph/osd.0) collection_setattr /var/lib/ceph/osd.0/current/3.51_head 'ondisklog' len 17 = 17
Oct 22 21:56:32 node03 osd.0[15433]: 7ff57621a720 filestore(/var/lib/ceph/osd.0) collection_setattr /var/lib/ceph/osd.0/current/3.51_head 'snap_collections' len 244
Oct 22 21:56:32 node03 osd.0[15433]: 7ff57621a720 filestore(/var/lib/ceph/osd.0) collection_setattr /var/lib/ceph/osd.0/current/3.51_head 'snap_collections' len 244 = 244
Oct 22 21:56:32 node03 osd.0[15433]: 7ff57621a720 filestore(/var/lib/ceph/osd.0) _destroy_collection /var/lib/ceph/osd.0/current/_temp
Oct 22 21:56:32 node03 osd.0[15433]: 7ff57621a720 filestore(/var/lib/ceph/osd.0) _destroy_collection /var/lib/ceph/osd.0/current/_temp = -39
Oct 22 21:56:32 node03 osd.0[15433]: os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&)', in thread '7ff57621a720'#012os/FileStore.cc: 2416: FAILED assert(0 == "ENOTEMPTY suggests garbage data in osd data dir")
Oct 22 21:56:32 node03 osd.0[15433]:  ceph version 0.37-96-g1b846f4 (commit:1b846f43f181ac31a1a1911c74b7cf3ae063f35d)#012 1: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x410b) [0x6c181b]#012 2: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x75) [0x6c1e75]#012 3: (JournalingObjectStore::journal_replay(unsigned long)+0x793) [0x6ccfc3]#012 4: (FileStore::mount()+0x2803) [0x6abb13]#012 5: (OSD::convertfs(std::string const&, std::string const&)+0xc8) [0x524f18]#012 6: (main()+0x2c8d) [0x4a55dd]#012 7: (__libc_start_main()+0xfd) [0x7ff5745a9c4d]#012 8: /usr/bin/ceph-osd() [0x4a2529]

Indeed, the directory is not empty:

root@node03:/var/lib/ceph/osd.0/current/_temp# ls
rb.0.3.000000000486_40
root@node03:/var/lib/ceph/osd.0/current/_temp#

The other OSD's are still working their way through the upgrade: "Updating collection...."

Do I have one broken PG on this OSD?

Actions

Also available in: Atom PDF