Bug #1471
closed
osd: destroy_collection on non-empty dir
Added by Damien Churchill over 12 years ago.
Updated over 12 years ago.
Description
This was running 0.34 installed from packages.
Last 30 lines of log:
2011-08-31 15:11:06.005343 7fbb3f5fa760 filestore(/srv/osd7) collection_setattr /srv/osd7/current/0.2a5_head 'info' len 611 = 611
2011-08-31 15:11:06.005349 7fbb3f5fa760 filestore(/srv/osd7) truncate meta/pginfo_0.2a5/0 size 0
2011-08-31 15:11:06.005355 7fbb3f5fa760 filestore(/srv/osd7) lfn_get cid=meta oid=pginfo_0.2a5/0 pathname=/srv/osd7/current/meta/pginfo_0.2a5_0 lfn=pginfo_0.2a5_0 is_lfn=0
2011-08-31 15:11:06.005383 7fbb3f5fa760 filestore(/srv/osd7) truncate meta/pginfo_0.2a5/0 size 0 = 0
2011-08-31 15:11:06.005389 7fbb3f5fa760 filestore(/srv/osd7) write meta/pginfo_0.2a5/0 0~40
2011-08-31 15:11:06.005395 7fbb3f5fa760 filestore(/srv/osd7) lfn_get cid=meta oid=pginfo_0.2a5/0 pathname=/srv/osd7/current/meta/pginfo_0.2a5_0 lfn=pginfo_0.2a5_0 is_lfn=0
2011-08-31 15:11:06.005408 7fbb3f5fa760 filestore(/srv/osd7) queue_flusher ep 0 fd 34 0~40 qlen 20
2011-08-31 15:11:06.005418 7fbb3f5fa760 filestore(/srv/osd7) write meta/pginfo_0.2a5/0 0~40 = 40
2011-08-31 15:11:06.005424 7fbb3f5fa760 journal read_entry 704397312 : seq 3733800 33 bytes
2011-08-31 15:11:06.005429 7fbb3f5fa760 filestore(/srv/osd7) _do_transaction on 0x22ad000
2011-08-31 15:11:06.005434 7fbb3f5fa760 journal read_entry 704405504 : seq 3733801 49 bytes
2011-08-31 15:11:06.005439 7fbb3f5fa760 filestore(/srv/osd7) _do_transaction on 0x22ad000
2011-08-31 15:11:06.005443 7fbb3f5fa760 filestore(/srv/osd7) _destroy_collection /srv/osd7/current/0.2a5_f
2011-08-31 15:11:06.005458 7fbb3f5fa760 filestore(/srv/osd7) _destroy_collection /srv/osd7/current/0.2a5_f = -39
*** Caught signal (Aborted) **
in thread 0x7fbb3f5fa760
ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6)
1: /usr/bin/cosd() [0x659052]
2: (()+0xfc60) [0x7fbb3f1f2c60]
3: (gsignal()+0x35) [0x7fbb3d977d05]
4: (abort()+0x186) [0x7fbb3d97bab6]
5: (__assert_fail()+0xf5) [0x7fbb3d9707c5]
6: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x3f76) [0x6bc146]
7: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x76) [0x6bc7a6]
8: (JournalingObjectStore::journal_replay(unsigned long)+0x773) [0x6cce43]
9: (FileStore::mount()+0x172a) [0x6b3e9a]
10: (OSD::init()+0x18d) [0x550c6d]
11: (main()+0x2b56) [0x4b3b06]
12: (__libc_start_main()+0xff) [0x7fbb3d962eff]
13: /usr/bin/cosd() [0x4b0b99]
Files
- Category set to OSD
- Status changed from New to 4
- Assignee set to Sage Weil
- Target version set to v0.35
The assert.h #includes were screwed up so there's no line number in the log and I can't tell which assert this was hitting. Can you run cosd in the foreground and reproduce (-f on command line)? Probably 'cosd -f -i 7' will do it.
THanks!
2011-08-31 20:17:13.966717 7fd5f3840760 filestore(/srv/osd7) collection_setattr /srv/osd7/current/0.2a5_head 'info' len 611
2011-08-31 20:17:13.966740 7fd5f3840760 filestore(/srv/osd7) collection_setattr /srv/osd7/current/0.2a5_head 'info' len 611 = 611
2011-08-31 20:17:13.966745 7fd5f3840760 filestore(/srv/osd7) truncate meta/pginfo_0.2a5/0 size 0
2011-08-31 20:17:13.966750 7fd5f3840760 filestore(/srv/osd7) lfn_get cid=meta oid=pginfo_0.2a5/0 pathname=/srv/osd7/current/meta/pginfo_0.2a5_0 lfn=pginfo_0.2a5_0 is_lfn=0
2011-08-31 20:17:13.966777 7fd5f3840760 filestore(/srv/osd7) truncate meta/pginfo_0.2a5/0 size 0 = 0
2011-08-31 20:17:13.966783 7fd5f3840760 filestore(/srv/osd7) write meta/pginfo_0.2a5/0 0~40
2011-08-31 20:17:13.966787 7fd5f3840760 filestore(/srv/osd7) lfn_get cid=meta oid=pginfo_0.2a5/0 pathname=/srv/osd7/current/meta/pginfo_0.2a5_0 lfn=pginfo_0.2a5_0 is_lfn=0
2011-08-31 20:17:13.966799 7fd5f3840760 filestore(/srv/osd7) queue_flusher ep 0 fd 34 0~40 qlen 20
2011-08-31 20:17:13.966803 7fd5f3840760 filestore(/srv/osd7) write meta/pginfo_0.2a5/0 0~40 = 40
2011-08-31 20:17:13.966806 7fd5f3840760 journal journal_replay: r = 0, op now seq 3733799
2011-08-31 20:17:13.966812 7fd5f3840760 journal read_entry 704397312 : seq 3733800 33 bytes
2011-08-31 20:17:13.966815 7fd5f3840760 journal journal_replay: applying op seq 3733800 (op_seq 3733799)
2011-08-31 20:17:13.966818 7fd5f3840760 filestore(/srv/osd7) _do_transaction on 0x29d0000
2011-08-31 20:17:13.966821 7fd5f3840760 journal journal_replay: r = 0, op now seq 3733800
2011-08-31 20:17:13.966825 7fd5f3840760 journal read_entry 704405504 : seq 3733801 49 bytes
2011-08-31 20:17:13.966828 7fd5f3840760 journal journal_replay: applying op seq 3733801 (op_seq 3733800)
2011-08-31 20:17:13.966831 7fd5f3840760 filestore(/srv/osd7) _do_transaction on 0x29d0000
2011-08-31 20:17:13.966835 7fd5f3840760 filestore(/srv/osd7) _destroy_collection /srv/osd7/current/0.2a5_f
2011-08-31 20:17:13.966848 7fd5f3840760 filestore(/srv/osd7) _destroy_collection /srv/osd7/current/0.2a5_f = -39
os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&)', in thread '0x7fd5f3840760'
os/FileStore.cc: 2539: FAILED assert(0 == "ENOTEMPTY suggests garbage data in osd data dir")
ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6)
1: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x3f76) [0x6bc1b6]
2: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x76) [0x6bc816]
3: (JournalingObjectStore::journal_replay(unsigned long)+0x773) [0x6ccea3]
4: (FileStore::mount()+0x172a) [0x6b3f0a]
5: (OSD::init()+0x18d) [0x550c6d]
6: (main()+0x2b56) [0x4b3b06]
7: (__libc_start_main()+0xff) [0x7fd5f1ba8eff]
8: /usr/bin/cosd() [0x4b0b99]
ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6)
1: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x3f76) [0x6bc1b6]
2: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x76) [0x6bc816]
3: (JournalingObjectStore::journal_replay(unsigned long)+0x773) [0x6ccea3]
4: (FileStore::mount()+0x172a) [0x6b3f0a]
5: (OSD::init()+0x18d) [0x550c6d]
6: (main()+0x2b56) [0x4b3b06]
7: (__libc_start_main()+0xff) [0x7fd5f1ba8eff]
8: /usr/bin/cosd() [0x4b0b99]
*** Caught signal (Aborted) **
in thread 0x7fd5f3840760
ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6)
1: /usr/bin/cosd() [0x583642]
2: (()+0xfc60) [0x7fd5f3438c60]
3: (gsignal()+0x35) [0x7fd5f1bbdd05]
4: (abort()+0x186) [0x7fd5f1bc1ab6]
5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd5f2474d6d]
6: (()+0xb9f16) [0x7fd5f2472f16]
7: (()+0xb9f43) [0x7fd5f2472f43]
8: (()+0xba03e) [0x7fd5f247303e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x396) [0x5a8876]
10: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x3f76) [0x6bc1b6]
11: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x76) [0x6bc816]
12: (JournalingObjectStore::journal_replay(unsigned long)+0x773) [0x6ccea3]
13: (FileStore::mount()+0x172a) [0x6b3f0a]
14: (OSD::init()+0x18d) [0x550c6d]
15: (main()+0x2b56) [0x4b3b06]
16: (__libc_start_main()+0xff) [0x7fd5f1ba8eff]
- Subject changed from osd crash to osd: destroy_collection on non-empty dir
- Status changed from 4 to In Progress
- Status changed from In Progress to Resolved
I'm actually hitting the same bug with v0.37
It was time to upgrade my old (and good running!) 0.27 cluster to the latest version and I encountered this crash:
Oct 22 21:56:32 node03 osd.0[15433]: 7ff57621a720 filestore(/var/lib/ceph/osd.0) collection_setattr /var/lib/ceph/osd.0/current/3.51_head 'ondisklog' len 17 = 17
Oct 22 21:56:32 node03 osd.0[15433]: 7ff57621a720 filestore(/var/lib/ceph/osd.0) collection_setattr /var/lib/ceph/osd.0/current/3.51_head 'snap_collections' len 244
Oct 22 21:56:32 node03 osd.0[15433]: 7ff57621a720 filestore(/var/lib/ceph/osd.0) collection_setattr /var/lib/ceph/osd.0/current/3.51_head 'snap_collections' len 244 = 244
Oct 22 21:56:32 node03 osd.0[15433]: 7ff57621a720 filestore(/var/lib/ceph/osd.0) _destroy_collection /var/lib/ceph/osd.0/current/_temp
Oct 22 21:56:32 node03 osd.0[15433]: 7ff57621a720 filestore(/var/lib/ceph/osd.0) _destroy_collection /var/lib/ceph/osd.0/current/_temp = -39
Oct 22 21:56:32 node03 osd.0[15433]: os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&)', in thread '7ff57621a720'#012os/FileStore.cc: 2416: FAILED assert(0 == "ENOTEMPTY suggests garbage data in osd data dir")
Oct 22 21:56:32 node03 osd.0[15433]: ceph version 0.37-96-g1b846f4 (commit:1b846f43f181ac31a1a1911c74b7cf3ae063f35d)#012 1: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x410b) [0x6c181b]#012 2: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x75) [0x6c1e75]#012 3: (JournalingObjectStore::journal_replay(unsigned long)+0x793) [0x6ccfc3]#012 4: (FileStore::mount()+0x2803) [0x6abb13]#012 5: (OSD::convertfs(std::string const&, std::string const&)+0xc8) [0x524f18]#012 6: (main()+0x2c8d) [0x4a55dd]#012 7: (__libc_start_main()+0xfd) [0x7ff5745a9c4d]#012 8: /usr/bin/ceph-osd() [0x4a2529]
Indeed, the directory is not empty:
root@node03:/var/lib/ceph/osd.0/current/_temp# ls
rb.0.3.000000000486_40
root@node03:/var/lib/ceph/osd.0/current/_temp#
The other OSD's are still working their way through the upgrade: "Updating collection...."
Do I have one broken PG on this OSD?
Also available in: Atom
PDF