Project

General

Profile

Bug #1471

osd: destroy_collection on non-empty dir

Added by Damien Churchill about 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
OSD
Target version:
Start date:
08/31/2011
Due date:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

This was running 0.34 installed from packages.

Last 30 lines of log:


2011-08-31 15:11:06.005343 7fbb3f5fa760 filestore(/srv/osd7) collection_setattr /srv/osd7/current/0.2a5_head 'info' len 611 = 611
2011-08-31 15:11:06.005349 7fbb3f5fa760 filestore(/srv/osd7) truncate meta/pginfo_0.2a5/0 size 0
2011-08-31 15:11:06.005355 7fbb3f5fa760 filestore(/srv/osd7) lfn_get cid=meta oid=pginfo_0.2a5/0 pathname=/srv/osd7/current/meta/pginfo_0.2a5_0 lfn=pginfo_0.2a5_0 is_lfn=0
2011-08-31 15:11:06.005383 7fbb3f5fa760 filestore(/srv/osd7) truncate meta/pginfo_0.2a5/0 size 0 = 0
2011-08-31 15:11:06.005389 7fbb3f5fa760 filestore(/srv/osd7) write meta/pginfo_0.2a5/0 0~40
2011-08-31 15:11:06.005395 7fbb3f5fa760 filestore(/srv/osd7) lfn_get cid=meta oid=pginfo_0.2a5/0 pathname=/srv/osd7/current/meta/pginfo_0.2a5_0 lfn=pginfo_0.2a5_0 is_lfn=0
2011-08-31 15:11:06.005408 7fbb3f5fa760 filestore(/srv/osd7) queue_flusher ep 0 fd 34 0~40 qlen 20
2011-08-31 15:11:06.005418 7fbb3f5fa760 filestore(/srv/osd7) write meta/pginfo_0.2a5/0 0~40 = 40
2011-08-31 15:11:06.005424 7fbb3f5fa760 journal read_entry 704397312 : seq 3733800 33 bytes
2011-08-31 15:11:06.005429 7fbb3f5fa760 filestore(/srv/osd7) _do_transaction on 0x22ad000
2011-08-31 15:11:06.005434 7fbb3f5fa760 journal read_entry 704405504 : seq 3733801 49 bytes
2011-08-31 15:11:06.005439 7fbb3f5fa760 filestore(/srv/osd7) _do_transaction on 0x22ad000
2011-08-31 15:11:06.005443 7fbb3f5fa760 filestore(/srv/osd7) _destroy_collection /srv/osd7/current/0.2a5_f
2011-08-31 15:11:06.005458 7fbb3f5fa760 filestore(/srv/osd7) _destroy_collection /srv/osd7/current/0.2a5_f = -39
*** Caught signal (Aborted) **
 in thread 0x7fbb3f5fa760
 ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6)
 1: /usr/bin/cosd() [0x659052]
 2: (()+0xfc60) [0x7fbb3f1f2c60]
 3: (gsignal()+0x35) [0x7fbb3d977d05]
 4: (abort()+0x186) [0x7fbb3d97bab6]
 5: (__assert_fail()+0xf5) [0x7fbb3d9707c5]
 6: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x3f76) [0x6bc146]
 7: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x76) [0x6bc7a6]
 8: (JournalingObjectStore::journal_replay(unsigned long)+0x773) [0x6cce43]
 9: (FileStore::mount()+0x172a) [0x6b3e9a]
 10: (OSD::init()+0x18d) [0x550c6d]
 11: (main()+0x2b56) [0x4b3b06]
 12: (__libc_start_main()+0xff) [0x7fbb3d962eff]
 13: /usr/bin/cosd() [0x4b0b99]

osd.7.log View (61.5 KB) Damien Churchill, 08/31/2011 07:13 AM

osd.8.log View (290 KB) Damien Churchill, 08/31/2011 07:22 AM

osd.7.log.1 (67.1 KB) Damien Churchill, 08/31/2011 02:43 PM

History

#1 Updated by Damien Churchill about 8 years ago

#2 Updated by Sage Weil about 8 years ago

  • Category set to OSD
  • Status changed from New to Feedback
  • Assignee set to Sage Weil
  • Target version set to v0.35

The assert.h #includes were screwed up so there's no line number in the log and I can't tell which assert this was hitting. Can you run cosd in the foreground and reproduce (-f on command line)? Probably 'cosd -f -i 7' will do it.

THanks!

#3 Updated by Damien Churchill about 8 years ago

2011-08-31 20:17:13.966717 7fd5f3840760 filestore(/srv/osd7) collection_setattr /srv/osd7/current/0.2a5_head 'info' len 611
2011-08-31 20:17:13.966740 7fd5f3840760 filestore(/srv/osd7) collection_setattr /srv/osd7/current/0.2a5_head 'info' len 611 = 611
2011-08-31 20:17:13.966745 7fd5f3840760 filestore(/srv/osd7) truncate meta/pginfo_0.2a5/0 size 0
2011-08-31 20:17:13.966750 7fd5f3840760 filestore(/srv/osd7) lfn_get cid=meta oid=pginfo_0.2a5/0 pathname=/srv/osd7/current/meta/pginfo_0.2a5_0 lfn=pginfo_0.2a5_0 is_lfn=0
2011-08-31 20:17:13.966777 7fd5f3840760 filestore(/srv/osd7) truncate meta/pginfo_0.2a5/0 size 0 = 0
2011-08-31 20:17:13.966783 7fd5f3840760 filestore(/srv/osd7) write meta/pginfo_0.2a5/0 0~40
2011-08-31 20:17:13.966787 7fd5f3840760 filestore(/srv/osd7) lfn_get cid=meta oid=pginfo_0.2a5/0 pathname=/srv/osd7/current/meta/pginfo_0.2a5_0 lfn=pginfo_0.2a5_0 is_lfn=0
2011-08-31 20:17:13.966799 7fd5f3840760 filestore(/srv/osd7) queue_flusher ep 0 fd 34 0~40 qlen 20
2011-08-31 20:17:13.966803 7fd5f3840760 filestore(/srv/osd7) write meta/pginfo_0.2a5/0 0~40 = 40
2011-08-31 20:17:13.966806 7fd5f3840760 journal journal_replay: r = 0, op now seq 3733799
2011-08-31 20:17:13.966812 7fd5f3840760 journal read_entry 704397312 : seq 3733800 33 bytes
2011-08-31 20:17:13.966815 7fd5f3840760 journal journal_replay: applying op seq 3733800 (op_seq 3733799)
2011-08-31 20:17:13.966818 7fd5f3840760 filestore(/srv/osd7) _do_transaction on 0x29d0000
2011-08-31 20:17:13.966821 7fd5f3840760 journal journal_replay: r = 0, op now seq 3733800
2011-08-31 20:17:13.966825 7fd5f3840760 journal read_entry 704405504 : seq 3733801 49 bytes
2011-08-31 20:17:13.966828 7fd5f3840760 journal journal_replay: applying op seq 3733801 (op_seq 3733800)
2011-08-31 20:17:13.966831 7fd5f3840760 filestore(/srv/osd7) _do_transaction on 0x29d0000
2011-08-31 20:17:13.966835 7fd5f3840760 filestore(/srv/osd7) _destroy_collection /srv/osd7/current/0.2a5_f
2011-08-31 20:17:13.966848 7fd5f3840760 filestore(/srv/osd7) _destroy_collection /srv/osd7/current/0.2a5_f = -39
os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&)', in thread '0x7fd5f3840760'
os/FileStore.cc: 2539: FAILED assert(0 == "ENOTEMPTY suggests garbage data in osd data dir")
 ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6)
 1: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x3f76) [0x6bc1b6]
 2: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x76) [0x6bc816]
 3: (JournalingObjectStore::journal_replay(unsigned long)+0x773) [0x6ccea3]
 4: (FileStore::mount()+0x172a) [0x6b3f0a]
 5: (OSD::init()+0x18d) [0x550c6d]
 6: (main()+0x2b56) [0x4b3b06]
 7: (__libc_start_main()+0xff) [0x7fd5f1ba8eff]
 8: /usr/bin/cosd() [0x4b0b99]
 ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6)
 1: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x3f76) [0x6bc1b6]
 2: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x76) [0x6bc816]
 3: (JournalingObjectStore::journal_replay(unsigned long)+0x773) [0x6ccea3]
 4: (FileStore::mount()+0x172a) [0x6b3f0a]
 5: (OSD::init()+0x18d) [0x550c6d]
 6: (main()+0x2b56) [0x4b3b06]
 7: (__libc_start_main()+0xff) [0x7fd5f1ba8eff]
 8: /usr/bin/cosd() [0x4b0b99]
*** Caught signal (Aborted) **
 in thread 0x7fd5f3840760
 ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6)
 1: /usr/bin/cosd() [0x583642]
 2: (()+0xfc60) [0x7fd5f3438c60]
 3: (gsignal()+0x35) [0x7fd5f1bbdd05]
 4: (abort()+0x186) [0x7fd5f1bc1ab6]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd5f2474d6d]
 6: (()+0xb9f16) [0x7fd5f2472f16]
 7: (()+0xb9f43) [0x7fd5f2472f43]
 8: (()+0xba03e) [0x7fd5f247303e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x396) [0x5a8876]
 10: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x3f76) [0x6bc1b6]
 11: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x76) [0x6bc816]
 12: (JournalingObjectStore::journal_replay(unsigned long)+0x773) [0x6ccea3]
 13: (FileStore::mount()+0x172a) [0x6b3f0a]
 14: (OSD::init()+0x18d) [0x550c6d]
 15: (main()+0x2b56) [0x4b3b06]
 16: (__libc_start_main()+0xff) [0x7fd5f1ba8eff]

#4 Updated by Damien Churchill about 8 years ago

#5 Updated by Sage Weil about 8 years ago

  • Subject changed from osd crash to osd: destroy_collection on non-empty dir
  • Status changed from Feedback to In Progress

#6 Updated by Sage Weil about 8 years ago

  • Status changed from In Progress to Resolved

I think the root bug is fixed by f1cae577e3730a73dd5478785160745150095af5 and commit:e0776761a1d6866fda73fa5af217f5be4527d798.

#7 Updated by Wido den Hollander almost 8 years ago

I'm actually hitting the same bug with v0.37

It was time to upgrade my old (and good running!) 0.27 cluster to the latest version and I encountered this crash:

Oct 22 21:56:32 node03 osd.0[15433]: 7ff57621a720 filestore(/var/lib/ceph/osd.0) collection_setattr /var/lib/ceph/osd.0/current/3.51_head 'ondisklog' len 17 = 17
Oct 22 21:56:32 node03 osd.0[15433]: 7ff57621a720 filestore(/var/lib/ceph/osd.0) collection_setattr /var/lib/ceph/osd.0/current/3.51_head 'snap_collections' len 244
Oct 22 21:56:32 node03 osd.0[15433]: 7ff57621a720 filestore(/var/lib/ceph/osd.0) collection_setattr /var/lib/ceph/osd.0/current/3.51_head 'snap_collections' len 244 = 244
Oct 22 21:56:32 node03 osd.0[15433]: 7ff57621a720 filestore(/var/lib/ceph/osd.0) _destroy_collection /var/lib/ceph/osd.0/current/_temp
Oct 22 21:56:32 node03 osd.0[15433]: 7ff57621a720 filestore(/var/lib/ceph/osd.0) _destroy_collection /var/lib/ceph/osd.0/current/_temp = -39
Oct 22 21:56:32 node03 osd.0[15433]: os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&)', in thread '7ff57621a720'#012os/FileStore.cc: 2416: FAILED assert(0 == "ENOTEMPTY suggests garbage data in osd data dir")
Oct 22 21:56:32 node03 osd.0[15433]:  ceph version 0.37-96-g1b846f4 (commit:1b846f43f181ac31a1a1911c74b7cf3ae063f35d)#012 1: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x410b) [0x6c181b]#012 2: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x75) [0x6c1e75]#012 3: (JournalingObjectStore::journal_replay(unsigned long)+0x793) [0x6ccfc3]#012 4: (FileStore::mount()+0x2803) [0x6abb13]#012 5: (OSD::convertfs(std::string const&, std::string const&)+0xc8) [0x524f18]#012 6: (main()+0x2c8d) [0x4a55dd]#012 7: (__libc_start_main()+0xfd) [0x7ff5745a9c4d]#012 8: /usr/bin/ceph-osd() [0x4a2529]

Indeed, the directory is not empty:

root@node03:/var/lib/ceph/osd.0/current/_temp# ls
rb.0.3.000000000486_40
root@node03:/var/lib/ceph/osd.0/current/_temp#

The other OSD's are still working their way through the upgrade: "Updating collection...."

Do I have one broken PG on this OSD?

Also available in: Atom PDF