Bug #1486
osd: 0-length meta/pginfo_* files
0%
Description
Linux 3.0, Ceph 0.34 + patches (5ae3e13617c9a63d12d12c8506daefd2be14677d, f13ad83d43e883938369e7c06574daa8ff2fc4ee, f1cae577e3730a73dd5478785160745150095af5, 82776e155af9eb67084e1d8f1d6d0093be51cbd0).
Reproducible on start-up, running with all debug logging switched on.
2011-09-01 10:54:05.538156 7f081f4bc720 filestore(/srv/osd13) lfn_get cid=meta oid=pglog_1.136/0 pathname=/srv/osd13/current/meta/pglog_1.136_0 lfn=pglog_1.136_0 is_lfn=0 2011-09-01 10:54:05.538177 7f081f4bc720 filestore(/srv/osd13) FileStore::read meta/pglog_1.136/0 0~273/273 2011-09-01 10:54:05.538192 7f081f4bc720 osd13 7581 pg[1.136( v 6508'159 (259'156,6508'159] n=0 ec=2 les/c 7452/7573 7574/7574/6913) [] r=0 (info mismatch, log(259'156,0'0]) (log bound mismatch, empty) lcod 0'0 mlcod 0'0 inacti ve] read_log 0 259'157 (259'156) m 200.00000034/head by mds0.3:3781 2011-08-10 06:20:03.804552 2011-09-01 10:54:05.538203 7f081f4bc720 osd13 7581 pg[1.136( v 6508'159 (259'156,6508'159] n=0 ec=2 les/c 7452/7573 7574/7574/6913) [] r=0 (info mismatch, log(259'156,0'0]) (log bound mismatch, actual=[259'157,259'157]) lcod 0 '0 mlcod 0'0 inactive] read_log 91 259'158 (259'157) m 200.00000034/head by mds0.3:3782 2011-08-10 06:20:03.809742 2011-09-01 10:54:05.538212 7f081f4bc720 osd13 7581 pg[1.136( v 6508'159 (259'156,6508'159] n=0 ec=2 les/c 7452/7573 7574/7574/6913) [] r=0 (info mismatch, log(259'156,0'0]) (log bound mismatch, actual=[259'157,259'158]) lcod 0 '0 mlcod 0'0 inactive] read_log 182 6508'159 (259'158) - 200.00000034/head by mds0.69:717 2011-08-30 10:19:25.620190 2011-09-01 10:54:05.538224 7f081f4bc720 osd13 7581 pg[1.136( v 6508'159 (259'156,6508'159] n=0 ec=2 les/c 7452/7573 7574/7574/6913) [] r=0 lcod 0'0 mlcod 0'0 inactive] read_log done 2011-09-01 10:54:05.538237 7f081f4bc720 osd13 7581 pg[1.136( v 6508'159 (259'156,6508'159] n=0 ec=2 les/c 7452/7573 7574/7574/6913) []/[13,0] r=0 lcod 0'0 mlcod 0'0 inactive] handle_backlog_loaded 2011-09-01 10:54:05.538245 7f081f4bc720 osd13 7581 pg[1.136( v 6508'159 (259'156,6508'159] n=0 ec=2 les/c 7452/7573 7574/7574/6913) []/[13,0] r=0 lcod 0'0 mlcod 0'0 inactive] exit Initial 0.015966 0 0.000000 2011-09-01 10:54:05.538252 7f081f4bc720 osd13 7581 pg[1.136( v 6508'159 (259'156,6508'159] n=0 ec=2 les/c 7452/7573 7574/7574/6913) []/[13,0] r=0 lcod 0'0 mlcod 0'0 inactive] enter Reset 2011-09-01 10:54:05.538258 7f081f4bc720 osd13 7581 load_pgs loaded pg[1.136( v 6508'159 (259'156,6508'159] n=0 ec=2 les/c 7452/7573 7574/7574/6913) []/[13,0] r=0 lcod 0'0 mlcod 0'0 inactive] log(259'156,6508'159] 2011-09-01 10:54:05.538263 7f081f4bc720 osd13 7581 _open_lock_pg 1.78 2011-09-01 10:54:05.538266 7f081f4bc720 osd13 7581 _get_pool 1 107 -> 108 2011-09-01 10:54:05.538278 7f081f4bc720 osd13 7581 pg[1.78( DNE empty n=0 ec=0 les/c 0/0 0/0/0) [] r=0 mlcod 0'0 inactive] enter Initial 2011-09-01 10:54:05.538286 7f081f4bc720 osd13 7581 pg[1.78( DNE empty n=0 ec=0 les/c 0/0 0/0/0) [] r=0 mlcod 0'0 inactive] enter NotTrimming 2011-09-01 10:54:05.538290 7f081f4bc720 filestore(/srv/osd13) collection_getattr /srv/osd13/current/1.78_head 'info' 2011-09-01 10:54:05.538309 7f081f4bc720 filestore(/srv/osd13) collection_getattr /srv/osd13/current/1.78_head 'info' = 307 2011-09-01 10:54:05.538315 7f081f4bc720 filestore(/srv/osd13) read meta/pginfo_1.78/0 0~0 2011-09-01 10:54:05.538320 7f081f4bc720 filestore(/srv/osd13) lfn_get cid=meta oid=pginfo_1.78/0 pathname=/srv/osd13/current/meta/pginfo_1.78_0 lfn=pginfo_1.78_0 is_lfn=0 2011-09-01 10:54:05.538331 7f081f4bc720 filestore(/srv/osd13) FileStore::read meta/pginfo_1.78/0 0~0/0 *** Caught signal (Aborted) ** in thread 0x7f081f4bc720 ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6) 1: cosd() [0x56f084] 2: (()+0xf7c0) [0x7f081f0c77c0] 3: (gsignal()+0x35) [0x7f081d842725] 4: (abort()+0x17b) [0x7f081d843b9b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f081e0c68bd] 6: (()+0xbea66) [0x7f081e0c4a66] 7: (()+0xbea93) [0x7f081e0c4a93] 8: (()+0xbeb8e) [0x7f081e0c4b8e] 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x127) [0x5a9867] 10: (void decode<unsigned int, PG::Interval>(std::map<unsigned int, PG::Interval, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, PG::Interval> > >&, ceph::buffer::list::iterator&)+0x31) [0x6321b1] 11: (PG::read_state(ObjectStore*)+0x310) [0x627bc0] 12: (OSD::load_pgs()+0x3a0) [0x517a60] 13: (OSD::init()+0x61a) [0x526ffa] 14: (main()+0x1e66) [0x4a4576] 15: (__libc_start_main()+0xed) [0x7f081d82f13d] 16: cosd() [0x4a5e79]
Related issues
History
#1 Updated by Damien Churchill over 12 years ago
#2 Updated by Damien Churchill over 12 years ago
- File osd.13.log View added
#3 Updated by Damien Churchill over 12 years ago
- File osd.11.log View added
#4 Updated by Damien Churchill over 12 years ago
I've attached the logs from the 3 OSDs exhibiting this behaviour. They're unable to start at the moment. Thanks to the commits last night though I've got 3 of the crashed ones starting again now.
#5 Updated by Sage Weil over 12 years ago
- Category set to OSD
- Priority changed from Normal to High
- Target version set to v0.35
This is the same crash Wido was seeing a couple weeks ago. The problem is the 0-length pglog files, which shouldn't exist in the first place.
Do you have logs from before you failed to restart? We need to figure out where they came from, either from someone running with logs and hitting the issue, or figuring out how to reproduce it (and then turning up logging).
#6 Updated by Sage Weil over 12 years ago
- Subject changed from osd: FileStore::read meta crash to osd: 0-length meta/pglog_* files
#7 Updated by Damien Churchill over 12 years ago
I'm afraid I don't have anything decent. I'm going to sort out a way I can store all the log data and run with a higher logging level. Will removing the offending files allow the osd to start?
#8 Updated by Sage Weil over 12 years ago
- Target version changed from v0.35 to v0.36
#9 Updated by Sage Weil over 12 years ago
- translation missing: en.field_position set to 11
#10 Updated by Sage Weil over 12 years ago
- Subject changed from osd: 0-length meta/pglog_* files to osd: 0-length meta/pginfo_* files
#11 Updated by Sage Weil over 12 years ago
- Target version changed from v0.36 to v0.37
#12 Updated by Sage Weil over 12 years ago
- Target version changed from v0.37 to v0.38
#13 Updated by Sage Weil over 12 years ago
I saw this on alexandria, and it was caused by:
1- EMFILE (too many open files)
2- filestore wasn't assering on unexpected errors
3- then a commit happened
not sure what the EMFILE cause was, but I'm pretty sure #2 is the root cause of this. fixed by b5c606230f7e002115d3b86e64dac9dbb4ffedef
#14 Updated by Sage Weil over 12 years ago
- Status changed from New to Resolved