Project

General

Profile

Bug #1486

osd: 0-length meta/pginfo_* files

Added by Damien Churchill about 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
OSD
Target version:
Start date:
09/01/2011
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Linux 3.0, Ceph 0.34 + patches (5ae3e13617c9a63d12d12c8506daefd2be14677d, f13ad83d43e883938369e7c06574daa8ff2fc4ee, f1cae577e3730a73dd5478785160745150095af5, 82776e155af9eb67084e1d8f1d6d0093be51cbd0).

Reproducible on start-up, running with all debug logging switched on.

2011-09-01 10:54:05.538156 7f081f4bc720 filestore(/srv/osd13) lfn_get cid=meta oid=pglog_1.136/0 pathname=/srv/osd13/current/meta/pglog_1.136_0 lfn=pglog_1.136_0 is_lfn=0
2011-09-01 10:54:05.538177 7f081f4bc720 filestore(/srv/osd13) FileStore::read meta/pglog_1.136/0 0~273/273
2011-09-01 10:54:05.538192 7f081f4bc720 osd13 7581 pg[1.136( v 6508'159 (259'156,6508'159] n=0 ec=2 les/c 7452/7573 7574/7574/6913) [] r=0 (info mismatch, log(259'156,0'0]) (log bound mismatch, empty) lcod 0'0 mlcod 0'0 inacti
ve] read_log 0 259'157 (259'156) m 200.00000034/head by mds0.3:3781 2011-08-10 06:20:03.804552
2011-09-01 10:54:05.538203 7f081f4bc720 osd13 7581 pg[1.136( v 6508'159 (259'156,6508'159] n=0 ec=2 les/c 7452/7573 7574/7574/6913) [] r=0 (info mismatch, log(259'156,0'0]) (log bound mismatch, actual=[259'157,259'157]) lcod 0
'0 mlcod 0'0 inactive] read_log 91 259'158 (259'157) m 200.00000034/head by mds0.3:3782 2011-08-10 06:20:03.809742
2011-09-01 10:54:05.538212 7f081f4bc720 osd13 7581 pg[1.136( v 6508'159 (259'156,6508'159] n=0 ec=2 les/c 7452/7573 7574/7574/6913) [] r=0 (info mismatch, log(259'156,0'0]) (log bound mismatch, actual=[259'157,259'158]) lcod 0
'0 mlcod 0'0 inactive] read_log 182 6508'159 (259'158) - 200.00000034/head by mds0.69:717 2011-08-30 10:19:25.620190
2011-09-01 10:54:05.538224 7f081f4bc720 osd13 7581 pg[1.136( v 6508'159 (259'156,6508'159] n=0 ec=2 les/c 7452/7573 7574/7574/6913) [] r=0 lcod 0'0 mlcod 0'0 inactive] read_log done
2011-09-01 10:54:05.538237 7f081f4bc720 osd13 7581 pg[1.136( v 6508'159 (259'156,6508'159] n=0 ec=2 les/c 7452/7573 7574/7574/6913) []/[13,0] r=0 lcod 0'0 mlcod 0'0 inactive] handle_backlog_loaded
2011-09-01 10:54:05.538245 7f081f4bc720 osd13 7581 pg[1.136( v 6508'159 (259'156,6508'159] n=0 ec=2 les/c 7452/7573 7574/7574/6913) []/[13,0] r=0 lcod 0'0 mlcod 0'0 inactive] exit Initial 0.015966 0 0.000000
2011-09-01 10:54:05.538252 7f081f4bc720 osd13 7581 pg[1.136( v 6508'159 (259'156,6508'159] n=0 ec=2 les/c 7452/7573 7574/7574/6913) []/[13,0] r=0 lcod 0'0 mlcod 0'0 inactive] enter Reset
2011-09-01 10:54:05.538258 7f081f4bc720 osd13 7581 load_pgs loaded pg[1.136( v 6508'159 (259'156,6508'159] n=0 ec=2 les/c 7452/7573 7574/7574/6913) []/[13,0] r=0 lcod 0'0 mlcod 0'0 inactive] log(259'156,6508'159]
2011-09-01 10:54:05.538263 7f081f4bc720 osd13 7581 _open_lock_pg 1.78
2011-09-01 10:54:05.538266 7f081f4bc720 osd13 7581 _get_pool 1 107 -> 108
2011-09-01 10:54:05.538278 7f081f4bc720 osd13 7581 pg[1.78( DNE empty n=0 ec=0 les/c 0/0 0/0/0) [] r=0 mlcod 0'0 inactive] enter Initial
2011-09-01 10:54:05.538286 7f081f4bc720 osd13 7581 pg[1.78( DNE empty n=0 ec=0 les/c 0/0 0/0/0) [] r=0 mlcod 0'0 inactive] enter NotTrimming
2011-09-01 10:54:05.538290 7f081f4bc720 filestore(/srv/osd13) collection_getattr /srv/osd13/current/1.78_head 'info'
2011-09-01 10:54:05.538309 7f081f4bc720 filestore(/srv/osd13) collection_getattr /srv/osd13/current/1.78_head 'info' = 307
2011-09-01 10:54:05.538315 7f081f4bc720 filestore(/srv/osd13) read meta/pginfo_1.78/0 0~0
2011-09-01 10:54:05.538320 7f081f4bc720 filestore(/srv/osd13) lfn_get cid=meta oid=pginfo_1.78/0 pathname=/srv/osd13/current/meta/pginfo_1.78_0 lfn=pginfo_1.78_0 is_lfn=0
2011-09-01 10:54:05.538331 7f081f4bc720 filestore(/srv/osd13) FileStore::read meta/pginfo_1.78/0 0~0/0
*** Caught signal (Aborted) **
 in thread 0x7f081f4bc720
 ceph version 0.34 (commit:2f039eeeb745622b866d80feda7afa055e15f6d6)
 1: cosd() [0x56f084]
 2: (()+0xf7c0) [0x7f081f0c77c0]
 3: (gsignal()+0x35) [0x7f081d842725]
 4: (abort()+0x17b) [0x7f081d843b9b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f081e0c68bd]
 6: (()+0xbea66) [0x7f081e0c4a66]
 7: (()+0xbea93) [0x7f081e0c4a93]
 8: (()+0xbeb8e) [0x7f081e0c4b8e]
 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x127) [0x5a9867]
 10: (void decode<unsigned int, PG::Interval>(std::map<unsigned int, PG::Interval, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, PG::Interval> > >&, ceph::buffer::list::iterator&)+0x31) [0x6321b1]
 11: (PG::read_state(ObjectStore*)+0x310) [0x627bc0]
 12: (OSD::load_pgs()+0x3a0) [0x517a60]
 13: (OSD::init()+0x61a) [0x526ffa]
 14: (main()+0x1e66) [0x4a4576]
 15: (__libc_start_main()+0xed) [0x7f081d82f13d]
 16: cosd() [0x4a5e79]

osd.3.log View (589 KB) Damien Churchill, 09/01/2011 03:08 AM

osd.13.log View (1.53 MB) Damien Churchill, 09/01/2011 03:09 AM

osd.11.log View (10 MB) Damien Churchill, 09/01/2011 03:13 AM


Related issues

Duplicates Ceph - Bug #1356: OSD crashes during recovery with OSDMap::decode(ceph::buffer::list&) Can't reproduce 08/04/2011

History

#1 Updated by Damien Churchill about 8 years ago

#2 Updated by Damien Churchill about 8 years ago

#3 Updated by Damien Churchill about 8 years ago

#4 Updated by Damien Churchill about 8 years ago

I've attached the logs from the 3 OSDs exhibiting this behaviour. They're unable to start at the moment. Thanks to the commits last night though I've got 3 of the crashed ones starting again now.

#5 Updated by Sage Weil about 8 years ago

  • Category set to OSD
  • Priority changed from Normal to High
  • Target version set to v0.35

This is the same crash Wido was seeing a couple weeks ago. The problem is the 0-length pglog files, which shouldn't exist in the first place.

Do you have logs from before you failed to restart? We need to figure out where they came from, either from someone running with logs and hitting the issue, or figuring out how to reproduce it (and then turning up logging).

#6 Updated by Sage Weil about 8 years ago

  • Subject changed from osd: FileStore::read meta crash to osd: 0-length meta/pglog_* files

#7 Updated by Damien Churchill about 8 years ago

I'm afraid I don't have anything decent. I'm going to sort out a way I can store all the log data and run with a higher logging level. Will removing the offending files allow the osd to start?

#8 Updated by Sage Weil about 8 years ago

  • Target version changed from v0.35 to v0.36

#9 Updated by Sage Weil about 8 years ago

  • translation missing: en.field_position set to 11

#10 Updated by Sage Weil about 8 years ago

  • Subject changed from osd: 0-length meta/pglog_* files to osd: 0-length meta/pginfo_* files

#11 Updated by Sage Weil about 8 years ago

  • Target version changed from v0.36 to v0.37

#12 Updated by Sage Weil about 8 years ago

  • Target version changed from v0.37 to v0.38

#13 Updated by Sage Weil about 8 years ago

I saw this on alexandria, and it was caused by:

1- EMFILE (too many open files)
2- filestore wasn't assering on unexpected errors
3- then a commit happened

not sure what the EMFILE cause was, but I'm pretty sure #2 is the root cause of this. fixed by b5c606230f7e002115d3b86e64dac9dbb4ffedef

#14 Updated by Sage Weil about 8 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF