Project

General

Profile

Actions

Bug #20303

closed

filejournal: Unable to read past sequence ... journal is corrupt

Added by Yuri Weinstein almost 7 years ago. Updated over 4 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
FileStore
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Run: http://pulpito.ceph.com/teuthology-2017-06-14_15:26:27-powercycle-master-distro-basic-smithi/
Job: 1285933
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2017-06-14_15:26:27-powercycle-master-distro-basic-smithi/1285933/teuthology.log

2017-06-14T16:34:23.904 INFO:tasks.ceph.osd.2.smithi171.stderr:2017-06-14 16:34:24.005013 7f594688eb80 -1 journal Unable to read past sequence 5171 but header indicates the journal has committed up through 5205, journal is corrupt
2017-06-14T16:34:23.905 INFO:tasks.ceph.osd.2.smithi171.stderr:*** Caught signal (Aborted) **
2017-06-14T16:34:23.905 INFO:tasks.ceph.osd.2.smithi171.stderr: in thread 7f594688eb80 thread_name:ceph-osd
2017-06-14T16:34:23.906 INFO:tasks.ceph.osd.2.smithi171.stderr: ceph version 12.0.3-1621-gb4e75a3 (b4e75a3990c66146424bc24ef8762dd4edb16794) luminous (dev)
2017-06-14T16:34:23.906 INFO:tasks.ceph.osd.2.smithi171.stderr: 1: (()+0x9a7a87) [0x7f5945f78a87]
2017-06-14T16:34:23.906 INFO:tasks.ceph.osd.2.smithi171.stderr: 2: (()+0x10330) [0x7f5943ad8330]
2017-06-14T16:34:23.906 INFO:tasks.ceph.osd.2.smithi171.stderr: 3: (gsignal()+0x37) [0x7f5942afcc37]
2017-06-14T16:34:23.906 INFO:tasks.ceph.osd.2.smithi171.stderr: 4: (abort()+0x148) [0x7f5942b00028]
2017-06-14T16:34:23.907 INFO:tasks.ceph.osd.2.smithi171.stderr: 5: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0x8f7) [0x7f5945f13617]
2017-06-14T16:34:23.907 INFO:tasks.ceph.osd.2.smithi171.stderr: 6: (JournalingObjectStore::journal_replay(unsigned long)+0x1d8) [0x7f5945dc8548]
2017-06-14T16:34:23.907 INFO:tasks.ceph.osd.2.smithi171.stderr: 7: (FileStore::mount()+0x33f6) [0x7f5945da3256]
2017-06-14T16:34:23.907 INFO:tasks.ceph.osd.2.smithi171.stderr: 8: (OSD::init()+0x28f) [0x7f5945aafbef]
2017-06-14T16:34:23.907 INFO:tasks.ceph.osd.2.smithi171.stderr: 9: (main()+0x29ca) [0x7f59459baaca]
2017-06-14T16:34:23.907 INFO:tasks.ceph.osd.2.smithi171.stderr: 10: (__libc_start_main()+0xf5) [0x7f5942ae7f45]
2017-06-14T16:34:23.907 INFO:tasks.ceph.osd.2.smithi171.stderr: 11: (()+0x483d36) [0x7f5945a54d36]
2017-06-14T16:34:23.907 INFO:tasks.ceph.osd.2.smithi171.stderr:2017-06-14 16:34:24.006580 7f594688eb80 -1 *** Caught signal (Aborted) **
2017-06-14T16:34:23.907 INFO:tasks.ceph.osd.2.smithi171.stderr: in thread 7f594688eb80 thread_name:ceph-osd
2017-06-14T16:34:23.908 INFO:tasks.ceph.osd.2.smithi171.stderr:
2017-06-14T16:34:23.908 INFO:tasks.ceph.osd.2.smithi171.stderr: ceph version 12.0.3-1621-gb4e75a3 (b4e75a3990c66146424bc24ef8762dd4edb16794) luminous (dev)
Actions #1

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category set to Correctness/Safety
  • Priority changed from Normal to Immediate
  • Component(RADOS) FileStore added

Bumping this priority up since it's an assert on read of committed data, rather than a simple disk write error.

Actions #2

Updated by Sage Weil almost 7 years ago

  • Subject changed from "Caught signal " in powercycle-master-distro-basic-smithi to filejournal: Unable to read past sequence ... journal is corrupt
  • Status changed from New to 12
Actions #3

Updated by Sage Weil almost 7 years ago

  • Status changed from 12 to Need More Info

The logs end long before the event in question. I think in order for us to gather more useful logs for the powercycle tests we need to do something like nfs mount /var/lib/ceph from the mon node (which doesn't get powercycled), mount with sync, and set log max recent = a small number. That will be a fair bit of work to get right...

Actions #4

Updated by Greg Farnum almost 7 years ago

Hmm, seems like that might slow stuff down enough to make it an unrealistic model, so probably not something we should make the defaults. :/

Actions #5

Updated by Kefu Chai almost 7 years ago

have we spotted this problem recently after the first occurrence?

rerunning at http://pulpito.ceph.com/kchai-2017-07-10_10:29:54-powercycle-master-distro-basic-smithi/

Actions #6

Updated by Greg Farnum almost 7 years ago

I don't think we run the powercycle tests very often — they're hard on the hardware. This may not really be an immediate bug but it needed to get some attention (and it definitely makes me uneasy — it's the first time I've seen it reported in a place I couldn't wave it away as misconfigured hardware).

Actions #7

Updated by Kefu Chai almost 7 years ago

http://pulpito.ceph.com/kchai-2017-07-10_10:29:54-powercycle-master-distro-basic-smithi/ failed with

"2017-07-10 10:40:40.569343 mon.a mon.0 172.21.15.2:6789/0 493 : cluster [WRN] MDS health message (mds.0): Behind on trimming (62/30)" in cluster log

and rerunning again at http://pulpito.ceph.com/kchai-2017-07-11_02:07:09-powercycle-master-distro-basic-mira/

Actions #8

Updated by Kefu Chai almost 7 years ago

filed #20566 against fs for "Behind on trimming" warnings from MDS

Actions #9

Updated by Sage Weil almost 7 years ago

  • Assignee set to Kefu Chai
Actions #10

Updated by Sage Weil almost 7 years ago

  • Priority changed from Immediate to Urgent
Actions #11

Updated by Sage Weil about 6 years ago

  • Priority changed from Urgent to High
Actions #12

Updated by Josh Durgin over 4 years ago

  • Status changed from Need More Info to Can't reproduce
Actions

Also available in: Atom PDF