Bug #6003
journal Unable to read past sequence 406 ...
0%
Description
2013-08-15 12:44:50.877768 7f51d4230780 -1 journal Unable to read past sequence 406 but header indicates the journal has committed up through 416, journal is corrupt 2013-08-15 12:44:50.901410 7f51d4230780 -1 os/FileJournal.cc: In function 'bool FileJournal::read_entry(ceph::bufferlist&, uint64_t&, bool*)' thread 7f51d4230780 time 2013-08-15 12:44:50.877826 os/FileJournal.cc: 1697: FAILED assert(0) ceph version 0.67-125-g4422f21 (4422f21a6586467a63ce6841552d0f60aa849cf1) 1: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0x7d6) [0x77e966] 2: (FileJournal::open(unsigned long)+0x733) [0x77bf73] 3: (JournalingObjectStore::journal_replay(unsigned long)+0x271) [0x7cad61] 4: (FileStore::mount()+0x34f3) [0x7a1503] 5: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x66dcba] 6: (OSD::convertfs(std::string const&, std::string const&)+0x47) [0x66e737] 7: (main()+0x1c13) [0x5d0e13] 8: (__libc_start_main()+0xed) [0x7f51d1ccc76d] 9: ceph-osd() [0x5d479d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
job was
ubuntu@teuthology:/a/teuthology-2013-08-15_01:00:05-rados-next-testing-basic-plana/107661$ cat orig.config.yaml kernel: kdb: true sha1: 68d04c9a036bea12d76fdd51e926aef771e8e041 machine_type: plana nuke-on-error: true os_type: ubuntu overrides: admin_socket: branch: next ceph: conf: global: ms inject socket failures: 5000 mon: debug mon: 20 debug ms: 1 debug paxos: 20 fs: xfs log-whitelist: - slow request sha1: 4422f21a6586467a63ce6841552d0f60aa849cf1 ceph-deploy: branch: dev: next conf: client: log file: /var/log/ceph/ceph-$name.$pid.log mon: debug mon: 1 debug ms: 20 debug paxos: 20 install: ceph: sha1: 4422f21a6586467a63ce6841552d0f60aa849cf1 s3tests: branch: next workunit: sha1: 4422f21a6586467a63ce6841552d0f60aa849cf1 roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - client.0 tasks: - chef: null - clock.check: null - install: null - ceph: log-whitelist: - wrongly marked me down - objects unfound and apparently lost - thrashosds: chance_pgnum_grow: 2 chance_pgpnum_fix: 1 timeout: 1200 - radosbench: clients: - client.0 time: 1800 teuthology_branch: next
Related issues
Associated revisions
common/buffer.cc: fix rebuild_page_aligned typo
Introduced: 66a9fbe2c7ba59b7cd034c17865adce3432cd2cb
Fixes: #6003
Signed-off-by: Samuel Just <sam.just@inktank.com>
common/buffer.cc: fix rebuild_page_aligned typo
Introduced: 66a9fbe2c7ba59b7cd034c17865adce3432cd2cb
Fixes: #6003
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit cd30e5fbb1d2d29b4fc0c2b13fa3b9b4493c24f0)
History
#1 Updated by Ian Colle over 9 years ago
- Assignee set to Samuel Just
#2 Updated by Sage Weil over 9 years ago
- Status changed from New to Need More Info
#3 Updated by Ian Colle over 9 years ago
Run this tasks repeatedly with logging
#4 Updated by Sage Weil over 9 years ago
- Assignee changed from Samuel Just to Sage Weil
#5 Updated by Sage Weil over 9 years ago
200 runs later and no luck reproducing this with logs.
/var/lib/teuthworker/archive/sage-bug-6003-a
200 passes
#6 Updated by Sage Weil over 9 years ago
- Priority changed from Urgent to High
#7 Updated by Sage Weil over 9 years ago
- Priority changed from High to Urgent
ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2013-08-22_01:00:13-rados-next-testing-basic-plana/804
#8 Updated by Sage Weil over 9 years ago
teuthology-2013-08-23_01:00:10-rados-master-testing-basic-plana/1275
#9 Updated by Sage Weil over 9 years ago
ubuntu@teuthology:/a/teuthology-2013-09-04_17:24:00-rados-master-testing-basic-plana/21102
#10 Updated by Sage Weil over 9 years ago
ubuntu@teuthology:/a/teuthology-2013-09-04_17:24:00-rados-master-testing-basic-plana/21102
#11 Updated by Samuel Just over 9 years ago
ubuntu@teuthology:/a/teuthology-2013-10-10_22:04:00-rados-master-testing-basic-plana/44493/remote
#12 Updated by Sage Weil over 9 years ago
- Assignee deleted (
Sage Weil)
#13 Updated by Sage Weil over 9 years ago
ubuntu@teuthology:/a/teuthology-2013-10-23_19:00:21-rados-dumpling-testing-basic-plana/65408
#14 Updated by Sage Weil over 9 years ago
ubuntu@teuthology:/a/teuthology-2013-10-27_19:00:21-rados-dumpling-testing-basic-plana/71126
#15 Updated by David Zafman over 9 years ago
- File ceph-osd.3.log View added
I hit this same problem in recovery testing. Attached is the complete log.
#16 Updated by Greg Farnum over 9 years ago
/a/teuthology-2013-10-28_23:01:45-kcephfs-master-testing-basic-plana/73927
#17 Updated by Samuel Just over 9 years ago
slider:~samuelj/bug_logs/13-10-31-04:05:43/remote
With journal logging! I seem to be able to reproduce this pretty much constantly. ext4. David's was ext4 as well, iirc.
#18 Updated by Samuel Just over 9 years ago
Kernel: 3.8.0-21-generic
#19 Updated by Sage Weil over 9 years ago
- Status changed from Need More Info to 12
#20 Updated by David Zafman over 9 years ago
This looks like the same problem seen in a unit test.
2013-10-31T15:23:41.910 INFO:teuthology.orchestra.run.err:[10.214.132.32]: 2013-10-31 15:23:41.909749 7f87dfb7d780 1 journal _open /tmp/ceph_test_filejournal.tmp.558768921 fd 8: 209715200 bytes, block size 4096 bytes, directio = 1, aio = 0
2013-10-31T15:23:41.986 INFO:teuthology.orchestra.run.out:[10.214.132.32]: test/test_filejournal.cc:199: Failure
2013-10-31T15:23:41.986 INFO:teuthology.orchestra.run.out:[10.214.132.32]: Value of: j.read_entry(inbl, seq)
2013-10-31T15:23:41.986 INFO:teuthology.orchestra.run.out:[10.214.132.32]: Actual: false
2013-10-31T15:23:41.986 INFO:teuthology.orchestra.run.out:[10.214.132.32]: Expected: true
2013-10-31T15:23:41.986 INFO:teuthology.orchestra.run.err:[10.214.132.32]: 2013-10-31 15:23:41.985155 7f87dfb7d780 -1 journal Unable to read past sequence 0 but header indicates the journal has committed up through 1, journal is corrupt
2013-10-31T15:23:41.989 INFO:teuthology.orchestra.run.out:[10.214.132.32]: [ FAILED ] TestFileJournal.WriteManyVecs (355 ms)
#21 Updated by Sage Weil over 9 years ago
- Assignee set to Samuel Just
#22 Updated by Samuel Just over 9 years ago
- Status changed from 12 to Need More Info
af0c5b415ad5a863940add7c2e8f3c1ac8040ef2 should fix the most recent iteration, but it doesn't explain the instances prior to the typo.
#23 Updated by Samuel Just about 9 years ago
- Assignee deleted (
Samuel Just)
#24 Updated by Samuel Just about 9 years ago
- Status changed from Need More Info to Can't reproduce
#25 Updated by Sage Weil about 9 years ago
- Status changed from Can't reproduce to 12
ubuntu@teuthology:/a/teuthology-2013-12-28_23:00:01-rados-next-distro-basic-plana/18074
-4> 2013-12-29 08:27:34.549948 7f602df89780 2 journal open /var/lib/ceph/osd/ceph-0/journal fsid dad1a3cb-5fce-44d2-ae19-1df8090c97fd fs_op_seq 1285 -3> 2013-12-29 08:27:34.549978 7f602df89780 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway -2> 2013-12-29 08:27:34.549998 7f602df89780 1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 20: 104857600 bytes, block size 4096 bytes, directio = 1, aio = 0 -1> 2013-12-29 08:27:34.571046 7f602df89780 -1 journal Unable to read past sequence 1274 but header indicates the journal has committed up through 1280, journal is corrupt 0> 2013-12-29 08:27:34.573614 7f602df89780 -1 os/FileJournal.cc: In function 'bool FileJournal::read_entry(ceph::bufferlist&, uint64_t&, bool*)' thread 7f602df89780 time 2013-12-29 08:27:34.571062 os/FileJournal.cc: 1697: FAILED assert(0)
#26 Updated by Sage Weil almost 9 years ago
ubuntu@teuthology:/a/sage-2014-02-08_15:50:23-rados:thrash-wip-agent-testing-basic-plana/72854
#27 Updated by Sage Weil almost 9 years ago
- Status changed from 12 to Need More Info
#28 Updated by David Zafman almost 9 years ago
- Status changed from Need More Info to 12
/a/dzafman-2014-02-28_12:09:58-rados:thrash-wip-7458-testing-basic-plana/111738
2014-03-01 02:09:40.455986 7f2b0c373780 10 filestore(/var/lib/ceph/osd/ceph-2) open_journal at /var/lib/ceph/osd/ceph-2/journal 2014-03-01 02:09:40.456008 7f2b0c373780 0 filestore(/var/lib/ceph/osd/ceph-2) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2014-03-01 02:09:40.456014 7f2b0c373780 10 filestore(/var/lib/ceph/osd/ceph-2) list_collections 2014-03-01 02:09:40.456462 7f2b0c373780 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2014-03-01 02:09:40.456498 7f2b0c373780 1 journal _open /var/lib/ceph/osd/ceph-2/journal fd 20: 104857600 bytes, block size 4096 bytes, directio = 1, aio = 0 2014-03-01 02:09:40.474869 7f2b0c373780 -1 journal Unable to read past sequence 958 but header indicates the journal has committed up through 967, journal is corrupt 2014-03-01 02:09:40.489832 7f2b0c373780 -1 os/FileJournal.cc: In function 'bool FileJournal::read_entry(ceph::bufferlist&, uint64_t&, bool*)' thread 7f2b0c373780 time 2014-03-01 02:09:40.474886 os/FileJournal.cc: 1673: FAILED assert(0) ceph version 0.77-618-g123ff9e (123ff9e18a27c6517a291aab1e0613b27d70d711) 1: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0x86f) [0x9244af] 2: (FileJournal::open(unsigned long)+0x6ad) [0x921a7d] 3: (JournalingObjectStore::journal_replay(unsigned long)+0x14f) [0x779b6f] 4: (FileStore::mount()+0x30c2) [0x74e812] 5: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x6243ba] 6: (main()+0x1d88) [0x60e7b8] 7: (__libc_start_main()+0xed) [0x7f2b09fd076d] 8: ceph-osd() [0x6129b9]
#29 Updated by Sage Weil almost 9 years ago
- Severity changed from 3 - minor to 2 - major
#30 Updated by Sage Weil almost 9 years ago
- Status changed from 12 to Need More Info
I think we need a log for this.. hopefully we will hit it now that things are cranked up on the nightlies?
#31 Updated by Samuel Just almost 9 years ago
- Status changed from Need More Info to 12
#32 Updated by Sage Weil almost 9 years ago
- Status changed from 12 to Need More Info
#33 Updated by Sage Weil over 8 years ago
- Priority changed from Urgent to High
#34 Updated by Sage Weil over 8 years ago
ubuntu@teuthology:/a/teuthology-2014-06-15_02:30:07-rados-next-testing-basic-plana/308538 (btrfs)
the problem in this case looks like premature trimming. the filestore start seq is X and the first journal entry is X+20 or something. appears that perhaps the btrfs snap wait method returned before it was really durable?
#35 Updated by Sage Weil over 8 years ago
ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2014-07-27_02:30:01-rados-next-testing-basic-plana/380261
rados/thrash/{clusters/fixed-2.yaml fs/ext4.yaml msgr-failures/osd-delay.yaml thrashers/default.yaml workloads/radosbench.yaml}
this time on ext4, not btrfs.
The header prints out valid info:
-6> 2014-07-27 15:46:20.583861 7f92aa089780 10 journal read_header -5> 2014-07-27 15:46:20.597188 7f92aa089780 10 journal header: block_size 4096 alignment 4096 max_size 104857600 -4> 2014-07-27 15:46:20.597210 7f92aa089780 10 journal header: start 100716544 -3> 2014-07-27 15:46:20.597213 7f92aa089780 10 journal write_pos 4096 -2> 2014-07-27 15:46:20.597220 7f92aa089780 10 journal open header.fsid = 0d6e2a30-a9e5-4c1d-ba4c-fd203e3cc64f
but a few lines down when we start reading entries we fail. printing the header shows bad info:
$27 = {flags = 41735368, fsid = {uuid = "\000\214{\002\000\000\000\000\300S\231\002\000\000\000"}, block_size = 15938360, alignment = 0, max_size = 140736923900312, start = 140267867036691, committed_up_to = 0, start_seq = 41651200}
including a corrupt stack from
#11 0x00000000009815ef in FileJournal::read_entry (this=0x53b, bl=..., next_seq=<optimized out>, corrupt=0x0) at os/FileJournal.cc:1677 1677 in os/FileJournal.cc
some of FileJournal fields are okay...
(gdb) p write_pos $21 = 10726185
others come from a bad header
(gdb) p read_pos $29 = 140736923900312 (gdb) p header.start $30 = 140267867036691
and other are also corrupt. fn follows header:
(gdb) p fn $31 = {static npos = <optimized out>, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x27cda80 "\275P\274"}}
and these precede it by a bunch
(gdb) p journaled_seq $33 = 140267894734912 (gdb) p /x journaled_seq $34 = 0x7f92aa0a0040 (gdb) p finisher_lock $35 = {name = 0x0, id = -564455648, recursive = 255, lockdep = 127, backtrace = false, _m = {__data = {__lock = 1315, __count = 0, __owner = 1698063408, __nusers = 808673586, __kind = 1698259245, __spins = 1664363829, __list = {__prev = 0x2d633461622d6431, __next = 0x6333653330326466}}, __size = "#\005\000\000\000\000\000\000\060d6e2a30-a9e5-4c1d-ba4c-fd203e3c", __align = 1315}, nlock = 1714697827, locked_by = 15961913567585126656, cct = 0x27bb230, logger = 0x27cd1d8}
and immediately prior is
(gdb) p completions_lock $39 = {name = 0x8 <Address 0x8 out of bounds>, id = -564455576, recursive = 255, lockdep = 127, backtrace = false, _m = {__data = { __lock = -1467469952, __count = 32658, __owner = 0, __nusers = 0, __kind = 41812256, __spins = 0, __list = {__prev = 0x27d76f0, __next = 0x7f92a8883200}}, __size = "\200/\210\250\222\177\000\000\000\000\000\000\000\000\000\000 \001~\002\000\000\000\000\360v}\002\000\000\000\000\000\062\210\250\222\177\000", __align = 140267869450112}, nlock = -1467468512, locked_by = 140267869451536, cct = 0x27ef480, logger = 0x27cd4c0} (gdb) p completions $40 = {<std::_Deque_base<FileJournal::completion_item, std::allocator<FileJournal::completion_item> >> = { _M_impl = {<std::allocator<FileJournal::completion_item>> = {<__gnu_cxx::new_allocator<FileJournal::completion_item>> = {<No data fields>}, <No data fields>}, _M_map = 0x27cd4c0, _M_map_size = 9215206, _M_start = {_M_cur = 0x0, _M_first = 0x27cd420, _M_last = 0x27cd420, _M_node = 0x27cd4c0}, _M_finish = {_M_cur = 0x0, _M_first = 0x0, _M_last = 0x2995408, _M_node = 0x8c7c34}}}, <No data fields>}
so.. looks like something scribbled all over memory... ? :/
#36 Updated by Sage Weil over 8 years ago
- Priority changed from High to Urgent
ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2014-08-01_02:32:01-rados-master-testing-basic-plana/392342
description: rados/thrash/{clusters/fixed-2.yaml fs/xfs.yaml msgr-failures/few.yaml thrashers/default.yaml workloads/radosbench.yaml}
this time, the read_entry frame is partly corrupt, too: this is bad, although the other fields look ok:
#11 0x0000000000984eff in FileJournal::read_entry (this=0x59e, bl=..., next_seq=<optimized out>, corrupt=0x0) at os/FileJournal.cc:1678 1678 os/FileJournal.cc: No such file or directory. (gdb) p this $13 = (FileJournal * const) 0x59e (gdb) p pos $14 = 44053728 (gdb) p next_pos $15 = 97988608 (gdb) p seq $16 = 1413 (gdb) up #12 0x00000000009824cd in FileJournal::open (this=0x7fff89c59f30, fs_op_seq=1444) at os/FileJournal.cc:514 514 in os/FileJournal.cc (gdb) p header $17 = {flags = 43775176, fsid = {uuid = "\000\\\232\002\000\000\000\000\300O\246\002\000\000\000"}, block_size = 15946952, alignment = 0, max_size = 140735504818856, start = 140553230486547, committed_up_to = 0, start_seq = 43670528}
*this contents still screwed up, though!
#37 Updated by Shambhu Rajak over 8 years ago
- File Log79set_5.zip added
I saw similar issue hitting my 3 node cluster: node1 - mon.0 osd.0 osd.1 osd.2 node2 - mon.1 osd.3 osd.4 osd.5 node3 - mon.2 client.0 Platform: Ubuntu 14.04 Trusty Snippet from teuthology Log: While reviving osd.4 saw error os/FileJournal.cc: 1693: FAILED assert(0) Snippet from teuthology log(complete Teuthology attached Log79set_5.zip): 2014-09-15T15:02:46.135 INFO:teuthology.task.thrashosds.thrasher:Reviving osd 4 2014-09-15T15:02:46.135 INFO:teuthology.task.ceph.osd.4:Restarting 2014-09-15T15:02:46.135 DEBUG:teuthology.orchestra.run:Running [10.15.16.60]: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-osd -f -i 4' 2014-09-15T15:02:46.693 INFO:teuthology.task.ceph.osd.4:Started 2014-09-15T15:02:46.694 DEBUG:teuthology.orchestra.run:Running [10.15.16.60]: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --admin-daemon /var/run/ceph/ceph-osd.4.asok dump_ops_in_flight' 2014-09-15T15:02:46.758 INFO:teuthology.task.ceph.osd.4.out:[10.15.16.60]: starting osd.4 at :/0 osd_data /var/lib/ceph/osd/ceph-4 /var/lib/ceph/osd/ceph-4/journal 2014-09-15T15:02:46.831 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 2014-09-15 09:32:46.701026 7f3be6f96900 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2014-09-15T15:02:46.831 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 2014-09-15 09:32:46.701614 7f3be6f96900 -1 journal Unable to read past sequence 1738 but header indicates the journal has committed up through 1774, journal is corrupt 2014-09-15T15:02:46.866 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: os/FileJournal.cc: In function 'bool FileJournal::read_entry(ceph::bufferlist&, uint64_t&, bool*)' thread 7f3be6f96900 time 2014-09-15 09:32:46.701625 2014-09-15T15:02:46.866 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: os/FileJournal.cc: 1693: FAILED assert(0) 2014-09-15T15:02:46.874 INFO:teuthology.task.radosbench.radosbench.0.out:[10.15.16.59]: 336 16 1790 1774 21.1158 0 - 2.98315 2014-09-15T15:02:46.883 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: ceph version 0.84-sd-sprint4 (3215c520e1306f50d0094b5646636c02456c9df4) 2014-09-15T15:02:46.884 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb776db] 2014-09-15T15:02:46.884 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 2: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0xaa4) [0xa3ec34] 2014-09-15T15:02:46.884 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 3: (FileJournal::open(unsigned long)+0x66d) [0xa4055d] 2014-09-15T15:02:46.884 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 4: (JournalingObjectStore::journal_replay(unsigned long)+0x103) [0x92e9b3] 2014-09-15T15:02:46.885 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 5: (FileStore::mount()+0x3eaa) [0x902e3a] 2014-09-15T15:02:46.885 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 6: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x66b9ba] 2014-09-15T15:02:46.885 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 7: (main()+0x23ee) [0x64379e] 2014-09-15T15:02:46.885 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 8: (__libc_start_main()+0xf5) [0x7f3be3f4dec5] 2014-09-15T15:02:46.885 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 9: ceph-osd() [0x65bc07] 2014-09-15T15:02:46.885 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2014-09-15T15:02:46.886 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 2014-09-15 09:32:46.741006 7f3be6f96900 -1 os/FileJournal.cc: In function 'bool FileJournal::read_entry(ceph::bufferlist&, uint64_t&, bool*)' thread 7f3be6f96900 time 2014-09-15 09:32:46.701625 2014-09-15T15:02:46.886 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: os/FileJournal.cc: 1693: FAILED assert(0) 2014-09-15T15:02:46.886 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 2014-09-15T15:02:46.886 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: ceph version 0.84-sd-sprint4 (3215c520e1306f50d0094b5646636c02456c9df4) 2014-09-15T15:02:46.887 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb776db] 2014-09-15T15:02:46.887 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 2: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0xaa4) [0xa3ec34] 2014-09-15T15:02:46.887 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 3: (FileJournal::open(unsigned long)+0x66d) [0xa4055d] 2014-09-15T15:02:46.887 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 4: (JournalingObjectStore::journal_replay(unsigned long)+0x103) [0x92e9b3] 2014-09-15T15:02:46.887 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 5: (FileStore::mount()+0x3eaa) [0x902e3a] 2014-09-15T15:02:46.888 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 6: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x66b9ba] 2014-09-15T15:02:46.888 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 7: (main()+0x23ee) [0x64379e] 2014-09-15T15:02:46.888 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 8: (__libc_start_main()+0xf5) [0x7f3be3f4dec5] 2014-09-15T15:02:46.888 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 9: ceph-osd() [0x65bc07] 2014-09-15T15:02:46.888 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2014-09-15T15:02:46.888 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 2014-09-15T15:02:47.058 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: -3> 2014-09-15 09:32:46.701026 7f3be6f96900 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2014-09-15T15:02:47.058 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: -1> 2014-09-15 09:32:46.701614 7f3be6f96900 -1 journal Unable to read past sequence 1738 but header indicates the journal has committed up through 1774, journal is corrupt 2014-09-15T15:02:47.059 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 0> 2014-09-15 09:32:46.741006 7f3be6f96900 -1 os/FileJournal.cc: In function 'bool FileJournal::read_entry(ceph::bufferlist&, uint64_t&, bool*)' thread 7f3be6f96900 time 2014-09-15 09:32:46.701625 2014-09-15T15:02:47.059 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: os/FileJournal.cc: 1693: FAILED assert(0) 2014-09-15T15:02:47.059 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 2014-09-15T15:02:47.060 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: ceph version 0.84-sd-sprint4 (3215c520e1306f50d0094b5646636c02456c9df4) 2014-09-15T15:02:47.060 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xb776db] 2014-09-15T15:02:47.060 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 2: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0xaa4) [0xa3ec34] 2014-09-15T15:02:47.060 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 3: (FileJournal::open(unsigned long)+0x66d) [0xa4055d] 2014-09-15T15:02:47.060 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 4: (JournalingObjectStore::journal_replay(unsigned long)+0x103) [0x92e9b3] 2014-09-15T15:02:47.061 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 5: (FileStore::mount()+0x3eaa) [0x902e3a] 2014-09-15T15:02:47.061 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 6: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x66b9ba] 2014-09-15T15:02:47.061 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 7: (main()+0x23ee) [0x64379e] 2014-09-15T15:02:47.061 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 8: (__libc_start_main()+0xf5) [0x7f3be3f4dec5] 2014-09-15T15:02:47.061 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 9: ceph-osd() [0x65bc07] 2014-09-15T15:02:47.061 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2014-09-15T15:02:47.062 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 2014-09-15T15:02:47.062 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: terminate called after throwing an instance of 'ceph::FailedAssertion' 2014-09-15T15:02:47.062 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: *** Caught signal (Aborted) ** 2014-09-15T15:02:47.062 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: in thread 7f3be6f96900 2014-09-15T15:02:47.062 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: ceph version 0.84-sd-sprint4 (3215c520e1306f50d0094b5646636c02456c9df4) 2014-09-15T15:02:47.063 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 1: ceph-osd() [0xa8c7ff] 2014-09-15T15:02:47.063 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 2: (()+0x10340) [0x7f3be5c49340] 2014-09-15T15:02:47.063 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 3: (gsignal()+0x39) [0x7f3be3f62bb9] 2014-09-15T15:02:47.063 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 4: (abort()+0x148) [0x7f3be3f65fc8] 2014-09-15T15:02:47.063 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f3be486e6b5] 2014-09-15T15:02:47.063 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 6: (()+0x5e836) [0x7f3be486c836] 2014-09-15T15:02:47.064 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 7: (()+0x5e863) [0x7f3be486c863] 2014-09-15T15:02:47.064 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 8: (()+0x5eaa2) [0x7f3be486caa2] 2014-09-15T15:02:47.064 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0xb778c8] 2014-09-15T15:02:47.064 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 10: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0xaa4) [0xa3ec34] 2014-09-15T15:02:47.064 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 11: (FileJournal::open(unsigned long)+0x66d) [0xa4055d] 2014-09-15T15:02:47.064 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 12: (JournalingObjectStore::journal_replay(unsigned long)+0x103) [0x92e9b3] 2014-09-15T15:02:47.065 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 13: (FileStore::mount()+0x3eaa) [0x902e3a] 2014-09-15T15:02:47.065 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 14: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x66b9ba] 2014-09-15T15:02:47.065 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 15: (main()+0x23ee) [0x64379e] 2014-09-15T15:02:47.065 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 16: (__libc_start_main()+0xf5) [0x7f3be3f4dec5] 2014-09-15T15:02:47.065 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 17: ceph-osd() [0x65bc07] 2014-09-15T15:02:47.104 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 2014-09-15 09:32:46.767054 7f3be6f96900 -1 *** Caught signal (Aborted) ** 2014-09-15T15:02:47.104 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: in thread 7f3be6f96900 2014-09-15T15:02:47.104 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 2014-09-15T15:02:47.105 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: ceph version 0.84-sd-sprint4 (3215c520e1306f50d0094b5646636c02456c9df4) 2014-09-15T15:02:47.105 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 1: ceph-osd() [0xa8c7ff] 2014-09-15T15:02:47.105 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 2: (()+0x10340) [0x7f3be5c49340] 2014-09-15T15:02:47.105 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 3: (gsignal()+0x39) [0x7f3be3f62bb9] 2014-09-15T15:02:47.105 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 4: (abort()+0x148) [0x7f3be3f65fc8] 2014-09-15T15:02:47.106 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f3be486e6b5] 2014-09-15T15:02:47.106 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 6: (()+0x5e836) [0x7f3be486c836] 2014-09-15T15:02:47.106 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 7: (()+0x5e863) [0x7f3be486c863] 2014-09-15T15:02:47.106 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 8: (()+0x5eaa2) [0x7f3be486caa2] 2014-09-15T15:02:47.106 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0xb778c8] 2014-09-15T15:02:47.106 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 10: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0xaa4) [0xa3ec34] 2014-09-15T15:02:47.107 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 11: (FileJournal::open(unsigned long)+0x66d) [0xa4055d] 2014-09-15T15:02:47.107 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 12: (JournalingObjectStore::journal_replay(unsigned long)+0x103) [0x92e9b3] 2014-09-15T15:02:47.107 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 13: (FileStore::mount()+0x3eaa) [0x902e3a] 2014-09-15T15:02:47.107 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 14: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x66b9ba] 2014-09-15T15:02:47.107 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 15: (main()+0x23ee) [0x64379e] 2014-09-15T15:02:47.108 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 16: (__libc_start_main()+0xf5) [0x7f3be3f4dec5] 2014-09-15T15:02:47.108 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 17: ceph-osd() [0x65bc07] 2014-09-15T15:02:47.108 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2014-09-15T15:02:47.108 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 2014-09-15T15:02:47.110 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 0> 2014-09-15 09:32:46.767054 7f3be6f96900 -1 *** Caught signal (Aborted) ** 2014-09-15T15:02:47.110 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: in thread 7f3be6f96900 2014-09-15T15:02:47.110 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 2014-09-15T15:02:47.110 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: ceph version 0.84-sd-sprint4 (3215c520e1306f50d0094b5646636c02456c9df4) 2014-09-15T15:02:47.110 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 1: ceph-osd() [0xa8c7ff] 2014-09-15T15:02:47.111 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 2: (()+0x10340) [0x7f3be5c49340] 2014-09-15T15:02:47.111 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 3: (gsignal()+0x39) [0x7f3be3f62bb9] 2014-09-15T15:02:47.111 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 4: (abort()+0x148) [0x7f3be3f65fc8] 2014-09-15T15:02:47.111 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f3be486e6b5] 2014-09-15T15:02:47.111 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 6: (()+0x5e836) [0x7f3be486c836] 2014-09-15T15:02:47.111 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 7: (()+0x5e863) [0x7f3be486c863] 2014-09-15T15:02:47.112 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 8: (()+0x5eaa2) [0x7f3be486caa2] 2014-09-15T15:02:47.112 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0xb778c8] 2014-09-15T15:02:47.112 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 10: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0xaa4) [0xa3ec34] 2014-09-15T15:02:47.112 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 11: (FileJournal::open(unsigned long)+0x66d) [0xa4055d] 2014-09-15T15:02:47.112 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 12: (JournalingObjectStore::journal_replay(unsigned long)+0x103) [0x92e9b3] 2014-09-15T15:02:47.112 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 13: (FileStore::mount()+0x3eaa) [0x902e3a] 2014-09-15T15:02:47.113 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 14: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x66b9ba] 2014-09-15T15:02:47.113 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 15: (main()+0x23ee) [0x64379e] 2014-09-15T15:02:47.113 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 16: (__libc_start_main()+0xf5) [0x7f3be3f4dec5] 2014-09-15T15:02:47.113 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: 17: ceph-osd() [0x65bc07] 2014-09-15T15:02:47.113 INFO:teuthology.task.ceph.osd.4.err:[10.15.16.60]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
#38 Updated by Sage Weil over 8 years ago
-1> 2014-10-06 16:29:02.286272 7f2368425780 -1 journal Unable to read past sequence 39581 but header indicates the journal has committed up through 39647, journal is corrupt 0> 2014-10-06 16:29:02.297264 7f2368425780 -1 os/FileJournal.cc: In function 'bool FileJournal::read_entry(ceph::bufferlist&, uint64_t&, bool*)' thread 7f2368425780 time 2014-10-06 16:29:02.286293 os/FileJournal.cc: 1677: FAILED assert(0) ceph version 0.80.6-48-g884f7c4 (884f7c40c4a28d519847d3995c8d98e5837ceaf0) 1: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0x86f) [0x97918f] 2: (FileJournal::open(unsigned long)+0x6b5) [0x9766f5] 3: (JournalingObjectStore::journal_replay(unsigned long)+0x26d) [0x8bd39d] 4: (FileStore::mount()+0x30c2) [0x890792] 5: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x61fd8a] 6: (main()+0x1d88) [0x6079b8] 7: (__libc_start_main()+0xed) [0x7f236622776d] 8: ceph-osd() [0x60bca9] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.ubuntu@teuthology:/a/teuthology-2014-10-05_10:00:04-upgrade:firefly-firefly-distro-basic-multi/528734
#39 Updated by Sage Weil about 8 years ago
ubuntu@teuthology:/a/sage-2014-11-20_17:03:30-rados:thrash-wip-watch-notify-distro-basic-multi/611427
#40 Updated by Sage Weil about 8 years ago
ubuntu@teuthology:/a/sage-2014-12-01_11:11:52-rados-next-distro-basic-multi/629379
#41 Updated by Loïc Dachary about 8 years ago
-13> 2015-01-07 16:52:19.063303 7f6a93337780 10 filestore(/var/lib/ceph/osd/ceph-2) list_collections -12> 2015-01-07 16:52:19.088037 7f6a93337780 10 journal journal_replay fs op_seq 2835 -11> 2015-01-07 16:52:19.088052 7f6a93337780 2 journal open /var/lib/ceph/osd/ceph-2/journal fsid 5af9380a-a5a6-4ca2-9310-311ee01ac6cc fs_op_seq 2835 -10> 2015-01-07 16:52:19.088074 7f6a93337780 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway -9> 2015-01-07 16:52:19.088082 7f6a93337780 10 journal _open journal is not a block device, NOT checking disk write cache on '/var/lib/ceph/osd/ceph-2/journal' -8> 2015-01-07 16:52:19.088084 7f6a93337780 1 journal _open /var/lib/ceph/osd/ceph-2/journal fd 20: 104857600 bytes, block size 4096 bytes, directio = 1, aio = 0 -7> 2015-01-07 16:52:19.088094 7f6a93337780 10 journal read_header -6> 2015-01-07 16:52:19.088099 7f6a8dd70700 20 filestore(/var/lib/ceph/osd/ceph-2) sync_entry waiting for max_interval 5.000000 -5> 2015-01-07 16:52:19.092091 7f6a93337780 10 journal header: block_size 4096 alignment 4096 max_size 104857600 -4> 2015-01-07 16:52:19.092105 7f6a93337780 10 journal header: start 101089280 -3> 2015-01-07 16:52:19.092107 7f6a93337780 10 journal write_pos 4096 -2> 2015-01-07 16:52:19.092111 7f6a93337780 10 journal open header.fsid = 5af9380a-a5a6-4ca2-9310-311ee01ac6cc -1> 2015-01-07 16:52:19.096659 7f6a93337780 -1 journal Unable to read past sequence 2751 but header indicates the journal has committed up through 2778, journal is corrupt 0> 2015-01-07 16:52:19.111136 7f6a93337780 -1 os/FileJournal.cc: In function 'bool FileJournal::read_entry(ceph::bufferlist&, uint64_t&, bool*)' thread 7f6a93337780 time 2015-01-07 16:52:19.096681 os/FileJournal.cc: 1693: FAILED assert(0) ceph version 0.80.7-185-gf43f147 (f43f14748d5efd382edc7f27104f29898e39b0d5) 1: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0x86f) [0x978c8f] 2: (FileJournal::open(unsigned long)+0x6b5) [0x9761f5] 3: (JournalingObjectStore::journal_replay(unsigned long)+0x26d) [0x8bca8d] 4: (FileStore::mount()+0x30c2) [0x88fe82] 5: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x61f30a] 6: (main()+0x1d88) [0x606db8] 7: (__libc_start_main()+0xed) [0x7f6a9113a76d] 8: ceph-osd() [0x60b0a9] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
#42 Updated by Samuel Just almost 8 years ago
- Status changed from Need More Info to 7
- Assignee set to Samuel Just
The journalq is being populated with positions one sequence number off in do_read_entry. If we trim committed_thru to a sequence number within the journalq portion populated during replay, we end up with a header.start one forward of where it should be causing FileJournal::open() to assume that it's at the end of the journal. Subsequent calls into read_entry will be with the wrong offset/sequence number combo causing the assert to fail.
#43 Updated by Samuel Just almost 8 years ago
I introduced this in March 2013, the fix should be backported to dumpling, firefly, and giant.
#44 Updated by Samuel Just almost 8 years ago
- Backport set to dumpling, firefly, giant
#45 Updated by David Zafman almost 8 years ago
- Status changed from 7 to Pending Backport
#46 Updated by Loïc Dachary almost 8 years ago
#47 Updated by Loïc Dachary almost 8 years ago
- Description updated (diff)
#48 Updated by Loïc Dachary almost 8 years ago
- firefly backport https://github.com/ceph/ceph/pull/3960
#49 Updated by Loïc Dachary almost 8 years ago
- giant backport https://github.com/ceph/ceph/pull/4088
#50 Updated by Loïc Dachary almost 8 years ago
- dumpling backport https://github.com/ceph/ceph/pull/2611
#51 Updated by Loïc Dachary almost 8 years ago
- Status changed from Pending Backport to Resolved