Project

General

Profile

Bug #9073

OSD with device/partition journals down after fresh deploy or upgrade to 0.83

Added by Mark Kirkwood over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
-
% Done:

100%

Source:
Community (user)
Tags:
Backport:
firefly
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Using a src build (and the packages built from it) on Ubuntu 14.04 x86_64. Ceph version is 0.83-399-gf77449c.

In a fresh install I'm seeing a hang at OSD mkfs with a journal partition, e.g:

$ ceph-osd --id 0 --mkjournal --mkfs --osd-data /data1/cephdata --osd-journal /dev/sdc1

Logs show:

7fbbc89e4800 -1 journal check: ondisk fsid d613adad-6e35-47d8-9f5d-e95f0170b4cd doesn't match expected 5390fcae-2ba8-497c-8dab-7265180bf82f, invalid (someone else's?) journal
7fbbc89e4800 -1 filestore(/data1/cephdata) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory

However that last line does not appear to be the issue, we are stuck after this point, waiting on a futex:

Process 12648 attached
futex(0x7fffaa3fcbac, FUTEX_WAIT_PRIVATE, 1, NULL)

Something strange happening with journals seems to be the cause, and reverting this commit:

commit 4eb18dd487da4cb621dcbecfc475fc0871b356ac
Author: Ma Jianpeng <>
Date: Wed Jul 23 10:10:38 2014 -0700

os/FileJournal: Update the journal header when closing journal
When closing journal, it should check must_write_header and update
journal header if must_write_header alreay set.
It can reduce the nosense journal-replay after restarting osd.

results in a OSD that is up.

Simiarly upgrading existing OSD's with (whole) device journals results in the going down with:

filestore(/var/lib/ceph/osd/ceph-0) mount failed to open journal /var/lib/ceph/osd/ceph-0/journal: (22) Invalid argument

..reverting the above commit makes them rescueable after a mkjournal.

I've attached a script I am using to do the simple fresh install - I was initially using ceph-deploy but wondered if it was causing the issue and wanted to rule that out.

deploy.sh View - Quick script to deply 1 OSD (2.69 KB) Mark Kirkwood, 08/11/2014 09:15 PM

ceph-osd.0.log View - log of osd mkfs with Intel 520 journal (35.9 KB) Mark Kirkwood, 08/13/2014 03:58 PM

hdpard-intel-520.txt View (3.07 KB) Mark Kirkwood, 08/13/2014 06:12 PM

journalblk.txt View (4 KB) Mark Kirkwood, 08/13/2014 06:51 PM

patch.diff View (571 Bytes) jianpeng ma, 08/13/2014 11:49 PM

ceph-osd.0.log View (37.7 KB) Mark Kirkwood, 08/14/2014 01:38 AM

journalblk-before.txt View (4 KB) Mark Kirkwood, 08/14/2014 01:38 AM

journalblk-after.txt View (4 KB) Mark Kirkwood, 08/14/2014 01:38 AM

journalblk-mkjournal.txt View (4 KB) Mark Kirkwood, 08/14/2014 02:06 AM

deploy.sh View (2.14 KB) Mark Kirkwood, 08/14/2014 03:04 AM

journal.diff View (535 Bytes) jianpeng ma, 08/14/2014 07:47 PM

journal.diff View (691 Bytes) jianpeng ma, 08/14/2014 10:58 PM

ceph-osd.strace (194 KB) Mark Kirkwood, 08/14/2014 11:25 PM

journal.diff View (2.29 KB) jianpeng ma, 08/17/2014 07:27 PM

ceph-osd.0.log View (16.5 KB) Mark Kirkwood, 08/17/2014 11:40 PM

ceph-osd.0.log View (16.6 KB) Mark Kirkwood, 08/18/2014 01:10 AM

debug-journal-header-3.diff View (562 Bytes) Mark Kirkwood, 08/18/2014 02:21 AM


Related issues

Related to Ceph - Bug #6003: journal Unable to read past sequence 406 ... Resolved 08/15/2013
Related to Ceph - Bug #9851: crash on journal/filestore shutdown on firefly Resolved 10/21/2014
Duplicated by Ceph - Bug #9768: ceph-osd mkfs hangs Duplicate 10/14/2014

History

#1 Updated by Mark Kirkwood over 9 years ago

A correctio - the 'stuck on a mutex' comment is completely wrong - sorry - I'd attached strace to the ceph-osd process blindly unaware of the existence of a number of threads! Adding some debugging showed we are actually doling stuff like:

2014-08-12 20:42:14.748499 7f9478b3c700 10 journal commit_start nothing to do
2014-08-12 20:42:14.748503 7f9478b3c700 10 journal commit_start
2014-08-12 20:42:14.748516 7f9478b3c700 20 filestore(/data1/cephdata) sync_entry waiting for max_interval 5.000000
2014-08-12 20:42:19.748623 7f9478b3c700 20 filestore(/data1/cephdata) sync_entry woke after 5.000106
2014-08-12 20:42:19.748650 7f9478b3c700 10 journal commit_start max_applied_seq 1, open_ops 0
2014-08-12 20:42:19.748653 7f9478b3c700 10 journal commit_start blocked, all open_ops have completed
2014-08-12 20:42:19.748654 7f9478b3c700 10 journal commit_start nothing to do
2014-08-12 20:42:19.748657 7f9478b3c700 10 journal commit_start
2014-08-12 20:42:19.748664 7f9478b3c700 20 filestore(/data1/cephdata) sync_entry waiting for max_interval 5.000000
2014-08-12 20:42:24.748778 7f9478b3c700 20 filestore(/data1/cephdata) sync_entry woke after 5.000113
2014-08-12 20:42:24.748819 7f9478b3c700 10 journal commit_start max_applied_seq 1, open_ops 0
2014-08-12 20:42:24.748824 7f9478b3c700 10 journal commit_start blocked, all open_ops have completed
2014-08-12 20:42:24.748827 7f9478b3c700 10 journal commit_start nothing to do
2014-08-12 20:42:24.748831 7f9478b3c700 10 journal commit_start
2014-08-12 20:42:24.748844 7f9478b3c700 20 filestore(/data1/cephdata) sync_entry waiting for max_interval 5.000000

But nonetheless not working right.

#2 Updated by Sage Weil over 9 years ago

  • Project changed from CephFS to Ceph
  • Priority changed from Normal to Urgent
  • Source changed from other to Community (user)

#3 Updated by Mark Kirkwood over 9 years ago

On a different machine instead of a hang I am reliably getting:

2014-08-13 12:50:28.253439 7ffc701bb8c0 -1 ** ERROR: error converting store /var/lib/ceph/osd/ceph-0: (22) Invalid argument

Adding in more debugging I'm seeing:

2014-08-13 12:50:28.196096 7ffc701bb8c0 0 filestore(/var/lib/ceph/osd/ceph-0) limited size xattrs
2014-08-13 12:50:28.196286 7ffc701bb8c0 5 filestore(/var/lib/ceph/osd/ceph-0) mount op_seq is 2
2014-08-13 12:50:28.239441 7ffc701bb8c0 20 filestore (init)dbobjectmap: seq is 1
2014-08-13 12:50:28.239478 7ffc701bb8c0 10 filestore(/var/lib/ceph/osd/ceph-0) open_journal at /var/lib/ceph/osd/ceph-0/journal
2014-08-13 12:50:28.239526 7ffc701bb8c0 0 filestore(/var/lib/ceph/osd/ceph-0) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2014-08-13 12:50:28.239535 7ffc701bb8c0 10 filestore(/var/lib/ceph/osd/ceph-0) list_collections
2014-08-13 12:50:28.239735 7ffc701bb8c0 10 journal journal_replay fs op_seq 2
2014-08-13 12:50:28.239752 7ffc701bb8c0 2 journal open /var/lib/ceph/osd/ceph-0/journal fsid fa1a9875-c5d4-4350-a603-4ecfc67a8fff fs_op_seq 2
2014-08-13 12:50:28.239784 7ffc6af6c700 20 filestore(/var/lib/ceph/osd/ceph-0) sync_entry waiting for max_interval 5.000000
2014-08-13 12:50:28.239810 7ffc701bb8c0 10 journal _open_block_device: ignoring osd journal size. We'll use the entire block device (size: 10737418240)
2014-08-13 12:50:28.252349 7ffc701bb8c0 20 journal _check_disk_write_cache: disk write cache is on, but your kernel is new enough to handle it correctly. (fn:/var/lib/ceph/osd/ceph-0/journal)
2014-08-13 12:50:28.252646 7ffc701bb8c0 1 journal _open /var/lib/ceph/osd/ceph-0/journal fd 20: 10737418240 bytes, block size 4096 bytes, directio = 1, aio = 1
2014-08-13 12:50:28.252678 7ffc701bb8c0 10 journal read_header
2014-08-13 12:50:28.253070 7ffc701bb8c0 -1 journal read_header error decoding journal header
2014-08-13 12:50:28.253096 7ffc701bb8c0 3 journal journal_replay open failed with (22) Invalid argument
2014-08-13 12:50:28.253143 7ffc701bb8c0 -1 filestore(/var/lib/ceph/osd/ceph-0) mount failed to open journal /var/lib/ceph/osd/ceph-0/journal: (22) Invalid argument

So looks like the invalid argument is coming from FileJournal::read_header() . I have not dug much deeper than that at this point. Atho this looks like a different error, it seems to be intimately tied up with the commit I mentioned prev (ie. reverting it stops this error too).

#4 Updated by Mark Kirkwood over 9 years ago

...or maybe the ::open()

#5 Updated by Sage Weil over 9 years ago

When you say that reverting fixes it, do you mean that it allows an OSD that was erroring out on start to then start, or that without the commit you never end up with a broken journal in the first place?

#6 Updated by jianpeng ma over 9 years ago

Yes, I already found this bug. If journal use aio mode. The bug occur.
The https://github.com/ceph/ceph/pull/2185 can fix this bug.
Mark, can you test this patch?

Thanks!

#7 Updated by Sage Weil over 9 years ago

  • Status changed from New to 7

Mark, I've pushed this as wip-9073.. can you please test?

Thanks, Jianpeng! Sorry I missed the pull request earlier!

#8 Updated by Mark Kirkwood over 9 years ago

Very quick work! Will test...

#9 Updated by Mark Kirkwood over 9 years ago

Building wip-9073. Hmmm still getting the invalid argument error and osd down. I'm guessing this means there are two bugs 1/ the hang (I'll have to check this from home later) 2/ the invalid argument/broken journal.

Some more info re 2/ If I revert commit d613adad then the journal is never broken it seems:

2014-08-13 17:03:23.652694 7f494d0a1800 10 filestore(/ceph0) open_journal at /dev/sdc1
2014-08-13 17:03:23.652766 7f494d0a1800 10 journal _open_block_device: ignoring osd journal size. We'll use the entire block device (size: 10737418240)
2014-08-13 17:03:23.665591 7f494d0a1800 20 journal _check_disk_write_cache: disk write cache is on, but your kernel is new enough to handle it correctly. (fn:/dev/sdc1)
2014-08-13 17:03:23.665803 7f494d0a1800 1 journal _open /dev/sdc1 fd 10: 10737418240 bytes, block size 4096 bytes, directio = 1, aio = 1
2014-08-13 17:03:23.665835 7f494d0a1800 10 journal read_header
2014-08-13 17:03:23.666193 7f494d0a1800 -1 journal read_header error decoding journal header
2014-08-13 17:03:23.666272 7f494d0a1800 2 journal create /dev/sdc1 fsid 2121da66-b421-48cb-817d-b56051b06634
2014-08-13 17:03:23.666300 7f494d0a1800 10 journal _open_block_device: ignoring osd journal size. We'll use the entire block device (size: 10737418240)
2014-08-13 17:03:23.675946 7f494d0a1800 20 journal _check_disk_write_cache: disk write cache is on, but your kernel is new enough to handle it correctly. (fn:/dev/sdc1)
2014-08-13 17:03:23.676180 7f494d0a1800 1 journal _open /dev/sdc1 fd 10: 10737418240 bytes, block size 4096 bytes, directio = 1, aio = 1
2014-08-13 17:03:23.676208 7f494d0a1800 10 journal header: block_size 4096 alignment 4096 max_size 10737418240
2014-08-13 17:03:23.676210 7f494d0a1800 10 journal header: start 4096
2014-08-13 17:03:23.676211 7f494d0a1800 10 journal write_pos 0
2014-08-13 17:03:23.676926 7f494d0a1800 2 journal create done
2014-08-13 17:03:23.676964 7f494d0a1800 0 filestore(/ceph0) mkjournal created journal on /dev/sdc1
2014-08-13 17:03:23.677012 7f494d0a1800 1 filestore(/ceph0) mkfs done in /ceph0
2014-08-13 17:03:23.677050 7f494d0a1800 5 filestore(/ceph0) basedir /ceph0 journal /dev/sdc1
2014-08-13 17:03:23.677100 7f494d0a1800 10 filestore(/ceph0) mount fsid is 2121da66-b421-48cb-817d-b56051b06634
2014-08-13 17:03:23.677183 7f494d0a1800 0 filestore(/ceph0) backend generic (magic 0xef53)
2014-08-13 17:03:23.701000 7f494d0a1800 0 genericfilestorebackend(/ceph0) detect_features: FIEMAP ioctl is supported and appears to work
2014-08-13 17:03:23.701018 7f494d0a1800 0 genericfilestorebackend(/ceph0) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2014-08-13 17:03:23.737354 7f494d0a1800 0 genericfilestorebackend(/ceph0) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2014-08-13 17:03:23.809547 7f494d0a1800 0 filestore(/ceph0) limited size xattrs
2014-08-13 17:03:23.809742 7f494d0a1800 5 filestore(/ceph0) mount op_seq is 1
2014-08-13 17:03:23.858738 7f494d0a1800 20 filestore (init)dbobjectmap: seq is 1
2014-08-13 17:03:23.858781 7f494d0a1800 10 filestore(/ceph0) open_journal at /dev/sdc1
2014-08-13 17:03:23.858808 7f494d0a1800 0 filestore(/ceph0) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2014-08-13 17:03:23.858819 7f494d0a1800 10 filestore(/ceph0) list_collections
2014-08-13 17:03:23.858924 7f494d0a1800 10 journal journal_replay fs op_seq 1
2014-08-13 17:03:23.858929 7f494d0a1800 2 journal open /dev/sdc1 fsid 2121da66-b421-48cb-817d-b56051b06634 fs_op_seq 1
2014-08-13 17:03:23.858958 7f494d0a1800 10 journal _open_block_device: ignoring osd journal size. We'll use the entire block device (size: 10737418240)
2014-08-13 17:03:23.858954 7f4948a7b700 20 filestore(/ceph0) sync_entry waiting for max_interval 5.000000
2014-08-13 17:03:23.871152 7f494d0a1800 20 journal _check_disk_write_cache: disk write cache is on, but your kernel is new enough to handle it correctly. (fn:/dev/sdc1)
2014-08-13 17:03:23.871348 7f494d0a1800 1 journal _open /dev/sdc1 fd 16: 10737418240 bytes, block size 4096 bytes, directio = 1, aio = 1
2014-08-13 17:03:23.871362 7f494d0a1800 10 journal read_header
2014-08-13 17:03:23.871537 7f494d0a1800 10 journal header: block_size 4096 alignment 4096 max_size 10737418240
2014-08-13 17:03:23.871552 7f494d0a1800 10 journal header: start 4096

#10 Updated by jianpeng ma over 9 years ago

Are you using this command "ceph-osd --id 0 --mkjournal --mkfs --osd-data /data1/cephdata --osd-journal /dev/sdc1"?

The commit d613adad? I don't found this commit.

#11 Updated by Mark Kirkwood over 9 years ago

Yeah I'm using that commmand.

Sorry - messed up the commit hash : 4eb18dd487da4cb621dcbecfc475fc0871b356ac

#12 Updated by jianpeng ma over 9 years ago

I can't reproduce.
From your messages, i can't find any error info.
Or am i missing something?

#13 Updated by Mark Kirkwood over 9 years ago

Interesting... I'm just building wip-9073 on my home machine now, will update you with what I find.

The issue might be something todo with how journal devices are setup - both are partitioned via fdisk (4 primary partitions either 20G or 30G in size), and I've zapped the relevant partition before installing viz:

$ dd if=/dev/zero of=/dev/sdc1 bs=512

The journal devices are ssd
- Crucial m4 (home)
- Intel 520 (work)

#14 Updated by Mark Kirkwood over 9 years ago

to (hopefully) clarify the errors:

- Home machine: osd mkfs hangs (which I've called 1/)
- work machine: osd mkfs fails with ' mount failed to open journal...invalid argument etc' ...and finally 'error converting store 22' (which I've been calling 2/)

#15 Updated by Mark Kirkwood over 9 years ago

I'm happy to report that wip-9073 definitely fixes problem 1/ (the hang).

#16 Updated by Mark Kirkwood over 9 years ago

An aside thought - to rule out weird ssd related stuff I had performed a secure erase on the Crucial m4's while investigating this bug - it might be worth doing the same with the Intel 520's in the work machine. I'll look at that tomorrow!

#17 Updated by Mark Kirkwood over 9 years ago

Doing a secure erase of the 520's changes nothing. Still seeing problem 2/ 'invalid argument' opening the journal.

#18 Updated by jianpeng ma over 9 years ago

Can you print more debuginfo about journal?
From the messages:
journal read_header error decoding journal header

But i don't know whether write journal header to ssd.

#19 Updated by jianpeng ma over 9 years ago

Can you send the 520 disk-info using hdparm to me?
I'll search the lab try to find this ssd.

Thanks!

#20 Updated by jianpeng ma over 9 years ago

From you message, I found:
14-08-14 10:58:01.735317 7f944f5e4800 20 journal check_disk_write_cache: disk write cache is on, but your kernel is new enough to handle it correctly. (_fn:/var/lib/ceph/osd/ceph-0/journal)
2014-08-14 10:58:01.735576 7f944f5e4800 1 journal open /var/lib/ceph/osd/ceph-0/journal fd 20: 15032385536 bytes, block size 4096 bytes, directio = 1, aio = 1
2014-08-14 10:58:01.735595 7f944f5e4800 10 journal read_header
2014-08-14 10:58:01.735953 7f944f5e4800 -1 journal read_header error decoding journal header
2014-08-14 10:58:01.735976 7f944f5e4800 3 journal journal_replay open failed with (22) Invalid argument
2014-08-14 10:58:01.736016 7f944f5e4800 -1 filestore(/var/lib/ceph/osd/ceph-0) mount failed to open journal _/var/lib/ceph/osd/ceph-0/journal
: (22) Invalid argument

Why journal is /var/lib/ceph/osd/ceph-0/journal? I think it should be /dev/sdc1.

#21 Updated by Mark Kirkwood over 9 years ago

The script puts in symlinks (also note slightly different osd data path on the work machine):

$ ls -l /var/lib/ceph/osd
total 0
lrwxrwxrwx 1 root root 6 Aug 14 13:09 ceph-0 -> /ceph0

$ ls l /var/lib/ceph/osd/ceph-0/
total 32
-rw-r--r-
1 root root 37 Aug 14 13:09 ceph_fsid
drwxr-xr-x 4 root root 4096 Aug 14 13:09 current
rw-r--r- 1 root root 37 Aug 14 13:09 fsid
lrwxrwxrwx 1 root root 9 Aug 14 13:09 journal > /dev/sdc1
-rw-r--r-
1 root root 21 Aug 14 13:09 magic
rw-r--r- 1 root root 6 Aug 14 13:09 ready
rw-r--r- 1 root root 4 Aug 14 13:09 store_version
rw-r--r- 1 root root 42 Aug 14 13:09 superblock
rw-r--r- 1 root root 2 Aug 14 13:09 whoami

#22 Updated by Mark Kirkwood over 9 years ago

The info for the Intel 520:

Re more journal debugging - sure, I already have the following set:

[osd]
debug osd = 20
debug filestore = 20
debug journal = 20
debug monc = 20

whart additional ones would you recommend?

#23 Updated by jianpeng ma over 9 years ago

Can you read the 4096 of /dev/sdc1 and send to me?
The journal header is in first 4096 size.

#24 Updated by jianpeng ma over 9 years ago

For the code,there is a logic error.
int r = ::pread(fd, bp.c_str(), bp.length(), 0);
bl.push_back(bp);

try {
bufferlist::iterator p = bl.begin();
::decode(header, p);
}
catch (buffer::error& e) {
derr << "read_header error decoding journal header" << dendl;
return -EINVAL;
}
if (r < 0) {
int err = errno;
dout(0) << "read_header got " << cpp_strerror(err) << dendl;
return -err;
}

It should first check if (r <0) and later to do decode.
But i don't make sure, you read error or journal header already destroy.

#25 Updated by Mark Kirkwood over 9 years ago

Here's the 4096 bytes of sdc1

#26 Updated by jianpeng ma over 9 years ago

hexdump journalblk.txt

0000000 1000 03ce 0000 0000 0001 0000 0000 0000
0000010 bdb9 29ac 51d7 a343 3bbf 1114 622e af19
0000020 1000 0000 1000 0000 0000 8000 0003 0000
0000030 2000 0000 0000 0000 0002 0000 0000 0000
0000040 0003 0000 0000 0000 0000 0000 0000 0000
0000050 0000 0000 0000 0000 0000 0000 0000 0000

The journal header already destroyed. I'll reread the code and found the possible.

#27 Updated by jianpeng ma over 9 years ago

Mark, i can't find the ssd in lab.
And i also can't find the code.But from my two patch, i don't modify code which could cause the journal-header destroyed.
The attachment is a patch which i add some debug info want to find who destroy journal-header.
Can you patch this and reset.

BTW, before do test
1:get the journal-header. The first 4096 of /dev/sdc1
2:apply this patch and test
3:get the journal-header.

At last, send me three files.

Thanks very much!

#28 Updated by Mark Kirkwood over 9 years ago

Shame about no 520, but here are the files.

#29 Updated by Mark Kirkwood over 9 years ago

Also, I note that running

$ sudo ceph-osd -i 0 --mkjournal

results in a journal state that lets the osd start, so hace attached what the 4096 bytes look like after doing the mkjournal.

#30 Updated by jianpeng ma over 9 years ago

The first 8 byte of journal-header destroyed. But the debug info show the content of journal-header is right.

Now the important thing is that i can reproduce this bug.

I think the step which we used may has different.
Can you tell me the step about you include all.

#31 Updated by Mark Kirkwood over 9 years ago

Sure - I'm running the script attached initially - now using a minor variation thereof (attached again).
The only other thing I am doing is dd'ing zero's to the relevant partition beforehand:

$ sudo dd if=/dev/zero of=/dev/sdc1 bs=512

#32 Updated by jianpeng ma over 9 years ago

Are you mean if you zero the journal-disk then the osd can start? Otherwise, it will met this bug.

#33 Updated by Mark Kirkwood over 9 years ago

No, sorry,

$ sudo dd if=/dev/zero of=/dev/sdc1 bs=512
$ sudo ./deploy.sh

is the prescription. The result is osd down. Subsequently running:

$ sudo ceph-osd -i 0 --mkjournal
$ sudo start ceph-osd id=0

gets the osd up. I'm guessing some sort of timing bug with the --mkfs --mkjournal happening at the same time when occurs I'm running deploy.sh that just running --mkjournal fixes.

#34 Updated by jianpeng ma over 9 years ago

Can you use "strace -f ceph-osd .." to trace all syscall?
We may from the info find some clue.

#35 Updated by jianpeng ma over 9 years ago

Hi Mark,
I use different on my hand but i can't reproduce this.

From the deply.sh, for osd operation
1:ceph-osd -i 0 --mkfs --mkjournal ...
2:ceph-osd -i 0

Can you check after step1?
If the journal-header is ok, the first 16 byte should:
0004 0000 0040 0000 0001 0000 0000 0000

For the common ceph-osd -i 0 --mkjournal is ok, I don't sure the journal is /dev/sdc1 or other.
I think if you don't set osd-journal, it maybe ok.

Thanks!

#36 Updated by Mark Kirkwood over 9 years ago

Funny you should mention that, I had just check that myself:

So, just after the mkfs, journal header is:

$ hexdump journalblk-prestart.txt|head -1
0000000 0004 0000 0040 0000 0001 0000 0000 0000

So looks like the startup appears to be breaking the journal?

#37 Updated by Mark Kirkwood over 9 years ago

Hmmm - just checked again and got:

$ hexdump journalblk-prestart.txt|head -1
0000000 3000 02a0 0000 0000 0001 0000 0000 0000

...so I'm scratching my head a bit

#38 Updated by jianpeng ma over 9 years ago

But from the code, when start osd, read journal-header is the first thing for journal.

I don't know the command 'start ceph-osd id=' how to do. But to make simple, can you use ceph-osd -i 0?
I think you should using strace 'strace -f ceph-osd -i 0' to catch all syscall.

Maybe there are some code which can modify journal but we can't notice.

#39 Updated by Mark Kirkwood over 9 years ago

Will do.

#40 Updated by Mark Kirkwood over 9 years ago

Very interesting: somethimes after the mkfs the header looks like:

0000000 b000 02b5 0000 0000 0001 0000 0000 0000
0000010 40f1 5a3e a813 b24b 13a2 f79c 19c6 4980
0000020 1000 0000 1000 0000 0000 8000 0003 0000
0000030 2000 0000 0000 0000 0002 0000 0000 0000
0000040 0003 0000 0000 0000 0000 0000 0000 0000
0000050 0000 0000 0000 0000 0000 0000 0000 0000

other times (still after the mkfs) it looks like:

0000000 0004 0000 0040 0000 0001 0000 0000 0000
0000010 0a00 bff1 07e0 1644 7080 505c 178e d91f
0000020 1000 0000 1000 0000 0000 8000 0003 0000
0000030 2000 0000 0000 0000 0002 0000 0000 0000
0000040 0003 0000 0000 0000 0000 0000 0000 0000
0000050 0000 0000 0000 0000 0000 0000 0000 0000

When it looks the latter, the osd will start. So my feeling is that is is not the startup that is corrupting the header but the mkfs step. Does that seem reasonable? If so it suggests we need to strace the mkfs, not the startup. Thoughts?

#41 Updated by jianpeng ma over 9 years ago

Yes. It's a great step. A strange bug.

The attachment is a patch which add read_header on some place.Can you try to test?
Thanks very much!

#42 Updated by Mark Kirkwood over 9 years ago

With that last patch applied, journal header looks good every mkfs and osd is starting every time.

#43 Updated by Mark Kirkwood over 9 years ago

...suggests a memory overwrite problem - we really need to get the binaries running under valgrind!

#44 Updated by jianpeng ma over 9 years ago

Makr, i don't find the reason. But i think this bug may caused by patch. So i modify my patch and hope the bug don't occcur. Cloud you test again?
Thanks!

#45 Updated by Mark Kirkwood over 9 years ago

Back to seeing the same error (invalid argument) with this latest patch :-(

#46 Updated by jianpeng ma over 9 years ago

Can you using strace to catch the ceh-osd command? Please using strace -f to cache all child process.

Thanks!

#47 Updated by Mark Kirkwood over 9 years ago

The strace attached. So this is the mkfs...and wip-9073 with just the last patch applied.

#48 Updated by jianpeng ma over 9 years ago

Can you paste the journal head after this command. Only first 8byte.

#49 Updated by Mark Kirkwood over 9 years ago

hexdump -n8 journalblk-prestart.txt
0000000 3000 021d 0000 0000

#50 Updated by Sage Weil over 9 years ago

This is starting to sound a lot like #6003!

#51 Updated by jianpeng ma over 9 years ago

I read the #6003. I think they are not the same.
You can see those two files (patch.diff Magnifier (571 Bytes) jianpeng ma, 2014-08-13 23:49 & ceph-osd.0.log (37.7 KB) Mark Kirkwood, 08/14/2014 01:38 am)
At this point, the content is ok. But the journal header still corrupted.

For mkfs, there are three times for journal write, two of those are write journal-header.
Because strace can't strace the content of io_submit.
Can you suggest how to debug this? Mark looks easily to reproduce this bug. This bug maybe resolved this time by you help.

#52 Updated by Sage Weil over 9 years ago

jianpeng ma wrote:

I read the #6003. I think they are not the same.
You can see those two files (patch.diff Magnifier (571 Bytes) jianpeng ma, 2014-08-13 23:49 & ceph-osd.0.log (37.7 KB) Mark Kirkwood, 08/14/2014 01:38 am)
At this point, the content is ok. But the journal header still corrupted.

For mkfs, there are three times for journal write, two of those are write journal-header.
Because strace can't strace the content of io_submit.
Can you suggest how to debug this? Mark looks easily to reproduce this bug. This bug maybe resolved this time by you help.

I would add debug messages with hexdumps of everything given to io_submit. Either that will show the wrong data being written, or there will be a disparity between that and what read_journal reads back shortly after that... which might indicate a mix of buffered vs direct io or something.

Something like:

diff --git a/src/os/FileJournal.cc b/src/os/FileJournal.cc
index b347674..362cb1b 100644
--- a/src/os/FileJournal.cc
+++ b/src/os/FileJournal.cc
@@ -652,6 +652,10 @@ int FileJournal::read_header()
   int r = ::pread(fd, bp.c_str(), bp.length(), 0);
   bl.push_back(bp);

+  dout(20) << __func__ << "\n";
+  bl.hexdump(*_dout);
+  *_dout << dendl;
+
   try {
     bufferlist::iterator p = bl.begin();
     ::decode(header, p);
@@ -692,6 +696,9 @@ bufferptr FileJournal::prepare_header()
     header.committed_up_to = journaled_seq;
   }
   ::encode(header, bl);
+  dout(20) << __func__ << "\n";
+  bl.hexdump(*_dout);
+  *_dout << dendl;
   bufferptr bp = buffer::create_page_aligned(get_top());
   bp.zero();
   memcpy(bp.c_str(), bl.c_str(), bl.length());

maybe?

#53 Updated by Sage Weil over 9 years ago

I've reverted 4eb18dd487da4cb621dcbecfc475fc0871b356ac on next so we can release v0.84. once we sort this out we can reapply on master.

#54 Updated by Mark Kirkwood over 9 years ago

Sage's comment suggested I check something - reverting 4eb18dd487da4cb621dcbecfc475fc0871b356ac from wip-9073 and running the 'usual' test several times (sometomes dd'ing zeros to the device 1st, sometimes not) results in a journal with header:

$ hexdump -n8 journalblk-prestart-1.txt
0000000 0004 0000 0040 0000

every time (18 consecutive runs).

#55 Updated by jianpeng ma over 9 years ago

Hi Mark,
Could you test again? I add more debug message this time.

Thanks!

#56 Updated by Mark Kirkwood over 9 years ago

Here's the log with that patch applied.

#57 Updated by Mark Kirkwood over 9 years ago

Meanwhile, I have been doing a little digging of my own: if I disable dio or aio via

[osd]
journal [d,a]io = false

then the journal header always looks like:

$ hexdump -n8 journalblk.txt
0000000 0004 0000 0040 0000

...and the osd starts up ok. So the 'problem' seems to be connected with dio and/or aio (and perhaps some oddity with how they are implemented on Intel firmware).

#58 Updated by jianpeng ma over 9 years ago

Using my latest patch, is journal-header corrupt?
From my debug info, before io_submit and after aio completed, the content of buffer can't change. It's ok.
Can you send the first 8byte of journal-header?

#59 Updated by Mark Kirkwood over 9 years ago

With only your latest pacth applied to wip-9073 I'm seeing a good journal header:

$ hexdump -n8 journalblk-prestart-19598.txt
0000000 0004 0000 0040 0000

#60 Updated by Mark Kirkwood over 9 years ago

Seems I spoke too soon - a few more runs showed up:

$ hexdump -n8 journalblk-prestart--20864.txt
0000000 7000 0335 0000 0000 0001 0000 0000 0000

#61 Updated by jianpeng ma over 9 years ago

From the latest ceph-osd.o.log. Before io_submit, the content is ok.
I found another issue.
2014-08-18 20:10:09.700185 7f66d2b65700 10 journal* write_finish_thread_entry exit*
2014-08-18 20:10:09.700210 7f66d3b67700 20 journal write_thread_entry woke up
2014-08-18 20:10:09.700223 7f66d3b67700 20 journal prepare_multi_write queue_pos now 8192
2014-08-18 20:10:09.700235 7f66d3b67700 10 journal prepare_header ptr= 0x3388000
2014-08-18 20:10:09.700237 7f66d3b67700 10 journal prepare_header content=0x4 ,0x=40
2014-08-18 20:10:09.700240 7f66d3b67700 15 journal do_aio_write writing 8192~0 + header
2014-08-18 20:10:09.700242 7f66d3b67700 20 journal write_aio_bl 0~4096 seq 0
2014-08-18 20:10:09.700244 7f66d3b67700 10 journal write_aio_bl ptr= 0x3388000
2014-08-18 20:10:09.700246 7f66d3b67700 10 journal write_aio_bl content=0x4 ,0x=40
2014-08-18 20:10:09.700249 7f66d3b67700 20 journal write_aio_bl .. 0~4096 in 1
2014-08-18 20:10:09.700251 7f66d3b67700 10 journal write_aio_bl ptr= 0x3388000
2014-08-18 20:10:09.700252 7f66d3b67700 10 journal write_aio_bl content=0x4 ,0x=40
2014-08-18 20:10:09.700278 7f66d3b67700 20 journal write_aio_bl 8192~0 seq 2
2014-08-18 20:10:09.700282 7f66d3b67700 5 journal put_throttle finished 0 ops and 0 bytes, now 0 ops and 0 bytes
2014-08-18 20:10:09.700285 7f66d3b67700 10 journal write_thread_entry finish

Although, i don't whether this is a bug. But i think write_finish_thread_entry should after write_thread_entry.

If only use aio(don't set dio), can the bug occur?
Your kerne version?

#62 Updated by Mark Kirkwood over 9 years ago

Yeah, disabling dio seems to get a consistently good header (10 consecutive runs)

#63 Updated by Mark Kirkwood over 9 years ago

...oh and kernel is 3.13.0-34-generic (sorry)!

#64 Updated by jianpeng ma over 9 years ago

I have a thought. It's strange.
Using aio, the kernel use user-space to write. But if before write to journal, the use-space freed(ceph-osd exit). What happed?
Are you remember i add read_header the bug can't reproduce. From the log, we see after io_submit, the ceph-osd exit.But it don't check aio return.

Can you apply this:
diff --git a/src/os/FileJournal.cc b/src/os/FileJournal.cc
index e18ceba..26c08bf 100644
--- a/src/os/FileJournal.cc
+++ b/src/os/FileJournal.cc
@ -1114,7 +1114,7 @ void FileJournal::write_thread_entry()
}

#ifdef HAVE_LIBAIO
- if (aio) {
+ if (aio && !write_stop) {
Mutex::Locker locker(aio_lock);
// should we back off to limit aios in flight? try to do this
// adaptively so that we submit larger aios once we have lots of
@ -1165,7 +1165,7 @ void FileJournal::write_thread_entry()
}
#ifdef HAVE_LIBAIO
- if (aio)
+ if (aio && !write_stop)
do_aio_write(bl);
else
do_write(bl);

Maybe this the last patch i can do. I already try my best.

#65 Updated by Mark Kirkwood over 9 years ago

growl, make that 3.13.0-32-generic, typed 'uname -a' in wrong (x)window before!

#66 Updated by Mark Kirkwood over 9 years ago

Lol, you certainly have - been a pleasure debugging this with you!

I actually applied the patch attached in this note, as the web page has mangled yours. With it applied to wip-9073 + previous debugging patch I'm seeing a good header in 10 consecutive runs. So I think you have nailed it!

#67 Updated by jianpeng ma over 9 years ago

What's mean of nail it? sorry, i don't know.

#68 Updated by Mark Kirkwood over 9 years ago

Heh - sorry, means 'really fixed it well'!

#69 Updated by jianpeng ma over 9 years ago

Are you only apply debug-journal-header-3.diff on wip-9073 to test ?

#70 Updated by Mark Kirkwood over 9 years ago

I had your last debugging diff on there as well (I can retest without that if needed).

#71 Updated by jianpeng ma over 9 years ago

I think you should retest only using debug-journal-header-3.diff on wip-9073. And test more times to avoid the bug reoccur.
Thanks very much!

#72 Updated by Mark Kirkwood over 9 years ago

Retested with only debug-journal-header-3.diff on wip-9073. I did 200 test runs, good journal every time.

#73 Updated by jianpeng ma over 9 years ago

Hmm, thanks very much! I'll send the patch.
Thanks again, Mark!

#74 Updated by Mark Kirkwood over 9 years ago

Excellent. Purely out of interest, any idea (now) why we only saw this bug on one particular system?

#75 Updated by jianpeng ma over 9 years ago

My linux version is 3.2 && 3.5. I'll test on 3.13.0-32-generic to find whether kernel cause this bug.

#76 Updated by Mark Kirkwood over 9 years ago

FWIW - checked this myself on my home machine (which was not seeing this last issue recall, only the hang) by rebooting into 3.13.0-32-generic...can not reproduce.

Differences between home and work machine are not that great:

Home:
Haswell chipset (H87)
i7 cpu (4770)
16G RAM
2x WD Black SATA 6 GBit
2x Crucial M4 SATA 6 Gbit
Ubuntu 14.04

Work:
Ivy Bridge chipset (Z77)
i7 cpu (3770K)
16G RAM
2x WD Vraptor SATA 3 Gbit
2x Intel 520 SATA 6 Gbit
Ubuntu 14.04

#77 Updated by Sage Weil over 9 years ago

  • Priority changed from Urgent to High

#78 Updated by Mark Kirkwood over 9 years ago

Doing a little more digging for the cause of 2/ (invalid argument erro). Using unpatched ipw-0973 and changing the journal from:

/dev/sdc1 # Intel 520

to

/dev/sdb5 # WD Velociraptor

results in a good journal header. Switching back to the Intel 520 gets us a bad header (most of the time). This seems to strongly implicate the Intel 520. I'd suggest trying to get hold on one, or else getting access to a system with one or more of 'em (I believe Mark Nelson was doing benchmarking on a Supermico box that had 520's in it).

#79 Updated by Mark Kirkwood over 9 years ago

That should have said unpatched wip-9073.

#80 Updated by Mark Kirkwood over 9 years ago

It might be worth trying an Intel 530 if that is dramatically easier to source - as it is similar to the 520 in the most significant aspect (i.e is a Sandforce based drive). I'll see if I can source one from work too.

#81 Updated by Mark Kirkwood over 9 years ago

A related thought is that the Intel 520s are plugged into the sata 6Gbit ports on the motherboard, so if there are any firmware bugs with these early Ivy Bridge mobos them we could be seeing them. I'll try plugging one of the 520s into a 3Gbit sata port and see if anything changes. Also I'm thinking a motherboard bios update might be worth applying.

#82 Updated by Mark Kirkwood over 9 years ago

Updated the bios on the work machine. No difference.

#83 Updated by Mark Kirkwood over 9 years ago

Plugging one of the 520s into a 3Gbit sata port makes no difference either.

#84 Updated by Sage Weil over 9 years ago

  • Status changed from 7 to Resolved

#85 Updated by Loïc Dachary over 9 years ago

  • Status changed from Resolved to Pending Backport
  • Backport set to firefly

The fix for this bug is https://github.com/ceph/ceph/commit/c776a89880fdac270e6334ad8e49fa616d05d0d4 and needs backport to firefly because it showed up as described here http://tracker.ceph.com/issues/9768

#86 Updated by Loïc Dachary over 9 years ago

  • Category set to OSD
  • Assignee set to Loïc Dachary

#87 Updated by Loïc Dachary over 9 years ago

  • Status changed from Pending Backport to Fix Under Review

#88 Updated by Mark Kirkwood over 9 years ago

I don't think this is the patch you want see c776a89880fdac270e6334ad8e49fa616d05d0d4 and acfe62e0aa45bff208e38aeedad56eb156ddf07a in wip-aio-journal.

#89 Updated by Loïc Dachary over 9 years ago

Indeed ! Thanks !

#91 Updated by Samuel Just over 9 years ago

  • Status changed from Fix Under Review to Resolved

#92 Updated by Loïc Dachary over 9 years ago

  • Status changed from Resolved to 7

https://github.com/ceph/ceph/pull/2764 is a better fix. The isolated patch made sense to me at the time but it looks like it was not enough.

#93 Updated by Loïc Dachary over 9 years ago

  • Status changed from 7 to Pending Backport

#94 Updated by Loïc Dachary over 9 years ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100

Also available in: Atom PDF