Project

General

Profile

Bug #1862

filestore: EINVAL on replay

Added by Marco Aroldi about 12 years ago. Updated about 12 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello,
I'm testing ceph 0.39 on two VM (Ubuntu 11.10) on Hyper-V with all Linux Integration Components installed.
2 osd on 2 devices (/dev/sdb and /dev/sdc) btfs formatted for each vm
First machine: Mon, Mds, Osd.1 and Osd.2
Second machine: Osd.3 and Osd.4

After 1hour and 20minutes, I've experienced some network trouble, so i've rebooted the machines.
At 12:10 osd.1 and .2 started to crash. See log attached.
I'm not able to bring that osds online.

Any help is appreciated.
Marco

osd.1.log.zip (1.07 MB) Marco Aroldi, 12/28/2011 07:25 AM

osd.1.log-0.39-195-ge18b1c9 (27.1 KB) Marco Aroldi, 12/29/2011 03:33 AM

osd.2.log-0.39-195-ge18b1c9 (10.8 KB) Marco Aroldi, 12/29/2011 03:33 AM


Related issues

Duplicates Ceph - Bug #1759: mds/client: truncate size overflow, fails with EINVAL Resolved 11/29/2011

History

#1 Updated by Sage Weil about 12 years ago

  • Category set to OSD
  • Status changed from New to Need More Info
  • Target version set to v0.40

Can you try running the latest master code and restart ceph-osd? Specifically, 7133a2faf0ae0710b7cbd9801c64767172d48faf will dump the contents of the transaction that failed to replay to the log.

THanks!

#2 Updated by Sage Weil about 12 years ago

  • Subject changed from Osd crashes after a while to filestore: EINVAL on replay

#3 Updated by Marco Aroldi about 12 years ago

Hi Sage,
I'm sorry but I don't understand the steps requested.
Please, could you explain a little bit more?

#4 Updated by Marco Aroldi about 12 years ago

Hi,
I've downloaded and compiled the latest code from the git repository.
I've issued a "ceph-osd -i 1 --debug_ms 20" and a "/etc/init.d/ceph -a restart" but nothing is changed.
See logs attached.

#5 Updated by Sage Weil about 12 years ago

  • Status changed from Need More Info to Duplicate

Aha, this is actually #1759. If you apply the patch in that bug report it'll get your OSDs up and running again. The master branch also has some additional checks and asserts now that will catch the MDS bug on the MDS side (instead of clobbering the OSD), which should help us find the actual problem. Thanks!

#6 Updated by Marco Aroldi about 12 years ago

Hmmm
I have another problem: i've tried the patch in #1759 but I have a error at compile time:

CXX    libos_la-FileStore.lo
os/FileStore.cc: In member function 'int FileStore::_truncate(coll_t, const hobject_t&, uint64_t)':
os/FileStore.cc:2608:21: error: 'int64' was not declared in this scope
os/FileStore.cc:2608:27: error: expected ')' before 'size'
make[3]: * [libos_la-FileStore.lo] Error 1
make[3]: Leaving directory `/root/ceph/src'
make[2]:
[all-recursive] Error 1
make[2]: Leaving directory `/root/ceph/src'
make[1]:
[all] Error 2
make[1]: Leaving directory `/root/ceph/src'
make: *
[all-recursive] Error 1

This is my git diff:

diff --git a/src/os/FileStore.cc b/src/os/FileStore.cc
index d25d974..910b52c 100644
--- a/src/os/FileStore.cc
+++ b/src/os/FileStore.cc
@ -2605,6 +2605,8 @ int FileStore::_remove(coll_t cid, const hobject_t& oid)
int FileStore::_truncate(coll_t cid, const hobject_t& oid, uint64_t size) {
dout(15) << "truncate " << cid << "/" << oid << " size " << size << dendl;
+ if (replaying && (int64)size < 0)
+ return 0;
int r = lfn_truncate(cid, oid, size);
dout(10) << "truncate " << cid << "/" << oid << " size " << size << " = " << r << dendl;
return r;

#7 Updated by Sage Weil about 12 years ago

Marco Aroldi wrote:

Hmmm
I have another problem: i've tried the patch in #1759 but I have a error at compile time:

CXX libos_la-FileStore.lo
os/FileStore.cc: In member function 'int FileStore::_truncate(coll_t, const hobject_t&, uint64_t)':
os/FileStore.cc:2608:21: error: 'int64' was not declared in this scope

Whoops, that should be int64_t, not int64.

#8 Updated by Marco Aroldi about 12 years ago

Hello Sage,
Int64_t do the trick and now the osd are back online again!

Thank you
Marco

Also available in: Atom PDF