Project

General

Profile

Actions

Bug #8300

closed

Regression in 3.14: "No such device or address" reading file content

Added by Markus Blank-Burian almost 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

On kernel 3.14.2 and ceph 0.72.2, reading from some files gives the error message "No such device or address". Kernel 3.10.36, 3.12.17 and 3.13.11 can successfully read the files content. The file is visible via "ls", but a simple "cat" fails. I see the same error on different nodes and it is persistent across remounts. A ceph-mds debug trace made during the "cat"-test is attached. The file having problems is called "T0.01_N260_S0.005/r21/lowtempshear.h5"


Files

ceph-mds.txt.gz (430 KB) ceph-mds.txt.gz Markus Blank-Burian, 05/07/2014 04:28 AM
kaa-94.txt (540 KB) kaa-94.txt client trace Markus Blank-Burian, 05/07/2014 08:02 AM
osdmap.txt (4.17 KB) osdmap.txt Markus Blank-Burian, 05/07/2014 08:02 AM
patch (3.99 KB) patch Ilya Dryomov, 05/08/2014 06:40 AM
Actions #1

Updated by Greg Farnum almost 10 years ago

Are there any messages in dmesg on the affected node? Do you have debugfs enabled?

Actions #2

Updated by Greg Farnum almost 10 years ago

I should note that the MDS is behaving fine according to that log; Zheng thinks there's been a regression in the CRUSH code since nothing else generates an ENXIO.

Updated by Markus Blank-Burian almost 10 years ago

dmesg shows nothing special without debuggung enabled. i attached debug output of kernel as well as the osdmap. can it pose a problem, that there are non-existing host in the latter?

Actions #4

Updated by Ilya Dryomov almost 10 years ago

  • Assignee set to Ilya Dryomov

Hi Markus,

Judging by debug output, I'm assuming you can build your own kernels?

Actions #5

Updated by Markus Blank-Burian almost 10 years ago

yes, we build our own kernels, so patching/testing is possible.

Actions #6

Updated by Ilya Dryomov almost 10 years ago

OK, please try the attached patch (on top of 3.14.2) and see if it fixes the problem.

Actions #7

Updated by Markus Blank-Burian almost 10 years ago

Yes, your patch fixes the problem. Thank you very much for looking into this!

Actions #8

Updated by Ilya Dryomov almost 10 years ago

  • Status changed from New to Resolved

Great, this patch is in 3.15-rc1 ("crush: fix off-by-one errors in total_tries refactor"). I'll make sure it gets into 3.14 stable.

Actions

Also available in: Atom PDF