Project

General

Profile

Actions

Bug #7371

closed

libceph: tid 36856 reply has 196608 bytes we had only 147456 bytes ready

Added by Sage Weil about 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
emperor, dumpling
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ubuntu@teuthology:/a/teuthology-2014-02-08_23:05:12-kcephfs-next-testing-basic-plana/73644

tasks:
- chef: null
- clock.check: null
- install: null
- ceph: null
- kclient: null
- workunit:
clients:
all:
- suites/fsx.sh


Files

patch (476 Bytes) patch Zheng Yan, 06/17/2014 04:58 AM
Actions #1

Updated by Zheng Yan about 10 years ago

  • Project changed from CephFS to Ceph
  • Category changed from 53 to OSD

I applied following patch

diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
index 826a7e9..9eaec5f 100644
--- a/src/osd/ReplicatedPG.cc
+++ b/src/osd/ReplicatedPG.cc
@@ -1105,7 +1105,7 @@ void ReplicatedPG::do_op(OpRequestRef op)
     op->may_cache() ||
     (m->get_flags() & CEPH_OSD_FLAG_RWORDERED);

-  dout(10) << "do_op " << *m
+  dout(1) << "do_op " << *m
           << (op->may_write() ? " may_write" : "")
           << (op->may_read() ? " may_read" : "")
           << (op->may_cache() ? " may_cache" : "")
@@ -5024,6 +5024,9 @@ void ReplicatedPG::complete_read_ctx(int result, OpContext *ctx)
   }

   reply->add_flags(CEPH_OSD_FLAG_ACK | CEPH_OSD_FLAG_ONDISK);
+
+  dout(1) << "complete_read_ctx " << *reply << dendl;
+
   osd->send_message_osd_client(reply, m->get_connection());
   close_op_ctx(ctx);
 }
diff --git a/src/osd/osd_types.cc b/src/osd/osd_types.cc
index 59a40d8..e49fa5d 100644
--- a/src/osd/osd_types.cc
+++ b/src/osd/osd_types.cc
@@ -4121,6 +4121,7 @@ ostream& operator<<(ostream& out, const OSDOp& op)
       out << " " << op.op.extent.offset << "~" << op.op.extent.length;
       if (op.op.extent.truncate_seq)
        out << " [" << op.op.extent.truncate_seq << "@" << (int64_t)op.op.extent.truncate_size << "]";
+      out << " in " << op.indata.length() << " out " << op.outdata.length();
     }
   } else if (ceph_osd_op_type_attr(op.op.op)) {
     // xattr name

Found following lines in OSD log after kclient complainted "libceph: tid 1212385 reply has 2706423 bytes we had only 129060 bytes ready"

2014-02-16 16:46:05.063490 7f32e31a4700  1 osd.3 pg_epoch: 3421 pg[8.4c( v 3421'1400939 (3421'1399938,3421'1400939] local-les=3416 n=214 ec=27 les/c 3416/3416 3415/3415/3415) [3,2] r=0 lpr=3415 crt=3402'1354393 lcod 3421'1400938 mlcod 3421'1400934 active+clean] do_op osd_op(mds.0.56:6605225 200.0001116c [write 1212385~169 in 169 out 0] 8.4c120e4c ondisk+write e3419) v4 may_write -> write-ordered flags ondisk+write
2014-02-16 16:49:20.083989 7f32e31a4700  1 osd.3 pg_epoch: 3421 pg[7.39( v 3421'277368 (3421'276367,3421'277368] local-les=3416 n=612 ec=26 les/c 3416/3416 3415/3415/3415) [3,2] r=0 lpr=3415 crt=3402'272042 lcod 3421'277367 mlcod 3421'277367 active+clean] do_op osd_op(client.5813.1:1212385 10000956144.00000000 [read 1487881~129060 [2@0] in 0 out 0] 7.dc115239 read e3421) v4 may_read -> read-ordered flags read
2014-02-16 16:49:20.085006 7f32e31a4700  1 osd.3 pg_epoch: 3421 pg[7.39( v 3421'277368 (3421'276367,3421'277368] local-les=3416 n=612 ec=26 les/c 3416/3416 3415/3415/3415) [3,2] r=0 lpr=3415 crt=3402'272042 lcod 3421'277367 mlcod 3421'277367 active+clean] complete_read_ctx osd_op_reply(1212385 10000956144.00000000 [read 1487881~129060 [2@0] in 0 out 2706423] v0'0 uv277368 ondisk = 0) v6
Actions #2

Updated by Zheng Yan about 10 years ago

  • Status changed from New to 7
Actions #3

Updated by Sage Weil about 10 years ago

  • Status changed from 7 to Pending Backport
Actions #4

Updated by Ian Colle about 10 years ago

  • Priority changed from Urgent to High
  • Backport set to emperor, dumpling
Actions #5

Updated by Samuel Just about 10 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (OSD)
Actions #6

Updated by Samuel Just about 10 years ago

  • Project changed from CephFS to Ceph
Actions #7

Updated by Sage Weil almost 10 years ago

  • Status changed from Pending Backport to Resolved

not worried about backporting this now that firefly is out.

Actions #8

Updated by David Halls almost 10 years ago

I'm seeing this on Ubuntu 14.04 (Ceph 0.80).

Actions #9

Updated by Sage Weil almost 10 years ago

Are all of the running ceph-osd daemons 0.80? Is the client cephfs or rbd?

Actions #10

Updated by David Halls almost 10 years ago

Yes - I have two ceph-osd machines and 4 cephfs client machines, all running Ubuntu 14.04 and the Ceph versions from Ubuntu updates.

I'm seeing the issue when running my own tests from Node.js (https://github.com/davedoesdev/qlobber-fsq#test).

Actions #11

Updated by David Halls almost 10 years ago

With a single cephfs client machine, I had no problems.
With two client machines, my tests failed again (missed reads), with this in the syslog of one of the clients:

Jun 16 22:12:08 david-desktop kernel: [  619.208236] libceph: tid 823 reply has 1023 bytes we had only 0 bytes ready

I'm happy to dig into it a bit more. Is there any logs in particular that would be useful?

Actions #12

Updated by Zheng Yan almost 10 years ago

which version of kernel ?

Actions #13

Updated by Zheng Yan almost 10 years ago

can you enable debugging mode (echo module ceph +p > /sys/kernel/debug/dynamic_debug/control) on cephfs. I suspect the user space code requests zero-length read.

Actions #14

Updated by David Halls almost 10 years ago

Kernel is 3.13.0. Happens with 32 or 64 bit. Here's part of the log from just before the error (last line has the error message). Let me know if you need more.

Jun 17 09:17:50 david-desktop kernel: [ 4160.454999] ceph:  __ceph_caps_issued d7bf2c04 cap d7bf53c0 issued pAsLsXsFr
Jun 17 09:17:50 david-desktop kernel: [ 4160.455005] ceph:  check_caps d7bf2c04 file_want pAsxXsxFsxcrwb used p dirty - flushing - issued pAsLsXsFr revoking - retain pAsxLsxXsxFsxcrwbl  AUTHONLY NODELAY
Jun 17 09:17:50 david-desktop kernel: [ 4160.455024] ceph:  mdsc con_put e5404000 (9)
Jun 17 09:17:50 david-desktop kernel: [ 4160.455027] ceph:  mdsc put_session e5404000 10 -> 9
Jun 17 09:17:50 david-desktop kernel: [ 4160.455032] ceph:  mdsc con_put e5404000 (8)
Jun 17 09:17:50 david-desktop kernel: [ 4160.455035] ceph:  mdsc put_session e5404000 9 -> 8
Jun 17 09:17:50 david-desktop kernel: [ 4160.455043] ceph:  do_request waited, got 0
Jun 17 09:17:50 david-desktop kernel: [ 4160.455047] ceph:  do_request e5556c00 done, result 0
Jun 17 09:17:50 david-desktop kernel: [ 4160.455052] ceph:  put_cap_refs d7bf2c04 had p
Jun 17 09:17:50 david-desktop kernel: [ 4160.455055] ceph:  mdsc put_session e5404000 8 -> 7
Jun 17 09:17:50 david-desktop kernel: [ 4160.455059] ceph:  unreserve caps ctx=e5556ddc count=0
Jun 17 09:17:50 david-desktop kernel: [ 4160.455062] ceph:  do_getattr result=0
Jun 17 09:17:50 david-desktop kernel: [ 4160.455074] ceph:  aio_read sync_read hit hole, ppos 2 < size 25601, reading more
Jun 17 09:17:50 david-desktop kernel: [ 4160.455078] ceph:  get_cap_refs d7bf2c04 need Fr want Fc
Jun 17 09:17:50 david-desktop kernel: [ 4160.455082] ceph:  __ceph_caps_issued d7bf2c04 cap d7bf53c0 issued pAsLsXsFr
Jun 17 09:17:50 david-desktop kernel: [ 4160.455086] ceph:  get_cap_refs d7bf2c04 have pAsLsXsFr but not Fc (revoking -)
Jun 17 09:17:50 david-desktop kernel: [ 4160.455090] ceph:  get_cap_refs d7bf2c04 ret 1 got Fr
Jun 17 09:17:50 david-desktop kernel: [ 4160.455095] ceph:  aio_read d7bf2c04 10000030519.fffffffffffffffe 0~0 got cap refs on Fr
Jun 17 09:17:50 david-desktop kernel: [ 4160.455098] ceph:  sync_read on file e702e480 2~0 
Jun 17 09:17:50 david-desktop kernel: [ 4160.456917] libceph: tid 1346 reply has 25599 bytes we had only 0 bytes ready
Actions #15

Updated by Zheng Yan almost 10 years ago

I thing I found the buggy code, but it has been changed in 3.14 kernel. please try 3.14 kernel or the attached patch.

Actions #16

Updated by David Halls almost 10 years ago

Yes that worked on 3.14.1 kernel on Ubuntu 14.04 (mix of 32, 64 bit and wifi and ethernet). Great work on CephFS and thanks for your prompt reply on this issue.

Actions

Also available in: Atom PDF