Project

General

Profile

Actions

Bug #9570

closed

osd crash in FileJournal::WriteFinisher::entry() aio

Added by Sheldon Mustard over 9 years ago. Updated over 7 years ago.

Status:
Rejected
Priority:
High
Assignee:
Category:
OSD
Target version:
-
% Done:

30%

Source:
Support
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Workaround

Try with a kernel newer than 3.13 - as new as the environment allows.

Collect more information

If the problem happens in your environment, increase the in memory logging with:

debug osd = 0/30
debug filestore = 0/30
debug ms = 0/1
debug journal = 0/30

save the core dump and add it to this issue, with the log, when it happens.

Original description

On an OSD running dumpling v0.67.10

   -13> 2014-09-22 16:08:54.484825 7feeb1ebd700  5 --OSD::tracker-- reqid: client.19451671.0:182034, seq: 81280699, time: 2014-09-22 16:08:54.484530, event: header_read, request: osd_sub_op(client.19451671.0:182034 366.33e f4e4873e/rb.0.113a598.238e1f29.000000000e4e/head//366 [] v 2466'680337 snapset=0=[]:[] snapc=0=[]) v7
   -12> 2014-09-22 16:08:54.484842 7feeb1ebd700  5 --OSD::tracker-- reqid: client.19451671.0:182034, seq: 81280699, time: 2014-09-22 16:08:54.484555, event: throttled, request: osd_sub_op(client.19451671.0:182034 366.33e f4e4873e/rb.0.113a598.238e1f29.000000000e4e/head//366 [] v 2466'680337 snapset=0=[]:[] snapc=0=[]) v7
   -11> 2014-09-22 16:08:54.484854 7feeb1ebd700  5 --OSD::tracker-- reqid: client.19451671.0:182034, seq: 81280699, time: 2014-09-22 16:08:54.484722, event: all_read, request: osd_sub_op(client.19451671.0:182034 366.33e f4e4873e/rb.0.113a598.238e1f29.000000000e4e/head//366 [] v 2466'680337 snapset=0=[]:[] snapc=0=[]) v7
   -10> 2014-09-22 16:08:54.484866 7feeb1ebd700  5 --OSD::tracker-- reqid: client.19451671.0:182034, seq: 81280699, time: 2014-09-22 16:08:54.484819, event: dispatched, request: osd_sub_op(client.19451671.0:182034 366.33e f4e4873e/rb.0.113a598.238e1f29.000000000e4e/head//366 [] v 2466'680337 snapset=0=[]:[] snapc=0=[]) v7
    -9> 2014-09-22 16:08:54.484878 7feeb1ebd700  5 --OSD::tracker-- reqid: client.19451671.0:182034, seq: 81280699, time: 2014-09-22 16:08:54.484878, event: waiting_for_osdmap, request: osd_sub_op(client.19451671.0:182034 366.33e f4e4873e/rb.0.113a598.238e1f29.000000000e4e/head//366 [] v 2466'680337 snapset=0=[]:[] snapc=0=[]) v7
    -8> 2014-09-22 16:08:54.484966 7feead6b4700  5 --OSD::tracker-- reqid: client.19451671.0:182034, seq: 81280699, time: 2014-09-22 16:08:54.484966, event: reached_pg, request: osd_sub_op(client.19451671.0:182034 366.33e f4e4873e/rb.0.113a598.238e1f29.000000000e4e/head//366 [] v 2466'680337 snapset=0=[]:[] snapc=0=[]) v7
    -7> 2014-09-22 16:08:54.485003 7feead6b4700  5 --OSD::tracker-- reqid: client.19451671.0:182034, seq: 81280699, time: 2014-09-22 16:08:54.485003, event: started, request: osd_sub_op(client.19451671.0:182034 366.33e f4e4873e/rb.0.113a598.238e1f29.000000000e4e/head//366 [] v 2466'680337 snapset=0=[]:[] snapc=0=[]) v7
    -6> 2014-09-22 16:08:54.485121 7feead6b4700  5 --OSD::tracker-- reqid: client.19451671.0:182034, seq: 81280699, time: 2014-09-22 16:08:54.485121, event: started, request: osd_sub_op(client.19451671.0:182034 366.33e f4e4873e/rb.0.113a598.238e1f29.000000000e4e/head//366 [] v 2466'680337 snapset=0=[]:[] snapc=0=[]) v7
    -5> 2014-09-22 16:08:54.485157 7feead6b4700  5 --OSD::tracker-- reqid: client.19451671.0:182034, seq: 81280699, time: 2014-09-22 16:08:54.485157, event: commit_queued_for_journal_write, request: osd_sub_op(client.19451671.0:182034 366.33e f4e4873e/rb.0.113a598.238e1f29.000000000e4e/head//366 [] v 2466'680337 snapset=0=[]:[] snapc=0=[]) v7
    -4> 2014-09-22 16:08:54.536191 7feeaf6b8700  1 -- 10.10.10.7:6810/25820 <== osd.9 10.10.10.24:0/16694 156020 ==== osd_ping(ping e2466 stamp 2014-09-22 16:08:54.535761) v2 ==== 47+0+0 (2119288752 0 0) 0x5071a8c0 con 0x4f26d840
    -3> 2014-09-22 16:08:54.536226 7feeb06ba700  1 -- 10.10.10.7:6812/25820 <== osd.9 10.10.10.24:0/16694 156020 ==== osd_ping(ping e2466 stamp 2014-09-22 16:08:54.535761) v2 ==== 47+0+0 (2119288752 0 0) 0x516b1e00 con 0x4f33d6e0
    -2> 2014-09-22 16:08:54.536258 7feeaf6b8700  1 -- 10.10.10.7:6810/25820 --> 10.10.10.24:0/16694 -- osd_ping(ping_reply e2466 stamp 2014-09-22 16:08:54.535761) v2 -- ?+0 0x38f711c0 con 0x4f26d840
    -1> 2014-09-22 16:08:54.536436 7feeb06ba700  1 -- 10.10.10.7:6812/25820 --> 10.10.10.24:0/16694 -- osd_ping(ping_reply e2466 stamp 2014-09-22 16:08:54.535761) v2 -- ?+0 0x4cdcdc40 con 0x4f33d6e0
     0> 2014-09-22 16:08:54.558046 7feeca125700 -1 *** Caught signal (Aborted) **
 in thread 7feeca125700

 ceph version 0.67.10 (9d446bd416c52cd785ccf048ca67737ceafcdd7f)
 1: /usr/bin/ceph-osd() [0x7fd50a]
 2: (()+0xfcb0) [0x7feed27e7cb0]
 3: (gsignal()+0x35) [0x7feed0f3e0d5]
 4: (abort()+0x17b) [0x7feed0f4183b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7feed189769d]
 6: (()+0xb5846) [0x7feed1895846]
 7: (()+0xb5873) [0x7feed1895873]
 8: (()+0xb596e) [0x7feed189596e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x8c68df]
 10: (FileJournal::write_finish_thread_entry()+0x6d7) [0x766c67]
 11: (FileJournal::WriteFinisher::entry()+0xd) [0x69cbad]
 12: (()+0x7e9a) [0x7feed27dfe9a]
 13: (clone()+0x6d) [0x7feed0ffc31d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.18.log
--- end dump of recent events ---
2014-09-22 16:08:55.656515 7fd6f29d9780  0 ceph version 0.67.10 (9d446bd416c52cd785ccf048ca67737ceafcdd7f), process ceph-osd, pid 8912
2014-09-22 16:08:55.736899 7fd6f29d9780  1 filestore(/var/lib/ceph/osd/ceph-18) mount detected xfs
2014-09-22 16:08:55.736910 7fd6f29d9780  1 filestore(/var/lib/ceph/osd/ceph-18)  disabling 'filestore replica fadvise' due to known issues with fadvise(DONTNEED) on xfs
2014-09-22 16:08:55.745399 7fd6f29d9780  0 filestore(/var/lib/ceph/osd/ceph-18) mount FIEMAP ioctl is supported and appears to work
2014-09-22 16:08:55.745427 7fd6f29d9780  0 filestore(/var/lib/ceph/osd/ceph-18) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option
2014-09-22 16:08:55.746476 7fd6f29d9780  0 filestore(/var/lib/ceph/osd/ceph-18) mount did NOT detect btrfs
2014-09-22 16:08:55.775236 7fd6f29d9780  0 filestore(/var/lib/ceph/osd/ceph-18) mount syncfs(2) syscall fully supported (by glibc and kernel)
2014-09-22 16:08:55.775847 7fd6f29d9780  0 filestore(/var/lib/ceph/osd/ceph-18) mount found snaps <>
2014-09-22 16:08:56.254811 7fd6f29d9780  0 filestore(/var/lib/ceph/osd/ceph-18) mount: enabling WRITEAHEAD journal mode: btrfs not detected
2014-09-22 16:09:00.301748 7fd6f29d9780  1 journal _open /var/lib/ceph/osd/ceph-18/journal fd 19: 16382951424 bytes, block size 4096 bytes, directio = 1, aio = 1
2014-09-22 16:09:00.620119 7fd6f29d9780  1 journal _open /var/lib/ceph/osd/ceph-18/journal fd 19: 16382951424 bytes, block size 4096 bytes, directio = 1, aio = 1
2014-09-22 16:09:00.651236 7fd6f29d9780  1 journal close /var/lib/ceph/osd/ceph-18/journal
2014-09-22 16:09:00.700334 7fd6f29d9780  1 filestore(/var/lib/ceph/osd/ceph-18) mount detected xfs
2014-09-22 16:09:00.707685 7fd6f29d9780  0 filestore(/var/lib/ceph/osd/ceph-18) mount FIEMAP ioctl is supported and appears to work
2014-09-22 16:09:00.707707 7fd6f29d9780  0 filestore(/var/lib/ceph/osd/ceph-18) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option
2014-09-22 16:09:00.708656 7fd6f29d9780  0 filestore(/var/lib/ceph/osd/ceph-18) mount did NOT detect btrfs
2014-09-22 16:09:00.731348 7fd6f29d9780  0 filestore(/var/lib/ceph/osd/ceph-18) mount syncfs(2) syscall fully supported (by glibc and kernel)
2014-09-22 16:09:00.731493 7fd6f29d9780  0 filestore(/var/lib/ceph/osd/ceph-18) mount found snaps <>
2014-09-22 16:09:00.744450 7fd6f29d9780  0 filestore(/var/lib/ceph/osd/ceph-18) mount: enabling WRITEAHEAD journal mode: btrfs not detected
2014-09-22 16:09:00.773887 7fd6f29d9780  1 journal _open /var/lib/ceph/osd/ceph-18/journal fd 20: 16382951424 bytes, block size 4096 bytes, directio = 1, aio = 1
2014-09-22 16:09:00.799478 7fd6f29d9780  1 journal _open /var/lib/ceph/osd/ceph-18/journal fd 20: 16382951424 bytes, block size 4096 bytes, directio = 1, aio = 1


Files

ceph-osd.18.log.gz (204 KB) ceph-osd.18.log.gz Sheldon Mustard, 09/30/2014 08:02 AM

Related issues 2 (0 open2 closed)

Related to Ceph - Bug #12100: OSD crash, uexpected aio error in FileJournal.ccRejectedSage Weil06/20/2015

Actions
Related to Ceph - Bug #14558: journal AIO write error message is confusingResolved01/29/2016

Actions
Actions #1

Updated by Samuel Just over 9 years ago

what was the assert?

Actions #2

Updated by Samuel Just over 9 years ago

  • Priority changed from Normal to Urgent
Actions #3

Updated by Samuel Just over 9 years ago

2014-09-22 16:00:20.680448 7fee6abcf700 0 -- 10.10.10.7:6808/25820 >> 10.10.10.16:0/1007485 pipe(0xba12a00 sd=628 :6808 s=0 pgs=0 cs=0 l=1 c=0x5031b6e0).accept replacing existing (lossy) channel (new one lossy=1)
2014-09-22 16:08:54.105116 7feeca125700 -1 journal aio to 165552576~80 got (8192) Unknown error 8192
2014-09-22 16:08:54.334038 7feeca125700 -1 os/FileJournal.cc: In function 'void FileJournal::write_finish_thread_entry()' thread 7feeca125700 time 2014-09-22 16:08:54.118638
os/FileJournal.cc: 1377: FAILED assert(0 == "unexpected aio error")

ceph version 0.67.10 (9d446bd416c52cd785ccf048ca67737ceafcdd7f)
1: (FileJournal::write_finish_thread_entry()+0x6d7) [0x766c67]
2: (FileJournal::WriteFinisher::entry()+0xd) [0x69cbad]
3: (()+0x7e9a) [0x7feed27dfe9a]
4: (clone()+0x6d) [0x7feed0ffc31d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.
Actions #4

Updated by Sage Weil over 9 years ago

  • Assignee set to Sage Weil
Actions #5

Updated by Loïc Dachary over 9 years ago

  • Assignee changed from Sage Weil to Loïc Dachary
Actions #7

Updated by Loïc Dachary over 9 years ago

Sheldon, could you upload the full log somewhere if you still have it ?

Actions #8

Updated by Loïc Dachary over 9 years ago

journal aio to 165552576~80 got (8192) Unknown error 8192

meaning an attempt to write 80 bytes at offset 165552576 claims to have written 8192 bytes instead.
Actions #9

Updated by Loïc Dachary over 9 years ago

Maybe iterating the bufferptr can return bufferptr beyond the size of the bufferlist ? In which case it creates an iovec totaling 8192 or more, hence the return value.

Actions #10

Updated by Loïc Dachary over 9 years ago

I don't see how it could be related to a problem in align_bl or bufferlist::rebuild_align. The worst these could do is to not align the buffer but it will not change the length of the bufferlist.

Iterating over the bufferptr is fragile. If for whatever reason the bufferlist is made of two bufferptr, has a length of 80 which is entirely in the first bufferptr, the second bufferptr will be written as well although it is supposed to contain nothing. A bufferlist can operate normally with extra bufferptr that are just ignored by normal operations.

Actions #11

Updated by Sage Weil over 9 years ago

  • Status changed from New to In Progress
Actions #12

Updated by Loïc Dachary over 9 years ago

Reading the buffer.{h,cc} code it looks like the caller is protected from a situation where a bufferptr leftover can be found.

Actions #13

Updated by Loïc Dachary over 9 years ago

Exploring the idea that maybe the buffers pointed to by the iovec are overriden, mixed up

Actions #15

Updated by Loïc Dachary over 9 years ago

None of the commits in FileJournal.cc from dumpling to master fix something that could cause a problem of that nature.

Actions #16

Updated by Loïc Dachary over 9 years ago

  • Description updated (diff)
Actions #17

Updated by Loïc Dachary over 9 years ago

https://github.com/ceph/ceph/commit/66a9fbe2c7ba59b7cd034c17865adce3432cd2cb and https://github.com/ceph/ceph/commit/cd30e5fbb1d2d29b4fc0c2b13fa3b9b4493c24f0 were not backported to dumpling. As a result https://github.com/ceph/ceph/blob/dumpling/src/os/FileJournal.cc#L940 may not align the bufferlist as expected. I'll need to figure out how bad it is to not align the buffer.

Actions #18

Updated by Loïc Dachary over 9 years ago

https://github.com/ceph/ceph/pull/2611 seems like a good candidate for backport.

Actions #19

Updated by Loïc Dachary over 9 years ago

It turns out that the alignment requirement has to be enforced indeed. On a 3.13 linux kernel the following:

  #define _GNU_SOURCE        /* syscall() is not POSIX */

  #include <stdio.h>        /* for perror() */
  #include <unistd.h>        /* for syscall() */
  #include <sys/syscall.h>    /* for __NR_* definitions */
  #include <linux/aio_abi.h>    /* for AIO types and constants */
  #include <fcntl.h>        /* O_RDWR */
  #include <string.h>        /* memset() */
  #include <inttypes.h>    /* uint64_t */

  inline int io_setup(unsigned nr, aio_context_t *ctxp)
  {
      return syscall(__NR_io_setup, nr, ctxp);
  }

  inline int io_destroy(aio_context_t ctx) 
  {
      return syscall(__NR_io_destroy, ctx);
  }

  inline int io_submit(aio_context_t ctx, long nr,  struct iocb **iocbpp) 
  {
      return syscall(__NR_io_submit, ctx, nr, iocbpp);
  }

  inline int io_getevents(aio_context_t ctx, long min_nr, long max_nr,
          struct io_event *events, struct timespec *timeout)
  {
      return syscall(__NR_io_getevents, ctx, min_nr, max_nr, events, timeout);
  }

int main(int argc, char** argv)
  {
      aio_context_t ctx;
      struct iocb cb;
      struct iocb *cbs[1];
      char *data;
      struct io_event events[1];
      int ret;
      int fd;

      fd = open("/tmp/testfile", O_RDWR | O_CREAT | O_DIRECT, 0644);
      if (fd < 0) {
          perror("open error");
          return -1;
      }

      ctx = 0;

      ret = io_setup(128, &ctx);
      if (ret < 0) {
          perror("io_setup error");
          return -1;
      }

        ret = posix_memalign((void**)&data, 2048, 8192);
        if (ret)
          perror("posix_memalign");

        if (!strcmp(argv[1], "noalign"))
          data++;

      /* setup I/O control block */
      memset(&cb, 0, sizeof(cb));
      cb.aio_fildes = fd;
      cb.aio_lio_opcode = IOCB_CMD_PWRITE;

    /* command-specific options */
      cb.aio_buf = (uint64_t)data;
      cb.aio_offset = 0;
      cb.aio_nbytes = 4096;

      cbs[0] = &cb;

      ret = io_submit(ctx, 1, cbs);
      if (ret != 1) {
        if (ret < 0)
              perror("io_submit error");
        else
            fprintf(stderr, "could not sumbit IOs");
        return  -1;
    }

    /* get the reply */
    ret = io_getevents(ctx, 1, 1, events, NULL);
    printf("number of events %d\n", ret);

        printf("res = %d, res2 = %d\n", (int)events[0].res, (int)events[0].res2);

    ret = io_destroy(ctx);
    if (ret < 0) {
        perror("io_destroy error");
        return -1;
    }

     return 0;
  }

shows
gcc -o /tmp/2 /tmp/2.c ; /tmp/2 align ; /tmp/2 noalign 
number of events 1
res = 4096, res2 = 0
number of events 1
res = -22, res2 = 0

It also confirms that the problem is not a direct consequence of an unaligned buffer because it would have failed with -22 EINVAL at io_submit. What we're seeing is that it fails after io_submit with inconsistent results returned by io_getevents. Back to square one.

Actions #20

Updated by Loïc Dachary over 9 years ago

Although aio: protect reqs_available updates from changes in interrupt handlers looks like something that could indeed corrupt the returned event, it was introduced a month before being fixed" and the odds that the kernel used has this problem are small. I'm not even sure a stable release was published in the interval.

Actions #21

Updated by Loïc Dachary over 9 years ago

The aio: v4 ensure access to ctx->ring_pages is correctly serialised for migration buf fix could mix the content of the buffers but not their length.

Actions #22

Updated by Loïc Dachary over 9 years ago

  • Category set to OSD
  • Status changed from In Progress to Need More Info
Actions #23

Updated by Sheldon Mustard over 9 years ago

linux kernel is 3.12.7 and libaio is 0.3.109-2ubuntu1

Actions #24

Updated by Loïc Dachary over 9 years ago

  • Status changed from Need More Info to In Progress

debian/changelog does not show anything that suggest a bug was fixed in libaio after 0.3.109-2 which could relate to this problem.

libaio (0.3.109-4) unstable; urgency=low

  * Now using Standards-Version 3.9.4 (no changes needed).
  * Remove package creation information from debian/copyright.
  * Update upstream git URL in debian/copyright.
  * Add support for x32 (from the Yocto project). (Closes: #702183)
    Thanks to Daniel Schepler <dschepler@gmail.com>.
  * Add support for arm64 (stolen from upstream). (Closes: #702409)
    Thanks to Ian Campbell <ijc@hellion.org.uk>.
  * Add cross-compilation support.
    Based on a patch by Ian Campbell <ijc@hellion.org.uk>.
  * Update debian/watch file to a working URL.

 -- Guillem Jover <guillem@debian.org>  Mon, 06 May 2013 18:56:19 +0200

libaio (0.3.109-3) unstable; urgency=low

  * Escape backslash in man pages. (Closes: #651833)
    Thanks to Stephan Springl <springl-libaio@bfw-online.de>.
  * Do not install man pages now provided by manpages-dev. (Closes: #650108)
  * Use dpkg-buildflags to set CPPFLAGS, CFLAGS and LDFLAGS.
  * Now using Standards-Version 3.9.3 (no changes needed).

 -- Guillem Jover <guillem@debian.org>  Thu, 21 Jun 2012 09:21:59 +0200

Actions #25

Updated by Loïc Dachary over 9 years ago

linux 3.12.7 has been released january 2014. The patches to aio.c that could have been related

are not. The bug they fix cannot lead to that kind of problem.

Actions #26

Updated by Loïc Dachary over 9 years ago

What probably happens is that the aio_info that is sent to io_submit is overriden while waiting for the write to complete.

Actions #27

Updated by Loïc Dachary over 9 years ago

A theory that does not explain the problem, for the record.

<loicd> is it safe to assume the address of an element in a std::list will always point to the same element even when previous elements in the list are erased https://github.com/ceph/ceph/blob/giant/src/os/FileJournal.cc#L1406 ?
<sage> loicd: yeah.. the iterators are stable as long as the referenced element isn't removed
<sage> the ls.erase(p++) pattern is used all over the place
<loicd> but https://github.com/ceph/ceph/blob/giant/src/os/FileJournal.cc#L1304 is not an iterator
<loicd> and it is given to io_submit https://github.com/ceph/ceph/blob/giant/src/os/FileJournal.cc#L1318
<loicd> and surfaces later via io_getevents https://github.com/ceph/ceph/blob/giant/src/os/FileJournal.cc#L1367
<loicd> and I wonder if the content pointed to by the pointer could have changed in the meantime
<loicd> I think it can't happen if the events are received in the same order they are submitted
<loicd> not even
<sjusthm> loicd: it's stable
<loicd> sjusthm: ok :-)

Actions #28

Updated by Loïc Dachary over 9 years ago

  • Status changed from In Progress to Need More Info
  • % Done changed from 0 to 30

I'm out of ideas. There hopefully is enough background information to help with diagnostic when / if it re-surfaces.

Actions #29

Updated by Sage Weil over 9 years ago

  • Priority changed from Urgent to High
Actions #30

Updated by Loïc Dachary over 9 years ago

  • Status changed from Need More Info to Can't reproduce

It looks like it did not show up in the past three months, mark it "Can't reproduce" until it does.

Actions #31

Updated by Josh Durgin about 8 years ago

  • Status changed from Can't reproduce to New

This resurfaced, again with a result of 8192 when writing 80 bytes. This time it was with ceph 0.94.1, but the kernel was the same, 3.13.

Actions #32

Updated by Loïc Dachary about 8 years ago

  • Related to Bug #12100: OSD crash, uexpected aio error in FileJournal.cc added
Actions #33

Updated by Loïc Dachary about 8 years ago

New facts from http://tracker.ceph.com/issues/12100

http://tracker.ceph.com/issues/12100#note-2

The 1 minute load average before the crash was between 1 and 10.
Around the moment the crash happens (15:00:25) the load spikes, going up to 500, while CPU-nice counters go up significantly, too. It stays that hight until I restart the service.
At the moment of restart (15:08:29) the nice-counters drop back to 0 and the load tanks as well.

http://tracker.ceph.com/issues/12100#note-3

the hanging processes do not trigger an automatic respawning.

http://tracker.ceph.com/issues/12100#note-7

last night we saw a crash with this same trace again on one of the 48 OSDs... :(

After about 10 minutes it came back, this time without manual intervention. Probably Upstart took care of it.

I.e. the OSD dumps the trace but does not abort immediately. But seems to abort and be restarted after some time.

http://tracker.ceph.com/issues/12100#note-18

This is a hardware problem--we are correct in failing in this case.

Actions #34

Updated by Loïc Dachary about 8 years ago

  • Has duplicate Bug #11298: aio gets EPERM when update-grub runs added
Actions #35

Updated by Loïc Dachary about 8 years ago

  • Related to Bug #14558: journal AIO write error message is confusing added
Actions #36

Updated by Ilya Dryomov about 8 years ago

  • Has duplicate deleted (Bug #11298: aio gets EPERM when update-grub runs)
Actions #37

Updated by Ilya Dryomov about 8 years ago

Loic Dachary wrote:

journal aio to 165552576~80 got (8192) Unknown error 8192
meaning an attempt to write 80 bytes at offset 165552576 claims to have written 8192 bytes instead.

It's the other way around: 8192 is valid, while 165552576 and 80
(aio_info::off and aio_info::len) are invalid. It attempted to write
8192 bytes and succeeded, but didn't pass the ai->len check.

What I think happens is aio_info gets freed, turning event[i].obj into
a dangling pointer and making "if (event[i].res != ai->len)" the first
use-after-free. The question then is how aio_info gets freed. I'm not
familiar with this code, but from a quick look it seems that critical
sections are properly protected with aio_lock and, because aio_info is
embedded in the list entry, the same iocb can't be submitted twice.

Looking at check_aio_completion(), what would explain the evidence at
hand is a bug in the kernel, where the completion for the iocb somehow
gets delivered more than once. With a small tweak to fs/aio.c to make
it do exactly that I managed to reproduce this, down to the same bad
ai->len:

journal aio to 140381878795808~80 wrote 4096
journal aio to 139193344~80 wrote 4202496

There is at least one report of something similar going on on SO [1].
The answer is wrong - the kernel preserves both obj and data pointers
and both can be used for storing user context - but OP sounds credible
and describes exactly the issue I suspect we are having.

All of the occurrences (#9570, first half of #12100) are on ~3.13.*,
which is around the time fixes for a bunch of refactoring work were
landing into fs/aio.c. I haven't found anything of the above nature
though, so who knows - tracking this down for sure without a core dump
and a more verbose log isn't really feasible.

[1] http://stackoverflow.com/questions/29780028/io-getevents-returns-a-completion-event-twice

Actions #38

Updated by Shinobu Kinjo about 8 years ago

@Ilya

If you were to hit this kind of crash not often but sometimes which is quite annoying, what would you do to know what's exactly going on?

Actions #39

Updated by Ilya Dryomov about 8 years ago

I'd try with a newer kernel - as new as the environment allows. If it's still a problem, I'd bump in-memory (the number to the right of "/") debug level for journal to 30 and make sure I get a core dump when OSD hits that assert.

Actions #40

Updated by Loïc Dachary about 8 years ago

  • Subject changed from osd crash in FileJournal::WriteFinisher::entry() to osd crash in FileJournal::WriteFinisher::entry() aio
Actions #41

Updated by Loïc Dachary about 8 years ago

  • Description updated (diff)
  • Status changed from New to Need More Info
Actions #42

Updated by Samuel Just over 7 years ago

  • Status changed from Need More Info to Rejected

Seems to be a kernel bug.

Actions

Also available in: Atom PDF