Project

General

Profile

Bug #12100

OSD crash, uexpected aio error in FileJournal.cc

Added by Daniel Schneller almost 9 years ago. Updated almost 7 years ago.

Status:
Rejected
Priority:
High
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This morning we had an OSD crash on us with the following trace:

2015-06-20 11:21:37.197601 7f3c3896e700 -1 journal aio to 1027784032~80 wrote 12288
2015-06-20 11:21:37.304686 7f3c3896e700 -1 os/FileJournal.cc: In function 'void FileJournal::write_finish_thread_entry()' thread 7f3c3896e700 time 2015-06-20 11:21:37.208575
os/FileJournal.cc: 1426: FAILED assert(0 == "unexpected aio error")

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc271b]
 2: (FileJournal::write_finish_thread_entry()+0x685) [0xa7e755]
 3: (FileJournal::WriteFinisher::entry()+0xd) [0x92bedd]
 4: (()+0x8182) [0x7f3c45058182]
 5: (clone()+0x6d) [0x7f3c435c3fbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

The full log is available at [[https://public.centerdevice.de/e42a98af-69ef-4330-b36c-5858889ba566]]
Attaching it here failed repeatedly.

After that, it recovered:

2015-06-20 11:30:49.799204 7f8c6791d900  0 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff), process ceph-osd, pid 32519
2015-06-20 11:30:49.992485 7f8c6791d900  0 filestore(/var/lib/ceph/osd/ceph-11) backend xfs (magic 0x58465342)
2015-06-20 11:30:49.994221 7f8c6791d900  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: FIEMAP ioctl is supported and appears to work
2015-06-20 11:30:49.994237 7f8c6791d900  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-06-20 11:30:50.021112 7f8c6791d900  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2015-06-20 11:30:50.021273 7f8c6791d900  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature: extsize is supported and kernel 3.13.0-27-generic >= 3.5
2015-06-20 11:30:50.500414 7f8c6791d900  0 filestore(/var/lib/ceph/osd/ceph-11) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2015-06-20 11:30:51.601678 7f8c6791d900  1 journal _open /var/lib/ceph/osd/ceph-11/journal fd 20: 5367660544 bytes, block size 4096 bytes, directio = 1, aio = 1
2015-06-20 11:30:51.651860 7f8c6791d900  1 journal _open /var/lib/ceph/osd/ceph-11/journal fd 20: 5367660544 bytes, block size 4096 bytes, directio = 1, aio = 1
2015-06-20 11:30:51.741162 7f8c6791d900  0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello
2015-06-20 11:30:51.893813 7f8c6791d900  0 osd.11 36845 crush map has features 2200130813952, adjusting msgr requires for clients
2015-06-20 11:30:51.893841 7f8c6791d900  0 osd.11 36845 crush map has features 2200130813952 was 8705, adjusting msgr requires for mons
2015-06-20 11:30:51.893849 7f8c6791d900  0 osd.11 36845 crush map has features 2200130813952, adjusting msgr requires for osds
2015-06-20 11:30:51.893876 7f8c6791d900  0 osd.11 36845 load_pgs
2015-06-20 11:31:19.324981 7f8c6791d900  0 osd.11 36845 load_pgs opened 913 pgs
2015-06-20 11:31:19.331751 7f8c6791d900 -1 osd.11 36845 log_to_monitors {default=true}
2015-06-20 11:31:19.360728 7f8c443c4700  0 osd.11 36845 ignoring osdmap until we have initialized
2015-06-20 11:31:19.360796 7f8c443c4700  0 osd.11 36845 ignoring osdmap until we have initialized
2015-06-20 11:31:20.013535 7f8c6791d900  0 osd.11 36845 done with init, starting boot process

We could not find anything else relevant in our logs.
The disk seems to be fine, at leat according to MegaCLI.
Server is a DELL PowerEdge R510.

We found (at least superficially) similar looking issues here: [[http://tracker.ceph.com/issues/11298]] and here: [[http://tracker.ceph.com/issues/9570]]

This was the first time we saw an issue like this.


Related issues

Related to Ceph - Bug #9570: osd crash in FileJournal::WriteFinisher::entry() aio Rejected 09/22/2014

History

#1 Updated by Daniel Schneller almost 9 years ago

# uname -a
Linux node01 3.13.0-27-generic #50-Ubuntu SMP Thu May 15 18:06:16 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

#2 Updated by Daniel Schneller over 8 years ago

Happened again today. Different OSD, some more information this time.

2015-06-30 15:00:25.228612 7fc2e87c0700 -1 journal aio to 1042503728~80 wrote 12288
2015-06-30 15:00:25.345467 7fc2e87c0700 -1 os/FileJournal.cc: In function 'void FileJournal::write_finish_thread_entry()' thread 7fc2e87c0700 time 2015-06-30 15:00:25.237046
os/FileJournal.cc: 1426: FAILED assert(0 == "unexpected aio error")

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc271b]
 2: (FileJournal::write_finish_thread_entry()+0x685) [0xa7e755]
 3: (FileJournal::WriteFinisher::entry()+0xd) [0x92bedd]
 4: (()+0x8182) [0x7fc2f4f9b182]
 5: (clone()+0x6d) [0x7fc2f3506fbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Two things strike me as odd/interesting:

Observation 1

The log line right before the assertion says journal aio to 1042503728~80 wrote 12288 this time, and (see original log extract above): journal aio to 1027784032~80 wrote 12288 the first time it happened 10 days ago.
Notice the exact same 12288 at the end.

Not having looked into the code, it seems to be a number of bytes? Hitting the exact count twice across different OSDs on different nodes might be indicative of something?

After the dump of the last 10.000 events, the process apparently received another signal and logged another 124 even more recent events. No idea where that came from:

2015-06-30 15:00:28.121171 7fc2e87c0700 -1 *** Caught signal (Aborted) **
 in thread 7fc2e87c0700

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: /usr/bin/ceph-osd() [0xacaf4a]
 2: (()+0x10340) [0x7fc2f4fa3340]
 3: (gsignal()+0x39) [0x7fc2f3442bb9]
 4: (abort()+0x148) [0x7fc2f3445fc8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fc2f3d4e6b5]
 6: (()+0x5e836) [0x7fc2f3d4c836]
 7: (()+0x5e863) [0x7fc2f3d4c863]
 8: (()+0x5eaa2) [0x7fc2f3d4caa2]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0xbc2908]
 10: (FileJournal::write_finish_thread_entry()+0x685) [0xa7e755]
 11: (FileJournal::WriteFinisher::entry()+0xd) [0x92bedd]
 12: (()+0x8182) [0x7fc2f4f9b182]
 13: (clone()+0x6d) [0x7fc2f3506fbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
  -124> 2015-06-30 15:00:25.555041 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.14 10.102.5.12:0/31481 1103257 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:25.551124) v2 ==== 47+0+0 (3941930089 0 0) 0x4ee3f200 con 0x524a9c80
  -123> 2015-06-30 15:00:25.555115 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.5.12:0/31481 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:25.551124) v2 -- ?+0 0x5820a000 con 0x524a9c80
  -122> 2015-06-30 15:00:25.555195 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.14 10.102.5.12:0/31481 1103257 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:25.551124) v2 ==== 47+0+0 (3941930089 0 0) 0x4cfa2600 con 0x52510000
  -121> 2015-06-30 15:00:25.555231 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.5.12:0/31481 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:25.551124) v2 -- ?+0 0x55148e00 con 0x52510000
  -120> 2015-06-30 15:00:25.564608 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.22 10.102.4.12:0/8014 1102542 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:25.561402) v2 ==== 47+0+0 (3877965095 0 0) 0x55f1d600 con 0x158186e0
  -119> 2015-06-30 15:00:25.564641 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.4.12:0/8014 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:25.561402) v2 -- ?+0 0x4cfa2600 con 0x158186e0
  -118> 2015-06-30 15:00:25.564742 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.22 10.102.4.12:0/8014 1102542 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:25.561402) v2 ==== 47+0+0 (3877965095 0 0) 0x3d89e200 con 0x15818b00
  -117> 2015-06-30 15:00:25.564778 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.4.12:0/8014 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:25.561402) v2 -- ?+0 0x4ee3f200 con 0x15818b00
  -116> 2015-06-30 15:00:25.635280 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.23 10.102.4.12:0/18231 1102803 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:25.632257) v2 ==== 47+0+0 (4031570926 0 0) 0x57252600 con 0x51eeac60
  -115> 2015-06-30 15:00:25.635318 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.4.12:0/18231 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:25.632257) v2 -- ?+0 0x55f1d600 con 0x51eeac60
  -114> 2015-06-30 15:00:25.635334 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.23 10.102.4.12:0/18231 1102803 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:25.632257) v2 ==== 47+0+0 (4031570926 0 0) 0x5419c200 con 0x51db4aa0
  -113> 2015-06-30 15:00:25.635376 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.4.12:0/18231 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:25.632257) v2 -- ?+0 0x3d89e200 con 0x51db4aa0
  -112> 2015-06-30 15:00:25.779476 7fc2d5585700  1 -- 10.102.4.14:6807/32098 <== mon.2 10.102.4.13:6789/0 632210 ==== pg_stats_ack(1 pgs tid 606088) v1 ==== 33+0+0 (3895268427 0 0) 0x53bdac00 con 0x4f02b4a0
  -111> 2015-06-30 15:00:25.854132 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.39 10.102.5.13:0/13618 1103262 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:25.845042) v2 ==== 47+0+0 (1592053422 0 0) 0x593b2800 con 0x520846e0
  -110> 2015-06-30 15:00:25.854131 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.39 10.102.5.13:0/13618 1103262 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:25.845042) v2 ==== 47+0+0 (1592053422 0 0) 0x56083a00 con 0x51e2b440
  -109> 2015-06-30 15:00:25.854182 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.5.13:0/13618 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:25.845042) v2 -- ?+0 0x57252600 con 0x520846e0
  -108> 2015-06-30 15:00:25.854239 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.5.13:0/13618 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:25.845042) v2 -- ?+0 0x5419c200 con 0x51e2b440
  -107> 2015-06-30 15:00:25.862097 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.2 10.102.4.11:0/21255 1102605 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:25.853715) v2 ==== 47+0+0 (1283020748 0 0) 0x56a1dc00 con 0x5504c9a0
  -106> 2015-06-30 15:00:25.862129 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.4.11:0/21255 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:25.853715) v2 -- ?+0 0x56083a00 con 0x5504c9a0
  -105> 2015-06-30 15:00:25.862314 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.2 10.102.4.11:0/21255 1102605 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:25.853715) v2 ==== 47+0+0 (1283020748 0 0) 0x57178800 con 0x526fe100
  -104> 2015-06-30 15:00:25.862343 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.4.11:0/21255 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:25.853715) v2 -- ?+0 0x593b2800 con 0x526fe100
  -103> 2015-06-30 15:00:25.873169 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.42 10.102.4.13:0/19830 1102478 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:25.864177) v2 ==== 47+0+0 (3498019742 0 0) 0x530bca00 con 0x52084580
  -102> 2015-06-30 15:00:25.873200 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.4.13:0/19830 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:25.864177) v2 -- ?+0 0x57178800 con 0x52084580
  -101> 2015-06-30 15:00:25.873362 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.42 10.102.4.13:0/19830 1102478 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:25.864177) v2 ==== 47+0+0 (3498019742 0 0) 0x564a3e00 con 0x353f22c0
  -100> 2015-06-30 15:00:25.873390 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.4.13:0/19830 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:25.864177) v2 -- ?+0 0x56a1dc00 con 0x353f22c0
   -99> 2015-06-30 15:00:26.036891 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.4 10.102.5.11:0/10857 1102577 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:26.028849) v2 ==== 47+0+0 (3878660794 0 0) 0x1da81400 con 0x51db4ec0
   -98> 2015-06-30 15:00:26.036930 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.5.11:0/10857 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:26.028849) v2 -- ?+0 0x564a3e00 con 0x51db4ec0
   -97> 2015-06-30 15:00:26.037674 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.4 10.102.5.11:0/10857 1102577 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:26.028849) v2 ==== 47+0+0 (3878660794 0 0) 0x13a3d600 con 0x51eed440
   -96> 2015-06-30 15:00:26.037708 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.5.11:0/10857 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:26.028849) v2 -- ?+0 0x530bca00 con 0x51eed440
   -95> 2015-06-30 15:00:26.081789 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.41 10.102.4.13:0/8812 1102662 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:26.073682) v2 ==== 47+0+0 (2034749784 0 0) 0x4e977000 con 0x526ffde0
   -94> 2015-06-30 15:00:26.081825 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.4.13:0/8812 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:26.073682) v2 -- ?+0 0x13a3d600 con 0x526ffde0
   -93> 2015-06-30 15:00:26.081879 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.41 10.102.4.13:0/8812 1102662 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:26.073682) v2 ==== 47+0+0 (2034749784 0 0) 0x565e7200 con 0x54e6a160
   -92> 2015-06-30 15:00:26.081910 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.4.13:0/8812 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:26.073682) v2 -- ?+0 0x1da81400 con 0x54e6a160
   -91> 2015-06-30 15:00:26.200344 7fc2eb878700  5 osd.24 36849 tick
   -90> 2015-06-30 15:00:26.305506 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.47 10.102.4.13:0/2957 1103463 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:26.297407) v2 ==== 47+0+0 (2473367057 0 0) 0x58703e00 con 0x51ec0aa0
   -89> 2015-06-30 15:00:26.305541 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.4.13:0/2957 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:26.297407) v2 -- ?+0 0x4e977000 con 0x51ec0aa0
   -88> 2015-06-30 15:00:26.305594 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.47 10.102.4.13:0/2957 1103463 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:26.297407) v2 ==== 47+0+0 (2473367057 0 0) 0x3faeae00 con 0x15819b80
   -87> 2015-06-30 15:00:26.305626 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.4.13:0/2957 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:26.297407) v2 -- ?+0 0x565e7200 con 0x15819b80
   -86> 2015-06-30 15:00:26.465075 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.5 10.102.5.11:0/22563 1102252 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:26.456909) v2 ==== 47+0+0 (3607192615 0 0) 0x1e622400 con 0x526fe520
   -85> 2015-06-30 15:00:26.465110 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.5.11:0/22563 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:26.456909) v2 -- ?+0 0x58703e00 con 0x526fe520
   -84> 2015-06-30 15:00:26.465353 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.5 10.102.5.11:0/22563 1102252 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:26.456909) v2 ==== 47+0+0 (3607192615 0 0) 0x56a1ac00 con 0x54e76680
   -83> 2015-06-30 15:00:26.465395 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.5.11:0/22563 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:26.456909) v2 -- ?+0 0x3faeae00 con 0x54e76680
   -82> 2015-06-30 15:00:26.539089 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.13 10.102.5.12:0/20242 1104047 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:26.535966) v2 ==== 47+0+0 (659808342 0 0) 0x45bee200 con 0x52d4c160
   -81> 2015-06-30 15:00:26.539128 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.5.12:0/20242 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:26.535966) v2 -- ?+0 0x56a1ac00 con 0x52d4c160
   -80> 2015-06-30 15:00:26.539117 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.13 10.102.5.12:0/20242 1104047 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:26.535966) v2 ==== 47+0+0 (659808342 0 0) 0x58103c00 con 0x15819e40
   -79> 2015-06-30 15:00:26.539236 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.5.12:0/20242 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:26.535966) v2 -- ?+0 0x1e622400 con 0x15819e40
   -78> 2015-06-30 15:00:26.674646 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.37 10.102.4.13:0/13636 1102858 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:26.664354) v2 ==== 47+0+0 (3291292117 0 0) 0x571c4a00 con 0x52084840
   -77> 2015-06-30 15:00:26.674676 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.37 10.102.4.13:0/13636 1102858 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:26.664354) v2 ==== 47+0+0 (3291292117 0 0) 0x57d64600 con 0x353f29a0
   -76> 2015-06-30 15:00:26.674706 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.4.13:0/13636 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:26.664354) v2 -- ?+0 0x58103c00 con 0x52084840
   -75> 2015-06-30 15:00:26.674990 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.4.13:0/13636 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:26.664354) v2 -- ?+0 0x45bee200 con 0x353f29a0
   -74> 2015-06-30 15:00:26.748251 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.43 10.102.4.13:0/1069 1103624 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:26.740158) v2 ==== 47+0+0 (2278595253 0 0) 0x55147c00 con 0x4550100
   -73> 2015-06-30 15:00:26.748293 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.4.13:0/1069 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:26.740158) v2 -- ?+0 0x571c4a00 con 0x4550100
   -72> 2015-06-30 15:00:26.748299 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.43 10.102.4.13:0/1069 1103624 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:26.740158) v2 ==== 47+0+0 (2278595253 0 0) 0x58208200 con 0x52d4c580
   -71> 2015-06-30 15:00:26.748376 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.4.13:0/1069 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:26.740158) v2 -- ?+0 0x57d64600 con 0x52d4c580
   -70> 2015-06-30 15:00:26.807036 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.7 10.102.5.11:0/4523 1102693 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:26.798505) v2 ==== 47+0+0 (1239403865 0 0) 0x52356800 con 0x51eea580
   -69> 2015-06-30 15:00:26.807077 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.5.11:0/4523 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:26.798505) v2 -- ?+0 0x55147c00 con 0x51eea580
   -68> 2015-06-30 15:00:26.807087 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.7 10.102.5.11:0/4523 1102693 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:26.798505) v2 ==== 47+0+0 (1239403865 0 0) 0x585c6800 con 0x51db5de0
   -67> 2015-06-30 15:00:26.807119 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.5.11:0/4523 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:26.798505) v2 -- ?+0 0x58208200 con 0x51db5de0
   -66> 2015-06-30 15:00:27.042911 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.13 10.102.5.12:0/20242 1104048 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.040475) v2 ==== 47+0+0 (4072290317 0 0) 0x56db8a00 con 0x52d4c160
   -65> 2015-06-30 15:00:27.042907 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.13 10.102.5.12:0/20242 1104048 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.040475) v2 ==== 47+0+0 (4072290317 0 0) 0x436f2200 con 0x15819e40
   -64> 2015-06-30 15:00:27.042965 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.5.12:0/20242 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.040475) v2 -- ?+0 0x585c6800 con 0x52d4c160
   -63> 2015-06-30 15:00:27.043012 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.5.12:0/20242 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.040475) v2 -- ?+0 0x52356800 con 0x15819e40
   -62> 2015-06-30 15:00:27.072978 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.36 10.102.4.13:0/32024 1103765 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.065106) v2 ==== 47+0+0 (1736692310 0 0) 0x56c87000 con 0x524a9de0
   -61> 2015-06-30 15:00:27.073009 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.4.13:0/32024 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.065106) v2 -- ?+0 0x56db8a00 con 0x524a9de0
   -60> 2015-06-30 15:00:27.072999 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.36 10.102.4.13:0/32024 1103765 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.065106) v2 ==== 47+0+0 (1736692310 0 0) 0x56f14e00 con 0x525102c0
   -59> 2015-06-30 15:00:27.073069 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.4.13:0/32024 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.065106) v2 -- ?+0 0x436f2200 con 0x525102c0
   -58> 2015-06-30 15:00:27.109971 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.6 10.102.4.11:0/32243 1102600 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.101757) v2 ==== 47+0+0 (1381236343 0 0) 0x5873de00 con 0x542962c0
   -57> 2015-06-30 15:00:27.110011 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.4.11:0/32243 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.101757) v2 -- ?+0 0x56c87000 con 0x542962c0
   -56> 2015-06-30 15:00:27.110150 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.6 10.102.4.11:0/32243 1102600 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.101757) v2 ==== 47+0+0 (1381236343 0 0) 0x3ebb2800 con 0x51eea840
   -55> 2015-06-30 15:00:27.110183 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.4.11:0/32243 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.101757) v2 -- ?+0 0x56f14e00 con 0x51eea840
   -54> 2015-06-30 15:00:27.172897 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.45 10.102.4.13:0/6834 1103296 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.164865) v2 ==== 47+0+0 (250551736 0 0) 0x546afc00 con 0x52992000
   -53> 2015-06-30 15:00:27.172920 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.45 10.102.4.13:0/6834 1103296 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.164865) v2 ==== 47+0+0 (250551736 0 0) 0xfd6da00 con 0x5298a000
   -52> 2015-06-30 15:00:27.172935 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.4.13:0/6834 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.164865) v2 -- ?+0 0x5873de00 con 0x52992000
   -51> 2015-06-30 15:00:27.173001 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.4.13:0/6834 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.164865) v2 -- ?+0 0x3ebb2800 con 0x5298a000
   -50> 2015-06-30 15:00:27.202267 7fc2eb878700  5 osd.24 36849 tick
   -49> 2015-06-30 15:00:27.316573 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.1 10.102.4.11:0/12300 1102698 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.307806) v2 ==== 47+0+0 (1258189517 0 0) 0x52777000 con 0x52636420
   -48> 2015-06-30 15:00:27.316601 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.4.11:0/12300 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.307806) v2 -- ?+0 0xfd6da00 con 0x52636420
   -47> 2015-06-30 15:00:27.316657 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.1 10.102.4.11:0/12300 1102698 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.307806) v2 ==== 47+0+0 (1258189517 0 0) 0x58d12600 con 0x52636c60
   -46> 2015-06-30 15:00:27.316686 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.4.11:0/12300 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.307806) v2 -- ?+0 0x546afc00 con 0x52636c60
   -45> 2015-06-30 15:00:27.332318 7fc2c0655700  1 -- 10.102.5.14:6808/32098 <== osd.14 10.102.5.12:6820/31481 782509 ==== osd_repop(client.139494231.0:6444256 19.464 5cf11c64/rbd_data.3f700e2395fa2b5.0000000000000100/head//19 v 36849'3610124) v1 ==== 923+0+4878 (1234781237 0 255592221) 0x5a485800 con 0x51d2c420
   -44> 2015-06-30 15:00:27.332594 7fc2c0655700  5 -- op tracker -- seq: 33158056, time: 2015-06-30 15:00:27.329278, event: header_read, op: osd_repop(client.139494231.0:6444256 19.464 5cf11c64/rbd_data.3f700e2395fa2b5.0000000000000100/head//19 v 36849'3610124)
   -43> 2015-06-30 15:00:27.332609 7fc2c0655700  5 -- op tracker -- seq: 33158056, time: 2015-06-30 15:00:27.329280, event: throttled, op: osd_repop(client.139494231.0:6444256 19.464 5cf11c64/rbd_data.3f700e2395fa2b5.0000000000000100/head//19 v 36849'3610124)
   -42> 2015-06-30 15:00:27.332615 7fc2c0655700  5 -- op tracker -- seq: 33158056, time: 2015-06-30 15:00:27.332306, event: all_read, op: osd_repop(client.139494231.0:6444256 19.464 5cf11c64/rbd_data.3f700e2395fa2b5.0000000000000100/head//19 v 36849'3610124)
   -41> 2015-06-30 15:00:27.332621 7fc2c0655700  5 -- op tracker -- seq: 33158056, time: 0.000000, event: dispatched, op: osd_repop(client.139494231.0:6444256 19.464 5cf11c64/rbd_data.3f700e2395fa2b5.0000000000000100/head//19 v 36849'3610124)
   -40> 2015-06-30 15:00:27.332739 7fc2c5d66700  5 -- op tracker -- seq: 33158056, time: 2015-06-30 15:00:27.332739, event: reached_pg, op: osd_repop(client.139494231.0:6444256 19.464 5cf11c64/rbd_data.3f700e2395fa2b5.0000000000000100/head//19 v 36849'3610124)
   -39> 2015-06-30 15:00:27.332769 7fc2c5d66700  5 -- op tracker -- seq: 33158056, time: 2015-06-30 15:00:27.332769, event: started, op: osd_repop(client.139494231.0:6444256 19.464 5cf11c64/rbd_data.3f700e2395fa2b5.0000000000000100/head//19 v 36849'3610124)
   -38> 2015-06-30 15:00:27.334631 7fc2c5d66700  5 -- op tracker -- seq: 33158056, time: 2015-06-30 15:00:27.334631, event: started, op: osd_repop(client.139494231.0:6444256 19.464 5cf11c64/rbd_data.3f700e2395fa2b5.0000000000000100/head//19 v 36849'3610124)
   -37> 2015-06-30 15:00:27.335391 7fc2c5d66700  5 -- op tracker -- seq: 33158056, time: 2015-06-30 15:00:27.335391, event: commit_queued_for_journal_write, op: osd_repop(client.139494231.0:6444256 19.464 5cf11c64/rbd_data.3f700e2395fa2b5.0000000000000100/head//19 v 36849'3610124)
   -36> 2015-06-30 15:00:27.426859 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.18 10.102.5.12:0/28245 1103605 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.423470) v2 ==== 47+0+0 (2453124798 0 0) 0x582bf200 con 0x454ffa0
   -35> 2015-06-30 15:00:27.426890 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.5.12:0/28245 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.423470) v2 -- ?+0 0x52777000 con 0x454ffa0
   -34> 2015-06-30 15:00:27.426954 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.18 10.102.5.12:0/28245 1103605 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.423470) v2 ==== 47+0+0 (2453124798 0 0) 0x1a9b2a00 con 0x52d4c6e0
   -33> 2015-06-30 15:00:27.427007 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.5.12:0/28245 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.423470) v2 -- ?+0 0x58d12600 con 0x52d4c6e0
   -32> 2015-06-30 15:00:27.579890 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.15 10.102.4.12:0/19598 1103583 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.576966) v2 ==== 47+0+0 (1435394218 0 0) 0x47a1ba00 con 0x51ebfe40
   -31> 2015-06-30 15:00:27.579925 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.4.12:0/19598 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.576966) v2 -- ?+0 0x582bf200 con 0x51ebfe40
   -30> 2015-06-30 15:00:27.580138 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.15 10.102.4.12:0/19598 1103583 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.576966) v2 ==== 47+0+0 (1435394218 0 0) 0x3e415e00 con 0x51e2a7e0
   -29> 2015-06-30 15:00:27.580174 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.4.12:0/19598 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.576966) v2 -- ?+0 0x1a9b2a00 con 0x51e2a7e0
   -28> 2015-06-30 15:00:27.588770 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.9 10.102.5.11:0/18201 1103886 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.580764) v2 ==== 47+0+0 (4126241442 0 0) 0x5615fe00 con 0x4550260
   -27> 2015-06-30 15:00:27.588811 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.5.11:0/18201 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.580764) v2 -- ?+0 0x47a1ba00 con 0x4550260
   -26> 2015-06-30 15:00:27.588890 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.9 10.102.5.11:0/18201 1103886 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.580764) v2 ==== 47+0+0 (4126241442 0 0) 0x556a4000 con 0x52d4c420
   -25> 2015-06-30 15:00:27.588919 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.5.11:0/18201 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.580764) v2 -- ?+0 0x3e415e00 con 0x52d4c420
   -24> 2015-06-30 15:00:27.740138 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.38 10.102.4.13:0/19645 1102896 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.731658) v2 ==== 47+0+0 (3569977930 0 0) 0x5739c600 con 0x54296840
   -23> 2015-06-30 15:00:27.740178 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.4.13:0/19645 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.731658) v2 -- ?+0 0x556a4000 con 0x54296840
   -22> 2015-06-30 15:00:27.740293 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.38 10.102.4.13:0/19645 1102896 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.731658) v2 ==== 47+0+0 (3569977930 0 0) 0x5e446400 con 0x52084000
   -21> 2015-06-30 15:00:27.740318 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.4.13:0/19645 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.731658) v2 -- ?+0 0x5615fe00 con 0x52084000
   -20> 2015-06-30 15:00:27.911756 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.7 10.102.5.11:0/4523 1102694 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.903666) v2 ==== 47+0+0 (1748142540 0 0) 0x21d17400 con 0x51eea580
   -19> 2015-06-30 15:00:27.911758 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.7 10.102.5.11:0/4523 1102694 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.903666) v2 ==== 47+0+0 (1748142540 0 0) 0x546afe00 con 0x51db5de0
   -18> 2015-06-30 15:00:27.911785 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.5.11:0/4523 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.903666) v2 -- ?+0 0x5e446400 con 0x51eea580
   -17> 2015-06-30 15:00:27.911837 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.5.11:0/4523 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.903666) v2 -- ?+0 0x5739c600 con 0x51db5de0
   -16> 2015-06-30 15:00:27.937705 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.19 10.102.4.12:0/2799 1102151 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.934576) v2 ==== 47+0+0 (136028817 0 0) 0x47ed5c00 con 0x5504c840
   -15> 2015-06-30 15:00:27.937736 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.4.12:0/2799 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.934576) v2 -- ?+0 0x546afe00 con 0x5504c840
   -14> 2015-06-30 15:00:27.937915 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.19 10.102.4.12:0/2799 1102151 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.934576) v2 ==== 47+0+0 (136028817 0 0) 0x579bfe00 con 0x526fd600
   -13> 2015-06-30 15:00:27.937948 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.4.12:0/2799 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.934576) v2 -- ?+0 0x21d17400 con 0x526fd600
   -12> 2015-06-30 15:00:27.981739 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.16 10.102.5.12:0/4725 1101706 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.977591) v2 ==== 47+0+0 (3699649071 0 0) 0x565fb800 con 0x54e742c0
   -11> 2015-06-30 15:00:27.981740 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.16 10.102.5.12:0/4725 1101706 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:27.977591) v2 ==== 47+0+0 (3699649071 0 0) 0x53561c00 con 0x526fc2c0
   -10> 2015-06-30 15:00:27.981772 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.5.12:0/4725 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.977591) v2 -- ?+0 0x47ed5c00 con 0x54e742c0
    -9> 2015-06-30 15:00:27.981829 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.5.12:0/4725 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:27.977591) v2 -- ?+0 0x579bfe00 con 0x526fc2c0
    -8> 2015-06-30 15:00:28.012088 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.47 10.102.4.13:0/2957 1103464 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:28.003538) v2 ==== 47+0+0 (862050394 0 0) 0x53292800 con 0x51ec0aa0
    -7> 2015-06-30 15:00:28.012087 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.47 10.102.4.13:0/2957 1103464 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:28.003538) v2 ==== 47+0+0 (862050394 0 0) 0x55773000 con 0x15819b80
    -6> 2015-06-30 15:00:28.012137 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.4.13:0/2957 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:28.003538) v2 -- ?+0 0x53561c00 con 0x51ec0aa0
    -5> 2015-06-30 15:00:28.012225 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.4.13:0/2957 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:28.003538) v2 -- ?+0 0x565fb800 con 0x15819b80
    -4> 2015-06-30 15:00:28.048820 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 <== osd.3 10.102.4.11:0/19477 1103430 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:28.041131) v2 ==== 47+0+0 (3377286115 0 0) 0x574d6200 con 0x158182c0
    -3> 2015-06-30 15:00:28.048859 7fc2cfd7a700  1 -- 10.102.5.14:6809/32098 --> 10.102.4.11:0/19477 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:28.041131) v2 -- ?+0 0x53292800 con 0x158182c0
    -2> 2015-06-30 15:00:28.048858 7fc2d157d700  1 -- 10.102.4.14:6808/32098 <== osd.3 10.102.4.11:0/19477 1103430 ==== osd_ping(ping e36849 stamp 2015-06-30 15:00:28.041131) v2 ==== 47+0+0 (3377286115 0 0) 0x57195c00 con 0x4551020
    -1> 2015-06-30 15:00:28.048903 7fc2d157d700  1 -- 10.102.4.14:6808/32098 --> 10.102.4.11:0/19477 -- osd_ping(ping_reply e36849 stamp 2015-06-30 15:00:28.041131) v2 -- ?+0 0x55773000 con 0x4551020
     0> 2015-06-30 15:00:28.121171 7fc2e87c0700 -1 *** Caught signal (Aborted) **
 in thread 7fc2e87c0700

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: /usr/bin/ceph-osd() [0xacaf4a]
 2: (()+0x10340) [0x7fc2f4fa3340]
 3: (gsignal()+0x39) [0x7fc2f3442bb9]
 4: (abort()+0x148) [0x7fc2f3445fc8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fc2f3d4e6b5]
 6: (()+0x5e836) [0x7fc2f3d4c836]
 7: (()+0x5e863) [0x7fc2f3d4c863]
 8: (()+0x5eaa2) [0x7fc2f3d4caa2]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0xbc2908]
 10: (FileJournal::write_finish_thread_entry()+0x685) [0xa7e755]
 11: (FileJournal::WriteFinisher::entry()+0xd) [0x92bedd]
 12: (()+0x8182) [0x7fc2f4f9b182]
 13: (clone()+0x6d) [0x7fc2f3506fbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
  ...
  99/99 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.24.log
--- end dump of recent events ---

Observation 2

The process was still present, so Ubuntu's Upstart did not see a problem and hence it did not respawn the daemon. This is really troubling, because if that happens overnight it would remain down until someone logged in and restarted it manually. Luckily this time it happened during office hours, so I was able to react very quickly.

See my (cleaned up) terminal history:

[B|root@node04]  ~ ?  ps aux | grep osd
...
root     32098  6.9  1.1 3157180 1565936 ?     Dsl  May20 4079:33 /usr/bin/ceph-osd --cluster=ceph -i 24 -f

[B|root@node04]  ~ ?  stop ceph-osd id=24
ceph-osd stop/waiting

[B|root@node04]  ~ ?  ps aux | grep 32098

[B|root@node04]  ~ ?  start ceph-osd id=24
ceph-osd (ceph/24) start/running, process 27019

Right after that, the log shows regular startup messages and everything goes back to normal:

2015-06-30 15:08:29.792883 7f08e63b3900  0 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff), process ceph-osd, pid 27019
2015-06-30 15:08:30.039684 7f08e63b3900  0 filestore(/var/lib/ceph/osd/ceph-24) backend xfs (magic 0x58465342)
2015-06-30 15:08:30.041733 7f08e63b3900  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-24) detect_features: FIEMAP ioctl is supported and appears to work
2015-06-30 15:08:30.041750 7f08e63b3900  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-24) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-06-30 15:08:30.123697 7f08e63b3900  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-24) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2015-06-30 15:08:30.124411 7f08e63b3900  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-24) detect_feature: extsize is supported and kernel 3.13.0-27-generic >= 3.5
2015-06-30 15:08:30.555434 7f08e63b3900  0 filestore(/var/lib/ceph/osd/ceph-24) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2015-06-30 15:08:31.506477 7f08e63b3900  1 journal _open /var/lib/ceph/osd/ceph-24/journal fd 20: 5367660544 bytes, block size 4096 bytes, directio = 1, aio = 1
2015-06-30 15:08:31.540422 7f08e63b3900  1 journal _open /var/lib/ceph/osd/ceph-24/journal fd 20: 5367660544 bytes, block size 4096 bytes, directio = 1, aio = 1
2015-06-30 15:08:31.747287 7f08e63b3900  0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello
2015-06-30 15:08:31.780969 7f08e63b3900  0 osd.24 36849 crush map has features 2200130813952, adjusting msgr requires for clients
2015-06-30 15:08:31.781027 7f08e63b3900  0 osd.24 36849 crush map has features 2200130813952 was 8705, adjusting msgr requires for mons
2015-06-30 15:08:31.781035 7f08e63b3900  0 osd.24 36849 crush map has features 2200130813952, adjusting msgr requires for osds
2015-06-30 15:08:31.781064 7f08e63b3900  0 osd.24 36849 load_pgs
2015-06-30 15:08:54.578527 7f08e63b3900  0 osd.24 36849 load_pgs opened 928 pgs
2015-06-30 15:08:54.583080 7f08e63b3900 -1 osd.24 36849 log_to_monitors {default=true}
2015-06-30 15:08:54.589659 7f08c172f700  0 osd.24 36849 ignoring osdmap until we have initialized
2015-06-30 15:08:54.589723 7f08c172f700  0 osd.24 36849 ignoring osdmap until we have initialized
2015-06-30 15:08:56.216715 7f08e63b3900  0 osd.24 36849 done with init, starting boot process

Full logs of both crashes are here: [[https://public.centerdevice.de/cacec857-c279-40ec-a92e-55f931fa468a]]

There is also a Bosun-Graph there, showing 1 minute the system load average and superimposed different CPU counters.

The 1 minute load average before the crash was between 1 and 10.

Around the moment the crash happens (15:00:25) the load spikes, going up to 500, while CPU-nice counters go up significantly, too. It stays that hight until I restart the service.

At the moment of restart (15:08:29) the nice-counters drop back to 0 and the load tanks as well.

(Notice that in the graph time is UTC, meaning it is -2:00 from what the textual logs say.)

I hope this can help track down the issue?

#3 Updated by Daniel Schneller over 8 years ago

And again today. Same issue. I see a pattern here. Any support would be appreciated, esp. because the hanging processes do not trigger an automatic respawning.

#4 Updated by Pontus Lindgren over 8 years ago

I am also experiencing a similar issue:

2015-08-16 08:11:54.227567 7f13d68de700  0 log_channel(cluster) log [WRN] : 3 slow requests, 3 included below; oldest blocked for > 30.685081 secs
2015-08-16 08:11:54.227579 7f13d68de700 0 log_channel(cluster) log [WRN] : slow request 30.685081 seconds old, received at 2015-08-16 08:11:23.542417: osd_op(client.1109461.0:219374023 rbd_data.10e67e79e2a9e3.000000000001c201 [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 2592768~4096] 5.89587894 ack+ondisk+write e1804) currently waiting for subops from 1,30
2015-08-16 08:11:54.227587 7f13d68de700 0 log_channel(cluster) log [WRN] : slow request 30.682262 seconds old, received at 2015-08-16 08:11:23.545236: osd_repop(client.1109461.0:219374083 5.c63 d6b85c63/rbd_data.10e67e79e2a9e3.000000000001a800/head//5 v 1804'121436) currently started
2015-08-16 08:11:54.227592 7f13d68de700 0 log_channel(cluster) log [WRN] : slow request 30.641702 seconds old, received at 2015-08-16 08:11:23.585797: osd_repop(client.1935041.0:1302764 5.82a 4219482a/rbd_data.1d685c2eb141f2.0000000000003c5f/head//5 v 1804'265055) currently started
2015-08-16 08:11:55.227784 7f13d68de700 0 log_channel(cluster) log [WRN] : 4 slow requests, 1 included below; oldest blocked for > 31.685317 secs
2015-08-16 08:11:55.227808 7f13d68de700 0 log_channel(cluster) log [WRN] : slow request 30.788521 seconds old, received at 2015-08-16 08:11:24.439213: osd_repop(client.1224667.0:34531998 5.abe 2f457abe/rbd_data.12aacc79e2a9e3.0000000000001d9d/head//5 v 1804'27936) currently started
2015-08-16 08:11:56.075649 7f13d3d89700 -1 journal aio to 7994220544~8192 wrote 18446744073709551611
2015-08-16 08:11:56.091460 7f13d3d89700 -1 os/FileJournal.cc: In function 'void FileJournal::write_finish_thread_entry()' thread 7f13d3d89700 time 2015-08-16 08:11:56.076462
os/FileJournal.cc: 1426: FAILED assert(0 == "unexpected aio error")
ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x72) [0xcdb572]
2: (FileJournal::write_finish_thread_entry()+0x847) [0xb9a437]
3: (FileJournal::WriteFinisher::entry()+0xd) [0xa3befd]
4: (()+0x6b50) [0x7f13de90ab50]
5: (clone()+0x6d) [0x7f13dd32695d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

#5 Updated by Shinobu Kinjo over 8 years ago

Was there any change or update not only for ceph?

Pontus Lindgren wrote:

I am also experiencing a similar issue:

2015-08-16 08:11:54.227567 7f13d68de700 0 log_channel(cluster) log [WRN] : 3 slow requests, 3 included below; oldest blocked for > 30.685081 secs
2015-08-16 08:11:54.227579 7f13d68de700 0 log_channel(cluster) log [WRN] : slow request 30.685081 seconds old, received at 2015-08-16 08:11:23.542417: osd_op(client.1109461.0:219374023 rbd_data.10e67e79e2a9e3.000000000001c201 [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 2592768~4096] 5.89587894 ack+ondisk+write e1804) currently waiting for subops from 1,30
2015-08-16 08:11:54.227587 7f13d68de700 0 log_channel(cluster) log [WRN] : slow request 30.682262 seconds old, received at 2015-08-16 08:11:23.545236: osd_repop(client.1109461.0:219374083 5.c63 d6b85c63/rbd_data.10e67e79e2a9e3.000000000001a800/head//5 v 1804'121436) currently started
2015-08-16 08:11:54.227592 7f13d68de700 0 log_channel(cluster) log [WRN] : slow request 30.641702 seconds old, received at 2015-08-16 08:11:23.585797: osd_repop(client.1935041.0:1302764 5.82a 4219482a/rbd_data.1d685c2eb141f2.0000000000003c5f/head//5 v 1804'265055) currently started
2015-08-16 08:11:55.227784 7f13d68de700 0 log_channel(cluster) log [WRN] : 4 slow requests, 1 included below; oldest blocked for > 31.685317 secs
2015-08-16 08:11:55.227808 7f13d68de700 0 log_channel(cluster) log [WRN] : slow request 30.788521 seconds old, received at 2015-08-16 08:11:24.439213: osd_repop(client.1224667.0:34531998 5.abe 2f457abe/rbd_data.12aacc79e2a9e3.0000000000001d9d/head//5 v 1804'27936) currently started
2015-08-16 08:11:56.075649 7f13d3d89700 -1 journal aio to 7994220544~8192 wrote 18446744073709551611
2015-08-16 08:11:56.091460 7f13d3d89700 -1 os/FileJournal.cc: In function 'void FileJournal::write_finish_thread_entry()' thread 7f13d3d89700 time 2015-08-16 08:11:56.076462
os/FileJournal.cc: 1426: FAILED assert(0 == "unexpected aio error")

ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x72) [0xcdb572]
2: (FileJournal::write_finish_thread_entry()+0x847) [0xb9a437]
3: (FileJournal::WriteFinisher::entry()+0xd) [0xa3befd]
4: (()+0x6b50) [0x7f13de90ab50]
5: (clone()+0x6d) [0x7f13dd32695d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#6 Updated by Pontus Lindgren over 8 years ago

Not close to this date no.

In June we deployed the cluster using ceph-deploy deploying dumpling. during june we did some performance testing and tuning and upgraded to hammer, there have been no upgrades to any packages since then.

Shinobu Kinjo wrote:

Was there any change or update not only for ceph?

Pontus Lindgren wrote:

I am also experiencing a similar issue:

2015-08-16 08:11:54.227567 7f13d68de700 0 log_channel(cluster) log [WRN] : 3 slow requests, 3 included below; oldest blocked for > 30.685081 secs
2015-08-16 08:11:54.227579 7f13d68de700 0 log_channel(cluster) log [WRN] : slow request 30.685081 seconds old, received at 2015-08-16 08:11:23.542417: osd_op(client.1109461.0:219374023 rbd_data.10e67e79e2a9e3.000000000001c201 [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 2592768~4096] 5.89587894 ack+ondisk+write e1804) currently waiting for subops from 1,30
2015-08-16 08:11:54.227587 7f13d68de700 0 log_channel(cluster) log [WRN] : slow request 30.682262 seconds old, received at 2015-08-16 08:11:23.545236: osd_repop(client.1109461.0:219374083 5.c63 d6b85c63/rbd_data.10e67e79e2a9e3.000000000001a800/head//5 v 1804'121436) currently started
2015-08-16 08:11:54.227592 7f13d68de700 0 log_channel(cluster) log [WRN] : slow request 30.641702 seconds old, received at 2015-08-16 08:11:23.585797: osd_repop(client.1935041.0:1302764 5.82a 4219482a/rbd_data.1d685c2eb141f2.0000000000003c5f/head//5 v 1804'265055) currently started
2015-08-16 08:11:55.227784 7f13d68de700 0 log_channel(cluster) log [WRN] : 4 slow requests, 1 included below; oldest blocked for > 31.685317 secs
2015-08-16 08:11:55.227808 7f13d68de700 0 log_channel(cluster) log [WRN] : slow request 30.788521 seconds old, received at 2015-08-16 08:11:24.439213: osd_repop(client.1224667.0:34531998 5.abe 2f457abe/rbd_data.12aacc79e2a9e3.0000000000001d9d/head//5 v 1804'27936) currently started
2015-08-16 08:11:56.075649 7f13d3d89700 -1 journal aio to 7994220544~8192 wrote 18446744073709551611
2015-08-16 08:11:56.091460 7f13d3d89700 -1 os/FileJournal.cc: In function 'void FileJournal::write_finish_thread_entry()' thread 7f13d3d89700 time 2015-08-16 08:11:56.076462
os/FileJournal.cc: 1426: FAILED assert(0 == "unexpected aio error")

ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x72) [0xcdb572]
2: (FileJournal::write_finish_thread_entry()+0x847) [0xb9a437]
3: (FileJournal::WriteFinisher::entry()+0xd) [0xa3befd]
4: (()+0x6b50) [0x7f13de90ab50]
5: (clone()+0x6d) [0x7f13dd32695d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#7 Updated by Daniel Schneller over 8 years ago

Just when we thought it wouldn't come back, last night we saw a crash with this same trace again on one of the 48 OSDs... :(
After about 10 minutes it came back, this time without manual intervention. Probably Upstart took care of it.

#8 Updated by Shinobu Kinjo over 8 years ago

I confirm that it's quite hard to reproduce. If you could provide the followings with me, it would be great help to make workaround for not only the issue you were facing but also same kind of issue which I've noticed. I could update with more later.
Can you attach log like you did before whenever you face same or kind of same error.
I am taking a closer look at what the root cause of this issue.

#9 Updated by Shinobu Kinjo over 8 years ago

Can you attach log file on the osd node producing that error log?

#10 Updated by Sage Weil over 8 years ago

  • Status changed from New to Need More Info
  • Assignee set to Sage Weil
  • Priority changed from Normal to High

18446744073709551611 == 0xfffffffffffffffb -> -5 (EIO). check kern.log for errors on your disk?

#11 Updated by Lukas Pustina over 8 years ago

Sage Weil wrote:

18446744073709551611 == 0xfffffffffffffffb -> -5 (EIO). check kern.log for errors on your disk?

The error just occurred again -- Daniel the initial reporter and I are colleagues. I did check kern.log as well as dmesg. kern.log is empty and dmesg does not show any hints.

Anything else I could add for your information?

#12 Updated by Sage Weil over 8 years ago

Lukas Pustina wrote:

Sage Weil wrote:

18446744073709551611 == 0xfffffffffffffffb -> -5 (EIO). check kern.log for errors on your disk?

The error just occurred again -- Daniel the initial reporter and I are colleagues. I did check kern.log as well as dmesg. kern.log is empty and dmesg does not show any hints.

Anything else I could add for your information?

It's interesting that the EIO is always preceeded by slow requests... about 30-35 seconds old. What hardware is the journal on? raw block device or file?

#13 Updated by Lukas Pustina over 8 years ago

Sage Weil wrote:

It's interesting that the EIO is always preceeded by slow requests... about 30-35 seconds old. What hardware is the journal on? raw block device or file?

The journal is a partition on the same disk as the OSD's partition, so a it's a raw block device. All OSDs have been created by ceph-deploy from Firefly.

#14 Updated by Lukas Pustina over 8 years ago

Sage Weil wrote:

It's interesting that the EIO is always preceeded by slow requests... about 30-35 seconds old. What hardware is the journal on? raw block device or file?

I just greped our log files and could not find a hint of slow request. Maybe we a different situation than Pontus.

#15 Updated by Pontus Lindgren over 8 years ago

Happened again yesterday.

We use two RAID mirrored SSD's as journals and have individual HDDs as OSD data disks. We use it as blockdevices.

Kern.log:

Sep 10 19:51:28 osd9 kernel: [5999553.857899] sd 0:0:0:0: attempting task abort! scmd(ffff88194498e840)
Sep 10 19:51:28 osd9 kernel: [5999553.857904] sd 0:0:0:0: [sda] CDB: Sep 10 19:51:28 osd9 kernel: [5999553.857906] Write(10): 2a 00 04 e4 18 80 00 00 48 00
Sep 10 19:51:28 osd9 kernel: [5999553.857913] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:28 osd9 kernel: [5999553.857914] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:28 osd9 kernel: [5999553.857940] sd 0:0:0:0: attempting task abort! scmd(ffff8802310d5840)
Sep 10 19:51:28 osd9 kernel: [5999553.857949] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:28 osd9 kernel: [5999553.857952] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
Sep 10 19:51:28 osd9 kernel: [5999553.857964] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:28 osd9 kernel: [5999553.857968] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351238] sd 0:0:0:0: task abort: SUCCESS scmd(ffff88194498e840)
Sep 10 19:51:29 osd9 kernel: [5999554.351253] sd 0:0:0:0: attempting task abort! scmd(ffff881db7f2bb00)
Sep 10 19:51:29 osd9 kernel: [5999554.351257] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351258] Write(10): 2a 00 00 1d aa 42 00 00 20 00
Sep 10 19:51:29 osd9 kernel: [5999554.351270] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351273] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351302] sd 0:0:0:0: task abort: SUCCESS scmd(ffff881db7f2bb00)
Sep 10 19:51:29 osd9 kernel: [5999554.351307] sd 0:0:0:0: attempting task abort! scmd(ffff88108c666340)
Sep 10 19:51:29 osd9 kernel: [5999554.351309] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351310] Write(10): 2a 00 04 e4 18 c8 00 00 10 00
Sep 10 19:51:29 osd9 kernel: [5999554.351319] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351322] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351329] sd 0:0:0:0: task abort: SUCCESS scmd(ffff88108c666340)
Sep 10 19:51:29 osd9 kernel: [5999554.351333] sd 0:0:0:0: attempting task abort! scmd(ffff881b020edc80)
Sep 10 19:51:29 osd9 kernel: [5999554.351335] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351336] Write(10): 2a 00 04 e4 18 d8 00 00 28 00
Sep 10 19:51:29 osd9 kernel: [5999554.351345] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351347] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351355] sd 0:0:0:0: task abort: SUCCESS scmd(ffff881b020edc80)
Sep 10 19:51:29 osd9 kernel: [5999554.351359] sd 0:0:0:0: attempting task abort! scmd(ffff881f5b1d2b40)
Sep 10 19:51:29 osd9 kernel: [5999554.351361] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351362] Write(10): 2a 00 04 e4 19 00 00 00 18 00
Sep 10 19:51:29 osd9 kernel: [5999554.351384] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351387] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351391] sd 0:0:0:0: task abort: SUCCESS scmd(ffff881f5b1d2b40)
Sep 10 19:51:29 osd9 kernel: [5999554.351393] sd 0:0:0:0: attempting task abort! scmd(ffff8814c9022940)
Sep 10 19:51:29 osd9 kernel: [5999554.351395] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351395] Write(10): 2a 00 03 2a 90 08 00 00 18 00
Sep 10 19:51:29 osd9 kernel: [5999554.351400] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351401] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351406] sd 0:0:0:0: task abort: SUCCESS scmd(ffff8814c9022940)
Sep 10 19:51:29 osd9 kernel: [5999554.351408] sd 0:0:0:0: attempting task abort! scmd(ffff8814cd85e3c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351409] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351410] Write(10): 2a 00 04 e4 19 18 00 00 10 00
Sep 10 19:51:29 osd9 kernel: [5999554.351414] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351415] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351419] sd 0:0:0:0: task abort: SUCCESS scmd(ffff8814cd85e3c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351421] sd 0:0:0:0: attempting task abort! scmd(ffff881095ec2340)
Sep 10 19:51:29 osd9 kernel: [5999554.351422] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351423] Write(10): 2a 00 04 e4 19 28 00 00 10 00
Sep 10 19:51:29 osd9 kernel: [5999554.351427] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351428] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351432] sd 0:0:0:0: task abort: SUCCESS scmd(ffff881095ec2340)
Sep 10 19:51:29 osd9 kernel: [5999554.351434] sd 0:0:0:0: attempting task abort! scmd(ffff88108ee17340)
Sep 10 19:51:29 osd9 kernel: [5999554.351436] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351436] Write(10): 2a 00 01 5a 08 d8 00 00 10 00
Sep 10 19:51:29 osd9 kernel: [5999554.351441] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351442] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351446] sd 0:0:0:0: task abort: SUCCESS scmd(ffff88108ee17340)
Sep 10 19:51:29 osd9 kernel: [5999554.351448] sd 0:0:0:0: attempting task abort! scmd(ffff88192b4e76c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351449] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351450] Write(10): 2a 00 01 5a 08 e8 00 00 18 00
Sep 10 19:51:29 osd9 kernel: [5999554.351454] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351455] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351459] sd 0:0:0:0: task abort: SUCCESS scmd(ffff88192b4e76c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351461] sd 0:0:0:0: attempting task abort! scmd(ffff88183cf91680)
Sep 10 19:51:29 osd9 kernel: [5999554.351462] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351463] Write(10): 2a 00 03 2a 90 20 00 00 40 00
Sep 10 19:51:29 osd9 kernel: [5999554.351467] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351468] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351472] sd 0:0:0:0: task abort: SUCCESS scmd(ffff88183cf91680)
Sep 10 19:51:29 osd9 kernel: [5999554.351474] sd 0:0:0:0: attempting task abort! scmd(ffff88146e256500)
Sep 10 19:51:29 osd9 kernel: [5999554.351475] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351476] Write(10): 2a 00 03 7e 18 00 00 00 08 00
Sep 10 19:51:29 osd9 kernel: [5999554.351480] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351481] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351485] sd 0:0:0:0: task abort: SUCCESS scmd(ffff88146e256500)
Sep 10 19:51:29 osd9 kernel: [5999554.351487] sd 0:0:0:0: attempting task abort! scmd(ffff8818ad2cc4c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351488] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351489] Write(10): 2a 00 04 2f 5e 38 00 00 78 00
Sep 10 19:51:29 osd9 kernel: [5999554.351493] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351494] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351498] sd 0:0:0:0: task abort: SUCCESS scmd(ffff8818ad2cc4c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351500] sd 0:0:0:0: attempting task abort! scmd(ffff8812b59c2680)
Sep 10 19:51:29 osd9 kernel: [5999554.351502] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351502] Write(10): 2a 00 04 2f 5e b0 00 00 10 00
Sep 10 19:51:29 osd9 kernel: [5999554.351507] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351508] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351512] sd 0:0:0:0: task abort: SUCCESS scmd(ffff8812b59c2680)
Sep 10 19:51:29 osd9 kernel: [5999554.351514] sd 0:0:0:0: attempting task abort! scmd(ffff881f9c59ae40)
Sep 10 19:51:29 osd9 kernel: [5999554.351515] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351516] Write(10): 2a 00 04 2f 5e c0 00 00 08 00
Sep 10 19:51:29 osd9 kernel: [5999554.351520] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351521] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351525] sd 0:0:0:0: task abort: SUCCESS scmd(ffff881f9c59ae40)
Sep 10 19:51:29 osd9 kernel: [5999554.351527] sd 0:0:0:0: attempting task abort! scmd(ffff88192b4e7cc0)
Sep 10 19:51:29 osd9 kernel: [5999554.351528] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351528] Write(10): 2a 00 03 2a 90 60 00 00 18 00
Sep 10 19:51:29 osd9 kernel: [5999554.351533] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351534] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351538] sd 0:0:0:0: task abort: SUCCESS scmd(ffff88192b4e7cc0)
Sep 10 19:51:29 osd9 kernel: [5999554.351540] sd 0:0:0:0: attempting task abort! scmd(ffff88114c6736c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351541] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351541] Write(10): 2a 00 00 1d aa 62 00 00 18 00
Sep 10 19:51:29 osd9 kernel: [5999554.351546] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351547] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351551] sd 0:0:0:0: task abort: SUCCESS scmd(ffff88114c6736c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351552] sd 0:0:0:0: attempting task abort! scmd(ffff8810962241c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351554] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351554] Write(10): 2a 00 03 2a 90 78 00 00 10 00
Sep 10 19:51:29 osd9 kernel: [5999554.351558] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351560] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351564] sd 0:0:0:0: task abort: SUCCESS scmd(ffff8810962241c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351565] sd 0:0:0:0: attempting task abort! scmd(ffff881ed5968940)
Sep 10 19:51:29 osd9 kernel: [5999554.351567] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351567] Write(10): 2a 00 04 2f 5e c8 00 00 10 00
Sep 10 19:51:29 osd9 kernel: [5999554.351572] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351573] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351577] sd 0:0:0:0: task abort: SUCCESS scmd(ffff881ed5968940)
Sep 10 19:51:29 osd9 kernel: [5999554.351579] sd 0:0:0:0: attempting task abort! scmd(ffff88108c666c40)
Sep 10 19:51:29 osd9 kernel: [5999554.351580] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351581] Write(10): 2a 00 01 5a 09 00 00 00 10 00
Sep 10 19:51:29 osd9 kernel: [5999554.351585] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351586] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351590] sd 0:0:0:0: task abort: SUCCESS scmd(ffff88108c666c40)
Sep 10 19:51:29 osd9 kernel: [5999554.351592] sd 0:0:0:0: attempting task abort! scmd(ffff8813b0faac40)
Sep 10 19:51:29 osd9 kernel: [5999554.351593] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351593] Write(10): 2a 00 00 1d aa 7a 00 00 18 00
Sep 10 19:51:29 osd9 kernel: [5999554.351598] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351599] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351603] sd 0:0:0:0: task abort: SUCCESS scmd(ffff8813b0faac40)
Sep 10 19:51:29 osd9 kernel: [5999554.351604] sd 0:0:0:0: attempting task abort! scmd(ffff88183cf91c80)
Sep 10 19:51:29 osd9 kernel: [5999554.351605] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351606] Write(10): 2a 00 04 2f 5e d8 00 00 18 00
Sep 10 19:51:29 osd9 kernel: [5999554.351610] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351611] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351615] sd 0:0:0:0: task abort: SUCCESS scmd(ffff88183cf91c80)
Sep 10 19:51:29 osd9 kernel: [5999554.351617] sd 0:0:0:0: attempting task abort! scmd(ffff881cc0b641c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351618] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351619] Write(10): 2a 00 03 2a 90 88 00 00 18 00
Sep 10 19:51:29 osd9 kernel: [5999554.351623] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351624] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351628] sd 0:0:0:0: task abort: SUCCESS scmd(ffff881cc0b641c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351630] sd 0:0:0:0: attempting task abort! scmd(ffff88165f47fe00)
Sep 10 19:51:29 osd9 kernel: [5999554.351631] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351632] Write(10): 2a 00 01 5a 09 10 00 00 18 00
Sep 10 19:51:29 osd9 kernel: [5999554.351636] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351637] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351641] sd 0:0:0:0: task abort: SUCCESS scmd(ffff88165f47fe00)
Sep 10 19:51:29 osd9 kernel: [5999554.353039] sd 0:0:0:0: device_blocked, handle(0x000a)
Sep 10 19:51:29 osd9 kernel: [5999554.601366] sd 0:0:0:0: task abort: SUCCESS scmd(ffff8802310d5840)
Sep 10 19:51:30 osd9 kernel: [5999555.351304] end_request: I/O error, dev sda, sector 0

ceph osd log:

-12> 2015-09-10 19:51:28.705402 7fd5ad038700  0 log_channel(cluster) log [WRN] : 6 slow requests, 6 included below; oldest blocked for > 30.887589 secs
-11> 2015-09-10 19:51:28.705412 7fd5ad038700 0 log_channel(cluster) log [WRN] : slow request 30.013703 seconds old, received at 2015-09-10 19:50:58.691631: osd_repop(client.1229791.0:19771837 5.d78 a7ed7d78/rbd_data.12b6bb79e2a9e3.0000000000014dd4/head//5 v 2511'61470) currently started
-10> 2015-09-10 19:51:28.705417 7fd5ad038700 0 log_channel(cluster) log [WRN] : slow request 30.887589 seconds old, received at 2015-09-10 19:50:57.817744: osd_op(client.1224662.0:279052700 rbd_data.12a6ec2eb141f2.0000000000008408 [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 4038656~4096] 5.c73cbbae ack+ondisk+write e2511) currently waiting for subops from 7,40
-9> 2015-09-10 19:51:28.705422 7fd5ad038700 0 log_channel(cluster) log [WRN] : slow request 30.886017 seconds old, received at 2015-09-10 19:50:57.819316: osd_repop(client.1224662.0:279052739 5.4fa 300484fa/rbd_data.12a6ec2eb141f2.0000000000022c00/head//5 v 2511'178514) currently started
-8> 2015-09-10 19:51:28.705426 7fd5ad038700 0 log_channel(cluster) log [WRN] : slow request 30.884823 seconds old, received at 2015-09-10 19:50:57.820510: osd_repop(client.1224662.0:279052776 5.f67 5672ef67/rbd_data.12a6ec2eb141f2.0000000000022e34/head//5 v 2511'130839) currently started
-7> 2015-09-10 19:51:28.705429 7fd5ad038700 0 log_channel(cluster) log [WRN] : slow request 30.310598 seconds old, received at 2015-09-10 19:50:58.394736: osd_op(client.1935041.0:4301337 rbd_data.1d685c2eb141f2.0000000000003c5e [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 3506176~8192] 5.4b102e2f ack+ondisk+write e2511) currently waiting for subops from 4,44
-6> 2015-09-10 19:51:29.705612 7fd5ad038700 0 log_channel(cluster) log [WRN] : 9 slow requests, 4 included below; oldest blocked for > 31.887796 secs
-5> 2015-09-10 19:51:29.705628 7fd5ad038700 0 log_channel(cluster) log [WRN] : slow request 30.338665 seconds old, received at 2015-09-10 19:50:59.366876: osd_repop(client.2969535.0:29050953 5.bdf e9b45bdf/rbd_data.168e3879e2a9e3.00000000000043d2/head//5 v 2511'39174) currently started
-4> 2015-09-10 19:51:29.705635 7fd5ad038700 0 log_channel(cluster) log [WRN] : slow request 30.337287 seconds old, received at 2015-09-10 19:50:59.368253: osd_repop(client.2969535.0:29050989 5.ff3 dbd5cff3/rbd_data.168e3879e2a9e3.000000000000537e/head//5 v 2511'569823) currently started
-3> 2015-09-10 19:51:29.705642 7fd5ad038700 0 log_channel(cluster) log [WRN] : slow request 30.314158 seconds old, received at 2015-09-10 19:50:59.391382: osd_op(client.2943132.0:26301654 rbd_data.10ee2079e2a9e3.00000000000043fe [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 1777664~4096] 5.bfc0e023 ack+ondisk+write e2511) currently waiting for subops from 4,49
-2> 2015-09-10 19:51:29.705647 7fd5ad038700 0 log_channel(cluster) log [WRN] : slow request 31.294362 seconds old, received at 2015-09-10 19:50:58.411179: osd_repop(client.1226462.0:15526762 5.b08 42f4bb08/rbd_data.12b2222eb141f2.0000000000000c65/head//5 v 2511'222927) currently started
-1> 2015-09-10 19:51:30.064830 7fd5aa2da700 -1 journal aio to 995360768~8192 wrote 18446744073709551611
0> 2015-09-10 19:51:30.080507 7fd5aa2da700 -1 os/FileJournal.cc: In function 'void FileJournal::write_finish_thread_entry()' thread 7fd5aa2da700 time 2015-09-10 19:51:30.065477
os/FileJournal.cc: 1426: FAILED assert(0 == "unexpected aio error")
ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x72) [0xcdb572]
2: (FileJournal::write_finish_thread_entry()+0x847) [0xb9a437]
3: (FileJournal::WriteFinisher::entry()+0xd) [0xa3befd]
4: (()+0x6b50) [0x7fd5b5064b50]
5: (clone()+0x6d) [0x7fd5b3a8095d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 0 lockdep
0/ 0 context
0/ 0 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 0 buffer
0/ 0 timer
0/ 0 filer
0/ 1 striper
0/ 0 objecter
0/ 0 rados
0/ 0 rbd
0/ 5 rbd_replay
0/ 0 journaler
0/ 5 objectcacher
0/ 0 client
0/ 0 osd
0/ 0 optracker
0/ 0 objclass
0/ 0 filestore
1/ 3 keyvaluestore
0/ 0 journal
0/ 0 ms
0/ 0 mon
0/ 0 monc
0/ 0 paxos
0/ 0 tp
0/ 0 auth
1/ 5 crypto
0/ 0 finisher
0/ 0 heartbeatmap
0/ 0 perfcounter
0/ 0 rgw
1/10 civetweb
1/ 5 javaclient
0/ 0 asok
0/ 0 throttle
0/ 0 refs
1/ 5 xio
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.30.log
--
end dump of recent events ---
2015-09-10 19:51:30.144096 7fd5aa2da700 -1 ** Caught signal (Aborted) *
in thread 7fd5aa2da700
ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
1: /usr/bin/ceph-osd() [0xbef08c]
2: (()+0xf0a0) [0x7fd5b506d0a0]
3: (gsignal()+0x35) [0x7fd5b39d7165]
4: (abort()+0x180) [0x7fd5b39da3e0]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7fd5b422d89d]
6: (()+0x63996) [0x7fd5b422b996]
7: (()+0x639c3) [0x7fd5b422b9c3]
8: (()+0x63bee) [0x7fd5b422bbee]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x220) [0xcdb720]
10: (FileJournal::write_finish_thread_entry()+0x847) [0xb9a437]
11: (FileJournal::WriteFinisher::entry()+0xd) [0xa3befd]
12: (()+0x6b50) [0x7fd5b5064b50]
13: (clone()+0x6d) [0x7fd5b3a8095d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.
--- begin dump of recent events ---
0> 2015-09-10 19:51:30.144096 7fd5aa2da700 -1 ** Caught signal (Aborted) *
in thread 7fd5aa2da700
ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
1: /usr/bin/ceph-osd() [0xbef08c]
2: (()+0xf0a0) [0x7fd5b506d0a0]
3: (gsignal()+0x35) [0x7fd5b39d7165]
4: (abort()+0x180) [0x7fd5b39da3e0]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7fd5b422d89d]
6: (()+0x63996) [0x7fd5b422b996]
7: (()+0x639c3) [0x7fd5b422b9c3]
8: (()+0x63bee) [0x7fd5b422bbee]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x220) [0xcdb720]
10: (FileJournal::write_finish_thread_entry()+0x847) [0xb9a437]
11: (FileJournal::WriteFinisher::entry()+0xd) [0xa3befd]
12: (()+0x6b50) [0x7fd5b5064b50]
13: (clone()+0x6d) [0x7fd5b3a8095d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 0 lockdep
0/ 0 context
0/ 0 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 0 buffer
0/ 0 timer
0/ 0 filer
0/ 1 striper
0/ 0 objecter
0/ 0 rados
0/ 0 rbd
0/ 5 rbd_replay
0/ 0 journaler
0/ 5 objectcacher
0/ 0 client
0/ 0 osd
0/ 0 optracker
0/ 0 objclass
0/ 0 filestore
1/ 3 keyvaluestore
0/ 0 journal
0/ 0 ms
0/ 0 mon
0/ 0 monc
0/ 0 paxos
0/ 0 tp
0/ 0 auth
1/ 5 crypto
0/ 0 finisher
0/ 0 heartbeatmap
0/ 0 perfcounter
0/ 0 rgw
1/10 civetweb
1/ 5 javaclient
0/ 0 asok
0/ 0 throttle
0/ 0 refs
1/ 5 xio
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.30.log
--
end dump of recent events --

#16 Updated by Shinobu Kinjo over 8 years ago

Pontus Lindgren wrote:

Happened again yesterday.

We use two RAID mirrored SSD's as journals and have individual HDDs as OSD data disks. We use it as blockdevices.

Can you attach full kernel log, messages or whatever generated by kenel, including the following logs?

Kern.log:

Sep 10 19:51:28 osd9 kernel: [5999553.857899] sd 0:0:0:0: attempting task abort! scmd(ffff88194498e840)
Sep 10 19:51:28 osd9 kernel: [5999553.857904] sd 0:0:0:0: [sda] CDB: Sep 10 19:51:28 osd9 kernel: [5999553.857906] Write(10): 2a 00 04 e4 18 80 00 00 48 00
Sep 10 19:51:28 osd9 kernel: [5999553.857913] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:28 osd9 kernel: [5999553.857914] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:28 osd9 kernel: [5999553.857940] sd 0:0:0:0: attempting task abort! scmd(ffff8802310d5840)
Sep 10 19:51:28 osd9 kernel: [5999553.857949] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:28 osd9 kernel: [5999553.857952] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
Sep 10 19:51:28 osd9 kernel: [5999553.857964] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:28 osd9 kernel: [5999553.857968] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351238] sd 0:0:0:0: task abort: SUCCESS scmd(ffff88194498e840)
Sep 10 19:51:29 osd9 kernel: [5999554.351253] sd 0:0:0:0: attempting task abort! scmd(ffff881db7f2bb00)
Sep 10 19:51:29 osd9 kernel: [5999554.351257] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351258] Write(10): 2a 00 00 1d aa 42 00 00 20 00
Sep 10 19:51:29 osd9 kernel: [5999554.351270] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351273] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351302] sd 0:0:0:0: task abort: SUCCESS scmd(ffff881db7f2bb00)
Sep 10 19:51:29 osd9 kernel: [5999554.351307] sd 0:0:0:0: attempting task abort! scmd(ffff88108c666340)
Sep 10 19:51:29 osd9 kernel: [5999554.351309] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351310] Write(10): 2a 00 04 e4 18 c8 00 00 10 00
Sep 10 19:51:29 osd9 kernel: [5999554.351319] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351322] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351329] sd 0:0:0:0: task abort: SUCCESS scmd(ffff88108c666340)
Sep 10 19:51:29 osd9 kernel: [5999554.351333] sd 0:0:0:0: attempting task abort! scmd(ffff881b020edc80)
Sep 10 19:51:29 osd9 kernel: [5999554.351335] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351336] Write(10): 2a 00 04 e4 18 d8 00 00 28 00
Sep 10 19:51:29 osd9 kernel: [5999554.351345] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351347] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351355] sd 0:0:0:0: task abort: SUCCESS scmd(ffff881b020edc80)
Sep 10 19:51:29 osd9 kernel: [5999554.351359] sd 0:0:0:0: attempting task abort! scmd(ffff881f5b1d2b40)
Sep 10 19:51:29 osd9 kernel: [5999554.351361] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351362] Write(10): 2a 00 04 e4 19 00 00 00 18 00
Sep 10 19:51:29 osd9 kernel: [5999554.351384] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351387] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351391] sd 0:0:0:0: task abort: SUCCESS scmd(ffff881f5b1d2b40)
Sep 10 19:51:29 osd9 kernel: [5999554.351393] sd 0:0:0:0: attempting task abort! scmd(ffff8814c9022940)
Sep 10 19:51:29 osd9 kernel: [5999554.351395] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351395] Write(10): 2a 00 03 2a 90 08 00 00 18 00
Sep 10 19:51:29 osd9 kernel: [5999554.351400] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351401] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351406] sd 0:0:0:0: task abort: SUCCESS scmd(ffff8814c9022940)
Sep 10 19:51:29 osd9 kernel: [5999554.351408] sd 0:0:0:0: attempting task abort! scmd(ffff8814cd85e3c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351409] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351410] Write(10): 2a 00 04 e4 19 18 00 00 10 00
Sep 10 19:51:29 osd9 kernel: [5999554.351414] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351415] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351419] sd 0:0:0:0: task abort: SUCCESS scmd(ffff8814cd85e3c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351421] sd 0:0:0:0: attempting task abort! scmd(ffff881095ec2340)
Sep 10 19:51:29 osd9 kernel: [5999554.351422] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351423] Write(10): 2a 00 04 e4 19 28 00 00 10 00
Sep 10 19:51:29 osd9 kernel: [5999554.351427] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351428] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351432] sd 0:0:0:0: task abort: SUCCESS scmd(ffff881095ec2340)
Sep 10 19:51:29 osd9 kernel: [5999554.351434] sd 0:0:0:0: attempting task abort! scmd(ffff88108ee17340)
Sep 10 19:51:29 osd9 kernel: [5999554.351436] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351436] Write(10): 2a 00 01 5a 08 d8 00 00 10 00
Sep 10 19:51:29 osd9 kernel: [5999554.351441] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351442] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351446] sd 0:0:0:0: task abort: SUCCESS scmd(ffff88108ee17340)
Sep 10 19:51:29 osd9 kernel: [5999554.351448] sd 0:0:0:0: attempting task abort! scmd(ffff88192b4e76c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351449] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351450] Write(10): 2a 00 01 5a 08 e8 00 00 18 00
Sep 10 19:51:29 osd9 kernel: [5999554.351454] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351455] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351459] sd 0:0:0:0: task abort: SUCCESS scmd(ffff88192b4e76c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351461] sd 0:0:0:0: attempting task abort! scmd(ffff88183cf91680)
Sep 10 19:51:29 osd9 kernel: [5999554.351462] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351463] Write(10): 2a 00 03 2a 90 20 00 00 40 00
Sep 10 19:51:29 osd9 kernel: [5999554.351467] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351468] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351472] sd 0:0:0:0: task abort: SUCCESS scmd(ffff88183cf91680)
Sep 10 19:51:29 osd9 kernel: [5999554.351474] sd 0:0:0:0: attempting task abort! scmd(ffff88146e256500)
Sep 10 19:51:29 osd9 kernel: [5999554.351475] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351476] Write(10): 2a 00 03 7e 18 00 00 00 08 00
Sep 10 19:51:29 osd9 kernel: [5999554.351480] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351481] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351485] sd 0:0:0:0: task abort: SUCCESS scmd(ffff88146e256500)
Sep 10 19:51:29 osd9 kernel: [5999554.351487] sd 0:0:0:0: attempting task abort! scmd(ffff8818ad2cc4c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351488] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351489] Write(10): 2a 00 04 2f 5e 38 00 00 78 00
Sep 10 19:51:29 osd9 kernel: [5999554.351493] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351494] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351498] sd 0:0:0:0: task abort: SUCCESS scmd(ffff8818ad2cc4c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351500] sd 0:0:0:0: attempting task abort! scmd(ffff8812b59c2680)
Sep 10 19:51:29 osd9 kernel: [5999554.351502] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351502] Write(10): 2a 00 04 2f 5e b0 00 00 10 00
Sep 10 19:51:29 osd9 kernel: [5999554.351507] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351508] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351512] sd 0:0:0:0: task abort: SUCCESS scmd(ffff8812b59c2680)
Sep 10 19:51:29 osd9 kernel: [5999554.351514] sd 0:0:0:0: attempting task abort! scmd(ffff881f9c59ae40)
Sep 10 19:51:29 osd9 kernel: [5999554.351515] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351516] Write(10): 2a 00 04 2f 5e c0 00 00 08 00
Sep 10 19:51:29 osd9 kernel: [5999554.351520] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351521] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351525] sd 0:0:0:0: task abort: SUCCESS scmd(ffff881f9c59ae40)
Sep 10 19:51:29 osd9 kernel: [5999554.351527] sd 0:0:0:0: attempting task abort! scmd(ffff88192b4e7cc0)
Sep 10 19:51:29 osd9 kernel: [5999554.351528] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351528] Write(10): 2a 00 03 2a 90 60 00 00 18 00
Sep 10 19:51:29 osd9 kernel: [5999554.351533] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351534] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351538] sd 0:0:0:0: task abort: SUCCESS scmd(ffff88192b4e7cc0)
Sep 10 19:51:29 osd9 kernel: [5999554.351540] sd 0:0:0:0: attempting task abort! scmd(ffff88114c6736c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351541] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351541] Write(10): 2a 00 00 1d aa 62 00 00 18 00
Sep 10 19:51:29 osd9 kernel: [5999554.351546] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351547] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351551] sd 0:0:0:0: task abort: SUCCESS scmd(ffff88114c6736c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351552] sd 0:0:0:0: attempting task abort! scmd(ffff8810962241c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351554] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351554] Write(10): 2a 00 03 2a 90 78 00 00 10 00
Sep 10 19:51:29 osd9 kernel: [5999554.351558] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351560] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351564] sd 0:0:0:0: task abort: SUCCESS scmd(ffff8810962241c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351565] sd 0:0:0:0: attempting task abort! scmd(ffff881ed5968940)
Sep 10 19:51:29 osd9 kernel: [5999554.351567] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351567] Write(10): 2a 00 04 2f 5e c8 00 00 10 00
Sep 10 19:51:29 osd9 kernel: [5999554.351572] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351573] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351577] sd 0:0:0:0: task abort: SUCCESS scmd(ffff881ed5968940)
Sep 10 19:51:29 osd9 kernel: [5999554.351579] sd 0:0:0:0: attempting task abort! scmd(ffff88108c666c40)
Sep 10 19:51:29 osd9 kernel: [5999554.351580] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351581] Write(10): 2a 00 01 5a 09 00 00 00 10 00
Sep 10 19:51:29 osd9 kernel: [5999554.351585] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351586] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351590] sd 0:0:0:0: task abort: SUCCESS scmd(ffff88108c666c40)
Sep 10 19:51:29 osd9 kernel: [5999554.351592] sd 0:0:0:0: attempting task abort! scmd(ffff8813b0faac40)
Sep 10 19:51:29 osd9 kernel: [5999554.351593] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351593] Write(10): 2a 00 00 1d aa 7a 00 00 18 00
Sep 10 19:51:29 osd9 kernel: [5999554.351598] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351599] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351603] sd 0:0:0:0: task abort: SUCCESS scmd(ffff8813b0faac40)
Sep 10 19:51:29 osd9 kernel: [5999554.351604] sd 0:0:0:0: attempting task abort! scmd(ffff88183cf91c80)
Sep 10 19:51:29 osd9 kernel: [5999554.351605] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351606] Write(10): 2a 00 04 2f 5e d8 00 00 18 00
Sep 10 19:51:29 osd9 kernel: [5999554.351610] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351611] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351615] sd 0:0:0:0: task abort: SUCCESS scmd(ffff88183cf91c80)
Sep 10 19:51:29 osd9 kernel: [5999554.351617] sd 0:0:0:0: attempting task abort! scmd(ffff881cc0b641c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351618] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351619] Write(10): 2a 00 03 2a 90 88 00 00 18 00
Sep 10 19:51:29 osd9 kernel: [5999554.351623] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351624] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351628] sd 0:0:0:0: task abort: SUCCESS scmd(ffff881cc0b641c0)
Sep 10 19:51:29 osd9 kernel: [5999554.351630] sd 0:0:0:0: attempting task abort! scmd(ffff88165f47fe00)
Sep 10 19:51:29 osd9 kernel: [5999554.351631] sd 0:0:0:0: [sda] CDB:
Sep 10 19:51:29 osd9 kernel: [5999554.351632] Write(10): 2a 00 01 5a 09 10 00 00 18 00
Sep 10 19:51:29 osd9 kernel: [5999554.351636] scsi target0:0:0: handle(0x000a), sas_address(0x5003048001d62c00), phy(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351637] scsi target0:0:0: enclosure_logical_id(0x5003048001d62c3f), slot(0)
Sep 10 19:51:29 osd9 kernel: [5999554.351641] sd 0:0:0:0: task abort: SUCCESS scmd(ffff88165f47fe00)
Sep 10 19:51:29 osd9 kernel: [5999554.353039] sd 0:0:0:0: device_blocked, handle(0x000a)
Sep 10 19:51:29 osd9 kernel: [5999554.601366] sd 0:0:0:0: task abort: SUCCESS scmd(ffff8802310d5840)
Sep 10 19:51:30 osd9 kernel: [5999555.351304] end_request: I/O error, dev sda, sector 0

ceph osd log:

-12> 2015-09-10 19:51:28.705402 7fd5ad038700 0 log_channel(cluster) log [WRN] : 6 slow requests, 6 included below; oldest blocked for > 30.887589 secs
-11> 2015-09-10 19:51:28.705412 7fd5ad038700 0 log_channel(cluster) log [WRN] : slow request 30.013703 seconds old, received at 2015-09-10 19:50:58.691631: osd_repop(client.1229791.0:19771837 5.d78 a7ed7d78/rbd_data.12b6bb79e2a9e3.0000000000014dd4/head//5 v 2511'61470) currently started
-10> 2015-09-10 19:51:28.705417 7fd5ad038700 0 log_channel(cluster) log [WRN] : slow request 30.887589 seconds old, received at 2015-09-10 19:50:57.817744: osd_op(client.1224662.0:279052700 rbd_data.12a6ec2eb141f2.0000000000008408 [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 4038656~4096] 5.c73cbbae ack+ondisk+write e2511) currently waiting for subops from 7,40
-9> 2015-09-10 19:51:28.705422 7fd5ad038700 0 log_channel(cluster) log [WRN] : slow request 30.886017 seconds old, received at 2015-09-10 19:50:57.819316: osd_repop(client.1224662.0:279052739 5.4fa 300484fa/rbd_data.12a6ec2eb141f2.0000000000022c00/head//5 v 2511'178514) currently started
-8> 2015-09-10 19:51:28.705426 7fd5ad038700 0 log_channel(cluster) log [WRN] : slow request 30.884823 seconds old, received at 2015-09-10 19:50:57.820510: osd_repop(client.1224662.0:279052776 5.f67 5672ef67/rbd_data.12a6ec2eb141f2.0000000000022e34/head//5 v 2511'130839) currently started
-7> 2015-09-10 19:51:28.705429 7fd5ad038700 0 log_channel(cluster) log [WRN] : slow request 30.310598 seconds old, received at 2015-09-10 19:50:58.394736: osd_op(client.1935041.0:4301337 rbd_data.1d685c2eb141f2.0000000000003c5e [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 3506176~8192] 5.4b102e2f ack+ondisk+write e2511) currently waiting for subops from 4,44
-6> 2015-09-10 19:51:29.705612 7fd5ad038700 0 log_channel(cluster) log [WRN] : 9 slow requests, 4 included below; oldest blocked for > 31.887796 secs
-5> 2015-09-10 19:51:29.705628 7fd5ad038700 0 log_channel(cluster) log [WRN] : slow request 30.338665 seconds old, received at 2015-09-10 19:50:59.366876: osd_repop(client.2969535.0:29050953 5.bdf e9b45bdf/rbd_data.168e3879e2a9e3.00000000000043d2/head//5 v 2511'39174) currently started
-4> 2015-09-10 19:51:29.705635 7fd5ad038700 0 log_channel(cluster) log [WRN] : slow request 30.337287 seconds old, received at 2015-09-10 19:50:59.368253: osd_repop(client.2969535.0:29050989 5.ff3 dbd5cff3/rbd_data.168e3879e2a9e3.000000000000537e/head//5 v 2511'569823) currently started
-3> 2015-09-10 19:51:29.705642 7fd5ad038700 0 log_channel(cluster) log [WRN] : slow request 30.314158 seconds old, received at 2015-09-10 19:50:59.391382: osd_op(client.2943132.0:26301654 rbd_data.10ee2079e2a9e3.00000000000043fe [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 1777664~4096] 5.bfc0e023 ack+ondisk+write e2511) currently waiting for subops from 4,49
-2> 2015-09-10 19:51:29.705647 7fd5ad038700 0 log_channel(cluster) log [WRN] : slow request 31.294362 seconds old, received at 2015-09-10 19:50:58.411179: osd_repop(client.1226462.0:15526762 5.b08 42f4bb08/rbd_data.12b2222eb141f2.0000000000000c65/head//5 v 2511'222927) currently started
-1> 2015-09-10 19:51:30.064830 7fd5aa2da700 -1 journal aio to 995360768~8192 wrote 18446744073709551611
0> 2015-09-10 19:51:30.080507 7fd5aa2da700 -1 os/FileJournal.cc: In function 'void FileJournal::write_finish_thread_entry()' thread 7fd5aa2da700 time 2015-09-10 19:51:30.065477
os/FileJournal.cc: 1426: FAILED assert(0 == "unexpected aio error")

ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x72) [0xcdb572]
2: (FileJournal::write_finish_thread_entry()+0x847) [0xb9a437]
3: (FileJournal::WriteFinisher::entry()+0xd) [0xa3befd]
4: (()+0x6b50) [0x7fd5b5064b50]
5: (clone()+0x6d) [0x7fd5b3a8095d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 0 lockdep
0/ 0 context
0/ 0 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 0 buffer
0/ 0 timer
0/ 0 filer
0/ 1 striper
0/ 0 objecter
0/ 0 rados
0/ 0 rbd
0/ 5 rbd_replay
0/ 0 journaler
0/ 5 objectcacher
0/ 0 client
0/ 0 osd
0/ 0 optracker
0/ 0 objclass
0/ 0 filestore
1/ 3 keyvaluestore
0/ 0 journal
0/ 0 ms
0/ 0 mon
0/ 0 monc
0/ 0 paxos
0/ 0 tp
0/ 0 auth
1/ 5 crypto
0/ 0 finisher
0/ 0 heartbeatmap
0/ 0 perfcounter
0/ 0 rgw
1/10 civetweb
1/ 5 javaclient
0/ 0 asok
0/ 0 throttle
0/ 0 refs
1/ 5 xio
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.30.log
--
end dump of recent events ---
2015-09-10 19:51:30.144096 7fd5aa2da700 -1 ** Caught signal (Aborted) *
in thread 7fd5aa2da700

ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
1: /usr/bin/ceph-osd() [0xbef08c]
2: (()+0xf0a0) [0x7fd5b506d0a0]
3: (gsignal()+0x35) [0x7fd5b39d7165]
4: (abort()+0x180) [0x7fd5b39da3e0]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7fd5b422d89d]
6: (()+0x63996) [0x7fd5b422b996]
7: (()+0x639c3) [0x7fd5b422b9c3]
8: (()+0x63bee) [0x7fd5b422bbee]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x220) [0xcdb720]
10: (FileJournal::write_finish_thread_entry()+0x847) [0xb9a437]
11: (FileJournal::WriteFinisher::entry()+0xd) [0xa3befd]
12: (()+0x6b50) [0x7fd5b5064b50]
13: (clone()+0x6d) [0x7fd5b3a8095d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
0> 2015-09-10 19:51:30.144096 7fd5aa2da700 -1 ** Caught signal (Aborted) *
in thread 7fd5aa2da700

ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
1: /usr/bin/ceph-osd() [0xbef08c]
2: (()+0xf0a0) [0x7fd5b506d0a0]
3: (gsignal()+0x35) [0x7fd5b39d7165]
4: (abort()+0x180) [0x7fd5b39da3e0]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7fd5b422d89d]
6: (()+0x63996) [0x7fd5b422b996]
7: (()+0x639c3) [0x7fd5b422b9c3]
8: (()+0x63bee) [0x7fd5b422bbee]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x220) [0xcdb720]
10: (FileJournal::write_finish_thread_entry()+0x847) [0xb9a437]
11: (FileJournal::WriteFinisher::entry()+0xd) [0xa3befd]
12: (()+0x6b50) [0x7fd5b5064b50]
13: (clone()+0x6d) [0x7fd5b3a8095d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 0 lockdep
0/ 0 context
0/ 0 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 0 buffer
0/ 0 timer
0/ 0 filer
0/ 1 striper
0/ 0 objecter
0/ 0 rados
0/ 0 rbd
0/ 5 rbd_replay
0/ 0 journaler
0/ 5 objectcacher
0/ 0 client
0/ 0 osd
0/ 0 optracker
0/ 0 objclass
0/ 0 filestore
1/ 3 keyvaluestore
0/ 0 journal
0/ 0 ms
0/ 0 mon
0/ 0 monc
0/ 0 paxos
0/ 0 tp
0/ 0 auth
1/ 5 crypto
0/ 0 finisher
0/ 0 heartbeatmap
0/ 0 perfcounter
0/ 0 rgw
1/10 civetweb
1/ 5 javaclient
0/ 0 asok
0/ 0 throttle
0/ 0 refs
1/ 5 xio
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.30.log
--
end dump of recent events --

#17 Updated by Pontus Lindgren over 8 years ago

That is everything that kern.log logged that day.

#18 Updated by Sage Weil over 8 years ago

  • Status changed from Need More Info to Rejected

This is a hardware problem--we are correct in failing in this case.

#19 Updated by Pontus Lindgren over 8 years ago

How did you conclude that Sage?

Do you have any pointers on what can be done?

#20 Updated by Pontus Lindgren about 8 years ago

Lowering the queue depth from 32 to 1 solved this issue for us.

@echo 1 > /sys/block/sda/device/queue_depth
@echo 1 > /sys/block/sdb/device/queue_depth

We still believe that the root cause is our SSD's are faulty/firmware is buggy.

Daniel Schneller wrote:

This morning we had an OSD crash on us with the following trace:

[...]

The full log is available at [[https://public.centerdevice.de/e42a98af-69ef-4330-b36c-5858889ba566]]
Attaching it here failed repeatedly.

After that, it recovered:
[...]

We could not find anything else relevant in our logs.
The disk seems to be fine, at leat according to MegaCLI.
Server is a DELL PowerEdge R510.

We found (at least superficially) similar looking issues here: [[http://tracker.ceph.com/issues/11298]] and here: [[http://tracker.ceph.com/issues/9570]]

This was the first time we saw an issue like this.

#21 Updated by Loïc Dachary about 8 years ago

  • Related to Bug #11298: aio gets EPERM when update-grub runs added

#22 Updated by Loïc Dachary about 8 years ago

  • Related to Bug #9570: osd crash in FileJournal::WriteFinisher::entry() aio added

#23 Updated by Ilya Dryomov about 8 years ago

  • Related to deleted (Bug #11298: aio gets EPERM when update-grub runs)

#24 Updated by Wido den Hollander about 7 years ago

I encountered this issue this morning on several machines where the unattended upgrades from Ubuntu kicked in and triggered the os-prober of GRUB.

Feb 03 06:53:00 ceph01.ceph.XXXX os-prober[173079]: debug: running /usr/lib/os-probes/mounted/90linux-distro on mounted /dev/sdk1
Feb 03 06:53:00 ceph01.ceph.XXXX os-prober[173083]: debug: running /usr/lib/os-probes/mounted/90solaris on mounted /dev/sdk1
Feb 03 06:53:00 ceph01.ceph.XXXX os-prober[173093]: debug: running /usr/lib/os-probes/50mounted-tests on /dev/sdk2
Feb 03 06:53:00 ceph01.ceph.XXXX kernel: EXT4-fs (sdk2): VFS: Can't find ext4 filesystem
Feb 03 06:53:00 ceph01.ceph.XXXX kernel: EXT4-fs (sdk2): VFS: Can't find ext4 filesystem
Feb 03 06:53:00 ceph01.ceph.XXXX kernel: EXT4-fs (sdk2): VFS: Can't find ext4 filesystem
Feb 03 06:53:00 ceph01.ceph.XXXX kernel: squashfs: SQUASHFS error: Can't find a SQUASHFS superblock on sdk2
Feb 03 06:53:00 ceph01.ceph.XXXX ceph-osd[5854]: 2017-02-03 06:53:00.867436 7f62a6247700 -1 journal aio to 3241111552~4202496 wrote 18446744073709551615
Feb 03 06:53:00 ceph01.ceph.XXXX ceph-osd[5854]: os/filestore/FileJournal.cc: In function 'void FileJournal::write_finish_thread_entry()' thread 7f62a6247700 time 2017-02-03 06:53:00.867457
Feb 03 06:53:00 ceph01.ceph.XXXX ceph-osd[5854]: os/filestore/FileJournal.cc: 1546: FAILED assert(0 == "unexpected aio error")
Feb 03 06:53:00 ceph01.ceph.XXXX ceph-osd[5854]:  ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
Feb 03 06:53:00 ceph01.ceph.XXXX ceph-osd[5854]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x5620bb2f7520]
Feb 03 06:53:00 ceph01.ceph.XXXX ceph-osd[5854]:  2: (FileJournal::write_finish_thread_entry()+0xa61) [0x5620bb0816b1]
Feb 03 06:53:00 ceph01.ceph.XXXX ceph-osd[5854]:  3: (FileJournal::WriteFinisher::entry()+0xd) [0x5620bafad92d]

The systems in this case are Dell R430 servers running with Samsung PM863a SSDs in there.

Ceph 10.2.5 with kernel 4.8.

I have seen this before, but only on Dell servers, never on SuperMicro systems with LSI controllers.

#25 Updated by Lukas de Boer about 7 years ago

Wido den Hollander wrote:

I encountered this issue this morning on several machines where the unattended upgrades from Ubuntu kicked in and triggered the os-prober of GRUB.

[...]

The systems in this case are Dell R430 servers running with Samsung PM863a SSDs in there.

Ceph 10.2.5 with kernel 4.8.

I have seen this before, but only on Dell servers, never on SuperMicro systems with LSI controllers.

Disabling os-prober seems to help (GRUB_DISABLE_OS_PROBER=true in /etc/default/grub).

#26 Updated by buffalo ray almost 7 years ago

Pontus Lindgren wrote:

Lowering the queue depth from 32 to 1 solved this issue for us.

@echo 1 > /sys/block/sda/device/queue_depth
@echo 1 > /sys/block/sdb/device/queue_depth

We still believe that the root cause is our SSD's are faulty/firmware is buggy.

Daniel Schneller wrote:

This morning we had an OSD crash on us with the following trace:

[...]

The full log is available at [[https://public.centerdevice.de/e42a98af-69ef-4330-b36c-5858889ba566]]
Attaching it here failed repeatedly.

After that, it recovered:
[...]

We could not find anything else relevant in our logs.
The disk seems to be fine, at leat according to MegaCLI.
Server is a DELL PowerEdge R510.

We found (at least superficially) similar looking issues here: [[http://tracker.ceph.com/issues/11298]] and here: [[http://tracker.ceph.com/issues/9570]]

This was the first time we saw an issue like this.

"Lowering the queue depth" works for me, thanks.
But this solution lowering the iops too.
Finally, we update the sas fw to solve the problem.

Also available in: Atom PDF