Bug #55328: OSD crashed due to checksum error - bluestore - Ceph

Actions

Copy link

Bug #55328

closed

OSD crashed due to checksum error

Added by Shinya Hayashi about 2 years ago. Updated over 1 year ago.

Status:

Closed

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

Ceph - v16.2.6

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

OSD.14 crashed and produced the following logs.

2022-04-09T02:18:07Z {} debug    -42> 2022-04-09T02:18:07.078+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _prepare_read_ioc    region 0x0: 0x0 reading 0x0~4000
2022-04-09T02:18:07Z {} debug    -41> 2022-04-09T02:18:07.078+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _do_read waiting for aio
2022-04-09T02:18:07Z {} debug    -40> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _generate_read_result_bl  blob Blob(0x55f7eaa36d20 blob([!~4000,0xa5b4000~1000] csum c
rc32c/0x1000) use_tracker(0x5*0x1000 0x[0,0,0,0,33e]) SharedBlob(0x55f7de2f1b90 sbid 0x0)) need 0x{<0x4000, 0x1000> : [0x4000:4000~33e]}
2022-04-09T02:18:07Z {} debug    -39> 2022-04-09T02:18:07.103+0000 7f488b599700 -1 bluestore(/var/lib/ceph/osd/ceph-14) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x4000, got 0x24dc4dde, expecte
d 0x4c69e4bd, device location [0xa5b4000~1000], logical extent 0x4000~1000, object #-1:a806e935:::osdmap.146:0#
2022-04-09T02:18:07Z {} debug    -38> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _do_read 0x0~433e size 0x433e (17214)
2022-04-09T02:18:07Z {} debug    -37> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _do_read will do buffered read
2022-04-09T02:18:07Z {} debug    -36> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _read_cache  blob Blob(0x55f7eaf0df80 blob([0xa5b0000~4000] csum crc32c/0x1000) use_tr
acker(0x4*0x1000 0x[1000,1000,1000,1000]) SharedBlob(0x55f7de65de30 sbid 0x0)) need 0x0~4000 cache has 0x[]
2022-04-09T02:18:07Z {} debug    -35> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _read_cache  blob Blob(0x55f7eaa36d20 blob([!~4000,0xa5b4000~1000] csum crc32c/0x1000)
 use_tracker(0x5*0x1000 0x[0,0,0,0,33e]) SharedBlob(0x55f7de2f1b90 sbid 0x0)) need 0x4000~33e cache has 0x[]
2022-04-09T02:18:07Z {} debug    -34> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _prepare_read_ioc  blob Blob(0x55f7eaa36d20 blob([!~4000,0xa5b4000~1000] csum crc32c/0
x1000) use_tracker(0x5*0x1000 0x[0,0,0,0,33e]) SharedBlob(0x55f7de2f1b90 sbid 0x0)) need {<0x4000, 0x1000> : [0x4000:4000~33e]}
2022-04-09T02:18:07Z {} debug    -33> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _prepare_read_ioc    region 0x4000: 0x4000 reading 0x4000~1000
2022-04-09T02:18:07Z {} debug    -32> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _prepare_read_ioc  blob Blob(0x55f7eaf0df80 blob([0xa5b0000~4000] csum crc32c/0x1000)
use_tracker(0x4*0x1000 0x[1000,1000,1000,1000]) SharedBlob(0x55f7de65de30 sbid 0x0)) need {<0x0, 0x4000> : [0x0:0~4000]}
2022-04-09T02:18:07Z {} debug    -31> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _prepare_read_ioc    region 0x0: 0x0 reading 0x0~4000
2022-04-09T02:18:07Z {} debug    -30> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _do_read waiting for aio
2022-04-09T02:18:07Z {} debug    -29> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _generate_read_result_bl  blob Blob(0x55f7eaa36d20 blob([!~4000,0xa5b4000~1000] csum c
rc32c/0x1000) use_tracker(0x5*0x1000 0x[0,0,0,0,33e]) SharedBlob(0x55f7de2f1b90 sbid 0x0)) need 0x{<0x4000, 0x1000> : [0x4000:4000~33e]}
2022-04-09T02:18:07Z {} debug    -28> 2022-04-09T02:18:07.103+0000 7f488b599700 -1 bluestore(/var/lib/ceph/osd/ceph-14) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x4000, got 0x24dc4dde, expecte
d 0x4c69e4bd, device location [0xa5b4000~1000], logical extent 0x4000~1000, object #-1:a806e935:::osdmap.146:0#
2022-04-09T02:18:07Z {} debug    -27> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _do_read 0x0~433e size 0x433e (17214)
2022-04-09T02:18:07Z {} debug    -26> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _do_read will do buffered read
2022-04-09T02:18:07Z {} debug    -25> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _read_cache  blob Blob(0x55f7eaf0df80 blob([0xa5b0000~4000] csum crc32c/0x1000) use_tracker(0x4*0x1000 0x[1000,1000,1000,1000]) SharedBlob(0x55f7de65de30 sbid 0x0)) need 0x0~4000 cache has 0x[]
2022-04-09T02:18:07Z {} debug    -24> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _read_cache  blob Blob(0x55f7eaa36d20 blob([!~4000,0xa5b4000~1000] csum crc32c/0x1000) use_tracker(0x5*0x1000 0x[0,0,0,0,33e]) SharedBlob(0x55f7de2f1b90 sbid 0x0)) need 0x4000~33e cache has 0x[]
2022-04-09T02:18:07Z {} debug    -23> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _prepare_read_ioc  blob Blob(0x55f7eaa36d20 blob([!~4000,0xa5b4000~1000] csum crc32c/0x1000) use_tracker(0x5*0x1000 0x[0,0,0,0,33e]) SharedBlob(0x55f7de2f1b90 sbid 0x0)) need {<0x4000, 0x1000> : [0x4000:4000~33e]}
2022-04-09T02:18:07Z {} debug    -22> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _prepare_read_ioc    region 0x4000: 0x4000 reading 0x4000~1000
2022-04-09T02:18:07Z {} debug    -21> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _prepare_read_ioc  blob Blob(0x55f7eaf0df80 blob([0xa5b0000~4000] csum crc32c/0x1000) use_tracker(0x4*0x1000 0x[1000,1000,1000,1000]) SharedBlob(0x55f7de65de30 sbid 0x0)) need {<0x0, 0x4000> : [0x0:0~4000]}
2022-04-09T02:18:07Z {} debug    -20> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _prepare_read_ioc    region 0x0: 0x0 reading 0x0~4000
2022-04-09T02:18:07Z {} debug    -19> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _do_read waiting for aio
2022-04-09T02:18:07Z {} debug    -18> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _generate_read_result_bl  blob Blob(0x55f7eaa36d20 blob([!~4000,0xa5b4000~1000] csum crc32c/0x1000) use_tracker(0x5*0x1000 0x[0,0,0,0,33e]) SharedBlob(0x55f7de2f1b90 sbid 0x0)) need 0x{<0x4000, 0x1000> : [0x4000:4000~33e]}
2022-04-09T02:18:07Z {} debug    -17> 2022-04-09T02:18:07.103+0000 7f488b599700 -1 bluestore(/var/lib/ceph/osd/ceph-14) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x4000, got 0x24dc4dde, expected 0x4c69e4bd, device location [0xa5b4000~1000], logical extent 0x4000~1000, object #-1:a806e935:::osdmap.146:0#
2022-04-09T02:18:07Z {} debug    -16> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _do_read 0x0~433e size 0x433e (17214)
2022-04-09T02:18:07Z {} debug    -15> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _do_read will do buffered read
2022-04-09T02:18:07Z {} debug    -14> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _read_cache  blob Blob(0x55f7eaf0df80 blob([0xa5b0000~4000] csum crc32c/0x1000) use_tracker(0x4*0x1000 0x[1000,1000,1000,1000]) SharedBlob(0x55f7de65de30 sbid 0x0)) need 0x0~4000 cache has 0x[]
2022-04-09T02:18:07Z {} debug    -13> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _read_cache  blob Blob(0x55f7eaa36d20 blob([!~4000,0xa5b4000~1000] csum crc32c/0x1000) use_tracker(0x5*0x1000 0x[0,0,0,0,33e]) SharedBlob(0x55f7de2f1b90 sbid 0x0)) need 0x4000~33e cache has 0x[]
2022-04-09T02:18:07Z {} debug    -12> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _prepare_read_ioc  blob Blob(0x55f7eaa36d20 blob([!~4000,0xa5b4000~1000] csum crc32c/0x1000) use_tracker(0x5*0x1000 0x[0,0,0,0,33e]) SharedBlob(0x55f7de2f1b90 sbid 0x0)) need {<0x4000, 0x1000> : [0x4000:4000~33e]}
2022-04-09T02:18:07Z {} debug    -11> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _prepare_read_ioc    region 0x4000: 0x4000 reading 0x4000~1000
2022-04-09T02:18:07Z {} debug    -10> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _prepare_read_ioc  blob Blob(0x55f7eaf0df80 blob([0xa5b0000~4000] csum crc32c/0x1000) use_tracker(0x4*0x1000 0x[1000,1000,1000,1000]) SharedBlob(0x55f7de65de30 sbid 0x0)) need {<0x0, 0x4000> : [0x0:0~4000]}
2022-04-09T02:18:07Z {} debug     -9> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _prepare_read_ioc    region 0x0: 0x0 reading 0x0~4000
2022-04-09T02:18:07Z {} debug     -8> 2022-04-09T02:18:07.103+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _do_read waiting for aio
2022-04-09T02:18:07Z {} debug     -7> 2022-04-09T02:18:07.104+0000 7f488b599700 20 bluestore(/var/lib/ceph/osd/ceph-14) _generate_read_result_bl  blob Blob(0x55f7eaa36d20 blob([!~4000,0xa5b4000~1000] csum crc32c/0x1000) use_tracker(0x5*0x1000 0x[0,0,0,0,33e]) SharedBlob(0x55f7de2f1b90 sbid 0x0)) need 0x{<0x4000, 0x1000> : [0x4000:4000~33e]}
2022-04-09T02:18:07Z {} debug     -6> 2022-04-09T02:18:07.104+0000 7f488b599700 -1 bluestore(/var/lib/ceph/osd/ceph-14) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x4000, got 0x24dc4dde, expected 0x4c69e4bd, device location [0xa5b4000~1000], logical extent 0x4000~1000, object #-1:a806e935:::osdmap.146:0#
2022-04-09T02:18:07Z {} debug     -5> 2022-04-09T02:18:07.104+0000 7f488b599700 20 _unpin0x55f7dc238000   #-1:a806e935:::osdmap.146:0# unpinned
2022-04-09T02:18:07Z {} debug     -4> 2022-04-09T02:18:07.104+0000 7f488b599700 10 bluestore(/var/lib/ceph/osd/ceph-14) read meta #-1:a806e935:::osdmap.146:0# 0x0~433e = -5
2022-04-09T02:18:07Z {} debug     -3> 2022-04-09T02:18:07.108+0000 7f488b599700 -1 /root/project/src/ceph/src/osd/OSD.cc: In function 'void OSD::handle_osd_map(MOSDMap*)' thread 7f488b599700 time 2022-04-09T02:18:07.105274+0000
2022-04-09T02:18:07Z {} /root/project/src/ceph/src/osd/OSD.cc: 8061: FAILED ceph_assert(p != added_maps_bl.end())
2022-04-09T02:18:07Z {}
2022-04-09T02:18:07Z {} ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
2022-04-09T02:18:07Z {} 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x55f7d0e4ee61]
2022-04-09T02:18:07Z {} 2: ceph-osd(+0xac6069) [0x55f7d0e4f069]
2022-04-09T02:18:07Z {} 3: (OSD::handle_osd_map(MOSDMap*)+0x15e2) [0x55f7d0f2fe22]
2022-04-09T02:18:07Z {} 4: (OSD::_dispatch(Message*)+0x18b) [0x55f7d0f52ccb]
2022-04-09T02:18:07Z {} 5: (OSD::ms_dispatch(Message*)+0x84) [0x55f7d0f53014]
2022-04-09T02:18:07Z {} 6: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0xb9) [0x55f7d19f8719]
2022-04-09T02:18:07Z {} 7: (DispatchQueue::entry()+0x58f) [0x55f7d19f73cf]
2022-04-09T02:18:07Z {} 8: (DispatchQueue::DispatchThread::entry()+0x11) [0x55f7d1810e11]
2022-04-09T02:18:07Z {} 9: /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f489d75c609]
2022-04-09T02:18:07Z {} 10: clone()
2022-04-09T02:18:07Z {}
2022-04-09T02:18:07Z {} debug     -2> 2022-04-09T02:18:07.114+0000 7f488b599700 -1 *** Caught signal (Aborted) **
2022-04-09T02:18:07Z {} in thread 7f488b599700 thread_name:ms_dispatch
2022-04-09T02:18:07Z {}
2022-04-09T02:18:07Z {} ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
2022-04-09T02:18:07Z {} 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7f489d7683c0]
2022-04-09T02:18:07Z {} 2: gsignal()
2022-04-09T02:18:07Z {} 3: abort()
2022-04-09T02:18:07Z {} 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1ad) [0x55f7d0e4eebc]
2022-04-09T02:18:07Z {} 5: ceph-osd(+0xac6069) [0x55f7d0e4f069]
2022-04-09T02:18:07Z {} 6: (OSD::handle_osd_map(MOSDMap*)+0x15e2) [0x55f7d0f2fe22]
2022-04-09T02:18:07Z {} 7: (OSD::_dispatch(Message*)+0x18b) [0x55f7d0f52ccb]
2022-04-09T02:18:07Z {} 8: (OSD::ms_dispatch(Message*)+0x84) [0x55f7d0f53014]
2022-04-09T02:18:07Z {} 9: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0xb9) [0x55f7d19f8719]
2022-04-09T02:18:07Z {} 10: (DispatchQueue::entry()+0x58f) [0x55f7d19f73cf]
2022-04-09T02:18:07Z {} 11: (DispatchQueue::DispatchThread::entry()+0x11) [0x55f7d1810e11]
2022-04-09T02:18:07Z {} 12: /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f489d75c609]
2022-04-09T02:18:07Z {} 13: clone()
2022-04-09T02:18:07Z {} NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

This OSD worked for about a day until it crashed, and no restart occurred.

We rebooted other node during the OSD was runnning, but we are not sure if it is related to this issue.

We collected the log of OSD.14. Because it is huge and cannot be attached to this ticket, we share the link to the log.
https://cybozu-my.sharepoint.com/:u:/g/personal/shayashi_cybozu_onmicrosoft_com/EfIMgrTgsdxPnvu9D2d9AcIBpKx_ssP0tZtvLfA6yLtl1w

Files

Download all files

dd.bin (1000 KB) dd.bin		Shinya Hayashi, 05/16/2022 02:42 AM
dd_to_end.zip (383 KB) dd_to_end.zip		Shinya Hayashi, 05/17/2022 01:25 AM

Actions

Copy link

Updated by Shinya Hayashi about 2 years ago

We run dd command to the disk area. The result is as follows.

$ sudo dd if=/dev/mapper/crypt-vdc bs=4096 skip=42420 count=1 status=none
T07:37:55.89317367Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 20 bluefs _replay 0x13f000:  op_file_remove 669
2022-04-10T07:37:55.893249874Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 10 bluefs _read h 0x5578e0784d00 0x140000~1000 from file(ino 1 size 0x140000 mtime 0.000000 allocated 410000 extents [1:0x22f0000~10000,1:0x18e10000~400000])
2022-04-10T07:37:55.893297124Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 20 bluefs _read left 0xd0000 len 0x1000
2022-04-10T07:37:55.893300481Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 20 bluefs _read got 4096
2022-04-10T07:37:55.893302704Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 10 bluefs _replay 0x140000: txn(seq 12040 len 0x63 crc 0xd0a093bd)
2022-04-10T07:37:55.893304748Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 20 bluefs _replay 0x140000:  op_file_update  file(ino 676 size 0x0 mtime 2022-04-09T15:42:58.451633+0000 allocated 0 extents [])
2022-04-10T07:37:55.893306762Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 20 bluefs _replay 0x140000:  op_dir_link  db/OPTIONS-000519.dbtmp to 676
2022-04-10T07:37:55.893308906Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 20 bluefs _replay 0x140000:  op_file_update  file(ino 676 size 0x8f27 mtime 2022-04-09T15:42:58.453299+0000 allocated 10000 extents [1:0x3cd0000~10000])
2022-04-10T07:37:55.89331101Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 10 bluefs _read h 0x5578e0784d00 0x141000~1000 from file(ino 1 size 0x141000 mtime 0.000000 allocated 410000 extents [1:0x22f0000~10000,1:0x18e10000~400000])
2022-04-10T07:37:55.893313214Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 20 bluefs _read left 0xcf000 len 0x1000
2022-04-10T07:37:55.893315258Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 20 bluefs _read got 4096
2022-04-10T07:37:55.893317252Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 10 bluefs _replay 0x141000: txn(seq 12041 len 0x5d crc 0x2cb3eb15)
2022-04-10T07:37:55.893319276Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 20 bluefs _replay 0x141000:  op_file_update  file(ino 677 size 0x0 mtime 2022-04-09T15:48:06.425269+0000 allocated 0 extents [])
2022-04-10T07:37:55.893321369Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 20 bluefs _replay 0x141000:  op_dir_link  db/MANIFEST-000520 to 677
2022-04-10T07:37:55.893323463Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 20 bluefs _replay 0x141000:  op_file_update  file(ino 677 size 0x85f mtime 2022-04-09T15:48:06.426116+0000 allocated 10000 extents [1:0x3d0000~10000])
2022-04-10T07:37:55.893325587Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 10 bluefs _read h 0x5578e0784d00 0x142000~1000 from file(ino 1 size 0x142000 mtime 0.000000 allocated 410000 extents [1:0x22f0000~10000,1:0x18e10000~400000])
2022-04-10T07:37:55.893327863Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 20 bluefs _read left 0xce000 len 0x1000
2022-04-10T07:37:55.893329907Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 20 bluefs _read got 4096
2022-04-10T07:37:55.893332Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 10 bluefs _replay 0x142000: txn(seq 12042 len 0x59 crc 0x2eb1e13e)
2022-04-10T07:37:55.893334105Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 20 bluefs _replay 0x142000:  op_file_update  file(ino 678 size 0x0 mtime 2022-04-09T15:48:06.426888+0000 allocated 0 extents [])
2022-04-10T07:37:55.893340867Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 20 bluefs _replay 0x142000:  op_dir_link  db/000520.dbtmp to 678
2022-04-10T07:37:55.893343893Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 20 bluefs _replay 0x142000:  op_file_update  file(ino 678 size 0x10 mtime 2022-04-09T15:48:06.427505+0000 allocated 10000 extents [1:0x3e0000~10000])
2022-04-10T07:37:55.893346017Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 10 bluefs _read h 0x5578e0784d00 0x143000~1000 from file(ino 1 size 0x143000 mtime 0.000000 allocated 410000 extents

Actions

Copy link

Updated by Shinya Hayashi almost 2 years ago

Can anyone respond? I believe this is a serious problem since data seems to be broken.

Actions

Copy link

Updated by Igor Fedotov almost 2 years ago

Project changed from Ceph to bluestore
Category deleted (~~OSD~~)

Actions

Copy link

Updated by Igor Fedotov almost 2 years ago

Hi Shinya,
is my description below correct? Doesn't it provide any clue for you?

1) At 02:18:07 on Apr 09 OSD detected checksum failure at offset 0xa5b4000
2022-04-09T02:18:07Z {} debug -6> 2022-04-09T02:18:07.104+0000 7f488b599700 -1 bluestore(/var/lib/ceph/osd/ceph-14) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x4000, got 0x24dc4dde, expected 0x4c69e4bd, device location [0xa5b4000~1000], logical extent 0x4000~1000, object #-1:a806e935:::osdmap.146:0#

2) Later you checked the content of the disk at the offset-in-question: 0xa5b4000
via dd and get an OSD log snippet dated 07:37:55 on Apr 10.

which evidently means that log snippet was written to disk later(!) than the checksum failure happened - as timestamp in the log snippet is newer than the original failure timestamp.

3) Moreover Bluefs file timestamp from the log snippet is older than the original failure: 15:48:06 on Apr 09:

2022-04-10T07:37:55.893319276Z stderr F debug 2022-04-10T07:37:55.883+0000 7ffbe6035f00 20 bluefs _replay 0x141000: op_file_update file(ino 677 size 0x0 mtime 2022-04-09T15:48:06.425269+0000 allocated 0 extents [])

Which means that bluefs was operational at that point and highly likely it belongs to a different OSD..

4) The above makes me think that OSD's (the same or rather another one) logging somehow shares the OSD device and hence corrupts it. Does it make any sense? IIUC this is a containerized deployment - may be some rook misconfiguration?

Additionally you might try to dd more disk chunks around offset-in-question and analyze the content in an attempt to determine the source of this log snippet... Also curious if its content is still the same now or it has got some other data...

Actions

Copy link

Updated by Shinya Hayashi almost 2 years ago

File dd.bin dd.bin added

Hi Igor

Thank you for your response.

1) At 02:18:07 on Apr 09 OSD detected checksum failure at offset 0xa5b4000

Correct.

which evidently means that log snippet was written to disk later(!) than the checksum failure happened - as timestamp in the log snippet is newer than the original failure timestamp.

That is true. However, because we were deploying Ceph using Rook, osd.14 was repeatedly restarting and crashing after the first crash. So it may have been overwritten by osd.14 itself when it restarted, not by another OSDs.

IIUC this is a containerized deployment - may be some rook misconfiguration?

In this test, the Rook configuration is the same as we usually use. Also, unless we intentionally repeat the reboot of the node, the test is working fine. Though we will try to check the Rook configuration, we guess that Rook is not misconfigured.

Additionally you might try to dd more disk chunks around offset-in-question and analyze the content in an attempt to determine the source of this log snippet... Also curious if its content is still the same now or it has got some other data...

Currently, we could still read the same data as https://tracker.ceph.com/issues/55328#note-1. We read out the surrounding data with the dd command, but we could not get any clues for who wrote this. The data read-out is attached.

Could you give me some advice on what else we should look into?

Actions

Copy link

Updated by Igor Fedotov almost 2 years ago

Hi Shinya,
I can suggest the following steps to proceed the investigation on the origin of this log snippet:
1) Grep node's (and apparently all available OSD containers) filesystem for some pattern specific to that log (e.g. the first line: "2022-04-10T07:37:55.880+0000 7ffbe6035f00 20 bluefs _replay 0x133000: op_dir_link db/000511.dbtmp to 666"
If found in a specific log file (not in a raw disk block) this might point out for the desired OSD.

2) Unfortunately the disk block dump you shared lacks any log output at the beginning. Hence we don't have OSD startup info for analysis. But the log in this dump is still unfinished so hopefully we can find BlueFS replay completion if proceed with scanning even more disk blocks. Hence suggesting to read more disk block following broken offset in an attempt to find that BlueFS replay completion, below is a sample how this should look like (evidently uuid/addresses/timestamps might differ):
2022-04-08T02:19:12Z {} debug 2022-04-08T02:19:12.954+0000 7f489d0bef00 10 bluefs _replay 0x29000: stop: uuid 00000000-0000-0000-0000-000000000000 != super.uuid 8f67605c-9740-40e7-9c44-44e298ad8911
2022-04-08T02:19:12Z {} debug 2022-04-08T02:19:12.954+0000 7f489d0bef00 10 bluefs _replay log file size was 0x29000
2022-04-08T02:19:12Z {} debug 2022-04-08T02:19:12.954+0000 7f489d0bef00 10 bluefs _replay done

3) Could you please take a fresh OSD startup log with debug-bluefs set to 20. I'd like to inspect whether it somehow matches the log snippet in question, e.g. whether maximum observed ino id in the fresh log is equal/higher than ones in the snippet. If that's not the case this means osd.14 isn't the culprit.

Actions

Copy link

Updated by Shinya Hayashi almost 2 years ago

Hi Igor,

Thank you for your kind suggestions.

1) Grep node's (and apparently all available OSD containers) filesystem for some pattern specific to that log (e.g. the first line: "2022-04-10T07:37:55.880+0000 7ffbe6035f00 20 bluefs _replay 0x133000: op_dir_link db/000511.dbtmp to 666"

I will try it.

2) Unfortunately the disk block dump you shared lacks any log output at the beginning.

I read the more disk blocks following the broken offset. I found that the date of the log jumped from 2022-04-10 to 2022-05-01 at some point. Though I found the BlueFS replay completion log on 2022-05-01, I'm not sure if that is what you want.
Perhaps, we should reproduce the problem and collect the data again.

- Date jump log

2022-04-10T07:37:55.925401926Z stderr F debug 2022-04-10T07:37:55.917+0000 7ffbe6035f00  4 rocksdb:                Options.max
_bytes_for_level_base: 268435456
2022-04-10T07:37:55.92540397Z stderr F debug 2022-04-10T07:37:55.917+0000 7ffbe6035f00  4 rocksdb: Options.level_compaction_dy
namic_level_bytes: 0
2022-04-10T07:37:55.925405983Z stderr F debug 2022-04-10T07:37:55.917+0000 7ffbe6035f00  4 rocksdb:          Options.max_by0 7
fdad8c25f00 20 bluefs _read left 0x9a000 len 0x1000
2022-05-01T15:40:27.01038562Z stderr F debug 2022-05-01T15:40:27.001+0000 7fdad8c25f00 20 bluefs _read got 4096
2022-05-01T15:40:27.010388135Z stderr F debug 2022-05-01T15:40:27.001+0000 7fdad8c25f00 10 bluefs _replay 0x5d6000: txn(seq 41381 len 0x7e crc 0x2716c058)
2022-05-01T15:40:27.010390248Z stderr F debug 2022-05-01T15:40:27.001+0000 7fdad8c25f00 20 bluefs _replay 0x5d6000:  op_file_update  file(ino 24143 size 0x0 mtime 2022-04-30T15:27:17.158679+0000 allocated 0 extents [])

- BlueFS replay completion log

2022-05-01T15:40:27.076035438Z stderr F debug 2022-05-01T15:40:27.065+0000 7fdad8c25f00 10 bluefs _replay 0xb55000: stop: uuid
 fd4d665a-e38c-7471-0d9c-2bae80431653 != super.uuid 8f67605c-9740-40e7-9c44-44e298ad8911
2022-05-01T15:40:27.076038784Z stderr F debug 2022-05-01T15:40:27.065+0000 7fdad8c25f00 10 bluefs _replay log file size was 0x
b55000
2022-05-01T15:40:27.076042201Z stderr F debug 2022-05-01T15:40:27.065+0000 7fdad8c25f00 10 bluefs _replay done

3) Could you please take a fresh OSD startup log with debug-bluefs set to 20.

Sorry for the lack of my explanation. The log shared on this issue is collected with the following log-level configurations, and it contains the fresh starting log of OSD.14. So I believe it will suffice your requirement.
I will share the link to the log again.
https://cybozu-my.sharepoint.com/:u:/g/personal/shayashi_cybozu_onmicrosoft_com/EfIMgrTgsdxPnvu9D2d9AcIBpKx_ssP0tZtvLfA6yLtl1w

    [global]
    debug_bluefs = 20/20
    debug_bluestore = 20/20

Actions

Copy link

Updated by Shinya Hayashi almost 2 years ago

File dd_to_end.zip dd_to_end.zip added

I read the more disk blocks following the broken offset.

Sorry, I forgot to attach the collected data.

Actions

Copy link

Updated by Igor Fedotov almost 2 years ago

Hey Shinya,

Sorry for the lack of my explanation. The log shared on this issue is collected with the following log-level configurations, and it contains the fresh starting log of OSD.14. So I believe it will suffice your requirement.
I will share the link to the log again.

The log under the link refers to OSD startup on Apr 09. While under the fresh one I mean the resulting log for one more new(!) OSD startup attempt.

Sorry for confusion.

Actions

Copy link

#10

Updated by Shinya Hayashi almost 2 years ago

Hi Igor,

The log under the link refers to OSD startup on Apr 09. While under the fresh one I mean the resulting log for one more new(!) OSD startup attempt.

I got it. I will try to reproduce the problem and collect the log.
Maybe, it will take a few weeks or longer.

Actions

Copy link

#11

Updated by Igor Fedotov almost 2 years ago

Shinya Hayashi wrote:

Hi Igor,

The log under the link refers to OSD startup on Apr 09. While under the fresh one I mean the resulting log for one more new(!) OSD startup attempt.

I got it. I will try to reproduce the problem and collect the log.
Maybe, it will take a few weeks or longer.

So you've already destroyed this OSD, right?

Actions

Copy link

#12

Updated by Shinya Hayashi almost 2 years ago

Hi Igor,

So you've already destroyed this OSD, right?

Yes. I destroyed it because the logs had been lost.
Since we are running Ceph in Rook environment, logs are output to the standard output, not files, and they are collected by the designated log management module (promtail/loki), which is also working on the same Kubernetes cluster. Due to the restriction of our testing environment, logs are only stored for several days.

Actions

Copy link

#13

Updated by Igor Fedotov almost 2 years ago

Status changed from New to Need More Info

Actions

Copy link

#14

Updated by Shinya Hayashi almost 2 years ago

Hi Igor,

We succeeded to reproduce the problem. This time, the problem occurred on OSD.7. However, there are slightly different points.

First, there are so many "bad crc" logs before the first crash.
(Last time, the OSD crashed right after the first "bad crc" log.)

The first "bad crc" logs appeared at 2022-05-19T08:11:40, and the first OSD crash occurred at 2022-05-22T05:09:25. The cause of the crash looks the same as the last time.
(At least, the backtrace looks similar to the last one.)

2022-05-22T05:09:25Z {} debug 2022-05-22T05:09:25.213+0000 7f9a642f1700 20 bluestore(/var/lib/ceph/osd/ceph-7) _prepare_read_ioc    region 0x0: 0x0 reading 0x0~4000
2022-05-22T05:09:25Z {} debug 2022-05-22T05:09:25.213+0000 7f9a642f1700 20 bluestore(/var/lib/ceph/osd/ceph-7) _prepare_read_ioc  blob Blob(0x55a4e991c8c0 blob([!~4000,0x31d3a6000~1000] csum crc32c/0x1000) use_tracker(0x5*0x1000 0x[0,0,0,0,1c5]) SharedBlob(0x55a4f20c4e00 sbid 0x0)) need {<0x4000, 0x1000> : [0x4000:4000~1c5]}
2022-05-22T05:09:25Z {} debug 2022-05-22T05:09:25.213+0000 7f9a642f1700 20 bluestore(/var/lib/ceph/osd/ceph-7) _prepare_read_ioc    region 0x4000: 0x4000 reading 0x4000~1000
2022-05-22T05:09:25Z {} debug 2022-05-22T05:09:25.213+0000 7f9a642f1700 20 bluestore(/var/lib/ceph/osd/ceph-7) _do_read waiting for aio
2022-05-22T05:09:25Z {} /root/project/src/ceph/src/osd/OSD.cc: In function 'void OSD::handle_osd_map(MOSDMap*)' thread 7f9a642f1700 time 2022-05-22T05:09:25.215357+0000
2022-05-22T05:09:25Z {} /root/project/src/ceph/src/osd/OSD.cc: 8061: FAILED ceph_assert(p != added_maps_bl.end())
2022-05-22T05:09:25Z {} debug 2022-05-22T05:09:25.214+0000 7f9a642f1700 20 bluestore(/var/lib/ceph/osd/ceph-7) _generate_read_result_bl  blob Blob(0x55a4e1901b20 blob([0x31d3a2000~4000] csum crc32c/0x1000) use_tracker(0x4*0x1000 0x[1000,1000,1000,1000]) SharedBlob(0x55a4ece488c0 sbid 0x0)) need 0x{<0x0, 0x4000> : [0x0:0~4000]}
2022-05-22T05:09:25Z {} debug 2022-05-22T05:09:25.214+0000 7f9a642f1700 -1 bluestore(/var/lib/ceph/osd/ceph-7) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0xd091aa77, expected 0x616a8cfa, device location [0x31d3a2000~1000], logical extent 0x0~1000, object #-1:a0bee935:::osdmap.163:0#
2022-05-22T05:09:25Z {} debug 2022-05-22T05:09:25.215+0000 7f9a642f1700 20 _unpin0x55a4c1a1a000   #-1:a0bee935:::osdmap.163:0# unpinned
2022-05-22T05:09:25Z {} debug 2022-05-22T05:09:25.215+0000 7f9a642f1700 10 bluestore(/var/lib/ceph/osd/ceph-7) read meta #-1:a0bee935:::osdmap.163:0# 0x0~41c5 = -5
2022-05-22T05:09:25Z {} ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
2022-05-22T05:09:25Z {} 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x55a4b5b60e61]
2022-05-22T05:09:25Z {} 2: ceph-osd(+0xac6069) [0x55a4b5b61069]
2022-05-22T05:09:25Z {} 3: (OSD::handle_osd_map(MOSDMap*)+0x15e2) [0x55a4b5c41e22]
2022-05-22T05:09:25Z {} 4: (OSD::_dispatch(Message*)+0x18b) [0x55a4b5c64ccb]
2022-05-22T05:09:25Z {} 5: (OSD::ms_dispatch(Message*)+0x84) [0x55a4b5c65014]
2022-05-22T05:09:25Z {} 6: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0xb9) [0x55a4b670a719]
2022-05-22T05:09:25Z {} 7: (DispatchQueue::entry()+0x58f) [0x55a4b67093cf]
2022-05-22T05:09:25Z {} 8: (DispatchQueue::DispatchThread::entry()+0x11) [0x55a4b6522e11]
2022-05-22T05:09:25Z {} 9: /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f9a764b4609]
2022-05-22T05:09:25Z {} 10: clone()
2022-05-22T05:09:25Z {} *** Caught signal (Aborted) **
2022-05-22T05:09:25Z {} in thread 7f9a642f1700 thread_name:ms_dispatch

Second, the contents of the data stored in the corrupted area look slightly different.
(Since there are so many "bad crc" logs, I chose the area which was reported by the log
right before the first OSD crash.)

The actual data is as follows.

sudo dd if=/dev/crypt-disk/by-path/pci-0000:00:09.0 bs=4096 skip=3265442 count=1 status=none
00~10000,1:0x326210000~10000,1:0x3263c0000~20000,1:0x326540000~10000,1:0x326570000~10000,1:0x326790000~20000,1:0x326840000~20000,1:0x326aa0000~10000,1:0x327e20000~10000,1:0x327f60000~10000,1:0x327fb0000~20000,1:0x3283e0000~10000,1:0x328b90000~10000,1:0x3291f0000~10000,1:0x329230000~30000,1:0x329e60000~30000,1:0x329f00000~20000,1:0x32a050000~10000,1:0x32a9f0000~20000,1:0x32ac80000~10000,1:0x32ad80000~10000,1:0x32add0000~10000,1:0x32ae20000~20000,1:0x32ae50000~10000,1:0x32aeb0000~10000,1:0x32af40000~10000,1:0x32af90000~30000,1:0x32b230000~40000,1:0x32b3b0000~20000,1:0x32b540000~10000,1:0x32b650000~10000,1:0x32bb00000~10000,1:0x32bdb0000~10000,1:0x32bf60000~20000,1:0x32bfe0000~30000,1:0x32c1a0000~10000,1:0x32c2b0000~20000,1:0x32c370000~20000,1:0x32c530000~10000,1:0x32c780000~40000,1:0x32cf50000~30000,1:0x330ca0000~50000,1:0x330d20000~30000,1:0x330d80000~10000,1:0x330dc0000~40000,1:0x88110000~20000,1:0x330e00000~30000,1:0x331060000~20000,1:0x88220000~20000,1:0x331080000~60000,1:0x331600000~10000,1:0x3316f0000~20000,1:0x331750000~10000,1:0x331820000~20000,1:0x331880000~20000,1:0x3319b0000~20000,1:0x331ac0000~10000,1:0x332250000~20000,1:0x3328a0000~20000,1:0x332bc0000~20000,1:0x332bf0000~30000,1:0x332e90000~10000,1:0x333440000~10000,1:0x3334a0000~10000,1:0x333570000~10000,1:0x333590000~10000,1:0x3335b0000~10000,1:0x333620000~10000,1:0x3336d0000~10000,1:0x333870000~20000,1:0x3338d0000~10000,1:0x333940000~40000,1:0x333990000~20000,1:0x333a00000~10000,1:0x333b50000~20000,1:0x333d40000~20000,1:0x333f80000~10000,1:0x333fc0000~20000,1:0x3341e0000~20000,1:0x334210000~10000,1:0x3344b0000~10000,1:0x3346f0000~20000,1:0x3347e0000~10000,1:0x334a10000~10000,1:0x334a30000~10000,1:0x334bd0000~20000,1:0x334ea0000~30000,1:0x335a50000~10000,1:0x335b40000~30000,1:0x335be0000~10000,1:0x336220000~20000,1:0x336530000~10000,1:0x336580000~10000,1:0x3365d0000~60000,1:0x336660000~30000,1:0x336830000~20000,1:0x336900000~20000,1:0x336b70000~10000,1:0x336c90000~40000,1:0x336eb0000~20000,1:0x337a90000~10000,1:0x338470000~10000,1:0x3388d0000~10000,1:0x338ae0000~10000,1:0x339110000~10000,1:0x3397b0000~10000,1:0x33a460000~20000,1:0x33a5f0000~60000,1:0x33ac00000~20000,1:0x33ac80000~10000,1:0x888e0000~20000,1:0x33ac90000~1
2022-05-22T01:04:51.882378369Z stderr F 0000,1:0x33b2a0000~10000,1:0x88e60000~20000,1:0x33b3a0000~10000,1:0x33b3e0000~20000,1:0x33b4c0000~10000,1:0x33b8e0000~10000,1:0x88e80000~20000,1:0x33b970000~10000,1:0x33bc10000~30000,1:0x33be50000~20000,1:0x33c1d0000~10000,1:0x33c7d0000~50000,1:0x33d160000~10000,1:0x33d1f0000~20000,1:0x33d2b0000~10000,1:0x33d2f0000~10000,1:0x33d570000~10000,1:0x33d5d0000~20000,1:0x33d7a0000~10000,1:0x33d7c0000~10000,1:0x33d820000~20000,1:0x33d940000~20000,1:0x33dba0000~10000,1:0x33dc80000~30000,1:0x33de10000~10000,1:0x33e0b0000~10000,1:0x33e4c0000~20000,1:0x33e770000~10000,1:0x33eee0000~10000,1:0x33f1f0000~10000,1:0x33f790000~10000,1:0x33f8f0000~10000,1:0x33fd40000~10000,1:0x33ff30000~30000,1:0x3400c0000~30000,1:0x3402a0000~10000,1:0x3402f0000~10000,1:0x3405c0000~10000,1:0x340c00000~10000,1:0x340c40000~10000,1:0x341400000~10000,1:0x341800000~20000,1:0x341f70000~20000,1:0x341fc0000~10000,1:0x342440000~20000,1:0x342b80000~10000,1:0x342c30000~60000,1:0x343550000~10000,1:0x343670000~10000,1:0x3437f0000~10000,1:0x343870000~10000,1:0x343f60000~50000,1:0x344080000~10000,1:0x3442f0000~20000,1:0x344370000~20000,1:0x3446b0000~10000,1:0x344e20000~10000,1:0x344fd0000~30000,1:0x345360000~40000,1:0x345500000~20000,1:0x345910000~10000,1:0x345a40000~30000,1:0x345b90000~10000,1:0x345d50000~40000,1:0x345f90000~40000,1:0x3460d0000~10000,1:0x891d0000~20000,1:0x346340000~20000,1:0x3463d0000~20000,1:0x346470000~20000,1:0x346640000~10000,1:0x346720000~10000,1:0x346ba0000~10000,1:0x346d10000~20000,1:0x3473d0000~10000,1:0x347690000~10000,1:0x347850000~30000,1:0x347890000~10000,1:0x3478c0000~20000,1:0x348240000~30000,1:0x3484f0000~20000,1:0x348890000~10000,1:0x348f90000~40000,1:0x349050000~20000,1:0x3490b0000~20000,1:0x3494a0000~20000,1:0x349880000~20000,1:0x349960000~20000,1:0x34a180000~10000,1:0x34a7f0000~20000,1:0x34a84000

(This area contains the time stamp "2022-05-22T01:04:51.882378369Z".)

I also tried running dd command for the surrounding disk area. Though I could not find the BlueFS replay completion log in the subsequent area, I found the OSD.7's log in the preceding area which contains the time stamps like "2022-05-22T01:04:51.881XXXXXX". From this fact, the corrupted area seems to be written by OSD.7 itself.

The logs found in the preceding area:

2022-05-22T01:04:51.881287446Z stderr F debug 2022-05-22T01:04:51.880+0000 7f9a662f5700 20 bluestore(/var/lib/ceph/osd/ceph-7) _txc_apply_kv onode 0x55a50591b400 had 1
2022-05-22T01:04:51.881294961Z stderr F debug 2022-05-22T01:04:51.880+0000 7f9a662f5700 20 bluestore(/var/lib/ceph/osd/ceph-7) _txc_apply_kv onode 0x55a4c2a44500 had 1

I checked the Kubernetes (Rook) cluster state, and found that there was a zombie OSD.7 pod remaining. Though it seems suspicious, the disk area should be protected by a file locking mechanism, so the bad crc is not expected behavior.

The log under the link refers to OSD startup on Apr 09. While under the fresh one I mean the resulting log for one more new(!) OSD startup attempt.

I got it. I will try to reproduce the problem and collect the log.

The OSD start logs right after the first crash were collected. It starts with the following line.

2022-05-22T05:09:26Z {filename="/var/log/pods/ceph-object-store_rook-ceph-osd-7-6447bcb8dc-pzlsp_6c8e77e0-7820-46bd-9e85-557af9f601ab/osd/1.log", pod="rook-ceph-osd-7-6447bcb8dc-pzlsp"}  debug 2022-05-22T05:09:26.350+0000 7f109bb76f00  0 set uid:gid to 64045:64045 (ceph:ceph)

Anyway, I will continue to try reproducing the problem to get the logs in a much similar situation to the last time.

Logs:

https://cybozu-my.sharepoint.com/:f:/g/personal/shayashi_cybozu_onmicrosoft_com/EkJWuxMCu35JmARPs3pmmZYBoo0Xqw-gwQAqGhhCZ-Z7qg?e=9jpAUJ

- OSD.7's logs are stored. Logs are separated into files per day because the entire log is huge.
- The result of the dd commands are also stored. These files are collected by the following command.

sudo dd if=/dev/crypt-disk/by-path/pci-0000:00:09.0 bs=4096 skip=3250000 count=16000 status=none > dd_before.bin
sudo dd if=/dev/crypt-disk/by-path/pci-0000:00:09.0 bs=4096 skip=3265442 count=16000 status=none > dd_after.bin

Actions

Copy link

#15

Updated by Shinya Hayashi almost 2 years ago

Hi Igor
I am continuously struggling with this issue, but unfortunately, I still cannot provide you with logs.
After my last report, I was suffering from our testing environment changes, and I had to do some trouble shootings that are unrelated to this issue.
Now I can rerun the test, so I hope I can share logs within several weeks.
Sorry for keeping you waiting.

Actions

Copy link

#16

Updated by Shinya Hayashi almost 2 years ago

After my last report, I was suffering from our testing environment changes,

Unfortunately, since our test environment was changed, I no longer get checksum errors.
I will start to run the same test senario with a newer Ceph version (v16.2.10) in a few weeks, and run the test script for about a month.
If the problem do not occur in that test, I will close this issue.

Actions

Copy link

#17

Updated by Shinya Hayashi over 1 year ago

Hi Igor

I will start to run the same test senario with a newer Ceph version (v16.2.10) in a few weeks, and run the test script for about a month.
If the problem do not occur in that test, I will close this issue.

Sorry for my late announcement.
I ran the script for about a month, but the problem did not occurred.
So I would like you to close this ticket.

Thank you for your cooperation about this issue.

Actions

Copy link

#18

Updated by Igor Fedotov over 1 year ago

Status changed from Need More Info to Closed

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » bluestore

Custom queries

Bug #55328

OSD crashed due to checksum error

Updated by Shinya Hayashi about 2 years ago

Updated by Shinya Hayashi almost 2 years ago

Updated by Igor Fedotov almost 2 years ago

Updated by Igor Fedotov almost 2 years ago

Updated by Shinya Hayashi almost 2 years ago

Updated by Igor Fedotov almost 2 years ago

Updated by Shinya Hayashi almost 2 years ago

Updated by Shinya Hayashi almost 2 years ago

Updated by Igor Fedotov almost 2 years ago

Updated by Shinya Hayashi almost 2 years ago

Updated by Igor Fedotov almost 2 years ago

Updated by Shinya Hayashi almost 2 years ago

Updated by Igor Fedotov almost 2 years ago

Updated by Shinya Hayashi almost 2 years ago

Updated by Shinya Hayashi almost 2 years ago

Updated by Shinya Hayashi almost 2 years ago

Updated by Shinya Hayashi over 1 year ago

Updated by Igor Fedotov over 1 year ago