Bug #48389: _do_read bdev-read failed - bluestore - Ceph

Actions

Copy link

Bug #48389

closed

_do_read bdev-read failed

Added by Seena Fallah over 3 years ago. Updated over 3 years ago.

Status:

Rejected

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

nautilus, octopus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v14.2.14

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I think it happens because of deep scrubbing as I see the one here https://tracker.ceph.com/issues/36455#note-11

2020-11-27 20:09:03.542 7f15478b8700 -1 bluestore(/var/lib/ceph/osd/ceph-195) _do_read bdev-read failed: (61) No data available
2020-11-27 20:09:03.554 7f15478b8700 -1 /build/ceph-14.2.14/src/os/bluestore/BlueStore.cc: In function 'int BlueStore::_do_read(BlueStore::Collection*, BlueStore::OnodeRef, uint64_t, size_t, ceph::bufferlist&, u
int32_t, uint64_t)' thread 7f15478b8700 time 2020-11-27 20:09:03.547900
/build/ceph-14.2.14/src/os/bluestore/BlueStore.cc: 9522: FAILED ceph_assert(r == 0)

 ceph version 14.2.14 (7e94c5afc28f3eaf36151ad1e1457de5f16c4fdf) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x560c9139ceea]
 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x560c9139d0c5]
 3: (BlueStore::_do_read(BlueStore::Collection*, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&, unsigned int, unsigned long)+0x2cec) [0x560c918e105c]
 4: (BlueStore::read(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&, unsigned int)+0x1bb) [0x560c918e680b]
 5: (ReplicatedBackend::be_deep_scrub(hobject_t const&, ScrubMap&, ScrubMapBuilder&, ScrubMap::object&)+0x2d2) [0x560c9173eb02]
 6: (PGBackend::be_scan_list(ScrubMap&, ScrubMapBuilder&)+0x393) [0x560c91654dd3]
 7: (PG::build_scrub_map_chunk(ScrubMap&, ScrubMapBuilder&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x7b) [0x560c914e463b]
 8: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x161b) [0x560c9151452b]
 9: (PG::scrub(unsigned int, ThreadPool::TPHandle&)+0xaf) [0x560c9151593f]
 10: (PGScrub::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1a) [0x560c916da1ea]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbf5) [0x560c9143eb85]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x560c91a5869c]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x560c91a5b860]
 14: (()+0x76db) [0x7f1569dbc6db]
 15: (clone()+0x3f) [0x7f1568b5c71f]

Files

ceph-osd.195.log.tar.gz (335 KB) ceph-osd.195.log.tar.gz

Seena Fallah, 11/27/2020 10:57 PM

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

I think this is another form of https://tracker.ceph.com/issues/48276
And the root cause is presumably pretty the same - out of range offset for disk op. Just a different appearance.

Could you please share OSD log (or last 20000 lines of it) prior to the crash?

Actions

Copy link

Updated by Seena Fallah over 3 years ago

File ceph-osd.195.log.tar.gz ceph-osd.195.log.tar.gz added

Thanks for your review. Here you go.

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

Thanks for sharing!
Unfortunately too low debug level for bdev hence not much useful info.

Wondering if you're able to reproduce such a crash by running deep fsck for this OSD?
If so please set debug-bdev to 10 for this run and collect the log.

Also it makes sense to set debug-bdev to 1/5 (currently it's at 1/3) for the cluster in an attempt to get a bit more verbose log if the crash happens again.
Please also check dmesg output for this host, there might be some additional info on the read failure there.

Actions

Copy link

Updated by Seena Fallah over 3 years ago

You are right. It seems the disk has read error by itself and this occurs 3 times today and I'm wondering why Ceph doesn't segfault on one of them!

kernel: [244042.830977] sd 1:0:3:0: [sdd] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
kernel: [244042.831008] sd 1:0:3:0: [sdd] tag#0 Sense Key : Medium Error [current] [descriptor] 
kernel: [244042.831017] sd 1:0:3:0: [sdd] tag#0 Add. Sense: Unrecovered read error
kernel: [244042.831027] sd 1:0:3:0: [sdd] tag#0 CDB: Read(16) 88 00 00 00 00 02 aa 7c ea 80 00 00 04 00 00 00 
kernel: [244042.831036] print_req_error: critical medium error, dev sdd, sector 11450248760

For the deep fsck do you mean running this command?

fsck -pvcf

For the log level you mentioned, I will do it but won't you set it to 1/5 by default?

Actions

Copy link

Updated by Igor Fedotov over 3 years ago

Seena Fallah wrote:

You are right. It seems the disk has read error by itself and this occurs 3 times today and I'm wondering why Ceph doesn't segfault on one of them!
[...]

For the deep fsck do you mean running this command?
[...]

no, I mean
ceph-bluestore-tool --path <path-to-osd> --command fsck --deep 1