Project

General

Profile

Actions

Bug #48389

closed

_do_read bdev-read failed

Added by Seena Fallah over 3 years ago. Updated over 3 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus, octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I think it happens because of deep scrubbing as I see the one here https://tracker.ceph.com/issues/36455#note-11

2020-11-27 20:09:03.542 7f15478b8700 -1 bluestore(/var/lib/ceph/osd/ceph-195) _do_read bdev-read failed: (61) No data available
2020-11-27 20:09:03.554 7f15478b8700 -1 /build/ceph-14.2.14/src/os/bluestore/BlueStore.cc: In function 'int BlueStore::_do_read(BlueStore::Collection*, BlueStore::OnodeRef, uint64_t, size_t, ceph::bufferlist&, u
int32_t, uint64_t)' thread 7f15478b8700 time 2020-11-27 20:09:03.547900
/build/ceph-14.2.14/src/os/bluestore/BlueStore.cc: 9522: FAILED ceph_assert(r == 0)

 ceph version 14.2.14 (7e94c5afc28f3eaf36151ad1e1457de5f16c4fdf) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x560c9139ceea]
 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x560c9139d0c5]
 3: (BlueStore::_do_read(BlueStore::Collection*, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&, unsigned int, unsigned long)+0x2cec) [0x560c918e105c]
 4: (BlueStore::read(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::v14_2_0::list&, unsigned int)+0x1bb) [0x560c918e680b]
 5: (ReplicatedBackend::be_deep_scrub(hobject_t const&, ScrubMap&, ScrubMapBuilder&, ScrubMap::object&)+0x2d2) [0x560c9173eb02]
 6: (PGBackend::be_scan_list(ScrubMap&, ScrubMapBuilder&)+0x393) [0x560c91654dd3]
 7: (PG::build_scrub_map_chunk(ScrubMap&, ScrubMapBuilder&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x7b) [0x560c914e463b]
 8: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x161b) [0x560c9151452b]
 9: (PG::scrub(unsigned int, ThreadPool::TPHandle&)+0xaf) [0x560c9151593f]
 10: (PGScrub::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1a) [0x560c916da1ea]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbf5) [0x560c9143eb85]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x560c91a5869c]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x560c91a5b860]
 14: (()+0x76db) [0x7f1569dbc6db]
 15: (clone()+0x3f) [0x7f1568b5c71f]

Files

ceph-osd.195.log.tar.gz (335 KB) ceph-osd.195.log.tar.gz Seena Fallah, 11/27/2020 10:57 PM
Actions #1

Updated by Igor Fedotov over 3 years ago

I think this is another form of https://tracker.ceph.com/issues/48276
And the root cause is presumably pretty the same - out of range offset for disk op. Just a different appearance.

Could you please share OSD log (or last 20000 lines of it) prior to the crash?

Actions #2

Updated by Seena Fallah over 3 years ago

Thanks for your review. Here you go.

Actions #3

Updated by Igor Fedotov over 3 years ago

Thanks for sharing!
Unfortunately too low debug level for bdev hence not much useful info.

Wondering if you're able to reproduce such a crash by running deep fsck for this OSD?
If so please set debug-bdev to 10 for this run and collect the log.

Also it makes sense to set debug-bdev to 1/5 (currently it's at 1/3) for the cluster in an attempt to get a bit more verbose log if the crash happens again.
Please also check dmesg output for this host, there might be some additional info on the read failure there.

Actions #4

Updated by Seena Fallah over 3 years ago

You are right. It seems the disk has read error by itself and this occurs 3 times today and I'm wondering why Ceph doesn't segfault on one of them!

kernel: [244042.830977] sd 1:0:3:0: [sdd] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
kernel: [244042.831008] sd 1:0:3:0: [sdd] tag#0 Sense Key : Medium Error [current] [descriptor] 
kernel: [244042.831017] sd 1:0:3:0: [sdd] tag#0 Add. Sense: Unrecovered read error
kernel: [244042.831027] sd 1:0:3:0: [sdd] tag#0 CDB: Read(16) 88 00 00 00 00 02 aa 7c ea 80 00 00 04 00 00 00 
kernel: [244042.831036] print_req_error: critical medium error, dev sdd, sector 11450248760

For the deep fsck do you mean running this command?

fsck -pvcf

For the log level you mentioned, I will do it but won't you set it to 1/5 by default?

Actions #5

Updated by Igor Fedotov over 3 years ago

Seena Fallah wrote:

You are right. It seems the disk has read error by itself and this occurs 3 times today and I'm wondering why Ceph doesn't segfault on one of them!
[...]

For the deep fsck do you mean running this command?
[...]

no, I mean
ceph-bluestore-tool --path <path-to-osd> --command fsck --deep 1

For the log level you mentioned, I will do it but won't you set it to 1/5 by default?

According to your osd log it's 1/3:

-- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
...
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
...

Actions #6

Updated by Igor Fedotov over 3 years ago

  • Status changed from New to Triaged
Actions #7

Updated by Igor Fedotov over 3 years ago

Seena,
mind this to be closed as invalid?

Actions #8

Updated by Seena Fallah over 3 years ago

Igor Fedotov wrote:

Seena,
mind this to be closed as invalid?

I’ve change my disk and seems it was because of bad sector on that disk so makes sense to close it as an invalid +1
Thanks for your help :)

Actions #9

Updated by Igor Fedotov over 3 years ago

  • Status changed from Triaged to Rejected
Actions

Also available in: Atom PDF