Bug #21861: osdc: truncate Object and remove the bh which have someone wait for read on it occur a assert fail. - CephFS - Ceph

Actions

Copy link

Bug #21861

open

osdc: truncate Object and remove the bh which have someone wait for read on it occur a assert fail.

Added by Ivan Guan over 6 years ago. Updated about 4 years ago.

Status:

New

Priority:

High

Assignee:

Category:

Correctness/Safety

Target version:

% Done:

Source:

Community (dev)

Tags:

crash

Backport:

luminous,mimic

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Client, osdc

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

ceph version: jewel 10.2.2

When one osd be written over the full_ratio(default is 0.95) will lead the cluster to full state.
As a result client will purge the oset which is dirty_or_tx. But if a read op come in and new a
bh in a object and send read op to osd, coincidentally we are purging this object will produce a
assert fail.

//16.55.20.32 read op come in

2017-10-13 16:55:20.324566 7ff400ff9700  3 client.1444221 ll_read 0x7ff40c060920 1000001c635  0~131072
2017-10-13 16:55:20.324569 7ff400ff9700 10 client.1444221 get_caps 1000001c635.head(faked_ino=0 ref=8 ll_ref=5 cap_refs={1024=1,4096=0,8192=1} open={1=1,2=0} mode=100666 size=815191/4194304 mtime=2017-10-13 16:55:20.232085 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=Fw objectset[1000001c635 ts 0/0 objects 1 dirty_or_tx 143801] parents=0x7ff40c0b4aa0 0x7ff40c3b2a90) have pAsxLsXsxFsxcrwb need Fr want Fc but not Fc revoking -
2017-10-13 16:55:20.324580 7ff400ff9700 10 client.1444221 *_read_async* 1000001c635.head(faked_ino=0 ref=8 ll_ref=5 cap_refs={1024=1,2048=1,4096=0,8192=1} open={1=1,2=0} mode=100666 size=815191/4194304 mtime=2017-10-13 16:55:20.232085 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=Fw objectset[1000001c635 ts 0/0 objects 1 dirty_or_tx 143801] parents=0x7ff40c0b4aa0 0x7ff40c3b2a90) 0~131072
2017-10-13 16:55:20.324585 7ff400ff9700 10 client.1444221  min_bytes=131072 max_bytes=16777216 max_periods=4
2017-10-13 16:55:20.324620 7ff400ff9700  1 -- 192.168.12.216:0/3579173032 --> 192.168.12.216:6803/22142 -- osd_op(client.1444221.0:2102 452.9a6db107 1000001c635.00000000 [read 0~131072] snapc 0=[] ack+read+known_if_redirected e6766) v7 -- ?+0 0x7ff404019820 con 0x7ff40c04bef0  //*send the op to osd
*

//16.55.25 purge object 1000001c635.head 
2017-10-13 16:55:25.214192 7ff420ff9700  4 client.1444221* _handle_full_flag:* FULL: inode 0x1000001c635.head has dirty objects, purging and setting ENOSPC
2017-10-13 16:55:25.785130 7ff420ff9700 -1 osdc/ObjectCacher.cc: In function 'void ObjectCacher::Object::truncate(loff_t)' thread 7ff420ff9700 time 2017-10-13 16:55:25.214197
osdc/ObjectCacher.cc: 491: *FAILED assert(bh->waitfor_read.empty())*
ceph version attr-v1-file-share-op-code-28-g1411918 (1411918ff2a8358567e32bad78f22ee4974c5975)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7ff42e6ceeb5]
2: (ObjectCacher::Object::truncate(long)+0x266) [0x7ff42e57de56]
3: (ObjectCacher::purge(ObjectCacher::Object*)+0x7a) [0x7ff42e57df2a]
4: (ObjectCacher::purge_set(ObjectCacher::ObjectSet*)+0xab) [0x7ff42e57e16b]
5: (Client::_handle_full_flag(long)+0x1d0) [0x7ff42e4d0540]
6: (Client::handle_osd_map(MOSDMap*)+0x17f) [0x7ff42e4d600f]
7: (Client::ms_dispatch(Message*)+0x5e3) [0x7ff42e540a73]
8: (DispatchQueue::entry()+0x78a) [0x7ff42e7dd1aa]
9: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff42e6b386d]
10: (()+0x7e25) [0x7ff42d337e25]
11: (clone()+0x6d) [0x7ff42c21f34d]

The read op come back and filled the bh also can't affect the cluster because it's clean will not be flushed to osd.
So i think we can let go of this bh if someone was waiting for it instead produce a assert fail.What do you think
about it.

Actions

Copy link

Updated by Greg Farnum about 6 years ago

Project changed from Ceph to CephFS
Category deleted (~~Objecter~~)
Assignee set to Zheng Yan

Zheng, is this what your recent master PR for ObjectCacher fixed?

Actions

Copy link

Updated by Patrick Donnelly about 6 years ago

Subject changed from ceph-fuse truncate Object and remove the bh which have someone wait for read on it occur a assert fail. to osdc: truncate Object and remove the bh which have someone wait for read on it occur a assert fail.
Description updated (diff)
Category set to Correctness/Safety
Priority changed from Normal to High
Target version changed from v12.2.2 to v14.0.0
Tags set to crash
Backport set to luminous,mimic
Component(FS) Client, osdc added