Project

General

Profile

Bug #21861

osdc: truncate Object and remove the bh which have someone wait for read on it occur a assert fail.

Added by Ivan Guan about 1 year ago. Updated about 2 months ago.

Status:
New
Priority:
High
Assignee:
Category:
Correctness/Safety
Target version:
Start date:
10/20/2017
Due date:
% Done:

0%

Source:
Community (dev)
Tags:
crash
Backport:
luminous,mimic
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
Client, osdc
Labels (FS):
Pull request ID:

Description

ceph version: jewel 10.2.2

When one osd be written over the full_ratio(default is 0.95) will lead the cluster to full state.
As a result client will purge the oset which is dirty_or_tx. But if a read op come in and new a
bh in a object and send read op to osd, coincidentally we are purging this object will produce a
assert fail.

//16.55.20.32 read op come in

2017-10-13 16:55:20.324566 7ff400ff9700  3 client.1444221 ll_read 0x7ff40c060920 1000001c635  0~131072
2017-10-13 16:55:20.324569 7ff400ff9700 10 client.1444221 get_caps 1000001c635.head(faked_ino=0 ref=8 ll_ref=5 cap_refs={1024=1,4096=0,8192=1} open={1=1,2=0} mode=100666 size=815191/4194304 mtime=2017-10-13 16:55:20.232085 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=Fw objectset[1000001c635 ts 0/0 objects 1 dirty_or_tx 143801] parents=0x7ff40c0b4aa0 0x7ff40c3b2a90) have pAsxLsXsxFsxcrwb need Fr want Fc but not Fc revoking -
2017-10-13 16:55:20.324580 7ff400ff9700 10 client.1444221 *_read_async* 1000001c635.head(faked_ino=0 ref=8 ll_ref=5 cap_refs={1024=1,2048=1,4096=0,8192=1} open={1=1,2=0} mode=100666 size=815191/4194304 mtime=2017-10-13 16:55:20.232085 caps=pAsxLsXsxFsxcrwb(0=pAsxLsXsxFsxcrwb) flushing_caps=Fw objectset[1000001c635 ts 0/0 objects 1 dirty_or_tx 143801] parents=0x7ff40c0b4aa0 0x7ff40c3b2a90) 0~131072
2017-10-13 16:55:20.324585 7ff400ff9700 10 client.1444221  min_bytes=131072 max_bytes=16777216 max_periods=4
2017-10-13 16:55:20.324620 7ff400ff9700  1 -- 192.168.12.216:0/3579173032 --> 192.168.12.216:6803/22142 -- osd_op(client.1444221.0:2102 452.9a6db107 1000001c635.00000000 [read 0~131072] snapc 0=[] ack+read+known_if_redirected e6766) v7 -- ?+0 0x7ff404019820 con 0x7ff40c04bef0  //*send the op to osd
*

//16.55.25 purge object 1000001c635.head 
2017-10-13 16:55:25.214192 7ff420ff9700  4 client.1444221* _handle_full_flag:* FULL: inode 0x1000001c635.head has dirty objects, purging and setting ENOSPC
2017-10-13 16:55:25.785130 7ff420ff9700 -1 osdc/ObjectCacher.cc: In function 'void ObjectCacher::Object::truncate(loff_t)' thread 7ff420ff9700 time 2017-10-13 16:55:25.214197
osdc/ObjectCacher.cc: 491: *FAILED assert(bh->waitfor_read.empty())*
ceph version attr-v1-file-share-op-code-28-g1411918 (1411918ff2a8358567e32bad78f22ee4974c5975)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7ff42e6ceeb5]
2: (ObjectCacher::Object::truncate(long)+0x266) [0x7ff42e57de56]
3: (ObjectCacher::purge(ObjectCacher::Object*)+0x7a) [0x7ff42e57df2a]
4: (ObjectCacher::purge_set(ObjectCacher::ObjectSet*)+0xab) [0x7ff42e57e16b]
5: (Client::_handle_full_flag(long)+0x1d0) [0x7ff42e4d0540]
6: (Client::handle_osd_map(MOSDMap*)+0x17f) [0x7ff42e4d600f]
7: (Client::ms_dispatch(Message*)+0x5e3) [0x7ff42e540a73]
8: (DispatchQueue::entry()+0x78a) [0x7ff42e7dd1aa]
9: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff42e6b386d]
10: (()+0x7e25) [0x7ff42d337e25]
11: (clone()+0x6d) [0x7ff42c21f34d]

The read op come back and filled the bh also can't affect the cluster because it's clean will not be flushed to osd.
So i think we can let go of this bh if someone was waiting for it instead produce a assert fail.What do you think
about it.

History

#1 Updated by Greg Farnum 10 months ago

  • Project changed from Ceph to fs
  • Category deleted (Objecter)
  • Assignee set to Zheng Yan

Zheng, is this what your recent master PR for ObjectCacher fixed?

#2 Updated by Patrick Donnelly 8 months ago

  • Subject changed from ceph-fuse truncate Object and remove the bh which have someone wait for read on it occur a assert fail. to osdc: truncate Object and remove the bh which have someone wait for read on it occur a assert fail.
  • Description updated (diff)
  • Category set to Correctness/Safety
  • Priority changed from Normal to High
  • Target version changed from v12.2.2 to v14.0.0
  • Tags set to crash
  • Backport set to luminous,mimic
  • Component(FS) Client, osdc added

#3 Updated by Zheng Yan about 2 months ago

I think this bug still exists in master

Also available in: Atom PDF