Project

General

Profile

Bug #41834

qa: EC Pool configuration and slow op warnings for OSDs caused by recent master changes

Added by Patrick Donnelly about 1 year ago. Updated 11 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(RADOS):
Pull request ID:
Crash signature:

Description

See: http://pulpito.ceph.com/pdonnell-2019-09-14_22:39:31-fs-master-distro-basic-smithi/

Recent run of fs suite on master without these failures: http://pulpito.ceph.com/pdonnell-2019-09-10_13:19:19-fs-master-distro-basic-smithi/

Must be one of the changes in this range: e1e962d28a913ac69ca183e7009e39efd2e072f2..faac21f54fc941cc3c2b306e0202728630559835


Related issues

Related to fs - Bug #42516: mds: some mutations have initiated (TrackedOp) set to 0 New

History

#1 Updated by Greg Farnum about 1 year ago

  • Project changed from Ceph to RADOS

#2 Updated by Casey Bodley about 1 year ago

This shows up in rgw's ec pool tests also. In osd logs, I see slow ops on MOSDECSubOpRead/Reply messages, and they always show a timestamp of 0.000000:

2019-09-18T17:02:45.915+0000 1693b700 20 slow op MOSDECSubOpReadReply(2.14s0 133/23 ECSubReadReply(tid=4368, attrs_read=0)) initiated 0.000000
2019-09-18T17:02:45.915+0000 1693b700 -1 osd.6 133 get_health_metrics reporting 1 slow ops, oldest is MOSDECSubOpReadReply(2.14s0 133/23 ECSubReadReply(tid=4368, attrs_read=0))
2019-09-18T17:02:51.988+0000 1693b700 20 slow op MOSDECSubOpReadReply(2.24s0 133/23 ECSubReadReply(tid=4376, attrs_read=0)) initiated 0.000000
2019-09-18T17:02:51.988+0000 1693b700 -1 osd.6 133 get_health_metrics reporting 1 slow ops, oldest is MOSDECSubOpReadReply(2.24s0 133/23 ECSubReadReply(tid=4376, attrs_read=0))
2019-09-18T17:09:42.971+0000 1693b700 20 slow op MOSDECSubOpReadReply(2.1as0 133/23 ECSubReadReply(tid=5254, attrs_read=0)) initiated 0.000000

So it doesn't look like they're actually slow ops, but the `throttle_stamp` isn't being initialized for these specific messages. I wasn't able to identify any recent changes that would cause that.

see also: http://pulpito.ceph.com/cbodley-2019-09-18_14:51:41-rgw-wip-cbodley-testing-distro-basic-smithi/

#4 Updated by Patrick Donnelly 12 months ago

  • Assignee set to Rishabh Dave

#5 Updated by Rishabh Dave 11 months ago

  • Assignee deleted (Rishabh Dave)

#6 Updated by Brad Hubbard 11 months ago

  • Assignee set to Brad Hubbard

#7 Updated by Brad Hubbard 11 months ago

Just an update.

I'm trying to zero in on this by adding an assert that triggers when a TrackedOp with initiated.is_zero() is created. Unfortunately, I've caught some other situations where TrackedOps are created in this fashion and also had problems with teuthology since it doesn't seem to recognise these asserts (triggered in the mds up to now) as actual failures and therefore run for 12 hours before it marks the job dead :P I'm running another set of tests with additional patches to work around the issues I don't want to catch and hopefully this run will shed more light on the issue.

#8 Updated by Brad Hubbard 11 months ago

Current candidate is this stack. I'll continue with this in the morning as I need to set up an environment to analyse the coredump.

    -1> 2019-10-23T07:13:22.489+0000 7f58bc0f7700 20 bluestore.MempoolThread(0x55af00eeaa08) _resize_shards cache_size: 2845415832 kv_alloc: 1073741824 kv_used: 1824 meta_alloc: 1073741824 meta_used: 270749 data_alloc: 671088640 data_used: 0
     0> 2019-10-23T07:13:22.493+0000 7f58b40e7700 -1 *** Caught signal (Aborted) **
 in thread 7f58b40e7700 thread_name:ms_local

 ceph version 15.0.0-5664-g15693f6 (15693f6a8661af547dbec60a56e7516f8f54402e) octopus (dev)
 1: (()+0xf5d0) [0x7f58cb3545d0]
 2: (gsignal()+0x37) [0x7f58ca14b2c7]
 3: (abort()+0x148) [0x7f58ca14c9b8]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x55aef55a83b8]
 5: (()+0x4ea531) [0x55aef55a8531]
 6: (OpRequest::OpRequest(Message*, OpTracker*)+0x156) [0x55aef5ee0346]
 7: (OpRequest::Ref OpTracker::create_request<OpRequest, Message*>(Message*)+0x42) [0x55aef56e7712]
 8: (OSD::ms_fast_dispatch(Message*)+0x2e0) [0x55aef56adee0]
 9: (DispatchQueue::fast_dispatch(boost::intrusive_ptr<Message> const&)+0x155) [0x55aef5f6bcd5]
 10: (DispatchQueue::run_local_delivery()+0x150) [0x55aef5f6beb0]
 11: (DispatchQueue::LocalDeliveryThread::entry()+0xd) [0x55aef5dd762d]
 12: (()+0x7dd5) [0x7f58cb34cdd5]
 13: (clone()+0x6d) [0x7f58ca21302d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#9 Updated by Brad Hubbard 11 months ago

throttle_stamp is indeed zero.

(gdb) f
#11 0x000055aef56adee0 in OSD::ms_fast_dispatch(Message*) () at /usr/src/debug/ceph-15.0.0-5664-g15693f6/src/osd/OSD.cc:7298
7298      OpRequestRef op = op_tracker.create_request<OpRequest, Message*>(m);
(gdb) p m->throttle_stamp
$7 = {
  tv = {
    tv_sec = 0, 
    tv_nsec = 0
  }
}

and this is an MOSDECSubOpRead so this is the one I've been trying to catch.

(gdb) p *m                      
$9 = {            
  <RefCountedObject> = {
    _vptr.RefCountedObject = 0x55aef680a0d8 <vtable for MOSDECSubOpRead+16>, 

I'm wondering if this has something to do with the LocalDeliveryThread?

I'm going to test a patch that sets the throttle stamp in DispatchQueue::local_delivery. I guess this may happen in any test where the MOSDECSubOpReads are using a loopback connection?

#10 Updated by Neha Ojha 11 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 31137

#11 Updated by Patrick Donnelly 11 months ago

  • Status changed from Fix Under Review to Resolved

#12 Updated by Patrick Donnelly about 1 month ago

  • Related to Bug #42516: mds: some mutations have initiated (TrackedOp) set to 0 added

Also available in: Atom PDF