Project

General

Profile

Actions

Bug #41834

closed

qa: EC Pool configuration and slow op warnings for OSDs caused by recent master changes

Added by Patrick Donnelly over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

See: http://pulpito.ceph.com/pdonnell-2019-09-14_22:39:31-fs-master-distro-basic-smithi/

Recent run of fs suite on master without these failures: http://pulpito.ceph.com/pdonnell-2019-09-10_13:19:19-fs-master-distro-basic-smithi/

Must be one of the changes in this range: e1e962d28a913ac69ca183e7009e39efd2e072f2..faac21f54fc941cc3c2b306e0202728630559835


Related issues 1 (0 open1 closed)

Related to CephFS - Bug #42516: mds: some mutations have initiated (TrackedOp) set to 0ResolvedRamana Raja

Actions
Actions #1

Updated by Greg Farnum over 4 years ago

  • Project changed from Ceph to RADOS
Actions #2

Updated by Casey Bodley over 4 years ago

This shows up in rgw's ec pool tests also. In osd logs, I see slow ops on MOSDECSubOpRead/Reply messages, and they always show a timestamp of 0.000000:

2019-09-18T17:02:45.915+0000 1693b700 20 slow op MOSDECSubOpReadReply(2.14s0 133/23 ECSubReadReply(tid=4368, attrs_read=0)) initiated 0.000000
2019-09-18T17:02:45.915+0000 1693b700 -1 osd.6 133 get_health_metrics reporting 1 slow ops, oldest is MOSDECSubOpReadReply(2.14s0 133/23 ECSubReadReply(tid=4368, attrs_read=0))
2019-09-18T17:02:51.988+0000 1693b700 20 slow op MOSDECSubOpReadReply(2.24s0 133/23 ECSubReadReply(tid=4376, attrs_read=0)) initiated 0.000000
2019-09-18T17:02:51.988+0000 1693b700 -1 osd.6 133 get_health_metrics reporting 1 slow ops, oldest is MOSDECSubOpReadReply(2.24s0 133/23 ECSubReadReply(tid=4376, attrs_read=0))
2019-09-18T17:09:42.971+0000 1693b700 20 slow op MOSDECSubOpReadReply(2.1as0 133/23 ECSubReadReply(tid=5254, attrs_read=0)) initiated 0.000000

So it doesn't look like they're actually slow ops, but the `throttle_stamp` isn't being initialized for these specific messages. I wasn't able to identify any recent changes that would cause that.

see also: http://pulpito.ceph.com/cbodley-2019-09-18_14:51:41-rgw-wip-cbodley-testing-distro-basic-smithi/

Actions #3

Updated by Jason Dillaman over 4 years ago

Actions #4

Updated by Patrick Donnelly over 4 years ago

  • Assignee set to Rishabh Dave
Actions #5

Updated by Rishabh Dave over 4 years ago

  • Assignee deleted (Rishabh Dave)
Actions #6

Updated by Brad Hubbard over 4 years ago

  • Assignee set to Brad Hubbard
Actions #7

Updated by Brad Hubbard over 4 years ago

Just an update.

I'm trying to zero in on this by adding an assert that triggers when a TrackedOp with initiated.is_zero() is created. Unfortunately, I've caught some other situations where TrackedOps are created in this fashion and also had problems with teuthology since it doesn't seem to recognise these asserts (triggered in the mds up to now) as actual failures and therefore run for 12 hours before it marks the job dead :P I'm running another set of tests with additional patches to work around the issues I don't want to catch and hopefully this run will shed more light on the issue.

Actions #8

Updated by Brad Hubbard over 4 years ago

Current candidate is this stack. I'll continue with this in the morning as I need to set up an environment to analyse the coredump.

    -1> 2019-10-23T07:13:22.489+0000 7f58bc0f7700 20 bluestore.MempoolThread(0x55af00eeaa08) _resize_shards cache_size: 2845415832 kv_alloc: 1073741824 kv_used: 1824 meta_alloc: 1073741824 meta_used: 270749 data_alloc: 671088640 data_used: 0
     0> 2019-10-23T07:13:22.493+0000 7f58b40e7700 -1 *** Caught signal (Aborted) **
 in thread 7f58b40e7700 thread_name:ms_local

 ceph version 15.0.0-5664-g15693f6 (15693f6a8661af547dbec60a56e7516f8f54402e) octopus (dev)
 1: (()+0xf5d0) [0x7f58cb3545d0]
 2: (gsignal()+0x37) [0x7f58ca14b2c7]
 3: (abort()+0x148) [0x7f58ca14c9b8]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x55aef55a83b8]
 5: (()+0x4ea531) [0x55aef55a8531]
 6: (OpRequest::OpRequest(Message*, OpTracker*)+0x156) [0x55aef5ee0346]
 7: (OpRequest::Ref OpTracker::create_request<OpRequest, Message*>(Message*)+0x42) [0x55aef56e7712]
 8: (OSD::ms_fast_dispatch(Message*)+0x2e0) [0x55aef56adee0]
 9: (DispatchQueue::fast_dispatch(boost::intrusive_ptr<Message> const&)+0x155) [0x55aef5f6bcd5]
 10: (DispatchQueue::run_local_delivery()+0x150) [0x55aef5f6beb0]
 11: (DispatchQueue::LocalDeliveryThread::entry()+0xd) [0x55aef5dd762d]
 12: (()+0x7dd5) [0x7f58cb34cdd5]
 13: (clone()+0x6d) [0x7f58ca21302d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #9

Updated by Brad Hubbard over 4 years ago

throttle_stamp is indeed zero.

(gdb) f
#11 0x000055aef56adee0 in OSD::ms_fast_dispatch(Message*) () at /usr/src/debug/ceph-15.0.0-5664-g15693f6/src/osd/OSD.cc:7298
7298      OpRequestRef op = op_tracker.create_request<OpRequest, Message*>(m);
(gdb) p m->throttle_stamp
$7 = {
  tv = {
    tv_sec = 0, 
    tv_nsec = 0
  }
}

and this is an MOSDECSubOpRead so this is the one I've been trying to catch.

(gdb) p *m                      
$9 = {            
  <RefCountedObject> = {
    _vptr.RefCountedObject = 0x55aef680a0d8 <vtable for MOSDECSubOpRead+16>, 

I'm wondering if this has something to do with the LocalDeliveryThread?

I'm going to test a patch that sets the throttle stamp in DispatchQueue::local_delivery. I guess this may happen in any test where the MOSDECSubOpReads are using a loopback connection?

Actions #10

Updated by Neha Ojha over 4 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 31137
Actions #11

Updated by Patrick Donnelly over 4 years ago

  • Status changed from Fix Under Review to Resolved
Actions #12

Updated by Patrick Donnelly over 3 years ago

  • Related to Bug #42516: mds: some mutations have initiated (TrackedOp) set to 0 added
Actions

Also available in: Atom PDF