Project

General

Profile

Bug #38358

short pg log + cache tier ceph_test_rados out of order reply

Added by Sage Weil about 1 year ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature:

Description

the combination of

- 1-pg-log-overrides/short_pg_log.yaml

and

- workloads/cache-agent-small.yaml

and any msgr failure injection

results in a ceph_test_rados crash like

2019-02-16T12:48:16.152 INFO:tasks.rados.rados.0.smithi139.stdout:3323:  finishing write tid 3 to smithi13913891-294
2019-02-16T12:48:16.152 INFO:tasks.rados.rados.0.smithi139.stdout:3323:  finishing write tid 2 to smithi13913891-294
2019-02-16T12:48:16.152 INFO:tasks.rados.rados.0.smithi139.stderr:Error: finished tid 2 when last_acked_tid was 3
2019-02-16T12:48:16.152 INFO:tasks.rados.rados.0.smithi139.stderr:/build/ceph-14.0.1-3796-g597cd08/src/test/osd/RadosModel.h: In function 'virtual void WriteOp::_finish(TestOp::CallbackInfo*)' thread 7fdcb4ff9700 time 2019-02-16 12:48:16.152554
2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr:/build/ceph-14.0.1-3796-g597cd08/src/test/osd/RadosModel.h: 905: abort()
2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: ceph version 14.0.1-3796-g597cd08 (597cd0800d5525c39d588f536bfb01afed545bdb) nautilus (dev)
2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x7fdccd2799b7]
2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 2: (WriteOp::_finish(TestOp::CallbackInfo*)+0x5eb) [0x55d3145cacfb]
2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 3: (write_callback(void*, void*)+0x19) [0x55d3145e6899]
2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 4: (()+0x537d6) [0x7fdcd5ea57d6]
2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 5: (Context::complete(int)+0x9) [0x7fdcd5e89739]
2019-02-16T12:48:16.154 INFO:tasks.rados.rados.0.smithi139.stderr: 6: (Finisher::finisher_thread_entry()+0x16e) [0x7fdccd2be79e]
2019-02-16T12:48:16.154 INFO:tasks.rados.rados.0.smithi139.stderr: 7: (()+0x76db) [0x7fdcccdf86db]
2019-02-16T12:48:16.154 INFO:tasks.rados.rados.0.smithi139.stderr: 8: (clone()+0x3f) [0x7fdccc57b88f]

/a/kchai-2019-02-16_11:36:29-rados-wip-sage-testing-2019-02-16-1748-distro-basic-smithi/3601272

The short pg log in the base tier means that reqid aren't reliable propagated back to the cache tier, breaking the ordering when client ops are resent.


Related issues

Related to RADOS - Bug #24320: out of order reply and/or osd assert with set-chunks-read.yaml Resolved 05/26/2018
Copied to RADOS - Backport #43346: nautilus: short pg log + cache tier ceph_test_rados out of order reply Resolved

History

#1 Updated by Sage Weil about 1 year ago

  • Related to Bug #24320: out of order reply and/or osd assert with set-chunks-read.yaml added

#2 Updated by Sage Weil about 1 year ago

/a/sage-2019-02-21_06:38:51-rados-wip-sage-testing-2019-02-20-2138-distro-basic-smithi/3620775

#3 Updated by Sage Weil about 1 year ago

/a/sage-2019-02-23_23:02:18-rados-wip-sage2-testing-2019-02-23-1354-distro-basic-smithi/3631889

#4 Updated by Neha Ojha about 1 year ago

This is on luminous:

/a/teuthology-2019-02-23_01:30:03-rados-luminous-distro-basic-smithi/3627561/

We recently changed the pg log limits for short_pg_log.yaml, which may be the reason why these failures are popping up more.

#5 Updated by Neha Ojha about 1 year ago

/a/yuriw-2019-03-07_00:04:47-rados-wip_yuri_nautilus_3.6.19-distro-basic-smithi/3675857/

#6 Updated by Sage Weil 9 months ago

avoiding this in the qa suite as of this pr: https://github.com/ceph/ceph/pull/28658

#7 Updated by Patrick Donnelly 4 months ago

  • Status changed from 12 to New

#8 Updated by Neha Ojha 4 months ago

  • Status changed from New to Pending Backport
  • Backport set to nautilus

Seen in nautilus: /a/yuriw-2019-12-15_16:25:11-rados-wip-yuri-nautilus-baseline_12.13.19-distro-basic-smithi/4605500/

#9 Updated by Nathan Cutler 4 months ago

  • Copied to Backport #43346: nautilus: short pg log + cache tier ceph_test_rados out of order reply added

#10 Updated by Nathan Cutler 2 months ago

  • Pull request ID set to 28658

#11 Updated by Nathan Cutler about 2 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF