Bug #38358
short pg log + cache tier ceph_test_rados out of order reply
0%
Description
the combination of
- 1-pg-log-overrides/short_pg_log.yaml
and
- workloads/cache-agent-small.yaml
and any msgr failure injection
results in a ceph_test_rados crash like
2019-02-16T12:48:16.152 INFO:tasks.rados.rados.0.smithi139.stdout:3323: finishing write tid 3 to smithi13913891-294 2019-02-16T12:48:16.152 INFO:tasks.rados.rados.0.smithi139.stdout:3323: finishing write tid 2 to smithi13913891-294 2019-02-16T12:48:16.152 INFO:tasks.rados.rados.0.smithi139.stderr:Error: finished tid 2 when last_acked_tid was 3 2019-02-16T12:48:16.152 INFO:tasks.rados.rados.0.smithi139.stderr:/build/ceph-14.0.1-3796-g597cd08/src/test/osd/RadosModel.h: In function 'virtual void WriteOp::_finish(TestOp::CallbackInfo*)' thread 7fdcb4ff9700 time 2019-02-16 12:48:16.152554 2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr:/build/ceph-14.0.1-3796-g597cd08/src/test/osd/RadosModel.h: 905: abort() 2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: ceph version 14.0.1-3796-g597cd08 (597cd0800d5525c39d588f536bfb01afed545bdb) nautilus (dev) 2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x7fdccd2799b7] 2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 2: (WriteOp::_finish(TestOp::CallbackInfo*)+0x5eb) [0x55d3145cacfb] 2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 3: (write_callback(void*, void*)+0x19) [0x55d3145e6899] 2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 4: (()+0x537d6) [0x7fdcd5ea57d6] 2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 5: (Context::complete(int)+0x9) [0x7fdcd5e89739] 2019-02-16T12:48:16.154 INFO:tasks.rados.rados.0.smithi139.stderr: 6: (Finisher::finisher_thread_entry()+0x16e) [0x7fdccd2be79e] 2019-02-16T12:48:16.154 INFO:tasks.rados.rados.0.smithi139.stderr: 7: (()+0x76db) [0x7fdcccdf86db] 2019-02-16T12:48:16.154 INFO:tasks.rados.rados.0.smithi139.stderr: 8: (clone()+0x3f) [0x7fdccc57b88f]
/a/kchai-2019-02-16_11:36:29-rados-wip-sage-testing-2019-02-16-1748-distro-basic-smithi/3601272
The short pg log in the base tier means that reqid aren't reliable propagated back to the cache tier, breaking the ordering when client ops are resent.
Related issues
History
#1 Updated by Sage Weil about 5 years ago
- Related to Bug #24320: out of order reply and/or osd assert with set-chunks-read.yaml added
#2 Updated by Sage Weil about 5 years ago
/a/sage-2019-02-21_06:38:51-rados-wip-sage-testing-2019-02-20-2138-distro-basic-smithi/3620775
#3 Updated by Sage Weil about 5 years ago
/a/sage-2019-02-23_23:02:18-rados-wip-sage2-testing-2019-02-23-1354-distro-basic-smithi/3631889
#4 Updated by Neha Ojha about 5 years ago
This is on luminous:
/a/teuthology-2019-02-23_01:30:03-rados-luminous-distro-basic-smithi/3627561/
We recently changed the pg log limits for short_pg_log.yaml, which may be the reason why these failures are popping up more.
#5 Updated by Neha Ojha about 5 years ago
/a/yuriw-2019-03-07_00:04:47-rados-wip_yuri_nautilus_3.6.19-distro-basic-smithi/3675857/
#6 Updated by Sage Weil almost 5 years ago
avoiding this in the qa suite as of this pr: https://github.com/ceph/ceph/pull/28658
#7 Updated by Patrick Donnelly over 4 years ago
- Status changed from 12 to New
#8 Updated by Neha Ojha over 4 years ago
- Status changed from New to Pending Backport
- Backport set to nautilus
Seen in nautilus: /a/yuriw-2019-12-15_16:25:11-rados-wip-yuri-nautilus-baseline_12.13.19-distro-basic-smithi/4605500/
#9 Updated by Nathan Cutler over 4 years ago
- Copied to Backport #43346: nautilus: short pg log + cache tier ceph_test_rados out of order reply added
#10 Updated by Nathan Cutler about 4 years ago
- Pull request ID set to 28658
#11 Updated by Nathan Cutler about 4 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".