Project

General

Profile

Bug #38358

short pg log + cache tier ceph_test_rados out of order reply

Added by Sage Weil 3 months ago. Updated 2 months ago.

Status:
Verified
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
02/16/2019
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:

Description

the combination of

- 1-pg-log-overrides/short_pg_log.yaml

and

- workloads/cache-agent-small.yaml

and any msgr failure injection

results in a ceph_test_rados crash like

2019-02-16T12:48:16.152 INFO:tasks.rados.rados.0.smithi139.stdout:3323:  finishing write tid 3 to smithi13913891-294
2019-02-16T12:48:16.152 INFO:tasks.rados.rados.0.smithi139.stdout:3323:  finishing write tid 2 to smithi13913891-294
2019-02-16T12:48:16.152 INFO:tasks.rados.rados.0.smithi139.stderr:Error: finished tid 2 when last_acked_tid was 3
2019-02-16T12:48:16.152 INFO:tasks.rados.rados.0.smithi139.stderr:/build/ceph-14.0.1-3796-g597cd08/src/test/osd/RadosModel.h: In function 'virtual void WriteOp::_finish(TestOp::CallbackInfo*)' thread 7fdcb4ff9700 time 2019-02-16 12:48:16.152554
2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr:/build/ceph-14.0.1-3796-g597cd08/src/test/osd/RadosModel.h: 905: abort()
2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: ceph version 14.0.1-3796-g597cd08 (597cd0800d5525c39d588f536bfb01afed545bdb) nautilus (dev)
2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x7fdccd2799b7]
2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 2: (WriteOp::_finish(TestOp::CallbackInfo*)+0x5eb) [0x55d3145cacfb]
2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 3: (write_callback(void*, void*)+0x19) [0x55d3145e6899]
2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 4: (()+0x537d6) [0x7fdcd5ea57d6]
2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 5: (Context::complete(int)+0x9) [0x7fdcd5e89739]
2019-02-16T12:48:16.154 INFO:tasks.rados.rados.0.smithi139.stderr: 6: (Finisher::finisher_thread_entry()+0x16e) [0x7fdccd2be79e]
2019-02-16T12:48:16.154 INFO:tasks.rados.rados.0.smithi139.stderr: 7: (()+0x76db) [0x7fdcccdf86db]
2019-02-16T12:48:16.154 INFO:tasks.rados.rados.0.smithi139.stderr: 8: (clone()+0x3f) [0x7fdccc57b88f]

/a/kchai-2019-02-16_11:36:29-rados-wip-sage-testing-2019-02-16-1748-distro-basic-smithi/3601272

The short pg log in the base tier means that reqid aren't reliable propagated back to the cache tier, breaking the ordering when client ops are resent.


Related issues

Related to RADOS - Bug #24320: out of order reply and/or osd assert with set-chunks-read.yaml Verified 05/26/2018

History

#1 Updated by Sage Weil 3 months ago

  • Related to Bug #24320: out of order reply and/or osd assert with set-chunks-read.yaml added

#2 Updated by Sage Weil 3 months ago

/a/sage-2019-02-21_06:38:51-rados-wip-sage-testing-2019-02-20-2138-distro-basic-smithi/3620775

#3 Updated by Sage Weil 3 months ago

/a/sage-2019-02-23_23:02:18-rados-wip-sage2-testing-2019-02-23-1354-distro-basic-smithi/3631889

#4 Updated by Neha Ojha 3 months ago

This is on luminous:

/a/teuthology-2019-02-23_01:30:03-rados-luminous-distro-basic-smithi/3627561/

We recently changed the pg log limits for short_pg_log.yaml, which may be the reason why these failures are popping up more.

#5 Updated by Neha Ojha 2 months ago

/a/yuriw-2019-03-07_00:04:47-rados-wip_yuri_nautilus_3.6.19-distro-basic-smithi/3675857/

Also available in: Atom PDF