Bug #58940: src/osd/PrimaryLogPG.cc: 4284: ceph_abort_msg("out of order op") - RADOS - Ceph

Actions

Copy link

Bug #58940

closed

src/osd/PrimaryLogPG.cc: 4284: ceph_abort_msg("out of order op")

Added by Venky Shankar about 1 year ago. Updated 4 months ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Aishwarya Mathuria

Category:

Target version:

% Done:

Source:

Tags:

backport_processed

Backport:

reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

50691

Crash signature (v1):

Crash signature (v2):

Description

This looks similar to https://tracker.ceph.com/issues/48793, but since that tracker is from a couple of years ago, I thought of creating a fresh one.

I'm seeing the following OSD crash for a job in fs suite: https://pulpito.ceph.com/vshankar-2023-03-08_15:12:36-fs-wip-vshankar-testing-20230308.112059-testing-default-smithi/7197018/

2023-03-08T15:46:23.413+0000 7f0d2c737700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-2861-ga2b8f27f/rpm/el8/BUILD/ceph-18.0.0-2861-ga2b8f27f/src/osd/PrimaryLogPG.cc: In function 'void PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)' thread 7f0d2c737700 time 2023-03-08T15:46:23.411114+0000     /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-2861-ga2b8f27f/rpm/el8/BUILD/ceph-18.0.0-2861-ga2b8f27f/src/osd/PrimaryLogPG.cc: 4284: ceph_abort_msg("out of order op")

ceph version 18.0.0-2861-ga2b8f27f (a2b8f27f8eef4b6c46c66813a64c178fb019988a) reef (dev)
1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xd7) [0x55a3fdaddc85]
2: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x13f4) [0x55a3fdd96694]
3: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x281e) [0x55a3fdd9cd0e]
4: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1c6) [0x55a3fdbec936]
5: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x55a3fdf026d9]
6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x115f) [0x55a3fdc01e6f]
7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x435) [0x55a3fe2b3e65]
8: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55a3fe2b7064]
9: /lib64/libpthread.so.0(+0x81ca) [0x7f0d554a31ca]
10: clone()

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Aishwarya Mathuria about 1 year ago

Logs from osd.1 before it crashed:


2023-03-08T15:48:00.185+0000 7f6665246700 20 osd.1 pg_epoch: 138 pg[3.30( v 138'261 (0'0,138'261] local-lis/les=137/138 n=59 ec=83/83 lis/c=137/83 les/c/f=138/84/0 sis=137) [1,7] r=0 lpr=137 pi=[83,137)/2 crt=138'261 lcod 138'260 mlcod 138'260 active+recovery_wait+undersized+degraded mbc={255={}}] execute_ctx alloc reply 0x56410dec2000 result 0
2023-03-08T15:48:00.185+0000 7f6665246700 10 osd.1 pg_epoch: 138 pg[3.30( v 138'261 (0'0,138'261] local-lis/les=137/138 n=59 ec=83/83 lis/c=137/83 les/c/f=138/84/0 sis=137) [1,7] r=0 lpr=137 pi=[83,137)/2 crt=138'261 lcod 138'260 mlcod 138'260 active+recovery_wait+undersized+degraded mbc={255={}}] calc_trim_to_aggressive limit = 138'261
2023-03-08T15:48:00.185+0000 7f6665246700 20 osd.1 pg_epoch: 138 pg[3.30( v 138'261 (0'0,138'261] local-lis/les=137/138 n=59 ec=83/83 lis/c=137/83 les/c/f=138/84/0 sis=137) [1,7] r=0 lpr=137 pi=[83,137)/2 crt=138'261 lcod 138'260 mlcod 138'260 active+recovery_wait+undersized+degraded mbc={255={}}]  op order client.25007 tid 14934 last was 14945
2023-03-08T15:48:00.185+0000 7f6665246700 -1 osd.1 pg_epoch: 138 pg[3.30( v 138'261 (0'0,138'261] local-lis/les=137/138 n=59 ec=83/83 lis/c=137/83 les/c/f=138/84/0 sis=137) [1,7] r=0 lpr=137 pi=[83,137)/2 crt=138'261 lcod 138'260 mlcod 138'260 active+recovery_wait+undersized+degraded mbc={255={}}] bad op order, already applied 14945 > this 14934

I'm looking into this.

Actions

Copy link

Updated by Rishabh Dave about 1 year ago

Saw this bug on following teuthology run - http://pulpito.front.sepia.ceph.com/rishabh-2023-03-03_21:39:49-fs-wip-rishabh-2023Mar03-2316-testing-default-smithi/7193878

The error can be reproduced on this job repeatedly. Running a second failed the job with a "Failure reason" but the same traceback can be found in the log - https://pulpito.front.sepia.ceph.com/rishabh-2023-03-06_12:50:54-fs-wip-rishabh-2023Mar03-2316-testing-default-smithi/7195138/

Actions

Copy link

Updated by Radoslaw Zarzynski about 1 year ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by Radoslaw Zarzynski about 1 year ago

A reoccurrence: https://pulpito.ceph.com/pdonnell-2023-03-21_13:11:41-fs:workload-wip-pdonnell-testing-20230316.214637-distro-default-smithi/7215290/ (reported by Patrick on 21 Mar 2023).

Actions

Copy link

Updated by Laura Flores about 1 year ago

@Venky, Rishabh, and Patrick, can you provide a good commit and a bad commit?

Actions

Copy link

Updated by Aishwarya Mathuria about 1 year ago

@Patrick, if you can reproduce the issue would you mind setting debug_osd to 30?

Actions

Copy link

Updated by Neha Ojha about 1 year ago

Assignee changed from Radoslaw Zarzynski to Aishwarya Mathuria

Actions

Copy link

Updated by Aishwarya Mathuria about 1 year ago

Assignee changed from Aishwarya Mathuria to Radoslaw Zarzynski
Priority changed from Urgent to Normal

When we enqueue an op in mClock, we check osd_op_queue_cutoff here https://github.com/ceph/ceph/blob/main/src/osd/scheduler/mClockScheduler.cc#L447.
By default this value is set to CEPH_MSG_PRIO_HIGH which is 196.
In some tests, we set osd_op_queue_cutoff to debug_random (https://github.com/ceph/ceph/blob/main/src/osd/scheduler/mClockScheduler.h#L161) which means that we might get the priority cutoff as CEPH_MSG_PRIO_LOW (64) https://github.com/ceph/ceph/blob/main/src/include/msgr.h#L223

These tests had debug_random for osd_op_queue_cutoff enabled and I saw that client ops with the same priority were being enqueued differently. The below op gets enqueued to the mClock scheduler as expected. (sched count increases to 3)

2023-03-22T21:36:45.242+0000 7f113d104700 15 osd.1 89 enqueue_op osd_op(client.24974.0:4821 3.16 3.2877e8d6 (undecoded) ondisk+write+known_if_redirected e89) v8 prio 127 type 42 cost 81920 latency 0.000087 epoch 89 osd_op(client.24974.0:4821 3.16 3.2877e8d6 (undecoded) ondisk+write+known_if_redirected e89) v8
2023-03-22T21:36:45.242+0000 7f113d104700 20 osd.1 op_wq(6) _enqueue OpSchedulerItem(3.16 PGOpItem(op=osd_op(client.24974.0:4821 3.16 3.2877e8d6 (undecoded) ondisk+write+known_if_redirected e89) v8) class_id 3 cost 81920 e89)
2023-03-22T21:36:45.242+0000 7f113d104700 20 mClockScheduler: enqueue { class_id: 3 client_id: 24974 profile_id: 0 } item_cost: 81920 scaled_cost: 1
2023-03-22T21:36:45.242+0000 7f113d104700 20 mClockScheduler: enqueue client_count: 2 queue_sizes: [  high_priority_queue: 0 sched: 3 ]

However the next op with the same priority is being enqueued to the high_priority_queue:

2023-03-22T21:36:45.242+0000 7f113d104700 15 osd.1 89 enqueue_op osd_op(client.24974.0:4822 3.16 3.2877e8d6 (undecoded) ondisk+write+known_if_redirected e89) v8 prio 127 type 42 cost 102400 latency 0.000150 epoch 89 osd_op(client.24974.0:4822 3.16 3.2877e8d6 (undecoded) ondisk+write+known_if_redirected e89) v8
2023-03-22T21:36:45.242+0000 7f113d104700 20 osd.1 op_wq(6) _enqueue OpSchedulerItem(3.16 PGOpItem(op=osd_op(client.24974.0:4822 3.16 3.2877e8d6 (undecoded) ondisk+write+known_if_redirected e89) v8) class_id 3 cost 102400 e89)
2023-03-22T21:36:45.242+0000 7f113d104700 20 mClockScheduler: enqueue client_count: 2 queue_sizes: [  high_priority_queue: 1 sched: 3 ]

This is causing the out of order op issue since the op in high_priority_queue will be dequeued first even though it was enqueued later.
I think this is because the osd_op_queue_cutoff changed to CEPH_MSG_PRIO_LOW from CEPH_MSG_PRIO_HIGH as it is set to debug_random.
I have a fix for this that I am working on testing. We shouldn’t be checking osd_op_queue_cutoff for each enqueue operation, we should only be checking it once. I will update the tracker with the test results.

I am not sure why this has shown up only in these tests (still working on this) but all the failed tests have osd_op_queue_cutoff set to debug_random.

Logs from: /a/amathuri-2023-03-22_17:03:44-fs-main-distro-default-smithi/7216331

Actions

Copy link