Project

General

Profile

Actions

Bug #23879

closed

test_mon_osdmap_prune.sh fails

Added by Kefu Chai about 6 years ago. Updated almost 4 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
mimic, nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2018-04-26T06:50:25.638 INFO:tasks.workunit.client.0.smithi009.stdout:    "osdmap_first_committed": 1,
2018-04-26T06:50:25.638 INFO:tasks.workunit.client.0.smithi009.stdout:    "osdmap_last_committed": 1027,

there is chance that we fail to trim osdmap.

/a//kchai-2018-04-26_05:37:16-rados-master-distro-basic-smithi/2441069/teuthology.log

http://pulpito.ceph.com/kchai-2018-04-26_05:37:16-rados-master-distro-basic-smithi/ it fails 1 out 15 times.


Related issues 1 (0 open1 closed)

Related to RADOS - Bug #23942: test_mon_osdmap_prune.sh failuresDuplicate04/30/2018

Actions
Actions #1

Updated by Kefu Chai about 6 years ago

  • Category set to Correctness/Safety
  • Source set to Development
Actions #2

Updated by Kefu Chai almost 6 years ago

$ zgrep propose_pending remote/*/log/ceph-mon.*.log.gz|grep osd | wc -l
1037

$ for f in remote/*/log/ceph-mon.*.log.gz; do zgrep propose_pending $f | grep osd | head -n1; done
2018-04-26 06:00:15.153 7f305afce700 10 mon.f@0(leader).paxosservice(osdmap 0..0) propose_pending
2018-04-26 06:12:33.218 7f1042aba700 10 mon.g@2(leader).paxosservice(osdmap 1..306) propose_pending
2018-04-26 06:03:38.337 7f3eaa0cc700 10 mon.h@4(leader).paxosservice(osdmap 1..91) propose_pending
2018-04-26 06:08:07.313 7fe295b50700 10 mon.a@1(leader).paxosservice(osdmap 1..176) propose_pending

$ for f in remote/*/log/ceph-mon.*.log.gz; do zgrep propose_pending $f | grep osd | tail -n1; done
2018-04-26 06:44:52.947 7f5bc90a4700 10 mon.f@0(leader).paxosservice(osdmap 1..1026) propose_pending
2018-04-26 06:12:36.198 7f1042aba700 10 mon.g@2(leader).paxosservice(osdmap 1..308) propose_pending
2018-04-26 06:05:10.805 7f3eaa0cc700 10 mon.h@4(leader).paxosservice(osdmap 1..118) propose_pending
2018-04-26 06:33:17.084 7fe295b50700 10 mon.a@1(leader).paxosservice(osdmap 1..1022) propose_pending

so we were constantly proposing

In [3]: 1037/(44*60)
Out[3]: 0.3928030303030303

i guess that's why osdmap never got trimmed. because PaxosService::maybe_trim() only trims if the PaxosService is active. and the service is not considered active if it proposing.

Actions #3

Updated by Josh Durgin almost 6 years ago

  • Related to Bug #23942: test_mon_osdmap_prune.sh failures added
Actions #4

Updated by Josh Durgin almost 6 years ago

  • Priority changed from Normal to Urgent
Actions #5

Updated by Josh Durgin almost 6 years ago

  • Priority changed from Urgent to Normal

Sounds like we need to block for trimming sometimes when there's a constant propose workload.

Actions #6

Updated by Sage Weil almost 6 years ago

  • Status changed from New to 12
  • Assignee set to Joao Eduardo Luis
  • Priority changed from Normal to High

/a/sage-2018-05-23_14:50:29-rados-wip-sage2-testing-2018-05-22-1410-distro-basic-smithi/2576533

Actions #7

Updated by Neha Ojha almost 6 years ago

/a/nojha-2018-06-21_00:18:52-rados-wip-24487-distro-basic-smithi/2686362

Actions #8

Updated by Kefu Chai over 5 years ago

/a/kchai-2018-09-11_09:51:05-rados-wip-kefu-testing-2018-09-10-1219-distro-basic-mira/3005452/teuthology.log

2018-09-11T11:39:19.500 INFO:tasks.workunit.client.0.mira037.stderr:+ (( i < 27 ))
2018-09-11T11:39:19.500 INFO:tasks.workunit.client.0.mira037.stderr:+ echo 'never trimmed up to epoch 11'
2018-09-11T11:39:19.500 INFO:tasks.workunit.client.0.mira037.stderr:+ ceph report
2018-09-11T11:39:19.500 INFO:tasks.workunit.client.0.mira037.stdout:never trimmed up to epoch 11
Actions #9

Updated by Kefu Chai over 5 years ago

2018-09-12T13:24:50.463 INFO:tasks.workunit.client.0.mira092.stderr:+ (( i < 27 ))
2018-09-12T13:24:50.463 INFO:tasks.workunit.client.0.mira092.stderr:+ echo 'never trimmed up to epoch 11'
2018-09-12T13:24:50.463 INFO:tasks.workunit.client.0.mira092.stderr:+ ceph report
2018-09-12T13:24:50.463 INFO:tasks.workunit.client.0.mira092.stdout:never trimmed up to epoch 11

/a//kchai-2018-09-12_11:57:28-rados-wip-kefu-testing-2018-09-12-1250-distro-basic-mira/3010904
Actions #10

Updated by Sage Weil over 5 years ago

  • Priority changed from High to Urgent

/a/sage-2018-10-10_15:50:53-rados-wip-sage-testing-2018-10-10-0850-distro-basic-smithi/3125020

Actions #11

Updated by Neha Ojha over 5 years ago

Joao, we've been seeing this one for a while, could you please take a look. Thanks!

Actions #12

Updated by Josh Durgin over 5 years ago

  • Priority changed from Urgent to High

We aren't hitting this in recent rados runs anymore

Actions #13

Updated by Neha Ojha about 5 years ago

  • Priority changed from High to Urgent

Seen in mimic /a/nojha-2019-01-29_03:40:43-rados-wip-37902-mimic-2019-01-28-distro-basic-smithi/3522485/

Actions #15

Updated by Neha Ojha almost 5 years ago

/a/yuriw-2019-04-29_22:14:10-rados-wip-yuri2-testing-2019-04-29-1936-mimic-distro-basic-smithi/3910028

Actions #16

Updated by Neha Ojha almost 5 years ago

/a/yuriw-2019-05-01_19:40:05-rados-wip-yuri3-testing-2019-04-30-1543-mimic-distro-basic-smithi/3916650/

Actions #17

Updated by Sage Weil almost 5 years ago

/a/sage-2019-07-02_17:58:21-rados-wip-sage-testing-2019-07-02-1056-distro-basic-smithi/4087740

Actions #18

Updated by David Zafman almost 5 years ago

  • Backport set to mimic, nautilus
Actions #19

Updated by Greg Farnum over 4 years ago

  • Priority changed from Urgent to High
Actions #20

Updated by Greg Farnum over 4 years ago

  • Assignee deleted (Joao Eduardo Luis)
Actions #21

Updated by Greg Farnum over 4 years ago

Are we really only seeing this about once a month? Is it just a probabilistic failure based on load of the monitor cluster?

Actions #22

Updated by Patrick Donnelly over 4 years ago

  • Status changed from 12 to New
Actions #23

Updated by Neha Ojha almost 4 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF