Project

General

Profile

Bug #23879

test_mon_osdmap_prune.sh fails

Added by Kefu Chai almost 6 years ago. Updated over 3 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
mimic, nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2018-04-26T06:50:25.638 INFO:tasks.workunit.client.0.smithi009.stdout:    "osdmap_first_committed": 1,
2018-04-26T06:50:25.638 INFO:tasks.workunit.client.0.smithi009.stdout:    "osdmap_last_committed": 1027,

there is chance that we fail to trim osdmap.

/a//kchai-2018-04-26_05:37:16-rados-master-distro-basic-smithi/2441069/teuthology.log

http://pulpito.ceph.com/kchai-2018-04-26_05:37:16-rados-master-distro-basic-smithi/ it fails 1 out 15 times.


Related issues

Related to RADOS - Bug #23942: test_mon_osdmap_prune.sh failures Duplicate 04/30/2018

History

#1 Updated by Kefu Chai almost 6 years ago

  • Category set to Correctness/Safety
  • Source set to Development

#2 Updated by Kefu Chai almost 6 years ago

$ zgrep propose_pending remote/*/log/ceph-mon.*.log.gz|grep osd | wc -l
1037

$ for f in remote/*/log/ceph-mon.*.log.gz; do zgrep propose_pending $f | grep osd | head -n1; done
2018-04-26 06:00:15.153 7f305afce700 10 mon.f@0(leader).paxosservice(osdmap 0..0) propose_pending
2018-04-26 06:12:33.218 7f1042aba700 10 mon.g@2(leader).paxosservice(osdmap 1..306) propose_pending
2018-04-26 06:03:38.337 7f3eaa0cc700 10 mon.h@4(leader).paxosservice(osdmap 1..91) propose_pending
2018-04-26 06:08:07.313 7fe295b50700 10 mon.a@1(leader).paxosservice(osdmap 1..176) propose_pending

$ for f in remote/*/log/ceph-mon.*.log.gz; do zgrep propose_pending $f | grep osd | tail -n1; done
2018-04-26 06:44:52.947 7f5bc90a4700 10 mon.f@0(leader).paxosservice(osdmap 1..1026) propose_pending
2018-04-26 06:12:36.198 7f1042aba700 10 mon.g@2(leader).paxosservice(osdmap 1..308) propose_pending
2018-04-26 06:05:10.805 7f3eaa0cc700 10 mon.h@4(leader).paxosservice(osdmap 1..118) propose_pending
2018-04-26 06:33:17.084 7fe295b50700 10 mon.a@1(leader).paxosservice(osdmap 1..1022) propose_pending

so we were constantly proposing

In [3]: 1037/(44*60)
Out[3]: 0.3928030303030303

i guess that's why osdmap never got trimmed. because PaxosService::maybe_trim() only trims if the PaxosService is active. and the service is not considered active if it proposing.

#3 Updated by Josh Durgin almost 6 years ago

  • Related to Bug #23942: test_mon_osdmap_prune.sh failures added

#4 Updated by Josh Durgin almost 6 years ago

  • Priority changed from Normal to Urgent

#5 Updated by Josh Durgin almost 6 years ago

  • Priority changed from Urgent to Normal

Sounds like we need to block for trimming sometimes when there's a constant propose workload.

#6 Updated by Sage Weil almost 6 years ago

  • Status changed from New to 12
  • Assignee set to Joao Eduardo Luis
  • Priority changed from Normal to High

/a/sage-2018-05-23_14:50:29-rados-wip-sage2-testing-2018-05-22-1410-distro-basic-smithi/2576533

#7 Updated by Neha Ojha almost 6 years ago

/a/nojha-2018-06-21_00:18:52-rados-wip-24487-distro-basic-smithi/2686362

#8 Updated by Kefu Chai over 5 years ago

/a/kchai-2018-09-11_09:51:05-rados-wip-kefu-testing-2018-09-10-1219-distro-basic-mira/3005452/teuthology.log

2018-09-11T11:39:19.500 INFO:tasks.workunit.client.0.mira037.stderr:+ (( i < 27 ))
2018-09-11T11:39:19.500 INFO:tasks.workunit.client.0.mira037.stderr:+ echo 'never trimmed up to epoch 11'
2018-09-11T11:39:19.500 INFO:tasks.workunit.client.0.mira037.stderr:+ ceph report
2018-09-11T11:39:19.500 INFO:tasks.workunit.client.0.mira037.stdout:never trimmed up to epoch 11

#9 Updated by Kefu Chai over 5 years ago

2018-09-12T13:24:50.463 INFO:tasks.workunit.client.0.mira092.stderr:+ (( i < 27 ))
2018-09-12T13:24:50.463 INFO:tasks.workunit.client.0.mira092.stderr:+ echo 'never trimmed up to epoch 11'
2018-09-12T13:24:50.463 INFO:tasks.workunit.client.0.mira092.stderr:+ ceph report
2018-09-12T13:24:50.463 INFO:tasks.workunit.client.0.mira092.stdout:never trimmed up to epoch 11

/a//kchai-2018-09-12_11:57:28-rados-wip-kefu-testing-2018-09-12-1250-distro-basic-mira/3010904

#10 Updated by Sage Weil over 5 years ago

  • Priority changed from High to Urgent

/a/sage-2018-10-10_15:50:53-rados-wip-sage-testing-2018-10-10-0850-distro-basic-smithi/3125020

#11 Updated by Neha Ojha over 5 years ago

Joao, we've been seeing this one for a while, could you please take a look. Thanks!

#12 Updated by Josh Durgin about 5 years ago

  • Priority changed from Urgent to High

We aren't hitting this in recent rados runs anymore

#13 Updated by Neha Ojha about 5 years ago

  • Priority changed from High to Urgent

Seen in mimic /a/nojha-2019-01-29_03:40:43-rados-wip-37902-mimic-2019-01-28-distro-basic-smithi/3522485/

#15 Updated by Neha Ojha almost 5 years ago

/a/yuriw-2019-04-29_22:14:10-rados-wip-yuri2-testing-2019-04-29-1936-mimic-distro-basic-smithi/3910028

#16 Updated by Neha Ojha almost 5 years ago

/a/yuriw-2019-05-01_19:40:05-rados-wip-yuri3-testing-2019-04-30-1543-mimic-distro-basic-smithi/3916650/

#17 Updated by Sage Weil over 4 years ago

/a/sage-2019-07-02_17:58:21-rados-wip-sage-testing-2019-07-02-1056-distro-basic-smithi/4087740

#18 Updated by David Zafman over 4 years ago

  • Backport set to mimic, nautilus

#19 Updated by Greg Farnum over 4 years ago

  • Priority changed from Urgent to High

#20 Updated by Greg Farnum over 4 years ago

  • Assignee deleted (Joao Eduardo Luis)

#21 Updated by Greg Farnum over 4 years ago

Are we really only seeing this about once a month? Is it just a probabilistic failure based on load of the monitor cluster?

#22 Updated by Patrick Donnelly over 4 years ago

  • Status changed from 12 to New

#23 Updated by Neha Ojha over 3 years ago

  • Status changed from New to Can't reproduce

Also available in: Atom PDF