Bug #23879
test_mon_osdmap_prune.sh fails
0%
Description
2018-04-26T06:50:25.638 INFO:tasks.workunit.client.0.smithi009.stdout: "osdmap_first_committed": 1, 2018-04-26T06:50:25.638 INFO:tasks.workunit.client.0.smithi009.stdout: "osdmap_last_committed": 1027,
there is chance that we fail to trim osdmap.
/a//kchai-2018-04-26_05:37:16-rados-master-distro-basic-smithi/2441069/teuthology.log
http://pulpito.ceph.com/kchai-2018-04-26_05:37:16-rados-master-distro-basic-smithi/ it fails 1 out 15 times.
Related issues
History
#1 Updated by Kefu Chai almost 6 years ago
- Category set to Correctness/Safety
- Source set to Development
#2 Updated by Kefu Chai almost 6 years ago
$ zgrep propose_pending remote/*/log/ceph-mon.*.log.gz|grep osd | wc -l 1037 $ for f in remote/*/log/ceph-mon.*.log.gz; do zgrep propose_pending $f | grep osd | head -n1; done 2018-04-26 06:00:15.153 7f305afce700 10 mon.f@0(leader).paxosservice(osdmap 0..0) propose_pending 2018-04-26 06:12:33.218 7f1042aba700 10 mon.g@2(leader).paxosservice(osdmap 1..306) propose_pending 2018-04-26 06:03:38.337 7f3eaa0cc700 10 mon.h@4(leader).paxosservice(osdmap 1..91) propose_pending 2018-04-26 06:08:07.313 7fe295b50700 10 mon.a@1(leader).paxosservice(osdmap 1..176) propose_pending $ for f in remote/*/log/ceph-mon.*.log.gz; do zgrep propose_pending $f | grep osd | tail -n1; done 2018-04-26 06:44:52.947 7f5bc90a4700 10 mon.f@0(leader).paxosservice(osdmap 1..1026) propose_pending 2018-04-26 06:12:36.198 7f1042aba700 10 mon.g@2(leader).paxosservice(osdmap 1..308) propose_pending 2018-04-26 06:05:10.805 7f3eaa0cc700 10 mon.h@4(leader).paxosservice(osdmap 1..118) propose_pending 2018-04-26 06:33:17.084 7fe295b50700 10 mon.a@1(leader).paxosservice(osdmap 1..1022) propose_pending
so we were constantly proposing
In [3]: 1037/(44*60) Out[3]: 0.3928030303030303
i guess that's why osdmap never got trimmed. because PaxosService::maybe_trim() only trims if the PaxosService is active. and the service is not considered active if it proposing.
#3 Updated by Josh Durgin almost 6 years ago
- Related to Bug #23942: test_mon_osdmap_prune.sh failures added
#4 Updated by Josh Durgin almost 6 years ago
- Priority changed from Normal to Urgent
#5 Updated by Josh Durgin almost 6 years ago
- Priority changed from Urgent to Normal
Sounds like we need to block for trimming sometimes when there's a constant propose workload.
#6 Updated by Sage Weil almost 6 years ago
- Status changed from New to 12
- Assignee set to Joao Eduardo Luis
- Priority changed from Normal to High
/a/sage-2018-05-23_14:50:29-rados-wip-sage2-testing-2018-05-22-1410-distro-basic-smithi/2576533
#7 Updated by Neha Ojha almost 6 years ago
/a/nojha-2018-06-21_00:18:52-rados-wip-24487-distro-basic-smithi/2686362
#8 Updated by Kefu Chai over 5 years ago
/a/kchai-2018-09-11_09:51:05-rados-wip-kefu-testing-2018-09-10-1219-distro-basic-mira/3005452/teuthology.log
2018-09-11T11:39:19.500 INFO:tasks.workunit.client.0.mira037.stderr:+ (( i < 27 )) 2018-09-11T11:39:19.500 INFO:tasks.workunit.client.0.mira037.stderr:+ echo 'never trimmed up to epoch 11' 2018-09-11T11:39:19.500 INFO:tasks.workunit.client.0.mira037.stderr:+ ceph report 2018-09-11T11:39:19.500 INFO:tasks.workunit.client.0.mira037.stdout:never trimmed up to epoch 11
#9 Updated by Kefu Chai over 5 years ago
2018-09-12T13:24:50.463 INFO:tasks.workunit.client.0.mira092.stderr:+ (( i < 27 )) 2018-09-12T13:24:50.463 INFO:tasks.workunit.client.0.mira092.stderr:+ echo 'never trimmed up to epoch 11' 2018-09-12T13:24:50.463 INFO:tasks.workunit.client.0.mira092.stderr:+ ceph report 2018-09-12T13:24:50.463 INFO:tasks.workunit.client.0.mira092.stdout:never trimmed up to epoch 11
/a//kchai-2018-09-12_11:57:28-rados-wip-kefu-testing-2018-09-12-1250-distro-basic-mira/3010904
#10 Updated by Sage Weil over 5 years ago
- Priority changed from High to Urgent
/a/sage-2018-10-10_15:50:53-rados-wip-sage-testing-2018-10-10-0850-distro-basic-smithi/3125020
#11 Updated by Neha Ojha over 5 years ago
Joao, we've been seeing this one for a while, could you please take a look. Thanks!
#12 Updated by Josh Durgin about 5 years ago
- Priority changed from Urgent to High
We aren't hitting this in recent rados runs anymore
#13 Updated by Neha Ojha about 5 years ago
- Priority changed from High to Urgent
Seen in mimic /a/nojha-2019-01-29_03:40:43-rados-wip-37902-mimic-2019-01-28-distro-basic-smithi/3522485/
#15 Updated by Neha Ojha almost 5 years ago
/a/yuriw-2019-04-29_22:14:10-rados-wip-yuri2-testing-2019-04-29-1936-mimic-distro-basic-smithi/3910028
#16 Updated by Neha Ojha almost 5 years ago
/a/yuriw-2019-05-01_19:40:05-rados-wip-yuri3-testing-2019-04-30-1543-mimic-distro-basic-smithi/3916650/
#17 Updated by Sage Weil over 4 years ago
/a/sage-2019-07-02_17:58:21-rados-wip-sage-testing-2019-07-02-1056-distro-basic-smithi/4087740
#18 Updated by David Zafman over 4 years ago
- Backport set to mimic, nautilus
Another time on mimic so I assume Nautilus needs a fix too.
http://qa-proxy.ceph.com/teuthology/yuriw-2019-07-09_15:21:18-rados-wip-yuri-testing-2019-07-08-2007-mimic-distro-basic-smithi/4106241
#19 Updated by Greg Farnum over 4 years ago
- Priority changed from Urgent to High
#20 Updated by Greg Farnum over 4 years ago
- Assignee deleted (
Joao Eduardo Luis)
#21 Updated by Greg Farnum over 4 years ago
Are we really only seeing this about once a month? Is it just a probabilistic failure based on load of the monitor cluster?
#22 Updated by Patrick Donnelly over 4 years ago
- Status changed from 12 to New
#23 Updated by Neha Ojha over 3 years ago
- Status changed from New to Can't reproduce