Bug #9864
closedosd doesn't report new stats for 3 hours when running test LibCephFS.MulticlientSimple
0%
Description
2014-10-21T16:28:26.427 INFO:tasks.workunit.client.0.burnupi46.stdout:[ RUN ] LibCephFS.MulticlientSimple 2014-10-21T19:27:13.440 INFO:tasks.workunit:Stopping ['libcephfs/test.sh'] on client.0...ubuntu@teuthology:/a/teuthology-2014-10-20_02:30:01-smoke-giant-distro-basic-multi/559318
Updated by Zheng Yan over 9 years ago
there is no mds log or client log. but ceph.log on both burnupi58 and burnupi58 look strange
2014-10-21 16:33:42.781647 mon.0 10.214.136.10:6789/0 79 : cluster [INF] pgmap v44: 528 pgs: 528 active+clean; 2273 kB data, 418 MB used, 3719 GB / 3724 GB avail 2014-10-21 16:33:42.781647 mon.0 10.214.136.10:6789/0 79 : cluster [INF] pgmap v44: 528 pgs: 528 active+clean; 2273 kB data, 418 MB used, 3719 GB / 3724 GB avail 2014-10-21 16:34:26.082290 mon.0 10.214.136.10:6789/0 80 : cluster [INF] pgmap v45: 528 pgs: 528 active+clean; 2273 kB data, 418 MB used, 3719 GB / 3724 GB avail 2014-10-21 16:34:26.082290 mon.0 10.214.136.10:6789/0 80 : cluster [INF] pgmap v45: 528 pgs: 528 active+clean; 2273 kB data, 418 MB used, 3719 GB / 3724 GB avail 2014-10-21 19:27:17.746144 mon.0 10.214.136.10:6789/0 84 : cluster [INF] pgmap v46: 528 pgs: 528 active+clean; 2273 kB data, 418 MB used, 3719 GB / 3724 GB avail; 0 B/s wr, 0 op/s 2014-10-21 19:27:17.746144 mon.0 10.214.136.10:6789/0 84 : cluster [INF] pgmap v46: 528 pgs: 528 active+clean; 2273 kB data, 418 MB used, 3719 GB / 3724 GB avail; 0 B/s wr, 0 op/s 2014-10-21 19:27:19.889214 mon.0 10.214.136.10:6789/0 85 : cluster [INF] pgmap v47: 528 pgs: 528 active+clean; 2273 kB data, 418 MB used, 3719 GB / 3724 GB avail; 0 B/s wr, 0 op/s 2014-10-21 19:27:19.889214 mon.0 10.214.136.10:6789/0 85 : cluster [INF] pgmap v47: 528 pgs: 528 active+clean; 2273 kB data, 418 MB used, 3719 GB / 3724 GB avail; 0 B/s wr, 0 op/s
16:34:26 is the time test hung.
Updated by Sage Weil over 9 years ago
let's add debug to teh test yaml so that we have logs next time?
Updated by Zheng Yan over 9 years ago
I think it's monitor bug. It took about two hours to commit an update
2014-10-21 16:34:27.133605 7f9914858700 10 mon.a@0(leader).paxos(paxos updating c 1..104) commit_start 105 2014-10-21 16:34:27.165724 7f991605b700 20 mon.a@0(leader).paxos(paxos writing c 1..104) commit_finish 105 2014-10-21 16:34:27.165761 7f991605b700 10 mon.a@0(leader).paxos(paxos writing c 1..105) sending commit to mon.1 2014-10-21 16:34:27.165765 7f991605b700 1 -- 10.214.136.10:6789/0 --> mon.1 10.214.136.14:6789/0 -- paxos(commit lc 105 fc 0 pn 200 opn 0) v3 -- ?+0 0x4e12a80 2014-10-21 16:34:27.165777 7f991605b700 10 mon.a@0(leader).paxos(paxos writing c 1..105) sending commit to mon.2 2014-10-21 16:34:27.165784 7f991605b700 1 -- 10.214.136.10:6789/0 --> mon.2 10.214.136.14:6790/0 -- paxos(commit lc 105 fc 0 pn 200 opn 0) v3 -- ?+0 0x4e13c00
14-10-21 18:26:46.985877 7f9914858700 10 mon.a@0(leader).paxos(paxos updating c 1..105) handle_accept paxos(accept lc 105 fc 0 pn 200 opn 0) v3 2014-10-21 18:26:46.985881 7f9914858700 10 mon.a@0(leader).paxos(paxos updating c 1..105) now 0,1 have accepted 2014-10-21 18:26:46.985886 7f9914858700 1 -- 10.214.136.10:6789/0 <== mon.2 10.214.136.14:6790/0 2941 ==== paxos(accept lc 105 fc 0 pn 200 opn 0) v3 ==== 80+0+0 (1565782145 0 0) 0x5844280 con 0x4bc58c0 2014-10-21 18:26:46.985891 7f9914858700 20 mon.a@0(leader) e1 have connection 2014-10-21 18:26:46.985892 7f9914858700 20 mon.a@0(leader) e1 ms_dispatch existing session MonSession: mon.2 10.214.136.14:6790/0 is openallow * for mon.2 10.214.136.14:6790/0 2014-10-21 18:26:46.985895 7f9914858700 20 mon.a@0(leader) e1 caps allow * 2014-10-21 18:26:46.985896 7f9914858700 20 is_capable service=mon command= read on cap allow * 2014-10-21 18:26:46.985897 7f9914858700 20 allow so far , doing grant allow * 2014-10-21 18:26:46.985898 7f9914858700 20 allow all 2014-10-21 18:26:46.985899 7f9914858700 20 is_capable service=mon command= exec on cap allow * 2014-10-21 18:26:46.985899 7f9914858700 20 allow so far , doing grant allow * 2014-10-21 18:26:46.985900 7f9914858700 20 allow all 2014-10-21 18:26:46.985901 7f9914858700 10 mon.a@0(leader).paxos(paxos updating c 1..105) handle_accept paxos(accept lc 105 fc 0 pn 200 opn 0) v3 2014-10-21 18:26:46.985903 7f9914858700 10 mon.a@0(leader).paxos(paxos updating c 1..105) now 0,1,2 have accepted 2014-10-21 18:26:46.985905 7f9914858700 10 mon.a@0(leader).paxos(paxos updating c 1..105) got majority, committing, done with update
Updated by Zheng Yan over 9 years ago
- Project changed from CephFS to Ceph
- Category set to Monitor
Updated by Sage Weil over 9 years ago
- Subject changed from LibCephFS.MulticlientSimple hang to mon takes 2 hours to commit
Updated by Joao Eduardo Luis over 9 years ago
this is not the monitor taking 2 hours to commit. The log snippets above refer to two different proposals: the first on timestamp 16:34:27.133605 7f9914858700 refers to the commit of paxos version 105 on the leader, while the second snippet on 18:26:46.985877 refers to the acceptance of version 106 on the peon. See below.
Begin of proposal for version 106 on the leader:
2014-10-21 18:26:46.952103 7f9915059700 10 mon.a@0(leader).paxos(paxos active c 1..105) propose_queued 106 819 bytes 2014-10-21 18:26:46.952106 7f9915059700 10 mon.a@0(leader).paxos(paxos updating c 1..105) begin for 106 819 bytes 2014-10-21 18:26:46.984924 7f9915059700 10 mon.a@0(leader).paxos(paxos updating c 1..105) sending begin to mon.1 2014-10-21 18:26:46.984945 7f9915059700 1 -- 10.214.136.10:6789/0 --> mon.1 10.214.136.14:6789/0 -- paxos(begin lc 105 fc 0 pn 200 opn 0) v3 -- ?+0 0x58f0000 2014-10-21 18:26:46.984957 7f9915059700 10 mon.a@0(leader).paxos(paxos updating c 1..105) sending begin to mon.2 2014-10-21 18:26:46.984965 7f9915059700 1 -- 10.214.136.10:6789/0 --> mon.2 10.214.136.14:6790/0 -- paxos(begin lc 105 fc 0 pn 200 opn 0) v3 -- ?+0 0x5844280
Acceptance for version 106 on a selected peon (both show roughly the same):
2014-10-21 18:26:46.985274 7f20a5fd9700 10 mon.c@2(peon).paxos(paxos active c 1..105) handle_begin paxos(begin lc 105 fc 0 pn 200 opn 0) v3 2014-10-21 18:26:46.985278 7f20a5fd9700 10 mon.c@2(peon).paxos(paxos updating c 1..105) accepting value for 106 pn 200 2014-10-21 18:26:46.985578 7f20a5fd9700 1 -- 10.214.136.14:6790/0 --> 10.214.136.10:6789/0 -- paxos(accept lc 105 fc 0 pn 200 opn 0) v3 -- ?+0 0x4bc1400 con 0x46eb760 2014-10-21 18:26:47.018656 7f20a5fd9700 1 -- 10.214.136.14:6790/0 <== mon.0 10.214.136.10:6789/0 3072 ==== paxos(commit lc 106 fc 0 pn 200 opn 0) v3 ==== 911+0+0 (1533841361 0 0) 0x4bc1400 con 0x46eb760
Commit on the leader:
2014-10-21 18:26:46.985901 7f9914858700 10 mon.a@0(leader).paxos(paxos updating c 1..105) handle_accept paxos(accept lc 105 fc 0 pn 200 opn 0) v3 2014-10-21 18:26:46.985903 7f9914858700 10 mon.a@0(leader).paxos(paxos updating c 1..105) now 0,1,2 have accepted 2014-10-21 18:26:46.985905 7f9914858700 10 mon.a@0(leader).paxos(paxos updating c 1..105) got majority, committing, done with update 2014-10-21 18:26:46.985907 7f9914858700 10 mon.a@0(leader).paxos(paxos updating c 1..105) commit_start 106 2014-10-21 18:26:47.018352 7f991605b700 20 mon.a@0(leader).paxos(paxos writing c 1..105) commit_finish 106 2014-10-21 18:26:47.018386 7f991605b700 10 mon.a@0(leader).paxos(paxos writing c 1..106) sending commit to mon.1 2014-10-21 18:26:47.018392 7f991605b700 1 -- 10.214.136.10:6789/0 --> mon.1 10.214.136.14:6789/0 -- paxos(commit lc 106 fc 0 pn 200 opn 0) v3 -- ?+0 0x53e2d00 2014-10-21 18:26:47.018403 7f991605b700 10 mon.a@0(leader).paxos(paxos writing c 1..106) sending commit to mon.2 2014-10-21 18:26:47.018419 7f991605b700 1 -- 10.214.136.10:6789/0 --> mon.2 10.214.136.14:6790/0 -- paxos(commit lc 106 fc 0 pn 200 opn 0) v3 -- ?+0 0x53e3200
Commit on a peon:
2014-10-21 18:26:47.018686 7f20a5fd9700 10 mon.c@2(peon).paxos(paxos updating c 1..105) handle_commit on 106 2014-10-21 18:26:47.018688 7f20a5fd9700 10 mon.c@2(peon).paxos(paxos updating c 1..106) store_state [106..106] 2014-10-21 18:26:47.019039 7f20a5fd9700 10 mon.c@2(peon) e1 refresh_from_paxos
I don't know what happened but I don't see anything obviously wrong in the logs. The only thing obviously weird here is the gap of ~3 hours in the cluster log, which may certainly be a monitor issue.
Updated by Joao Eduardo Luis over 9 years ago
Not so weird after all.
Log shows that last log is created because we had some stats to report:
2014-10-21 16:34:25.965364 7f9914858700 10 mon.a@0(leader).pg v44 prepare_pg_stats pg_stats(0 pgs tid 25 v 0) v1 from osd.1 2014-10-21 16:34:25.965370 7f9914858700 10 mon.a@0(leader).pg v44 got osd.1 osd_stat(107 MB used, 929 GB avail, 931 GB total, peers [0,2,3]/[] op hist []) (was osd_stat(107 MB used, 929 GB avail, 931 GB total, peers [0,2,3]/[] op hist [])) 2014-10-21 16:34:25.965376 7f9914858700 10 mon.a@0(leader).paxosservice(pgmap 1..44) setting proposal_timer 0x4b46d30 with delay of 0.05
And then it's just a bunch of:
2014-10-21 16:42:26.041432 7f9914858700 10 mon.a@0(leader).pg v45 prepare_pg_stats pg_stats(0 pgs tid 29 v 0) v1 from osd.1 2014-10-21 16:42:26.041436 7f9914858700 10 mon.a@0(leader).pg v45 message contains no new osd|pg stats
Until we get some worthy stats to report:
2014-10-21 19:27:17.602557 7f9914858700 10 mon.a@0(leader).pg v45 prepare_update pg_stats(3 pgs tid 112 v 0) v1 from osd.1 10.214.136.10:6805/44758 2014-10-21 19:27:17.602561 7f9914858700 10 mon.a@0(leader).pg v45 prepare_pg_stats pg_stats(3 pgs tid 112 v 0) v1 from osd.1 2014-10-21 19:27:17.602568 7f9914858700 10 mon.a@0(leader).pg v45 got osd.1 osd_stat(107 MB used, 929 GB avail, 931 GB total, peers [0,2,3]/[] op hist []) (was osd_stat(107 MB used, 929 GB avail, 931 GB total, peers [0,2,3]/[] op hist [])) 2014-10-21 19:27:17.602578 7f9914858700 15 mon.a@0(leader).pg v45 got 0.7 reported at 8:21 state active+clean -> active+clean 2014-10-21 19:27:17.602586 7f9914858700 15 mon.a@0(leader).pg v45 got 1.74 reported at 8:851 state active+clean -> active+clean 2014-10-21 19:27:17.602591 7f9914858700 15 mon.a@0(leader).pg v45 got 1.94 reported at 8:23 state active+clean -> active+clean 2014-10-21 19:27:17.602598 7f9914858700 10 mon.a@0(leader).paxosservice(pgmap 1..45) setting proposal_timer 0x4b47580 with delay of 0.05 2014-10-21 19:27:17.652719 7f9915059700 10 mon.a@0(leader).paxosservice(pgmap 1..45) propose_pending
It just takes 3 hours to get some new stats. I don't think this is a monitor issue.
Updated by Joao Eduardo Luis over 9 years ago
- Subject changed from mon takes 2 hours to commit to osd doesn't report new stats for 3 hours when running test LibCephFS.MulticlientSimple
- Status changed from New to 4
Updated by Joao Eduardo Luis over 9 years ago
Looking into the osd logs show that the osds don't report new stats for the ~3 hours because no pgs are update in that timeframe.
Updated by Sage Weil over 9 years ago
- Status changed from 4 to Can't reproduce
not enough info to tell why teh client test hung. let's see if it happens again!