Project

General

Profile

Actions

Bug #9864

closed

osd doesn't report new stats for 3 hours when running test LibCephFS.MulticlientSimple

Added by Sage Weil over 9 years ago. Updated over 9 years ago.

Status:
Can't reproduce
Priority:
Urgent
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2014-10-21T16:28:26.427 INFO:tasks.workunit.client.0.burnupi46.stdout:[ RUN      ] LibCephFS.MulticlientSimple
2014-10-21T19:27:13.440 INFO:tasks.workunit:Stopping ['libcephfs/test.sh'] on client.0...

ubuntu@teuthology:/a/teuthology-2014-10-20_02:30:01-smoke-giant-distro-basic-multi/559318
Actions #1

Updated by Zheng Yan over 9 years ago

there is no mds log or client log. but ceph.log on both burnupi58 and burnupi58 look strange

2014-10-21 16:33:42.781647 mon.0 10.214.136.10:6789/0 79 : cluster [INF] pgmap v44: 528 pgs: 528 active+clean; 2273 kB data, 418 MB used, 3719 GB / 3724 GB avail
2014-10-21 16:33:42.781647 mon.0 10.214.136.10:6789/0 79 : cluster [INF] pgmap v44: 528 pgs: 528 active+clean; 2273 kB data, 418 MB used, 3719 GB / 3724 GB avail
2014-10-21 16:34:26.082290 mon.0 10.214.136.10:6789/0 80 : cluster [INF] pgmap v45: 528 pgs: 528 active+clean; 2273 kB data, 418 MB used, 3719 GB / 3724 GB avail
2014-10-21 16:34:26.082290 mon.0 10.214.136.10:6789/0 80 : cluster [INF] pgmap v45: 528 pgs: 528 active+clean; 2273 kB data, 418 MB used, 3719 GB / 3724 GB avail
2014-10-21 19:27:17.746144 mon.0 10.214.136.10:6789/0 84 : cluster [INF] pgmap v46: 528 pgs: 528 active+clean; 2273 kB data, 418 MB used, 3719 GB / 3724 GB avail; 0 B/s wr, 0 op/s
2014-10-21 19:27:17.746144 mon.0 10.214.136.10:6789/0 84 : cluster [INF] pgmap v46: 528 pgs: 528 active+clean; 2273 kB data, 418 MB used, 3719 GB / 3724 GB avail; 0 B/s wr, 0 op/s
2014-10-21 19:27:19.889214 mon.0 10.214.136.10:6789/0 85 : cluster [INF] pgmap v47: 528 pgs: 528 active+clean; 2273 kB data, 418 MB used, 3719 GB / 3724 GB avail; 0 B/s wr, 0 op/s
2014-10-21 19:27:19.889214 mon.0 10.214.136.10:6789/0 85 : cluster [INF] pgmap v47: 528 pgs: 528 active+clean; 2273 kB data, 418 MB used, 3719 GB / 3724 GB avail; 0 B/s wr, 0 op/s

16:34:26 is the time test hung.

Actions #2

Updated by Sage Weil over 9 years ago

let's add debug to teh test yaml so that we have logs next time?

Actions #3

Updated by Zheng Yan over 9 years ago

I think it's monitor bug. It took about two hours to commit an update

2014-10-21 16:34:27.133605 7f9914858700 10 mon.a@0(leader).paxos(paxos updating c 1..104) commit_start 105
2014-10-21 16:34:27.165724 7f991605b700 20 mon.a@0(leader).paxos(paxos writing c 1..104) commit_finish 105
2014-10-21 16:34:27.165761 7f991605b700 10 mon.a@0(leader).paxos(paxos writing c 1..105)  sending commit to mon.1
2014-10-21 16:34:27.165765 7f991605b700  1 -- 10.214.136.10:6789/0 --> mon.1 10.214.136.14:6789/0 -- paxos(commit lc 105 fc 0 pn 200 opn 0) v3 -- ?+0 0x4e12a80
2014-10-21 16:34:27.165777 7f991605b700 10 mon.a@0(leader).paxos(paxos writing c 1..105)  sending commit to mon.2
2014-10-21 16:34:27.165784 7f991605b700  1 -- 10.214.136.10:6789/0 --> mon.2 10.214.136.14:6790/0 -- paxos(commit lc 105 fc 0 pn 200 opn 0) v3 -- ?+0 0x4e13c00
14-10-21 18:26:46.985877 7f9914858700 10 mon.a@0(leader).paxos(paxos updating c 1..105) handle_accept paxos(accept lc 105 fc 0 pn 200 opn 0) v3
2014-10-21 18:26:46.985881 7f9914858700 10 mon.a@0(leader).paxos(paxos updating c 1..105)  now 0,1 have accepted
2014-10-21 18:26:46.985886 7f9914858700  1 -- 10.214.136.10:6789/0 <== mon.2 10.214.136.14:6790/0 2941 ==== paxos(accept lc 105 fc 0 pn 200 opn 0) v3 ==== 80+0+0 (1565782145 0 0) 0x5844280 con 0x4bc58c0
2014-10-21 18:26:46.985891 7f9914858700 20 mon.a@0(leader) e1 have connection
2014-10-21 18:26:46.985892 7f9914858700 20 mon.a@0(leader) e1 ms_dispatch existing session MonSession: mon.2 10.214.136.14:6790/0 is openallow * for mon.2 10.214.136.14:6790/0
2014-10-21 18:26:46.985895 7f9914858700 20 mon.a@0(leader) e1  caps allow *
2014-10-21 18:26:46.985896 7f9914858700 20 is_capable service=mon command= read on cap allow *
2014-10-21 18:26:46.985897 7f9914858700 20  allow so far , doing grant allow *
2014-10-21 18:26:46.985898 7f9914858700 20  allow all
2014-10-21 18:26:46.985899 7f9914858700 20 is_capable service=mon command= exec on cap allow *
2014-10-21 18:26:46.985899 7f9914858700 20  allow so far , doing grant allow *
2014-10-21 18:26:46.985900 7f9914858700 20  allow all
2014-10-21 18:26:46.985901 7f9914858700 10 mon.a@0(leader).paxos(paxos updating c 1..105) handle_accept paxos(accept lc 105 fc 0 pn 200 opn 0) v3
2014-10-21 18:26:46.985903 7f9914858700 10 mon.a@0(leader).paxos(paxos updating c 1..105)  now 0,1,2 have accepted
2014-10-21 18:26:46.985905 7f9914858700 10 mon.a@0(leader).paxos(paxos updating c 1..105)  got majority, committing, done with update
Actions #4

Updated by Zheng Yan over 9 years ago

  • Project changed from CephFS to Ceph
  • Category set to Monitor
Actions #5

Updated by Sage Weil over 9 years ago

  • Subject changed from LibCephFS.MulticlientSimple hang to mon takes 2 hours to commit
Actions #6

Updated by Sage Weil over 9 years ago

  • Assignee set to Joao Eduardo Luis
Actions #7

Updated by Joao Eduardo Luis over 9 years ago

this is not the monitor taking 2 hours to commit. The log snippets above refer to two different proposals: the first on timestamp 16:34:27.133605 7f9914858700 refers to the commit of paxos version 105 on the leader, while the second snippet on 18:26:46.985877 refers to the acceptance of version 106 on the peon. See below.

Begin of proposal for version 106 on the leader:

2014-10-21 18:26:46.952103 7f9915059700 10 mon.a@0(leader).paxos(paxos active c 1..105) propose_queued 106 819 bytes
2014-10-21 18:26:46.952106 7f9915059700 10 mon.a@0(leader).paxos(paxos updating c 1..105) begin for 106 819 bytes
2014-10-21 18:26:46.984924 7f9915059700 10 mon.a@0(leader).paxos(paxos updating c 1..105)  sending begin to mon.1
2014-10-21 18:26:46.984945 7f9915059700  1 -- 10.214.136.10:6789/0 --> mon.1 10.214.136.14:6789/0 -- paxos(begin lc 105 fc 0 pn 200 opn 0) v3 -- ?+0 0x58f0000
2014-10-21 18:26:46.984957 7f9915059700 10 mon.a@0(leader).paxos(paxos updating c 1..105)  sending begin to mon.2
2014-10-21 18:26:46.984965 7f9915059700  1 -- 10.214.136.10:6789/0 --> mon.2 10.214.136.14:6790/0 -- paxos(begin lc 105 fc 0 pn 200 opn 0) v3 -- ?+0 0x5844280

Acceptance for version 106 on a selected peon (both show roughly the same):

2014-10-21 18:26:46.985274 7f20a5fd9700 10 mon.c@2(peon).paxos(paxos active c 1..105) handle_begin paxos(begin lc 105 fc 0 pn 200 opn 0) v3
2014-10-21 18:26:46.985278 7f20a5fd9700 10 mon.c@2(peon).paxos(paxos updating c 1..105) accepting value for 106 pn 200
2014-10-21 18:26:46.985578 7f20a5fd9700  1 -- 10.214.136.14:6790/0 --> 10.214.136.10:6789/0 -- paxos(accept lc 105 fc 0 pn 200 opn 0) v3 -- ?+0 0x4bc1400 con 0x46eb760
2014-10-21 18:26:47.018656 7f20a5fd9700  1 -- 10.214.136.14:6790/0 <== mon.0 10.214.136.10:6789/0 3072 ==== paxos(commit lc 106 fc 0 pn 200 opn 0) v3 ==== 911+0+0 (1533841361 0 0) 0x4bc1400 con 0x46eb760

Commit on the leader:

2014-10-21 18:26:46.985901 7f9914858700 10 mon.a@0(leader).paxos(paxos updating c 1..105) handle_accept paxos(accept lc 105 fc 0 pn 200 opn 0) v3
2014-10-21 18:26:46.985903 7f9914858700 10 mon.a@0(leader).paxos(paxos updating c 1..105)  now 0,1,2 have accepted
2014-10-21 18:26:46.985905 7f9914858700 10 mon.a@0(leader).paxos(paxos updating c 1..105)  got majority, committing, done with update
2014-10-21 18:26:46.985907 7f9914858700 10 mon.a@0(leader).paxos(paxos updating c 1..105) commit_start 106
2014-10-21 18:26:47.018352 7f991605b700 20 mon.a@0(leader).paxos(paxos writing c 1..105) commit_finish 106
2014-10-21 18:26:47.018386 7f991605b700 10 mon.a@0(leader).paxos(paxos writing c 1..106)  sending commit to mon.1
2014-10-21 18:26:47.018392 7f991605b700  1 -- 10.214.136.10:6789/0 --> mon.1 10.214.136.14:6789/0 -- paxos(commit lc 106 fc 0 pn 200 opn 0) v3 -- ?+0 0x53e2d00
2014-10-21 18:26:47.018403 7f991605b700 10 mon.a@0(leader).paxos(paxos writing c 1..106)  sending commit to mon.2
2014-10-21 18:26:47.018419 7f991605b700  1 -- 10.214.136.10:6789/0 --> mon.2 10.214.136.14:6790/0 -- paxos(commit lc 106 fc 0 pn 200 opn 0) v3 -- ?+0 0x53e3200

Commit on a peon:

2014-10-21 18:26:47.018686 7f20a5fd9700 10 mon.c@2(peon).paxos(paxos updating c 1..105) handle_commit on 106
2014-10-21 18:26:47.018688 7f20a5fd9700 10 mon.c@2(peon).paxos(paxos updating c 1..106) store_state [106..106]
2014-10-21 18:26:47.019039 7f20a5fd9700 10 mon.c@2(peon) e1 refresh_from_paxos

I don't know what happened but I don't see anything obviously wrong in the logs. The only thing obviously weird here is the gap of ~3 hours in the cluster log, which may certainly be a monitor issue.

Actions #8

Updated by Joao Eduardo Luis over 9 years ago

Not so weird after all.

Log shows that last log is created because we had some stats to report:

2014-10-21 16:34:25.965364 7f9914858700 10 mon.a@0(leader).pg v44 prepare_pg_stats pg_stats(0 pgs tid 25 v 0) v1 from osd.1
2014-10-21 16:34:25.965370 7f9914858700 10 mon.a@0(leader).pg v44  got osd.1 osd_stat(107 MB used, 929 GB avail, 931 GB total, peers [0,2,3]/[] op hist []) (was osd_stat(107 MB used, 929 GB avail, 931 GB total, peers [0,2,3]/[] op hist []))
2014-10-21 16:34:25.965376 7f9914858700 10 mon.a@0(leader).paxosservice(pgmap 1..44)  setting proposal_timer 0x4b46d30 with delay of 0.05

And then it's just a bunch of:

2014-10-21 16:42:26.041432 7f9914858700 10 mon.a@0(leader).pg v45 prepare_pg_stats pg_stats(0 pgs tid 29 v 0) v1 from osd.1
2014-10-21 16:42:26.041436 7f9914858700 10 mon.a@0(leader).pg v45  message contains no new osd|pg stats

Until we get some worthy stats to report:

2014-10-21 19:27:17.602557 7f9914858700 10 mon.a@0(leader).pg v45 prepare_update pg_stats(3 pgs tid 112 v 0) v1 from osd.1 10.214.136.10:6805/44758
2014-10-21 19:27:17.602561 7f9914858700 10 mon.a@0(leader).pg v45 prepare_pg_stats pg_stats(3 pgs tid 112 v 0) v1 from osd.1
2014-10-21 19:27:17.602568 7f9914858700 10 mon.a@0(leader).pg v45  got osd.1 osd_stat(107 MB used, 929 GB avail, 931 GB total, peers [0,2,3]/[] op hist []) (was osd_stat(107 MB used, 929 GB avail, 931 GB total, peers [0,2,3]/[] op hist []))
2014-10-21 19:27:17.602578 7f9914858700 15 mon.a@0(leader).pg v45  got 0.7 reported at 8:21 state active+clean -> active+clean 
2014-10-21 19:27:17.602586 7f9914858700 15 mon.a@0(leader).pg v45  got 1.74 reported at 8:851 state active+clean -> active+clean
2014-10-21 19:27:17.602591 7f9914858700 15 mon.a@0(leader).pg v45  got 1.94 reported at 8:23 state active+clean -> active+clean
2014-10-21 19:27:17.602598 7f9914858700 10 mon.a@0(leader).paxosservice(pgmap 1..45)  setting proposal_timer 0x4b47580 with delay of 0.05
2014-10-21 19:27:17.652719 7f9915059700 10 mon.a@0(leader).paxosservice(pgmap 1..45) propose_pending

It just takes 3 hours to get some new stats. I don't think this is a monitor issue.

Actions #9

Updated by Joao Eduardo Luis over 9 years ago

  • Subject changed from mon takes 2 hours to commit to osd doesn't report new stats for 3 hours when running test LibCephFS.MulticlientSimple
  • Status changed from New to 4
Actions #10

Updated by Joao Eduardo Luis over 9 years ago

Looking into the osd logs show that the osds don't report new stats for the ~3 hours because no pgs are update in that timeframe.

Actions #11

Updated by Sage Weil over 9 years ago

  • Status changed from 4 to Can't reproduce

not enough info to tell why teh client test hung. let's see if it happens again!

Actions

Also available in: Atom PDF