Bug #14758
closedfailed TestJournalRepair test on smithi
0%
Description
http://pulpito.ceph.com/teuthology-2016-02-12_14:03:01-fs-jewel---basic-smithi/8218/
2016-02-12T23:58:57.155 INFO:tasks.cephfs_test_runner:====================================================================== 2016-02-12T23:58:57.156 INFO:tasks.cephfs_test_runner:ERROR: test_reset (tasks.cephfs.test_journal_repair.TestJournalRepair) 2016-02-12T23:58:57.156 INFO:tasks.cephfs_test_runner:---------------------------------------------------------------------- 2016-02-12T23:58:57.156 INFO:tasks.cephfs_test_runner:Traceback (most recent call last): 2016-02-12T23:58:57.156 INFO:tasks.cephfs_test_runner: File "/var/lib/teuthworker/src/ceph-qa-suite_jewel/tasks/cephfs/test_journal_repair.py", line 245, in test_reset 2016-02-12T23:58:57.157 INFO:tasks.cephfs_test_runner: self.fs.wait_for_state("up:standby", timeout=60, mds_id=damaged_id) 2016-02-12T23:58:57.157 INFO:tasks.cephfs_test_runner: File "/var/lib/teuthworker/src/ceph-qa-suite_jewel/tasks/cephfs/filesystem.py", line 507, in wait_for_state 2016-02-12T23:58:57.157 INFO:tasks.cephfs_test_runner: elapsed, goal_state, current_state 2016-02-12T23:58:57.157 INFO:tasks.cephfs_test_runner:RuntimeError: Reached timeout after 60.3698599339 seconds waiting for state up:standby, while in state up:resolve
Updated by Greg Farnum about 8 years ago
- Priority changed from Urgent to High
Hmm, we haven't seen this since then, and there have been some smithi runs so it's apparently not just infrastructure differences.
Updated by John Spray about 8 years ago
- Status changed from New to Rejected
The link in the report was wrong, it was actually:
http://pulpito.ceph.com/teuthology-2016-02-12_14:03:01-fs-jewel---basic-smithi/8241/
Updated by John Spray about 8 years ago
- Status changed from New to Resolved
In this case, mds.b was going into resolve (rank 0), while mds.a was remaining in standby. That's correct (rank 1 is damaged).
The test should be waiting for 'a' to go to standby but it's waiting for 'b' to go to standby.
It restarted b expecting it to go to standby while a remained active.
But 'a' restarted too, because it saw itself removed from the mdsmap:
2016-02-13 07:57:48.516326 7f32becb6700 1 mds.a handle_mds_map i (172.21.15.46:6805/32687) dne in the mdsmap, respawning myself
The mon removed it because it thought it timed out:
2016-02-13 07:57:48.088872 7fa9cf462700 10 mon.a@0(leader).mds e34 no beacon from 4208 172.21.15.46:6805/32687 mds.0.4 up:active since 2016-02-13 07:57:32.668364
Here's the last beacon exchange in active state:
2016-02-13 07:57:44.668380 7f32bbcb0700 1 -- 172.21.15.46:6805/32687 --> 172.21.15.46:6789/0 -- mdsbeacon(4208/a up:active seq 36 v33) v4 -- ?+0 0x7f32d239c600 con 0x7f32d0b84600 2016-02-13 07:57:48.098621 7f32becb6700 1 -- 172.21.15.46:6805/32687 <== mon.0 172.21.15.46:6789/0 61 ==== mdsbeacon(4208/a up:active seq 36 v34) v4 ==== 113+0+0 (2581632639 0 0) 0x7f32d34e1c00 con 0x7f32d0b84600
OK, the mon is doing something wrong here, because we can see this beacon getting handled:
2016-02-13 07:57:41.681501 7fa9cf462700 10 mon.a@0(leader).mds e33 _updated mds.0 mdsbeacon(4208/a up:active seq 35 v32) v4 2016-02-13 07:57:41.681512 7fa9cf462700 0 log_channel(cluster) log [INF] : mds.0 172.21.15.46:6805/32687 up:active 2016-02-13 07:57:41.681527 7fa9cf462700 1 -- 172.21.15.46:6789/0 --> 172.21.15.46:6789/0 -- log(1 entries from seq 321 at 2016-02-13 07:57:41.681515) v1 -- ?+0 0x7fa9e0a72880 con 0x7fa9e077ad00 2016-02-13 07:57:41.681544 7fa9cf462700 2 mon.a@0(leader) e1 send_reply 0x7fa9e0a42000 0x7fa9e1204d00 mdsbeacon(4208/a up:active seq 35 v33) v4 2016-02-13 07:57:41.681548 7fa9cf462700 1 -- 172.21.15.46:6789/0 --> 172.21.15.46:6805/32687 -- mdsbeacon(4208/a up:active seq 35 v33) v4 -- ?+0 0x7fa9e1204d00 con 0x7fa9e0a0bb00 2016-02-13 07:57:41.681577 7fa9cf462700 10 mon.a@0(leader).mds e33 e33: 2/2/2 up {0=a=up:active,1=b=up:active(laggy or crashed)}
Ah, this was the bug that was fixed by:
commit dff71a5da1de569928619deeb2f832dcb744f5c4 Author: Yan, Zheng <zyan@redhat.com> Date: Tue Feb 23 17:11:17 2016 +0800 mon/MDSMonitor.cc: properly note beacon when health metrics changes Fixes: #14684 Signed-off-by: Yan, Zheng <zyan@redhat.com> diff --git a/src/mon/MDSMonitor.cc b/src/mon/MDSMonitor.cc index 4238ffb..79d1500 100644 --- a/src/mon/MDSMonitor.cc +++ b/src/mon/MDSMonitor.cc @@ -367,6 +367,7 @@ bool MDSMonitor::preprocess_beacon(MonOpRequestRef op) // do not match, to update our stored if (!(pending_daemon_health[gid] == m->get_health())) { dout(20) << __func__ << " health metrics for gid " << gid << " were updated" << dendl; + _note_beacon(m); return false; }