Project

General

Profile

Actions

Bug #14758

closed

failed TestJournalRepair test on smithi

Added by Greg Farnum about 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

http://pulpito.ceph.com/teuthology-2016-02-12_14:03:01-fs-jewel---basic-smithi/8218/

2016-02-12T23:58:57.155 INFO:tasks.cephfs_test_runner:======================================================================
2016-02-12T23:58:57.156 INFO:tasks.cephfs_test_runner:ERROR: test_reset (tasks.cephfs.test_journal_repair.TestJournalRepair)
2016-02-12T23:58:57.156 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2016-02-12T23:58:57.156 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2016-02-12T23:58:57.156 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_jewel/tasks/cephfs/test_journal_repair.py", line 245, in test_reset
2016-02-12T23:58:57.157 INFO:tasks.cephfs_test_runner:    self.fs.wait_for_state("up:standby", timeout=60, mds_id=damaged_id)
2016-02-12T23:58:57.157 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_jewel/tasks/cephfs/filesystem.py", line 507, in wait_for_state
2016-02-12T23:58:57.157 INFO:tasks.cephfs_test_runner:    elapsed, goal_state, current_state
2016-02-12T23:58:57.157 INFO:tasks.cephfs_test_runner:RuntimeError: Reached timeout after 60.3698599339 seconds waiting for state up:standby, while in state up:resolve
Actions #1

Updated by Greg Farnum about 8 years ago

  • Priority changed from Urgent to High

Hmm, we haven't seen this since then, and there have been some smithi runs so it's apparently not just infrastructure differences.

Actions #2

Updated by John Spray about 8 years ago

  • Status changed from New to Rejected
Actions #3

Updated by John Spray about 8 years ago

  • Status changed from Rejected to New
Actions #4

Updated by John Spray about 8 years ago

  • Status changed from New to Resolved

In this case, mds.b was going into resolve (rank 0), while mds.a was remaining in standby. That's correct (rank 1 is damaged).

The test should be waiting for 'a' to go to standby but it's waiting for 'b' to go to standby.

It restarted b expecting it to go to standby while a remained active.

But 'a' restarted too, because it saw itself removed from the mdsmap:

2016-02-13 07:57:48.516326 7f32becb6700  1 mds.a handle_mds_map i (172.21.15.46:6805/32687) dne in the mdsmap, respawning myself

The mon removed it because it thought it timed out:

2016-02-13 07:57:48.088872 7fa9cf462700 10 mon.a@0(leader).mds e34 no beacon from 4208 172.21.15.46:6805/32687 mds.0.4 up:active since 2016-02-13 07:57:32.668364

Here's the last beacon exchange in active state:

2016-02-13 07:57:44.668380 7f32bbcb0700  1 -- 172.21.15.46:6805/32687 --> 172.21.15.46:6789/0 -- mdsbeacon(4208/a up:active seq 36 v33) v4 -- ?+0 0x7f32d239c600 con 0x7f32d0b84600
2016-02-13 07:57:48.098621 7f32becb6700  1 -- 172.21.15.46:6805/32687 <== mon.0 172.21.15.46:6789/0 61 ==== mdsbeacon(4208/a up:active seq 36 v34) v4 ==== 113+0+0 (2581632639 0 0) 0x7f32d34e1c00 con 0x7f32d0b84600

OK, the mon is doing something wrong here, because we can see this beacon getting handled:

2016-02-13 07:57:41.681501 7fa9cf462700 10 mon.a@0(leader).mds e33 _updated mds.0 mdsbeacon(4208/a up:active seq 35 v32) v4
2016-02-13 07:57:41.681512 7fa9cf462700  0 log_channel(cluster) log [INF] : mds.0 172.21.15.46:6805/32687 up:active
2016-02-13 07:57:41.681527 7fa9cf462700  1 -- 172.21.15.46:6789/0 --> 172.21.15.46:6789/0 -- log(1 entries from seq 321 at 2016-02-13 07:57:41.681515) v1 -- ?+0 0x7fa9e0a72880 con 0x7fa9e077ad00
2016-02-13 07:57:41.681544 7fa9cf462700  2 mon.a@0(leader) e1 send_reply 0x7fa9e0a42000 0x7fa9e1204d00 mdsbeacon(4208/a up:active seq 35 v33) v4
2016-02-13 07:57:41.681548 7fa9cf462700  1 -- 172.21.15.46:6789/0 --> 172.21.15.46:6805/32687 -- mdsbeacon(4208/a up:active seq 35 v33) v4 -- ?+0 0x7fa9e1204d00 con 0x7fa9e0a0bb00
2016-02-13 07:57:41.681577 7fa9cf462700 10 mon.a@0(leader).mds e33 e33: 2/2/2 up {0=a=up:active,1=b=up:active(laggy or crashed)}

Ah, this was the bug that was fixed by:

commit dff71a5da1de569928619deeb2f832dcb744f5c4
Author: Yan, Zheng <zyan@redhat.com>
Date:   Tue Feb 23 17:11:17 2016 +0800

    mon/MDSMonitor.cc: properly note beacon when health metrics changes

    Fixes: #14684
    Signed-off-by: Yan, Zheng <zyan@redhat.com>

diff --git a/src/mon/MDSMonitor.cc b/src/mon/MDSMonitor.cc
index 4238ffb..79d1500 100644
--- a/src/mon/MDSMonitor.cc
+++ b/src/mon/MDSMonitor.cc
@@ -367,6 +367,7 @@ bool MDSMonitor::preprocess_beacon(MonOpRequestRef op)
   // do not match, to update our stored
   if (!(pending_daemon_health[gid] == m->get_health())) {
     dout(20) << __func__ << " health metrics for gid " << gid << " were updated" << dendl;
+    _note_beacon(m);
     return false;
   }

Actions

Also available in: Atom PDF