Bug #14758: failed TestJournalRepair test on smithi - CephFS - Ceph

Actions

Copy link

Bug #14758

closed

failed TestJournalRepair test on smithi

Added by Greg Farnum about 8 years ago. Updated about 8 years ago.

Status:

Resolved

Priority:

High

Assignee:

John Spray

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

http://pulpito.ceph.com/teuthology-2016-02-12_14:03:01-fs-jewel---basic-smithi/8218/

2016-02-12T23:58:57.155 INFO:tasks.cephfs_test_runner:======================================================================
2016-02-12T23:58:57.156 INFO:tasks.cephfs_test_runner:ERROR: test_reset (tasks.cephfs.test_journal_repair.TestJournalRepair)
2016-02-12T23:58:57.156 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2016-02-12T23:58:57.156 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2016-02-12T23:58:57.156 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_jewel/tasks/cephfs/test_journal_repair.py", line 245, in test_reset
2016-02-12T23:58:57.157 INFO:tasks.cephfs_test_runner:    self.fs.wait_for_state("up:standby", timeout=60, mds_id=damaged_id)
2016-02-12T23:58:57.157 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_jewel/tasks/cephfs/filesystem.py", line 507, in wait_for_state
2016-02-12T23:58:57.157 INFO:tasks.cephfs_test_runner:    elapsed, goal_state, current_state
2016-02-12T23:58:57.157 INFO:tasks.cephfs_test_runner:RuntimeError: Reached timeout after 60.3698599339 seconds waiting for state up:standby, while in state up:resolve

Actions

Copy link

Updated by Greg Farnum about 8 years ago

Priority changed from Urgent to High

Hmm, we haven't seen this since then, and there have been some smithi runs so it's apparently not just infrastructure differences.

Actions

Copy link

Updated by John Spray about 8 years ago

Status changed from New to Rejected

The link in the report was wrong, it was actually:
http://pulpito.ceph.com/teuthology-2016-02-12_14:03:01-fs-jewel---basic-smithi/8241/

Actions

Copy link

Updated by John Spray about 8 years ago

Status changed from Rejected to New

Actions

Copy link

Updated by John Spray about 8 years ago

Status changed from New to Resolved

In this case, mds.b was going into resolve (rank 0), while mds.a was remaining in standby. That's correct (rank 1 is damaged).

The test should be waiting for 'a' to go to standby but it's waiting for 'b' to go to standby.

It restarted b expecting it to go to standby while a remained active.

But 'a' restarted too, because it saw itself removed from the mdsmap:

2016-02-13 07:57:48.516326 7f32becb6700  1 mds.a handle_mds_map i (172.21.15.46:6805/32687) dne in the mdsmap, respawning myself

The mon removed it because it thought it timed out:

2016-02-13 07:57:48.088872 7fa9cf462700 10 mon.a@0(leader).mds e34 no beacon from 4208 172.21.15.46:6805/32687 mds.0.4 up:active since 2016-02-13 07:57:32.668364

Here's the last beacon exchange in active state:

2016-02-13 07:57:44.668380 7f32bbcb0700  1 -- 172.21.15.46:6805/32687 --> 172.21.15.46:6789/0 -- mdsbeacon(4208/a up:active seq 36 v33) v4 -- ?+0 0x7f32d239c600 con 0x7f32d0b84600
2016-02-13 07:57:48.098621 7f32becb6700  1 -- 172.21.15.46:6805/32687 <== mon.0 172.21.15.46:6789/0 61 ==== mdsbeacon(4208/a up:active seq 36 v34) v4 ==== 113+0+0 (2581632639 0 0) 0x7f32d34e1c00 con 0x7f32d0b84600

OK, the mon is doing something wrong here, because we can see this beacon getting handled:

2016-02-13 07:57:41.681501 7fa9cf462700 10 mon.a@0(leader).mds e33 _updated mds.0 mdsbeacon(4208/a up:active seq 35 v32) v4
2016-02-13 07:57:41.681512 7fa9cf462700  0 log_channel(cluster) log [INF] : mds.0 172.21.15.46:6805/32687 up:active
2016-02-13 07:57:41.681527 7fa9cf462700  1 -- 172.21.15.46:6789/0 --> 172.21.15.46:6789/0 -- log(1 entries from seq 321 at 2016-02-13 07:57:41.681515) v1 -- ?+0 0x7fa9e0a72880 con 0x7fa9e077ad00
2016-02-13 07:57:41.681544 7fa9cf462700  2 mon.a@0(leader) e1 send_reply 0x7fa9e0a42000 0x7fa9e1204d00 mdsbeacon(4208/a up:active seq 35 v33) v4
2016-02-13 07:57:41.681548 7fa9cf462700  1 -- 172.21.15.46:6789/0 --> 172.21.15.46:6805/32687 -- mdsbeacon(4208/a up:active seq 35 v33) v4 -- ?+0 0x7fa9e1204d00 con 0x7fa9e0a0bb00
2016-02-13 07:57:41.681577 7fa9cf462700 10 mon.a@0(leader).mds e33 e33: 2/2/2 up {0=a=up:active,1=b=up:active(laggy or crashed)}

Ah, this was the bug that was fixed by:

commit dff71a5da1de569928619deeb2f832dcb744f5c4
Author: Yan, Zheng <zyan@redhat.com>
Date:   Tue Feb 23 17:11:17 2016 +0800

    mon/MDSMonitor.cc: properly note beacon when health metrics changes

    Fixes: #14684
    Signed-off-by: Yan, Zheng <zyan@redhat.com>

diff --git a/src/mon/MDSMonitor.cc b/src/mon/MDSMonitor.cc
index 4238ffb..79d1500 100644
--- a/src/mon/MDSMonitor.cc
+++ b/src/mon/MDSMonitor.cc
@@ -367,6 +367,7 @@ bool MDSMonitor::preprocess_beacon(MonOpRequestRef op)
   // do not match, to update our stored
   if (!(pending_daemon_health[gid] == m->get_health())) {
     dout(20) << __func__ << " health metrics for gid " << gid << " were updated" << dendl;
+    _note_beacon(m);
     return false;
   }

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #14758

failed TestJournalRepair test on smithi

Updated by Greg Farnum about 8 years ago

Updated by John Spray about 8 years ago

Updated by John Spray about 8 years ago

Updated by John Spray about 8 years ago