Project

General

Profile

Actions

Bug #7679

closed

mds: stuck on TMAP2OMAP check incorrectly

Added by Sage Weil about 10 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
documentation
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ubuntu@teuthology:/a/teuthology-2014-03-10_10:33:21-upgrade:dumpling-x:parallel-firefly---basic-plana/124951

mds restarted after osds, but it still got stuck on tmap2omap

Actions #1

Updated by Sage Weil about 10 years ago

  • Assignee set to Sage Weil
Actions #2

Updated by Sage Weil about 10 years ago

  "osd_xinfo": [
        { "osd": 0,
          "down_stamp": "2014-03-10 13:26:30.366188",
          "laggy_probability": "0.000000",
          "laggy_interval": 0,
          "features": 0},
        { "osd": 1,
          "down_stamp": "2014-03-10 13:27:40.512339",
          "laggy_probability": "0.000000",
          "laggy_interval": 0,
          "features": 0},
        { "osd": 2,
          "down_stamp": "2014-03-10 13:28:53.373336",
          "laggy_probability": "0.000000",
          "laggy_interval": 0,
          "features": 0},
        { "osd": 3,
          "down_stamp": "2014-03-10 13:30:07.786623",
          "laggy_probability": "0.000000",
          "laggy_interval": 0,
          "features": 0}],

Actions #3

Updated by Sage Weil about 10 years ago

b4fbe4f81348be74c654f3dae1c20a961b99c895 and a later commit fixed feature forwarding, which is needed for the osdmap to have teh osd featuers if the boot messages don't go straight to the primary.

more importantly, the osdmap won't have features at all if the mon isn't running a new release.

so: upgrade mons before osds.

Actions #4

Updated by Sage Weil about 10 years ago

  • Status changed from 12 to Resolved
Actions #5

Updated by Yuri Weinstein over 9 years ago

  • Status changed from Resolved to New

I see similar problem on latest runs:

http://qa-proxy.ceph.com/teuthology/teuthology-2014-10-28_17:00:01-upgrade:firefly:older-firefly-distro-basic-multi/576877/teuthology.log

2014-10-30T10:20:12.855 INFO:tasks.ceph.mds.a.plana92.stderr:2014-10-30 10:20:12.854915 7f3c980e1780 -1 mds.-1.-1 *** one or more OSDs do not support TMAP2OMAP; upgrade OSDs before starting MDS (or downgrade MDS) ***

and was able to reproduce for this test http://qa-proxy.ceph.com/teuthology/teuthology-2014-11-02_17:03:01-upgrade:firefly:older-firefly-distro-basic-vps/583041/

So reopening this ticket.

Also this run http://pulpito.front.sepia.ceph.com/teuthology-2014-11-02_17:00:02-upgrade:firefly:newer-firefly-distro-basic-vps/
seems to be doing "Restarting daemon" in endless loop (see job for example job http://qa-proxy.ceph.com/teuthology/teuthology-2014-11-02_17:00:02-upgrade:firefly:newer-firefly-distro-basic-vps/582972/teuthology.log)

Actions #6

Updated by John Spray over 9 years ago

  • Assignee changed from Sage Weil to Yuri Weinstein

I'm confused by the order of operations in the tests, it seems like there is an upgrade, then a restart, then an upgrade, then a restart, then an upgrade (but no restart). For example, looking at /a/teuthology-2014-10-28_17:00:01-upgrade:firefly:older-firefly-distro-basic-multi/576877

grep -e install.upgrade -e Restarting\ daemon -e Upgrading -e config.*firefly teuthology.log
2014-10-30T10:09:46.848 INFO:tasks.ceph.mon.b:Restarting daemon
2014-10-30T10:09:46.852 INFO:tasks.ceph.mon.c:Restarting daemon
2014-10-30T10:09:46.855 INFO:tasks.ceph.mon.a:Restarting daemon
2014-10-30T10:09:46.889 INFO:tasks.ceph.osd.3:Restarting daemon
2014-10-30T10:09:46.892 INFO:tasks.ceph.osd.4:Restarting daemon
2014-10-30T10:09:46.894 INFO:tasks.ceph.osd.5:Restarting daemon
2014-10-30T10:09:46.896 INFO:tasks.ceph.osd.0:Restarting daemon
2014-10-30T10:09:46.899 INFO:tasks.ceph.osd.1:Restarting daemon
2014-10-30T10:09:46.901 INFO:tasks.ceph.osd.2:Restarting daemon
2014-10-30T10:09:46.904 INFO:tasks.ceph.mds.a:Restarting daemon
2014-10-30T10:10:02.518 INFO:teuthology.task.sequential:In sequential, running task install.upgrade...
2014-10-30T10:10:02.519 INFO:teuthology.task.install:project ceph config {'mon.a': {'branch': 'firefly'}, 'mon.b': {'branch': 'firefly'}} overrides {'sha1': '6fd88792e77cdc7ad33ff0acf9b3189a7c525430'}
2014-10-30T10:10:02.519 INFO:teuthology.task.install:remote ubuntu@mira034.front.sepia.ceph.com config {'branch': 'firefly'}
2014-10-30T10:10:02.563 INFO:teuthology.task.install:Upgrading ceph deb packages: ceph, ceph-dbg, ceph-mds, ceph-mds-dbg, ceph-common, ceph-common-dbg, ceph-fuse, ceph-fuse-dbg, ceph-test, ceph-test-dbg, radosgw, radosgw-dbg, python-ceph, libcephfs1, libcephfs1-dbg, libcephfs-java, librados2, librados2-dbg, librbd1, librbd1-dbg
2014-10-30T10:10:32.152 INFO:teuthology.task.install:remote ubuntu@plana92.front.sepia.ceph.com config {'branch': 'firefly'}
2014-10-30T10:10:32.203 INFO:teuthology.task.install:Upgrading ceph deb packages: ceph, ceph-dbg, ceph-mds, ceph-mds-dbg, ceph-common, ceph-common-dbg, ceph-fuse, ceph-fuse-dbg, ceph-test, ceph-test-dbg, radosgw, radosgw-dbg, python-ceph, libcephfs1, libcephfs1-dbg, libcephfs-java, librados2, librados2-dbg, librbd1, librbd1-dbg
2014-10-30T10:11:35.517 INFO:tasks.ceph.osd.0:Restarting daemon
2014-10-30T10:12:11.978 INFO:tasks.ceph.osd.1:Restarting daemon
2014-10-30T10:12:48.443 INFO:tasks.ceph.osd.2:Restarting daemon
2014-10-30T10:13:24.950 INFO:tasks.ceph.osd.3:Restarting daemon
2014-10-30T10:14:08.836 INFO:tasks.ceph.osd.4:Restarting daemon
2014-10-30T10:14:52.560 INFO:tasks.ceph.osd.5:Restarting daemon
2014-10-30T10:16:06.261 INFO:tasks.ceph.mon.a:Restarting daemon
2014-10-30T10:17:16.355 INFO:tasks.ceph.mon.b:Restarting daemon
2014-10-30T10:18:23.296 INFO:tasks.ceph.mon.c:Restarting daemon
2014-10-30T10:19:32.777 INFO:tasks.ceph.mds.a:Restarting daemon
2014-10-30T10:39:11.596 INFO:teuthology.run_tasks:Running task install.upgrade...
2014-10-30T10:39:11.771 INFO:teuthology.task.install:Upgrading ceph deb packages: ceph, ceph-dbg, ceph-mds, ceph-mds-dbg, ceph-common, ceph-common-dbg, ceph-fuse, ceph-fuse-dbg, ceph-test, ceph-test-dbg, radosgw, radosgw-dbg, python-ceph, libcephfs1, libcephfs1-dbg, libcephfs-java, librados2, librados2-dbg, librbd1, librbd1-dbg
2014-10-30T10:40:53.794 INFO:tasks.ceph.mon.a:Restarting daemon
2014-10-30T10:41:47.886 INFO:tasks.ceph.mon.c:Restarting daemon
2014-10-30T10:42:48.666 INFO:tasks.ceph.mon.c:Restarting daemon
2014-10-30T10:43:28.966 INFO:tasks.ceph.mon.b:Restarting daemon
... the mon thrashing goes on

The MDS messages start from just after the 10:19:32.777 MDS restart, so that does seem to be similar to the original problem, but why is the test proceeding to run another install.upgrade at 2014-10-30T10:39:11.596?

Anyway, to make more sense of this I think we need to see the mds/osd/mon logs (there are none here because they are hung/killed runs), and ideally to know exactly what versions of each service are running at the point it's hitting the MDS error, and see the OSD map at that point as well. Hopefully we can see all those things the next time one of these hangs.

Actions #7

Updated by John Spray over 9 years ago

Okay, so we had a better smoking gun, logs in teuthology:~/jcsp/7679. The OSDs all have features set to 0 in the OSD map, so the MDS is behaving correctly.

In this instance the OSDs are being restarted before the mons, so the firefly OSDs are sending their osd_boot messages to a dumpling mon. When the firefly mons start up, the OSD map is not updated to reflect the new features, because the OSDs already booted.

Scrolling up, sage said that the mons should be restarted before the OSDs -- I guess that's still the case, so maybe this is a test regression in the ordering? Sage/Joao probably have a clearer sense of exactly what the right rules are for mon/osd ordering than I do.

Actions #8

Updated by Tamilarasi muthamizhan over 9 years ago

  • Category changed from 1 to documentation
  • Assignee changed from Yuri Weinstein to John Wilkins

https://github.com/ceph/ceph-qa-suite/pull/229 - fixed by Yuri

assigning this to John Wilkins, to make sure we already have this covered in the docs.

Actions #9

Updated by John Wilkins over 9 years ago

  • Status changed from New to In Progress
Actions #10

Updated by John Wilkins over 9 years ago

  • Status changed from In Progress to Resolved

Added a new section for upgrading from Dumpling to Firefly. Reviewed by Tamil.

http://ceph.com/docs/master/install/upgrading-ceph/#dumpling-to-firefly

Actions

Also available in: Atom PDF