Project

General

Profile

Bug #25076

MON crash when upgrading luminous v12.2.7 -> mimic v13.2.0 during ceph-fuse task

Added by Nathan Cutler 12 months ago. Updated 11 months ago.

Status:
Duplicate
Priority:
Normal
Category:
-
Target version:
-
Start date:
07/24/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:

Description

Teuthology log: http://qa-proxy.ceph.com/teuthology/smithfarm-2018-07-24_02:10:24-upgrade:luminous-x-mimic-distro-basic-smithi/2807843/teuthology.log

Cluster topology: 2-node cluster, mon.a and mgr.x are on one host, mon.b and mon.c are on the other.

Background: the MONs are being upgraded while ceph-fuse task is running. It looks like one of the MONs crashes.

Packages are upgraded from v12.2.7 to v13.2.0

2018-07-24T06:55:58.424 INFO:teuthology.packaging:The installed version of ceph is 13.2.0-386-g181734b-1xenial
2018-07-24T06:55:58.424 INFO:teuthology.task.install:The correct ceph version 13.2.0-386-g181734b-1xenial is installed.
2018-07-24T06:55:58.425 INFO:teuthology.run_tasks:Running task print...
2018-07-24T06:55:58.653 INFO:teuthology.task.print:**** done install.upgrade both hosts

We start "parallel", which in practice means we will restart the daemons while running ceph-fuse task:

2018-07-24T06:55:58.653 INFO:teuthology.run_tasks:Running task parallel...
2018-07-24T06:55:58.793 INFO:teuthology.task.parallel:starting parallel...
2018-07-24T06:55:58.794 INFO:teuthology.task.parallel:In parallel, running task full_sequential...
2018-07-24T06:55:58.794 INFO:teuthology.task.full_sequential:In full_sequential, running task sequential...
2018-07-24T06:55:58.795 INFO:teuthology.task.sequential:In sequential, running task ceph-fuse...
2018-07-24T06:55:58.813 INFO:tasks.ceph_fuse:Mounting ceph-fuse clients...
2018-07-24T06:55:58.813 INFO:tasks.ceph_fuse:Wait for MDS to reach steady state...

We join the action a little while later - immediately prior to restarting the MONs:

2018-07-24T06:55:59.731 INFO:teuthology.orchestra.run.smithi157:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd pool get cephfs_data pg_num'

This command does not immediately complete. Immediately after that log message the MON restarts begin:

2018-07-24T06:55:59.736 INFO:tasks.ceph.mon.a:Restarting daemon
2018-07-24T06:55:59.736 INFO:tasks.ceph.mon.a:Stopping old one...
2018-07-24T06:55:59.737 DEBUG:tasks.ceph.mon.a:waiting for process to exit
2018-07-24T06:55:59.737 INFO:teuthology.orchestra.run:waiting for 300
2018-07-24T06:55:59.782 INFO:tasks.ceph.mon.a:Stopped
2018-07-24T06:55:59.782 INFO:teuthology.orchestra.run.smithi157:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mon -f --cluster ceph -i a'
2018-07-24T06:55:59.794 INFO:teuthology.orchestra.run.smithi157.stderr:2018-07-24 06:55:59.786 7fe6e733f700 -1 WARNING: all dangerous and experimental features are enabled.

In other words, mon.a is running on smithi157.

2018-07-24T06:55:59.801 INFO:tasks.ceph.mon.a:Started
2018-07-24T06:55:59.801 INFO:tasks.ceph.mon.b:Restarting daemon
2018-07-24T06:55:59.801 INFO:tasks.ceph.mon.b:Stopping old one...
2018-07-24T06:55:59.801 DEBUG:tasks.ceph.mon.b:waiting for process to exit
2018-07-24T06:55:59.801 INFO:teuthology.orchestra.run:waiting for 300
2018-07-24T06:55:59.813 INFO:tasks.ceph.mon.b:Stopped
2018-07-24T06:55:59.813 INFO:teuthology.orchestra.run.smithi066:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mon -f --cluster ceph -i b'
2018-07-24T06:55:59.833 INFO:tasks.ceph.mon.a.smithi157.stderr:2018-07-24 06:55:59.826 7fa0897fe180 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:55:59.834 INFO:tasks.ceph.mon.a.smithi157.stderr:2018-07-24 06:55:59.826 7fa0897fe180 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:55:59.834 INFO:tasks.ceph.mon.a.smithi157.stderr:2018-07-24 06:55:59.826 7fa0897fe180 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:55:59.862 INFO:tasks.ceph.mon.b:Started
2018-07-24T06:55:59.863 INFO:tasks.ceph.mon.c:Restarting daemon
2018-07-24T06:55:59.863 INFO:tasks.ceph.mon.c:Stopping old one...
2018-07-24T06:55:59.863 DEBUG:tasks.ceph.mon.c:waiting for process to exit
2018-07-24T06:55:59.863 INFO:teuthology.orchestra.run:waiting for 300
2018-07-24T06:55:59.876 INFO:tasks.ceph.mon.c:Stopped
2018-07-24T06:55:59.877 INFO:teuthology.orchestra.run.smithi066:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mon -f --cluster ceph -i c'

mon.b and mon.c are running on smithi066.

2018-07-24T06:55:59.927 INFO:tasks.ceph.mon.b.smithi066.stderr:2018-07-24 06:55:59.919 7fca222cc180 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:55:59.927 INFO:tasks.ceph.mon.b.smithi066.stderr:2018-07-24 06:55:59.919 7fca222cc180 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:55:59.928 INFO:tasks.ceph.mon.b.smithi066.stderr:2018-07-24 06:55:59.923 7fca222cc180 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:55:59.945 INFO:tasks.ceph.mon.c:Started
2018-07-24T06:55:59.945 INFO:tasks.ceph.mgr.x:Restarting daemon
2018-07-24T06:55:59.946 INFO:tasks.ceph.mgr.x:Stopping old one...
2018-07-24T06:55:59.946 DEBUG:tasks.ceph.mgr.x:waiting for process to exit
2018-07-24T06:55:59.946 INFO:teuthology.orchestra.run:waiting for 300
2018-07-24T06:55:59.958 INFO:tasks.ceph.mgr.x:Stopped
2018-07-24T06:55:59.958 INFO:teuthology.orchestra.run.smithi157:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mgr -f --cluster ceph -i x'
2018-07-24T06:56:00.000 INFO:tasks.ceph.mon.c.smithi066.stderr:2018-07-24 06:55:59.995 7f817594c180 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:56:00.000 INFO:tasks.ceph.mon.c.smithi066.stderr:2018-07-24 06:55:59.995 7f817594c180 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:56:00.002 INFO:tasks.ceph.mon.c.smithi066.stderr:2018-07-24 06:55:59.995 7f817594c180 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:56:00.027 INFO:tasks.ceph.mgr.x:Started
2018-07-24T06:56:00.027 INFO:teuthology.orchestra.run.smithi157:Running: "sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph -- tell 'mon.*' injectargs --mon_health_to_clog=true" 
2018-07-24T06:56:00.091 INFO:tasks.ceph.mgr.x.smithi157.stderr:2018-07-24 06:56:00.082 7fa8b3cc8380 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:56:00.112 INFO:teuthology.orchestra.run.smithi157.stderr:2018-07-24 06:56:00.106 7fc74cbfe700 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:56:05.201 INFO:tasks.ceph.mgr.x.smithi157.stderr:2018-07-24 06:56:05.194 7fa8b3cc8380 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:56:05.206 INFO:teuthology.orchestra.run.smithi157.stderr:2018-07-24 06:56:05.194 7fc74cbfe700 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:56:05.206 INFO:tasks.ceph.mgr.x.smithi157.stderr:2018-07-24 06:56:05.198 7fa8b3cc8380 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:56:05.207 INFO:teuthology.orchestra.run.smithi157.stderr:2018-07-24 06:56:05.198 7fe6e733f700 -1 WARNING: all dangerous and experimental features are enabled.

mgr.x is running on smithi157. During this time, all three MONs and the sole MGR are undergoing restart via the following yaml:

    - ceph.restart:
        daemons:
        - mon.a
        - mon.b
        - mon.c
        - mgr.x
        mon-health-to-clog: false

We get the output from the "ceph osd pool get cephfs_data pg_num" command and immediately run next ceph-fuse task command, which is "ceph fs dump".

2018-07-24T06:56:05.369 INFO:teuthology.orchestra.run.smithi157.stdout:pg_num: 4
2018-07-24T06:56:05.370 INFO:teuthology.orchestra.run.smithi157:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph fs dump --format=json'

And BOOM - mon.a crashes:

2018-07-24T06:56:05.379 INFO:teuthology.orchestra.run.smithi157.stderr:mon.b: injectargs:mon_health_to_clog = 'true'
2018-07-24T06:56:05.401 INFO:tasks.ceph.mon.a.smithi157.stderr:*** Caught signal (Bus error) **
2018-07-24T06:56:05.401 INFO:tasks.ceph.mon.a.smithi157.stderr: in thread 7fa0787c9700 thread_name:OpHistorySvc
2018-07-24T06:56:05.426 INFO:tasks.ceph.mon.a.smithi157.stderr: ceph version 13.2.0-386-g181734b (181734be624fdf1a164ef34250e58cfe83a1630b) mimic (stable)
2018-07-24T06:56:05.426 INFO:tasks.ceph.mon.a.smithi157.stderr: 1: (()+0x4b3cf0) [0x558522ad5cf0]
2018-07-24T06:56:05.426 INFO:tasks.ceph.mon.a.smithi157.stderr: 2: (()+0x11390) [0x7fa080444390]
2018-07-24T06:56:05.426 INFO:tasks.ceph.mon.a.smithi157.stderr: 3: (RefCountedObject::put() const+0xb8) [0x5585228cbf68]
2018-07-24T06:56:05.426 INFO:tasks.ceph.mon.a.smithi157.stderr: 4: (MonOpRequest::~MonOpRequest()+0x32) [0x5585228cc262]
2018-07-24T06:56:05.427 INFO:tasks.ceph.mon.a.smithi157.stderr: 5: (std::_Rb_tree<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::_Identity<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::allocator<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::_M_erase_aux(std::_Rb_tree_const_iterator<std::pair<double, boost::intrusive_ptr<TrackedOp> > >)+0x83) [0x7fa080b57333]
2018-07-24T06:56:05.427 INFO:tasks.ceph.mon.a.smithi157.stderr: 6: (OpHistory::cleanup(utime_t)+0x2b4) [0x7fa080b53fb4]
2018-07-24T06:56:05.427 INFO:tasks.ceph.mon.a.smithi157.stderr: 7: (OpHistory::_insert_delayed(utime_t const&, boost::intrusive_ptr<TrackedOp>)+0x20c) [0x7fa080b5493c]
2018-07-24T06:56:05.427 INFO:tasks.ceph.mon.a.smithi157.stderr: 8: (OpHistoryServiceThread::entry()+0xe9) [0x7fa080b54d19]
2018-07-24T06:56:05.427 INFO:tasks.ceph.mon.a.smithi157.stderr: 9: (()+0x76ba) [0x7fa08043a6ba]
2018-07-24T06:56:05.427 INFO:tasks.ceph.mon.a.smithi157.stderr: 10: (clone()+0x6d) [0x7fa07f15e41d]

Some time later, the "workload" composite task completes, but upgrade-sequence (which is running in parallel) never finishes because the initial ceph.restart fails to restart all the MONs.

The test doesn't continue past this point. There is a two-hour gap, after which something on smithi157 (presumably one of the osds) starts complaining that it can't reach a MON (?):

2018-07-24T07:51:59.381 INFO:teuthology.task.print:**** done rbd/test_librbd_python.sh 2-workload
2018-07-24T09:58:49.684 INFO:teuthology.orchestra.run.smithi157.stderr:2018-07-24 09:58:49.680 7f5c3df3d700  0 -- 172.21.15.157:0/2764616857 >> 172.21.15.157:6800/15306 conn(0x7f5c1801ac00 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2018-07-24T09:58:49.685 INFO:teuthology.orchestra.run.smithi157.stderr:2018-07-24 09:58:49.680 7f5c3df3d700  0 -- 172.21.15.157:0/2764616857 >> 172.21.15.157:6800/15306 conn(0x7f5c1801ac00 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2018-07-24T09:58:49.886 INFO:teuthology.orchestra.run.smithi157.stderr:2018-07-24 09:58:49.884 7f5c3df3d700  0 -- 172.21.15.157:0/2764616857 >> 172.21.15.157:6800/15306 conn(0x7f5c1801ac00 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2018-07-24T09:58:49.887 INFO:teuthology.orchestra.run.smithi157.stderr:2018-07-24 09:58:49.884 7f5c3df3d700  0 -- 172.21.15.157:0/2764616857 >> 172.21.15.157:6800/15306 conn(0x7f5c1801ac00 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2018-07-24T09:58:50.292 INFO:teuthology.orchestra.run.smithi157.stderr:2018-07-24 09:58:50.288 7f5c3df3d700  0 -- 172.21.15.157:0/2764616857 >> 172.21.15.157:6800/15306 conn(0x7f5c1801ac00 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
...
(lots more of these)

Related issues

Duplicates RADOS - Bug #24664: osd: crash in OpTracker::unregister_inflight_op via OSD::get_health_metrics Pending Backport 06/26/2018

History

#1 Updated by Nathan Cutler 12 months ago

  • Description updated (diff)

#2 Updated by Radoslaw Zarzynski 12 months ago

  • Assignee set to Radoslaw Zarzynski

#3 Updated by Radoslaw Zarzynski 12 months ago

It appears the crash can be explained on the same basis as in the case of bug #24664 . Monitors, similarly to OSDs, use OpTracker::visit_ops_in_flight() in the affected way:

  • driven just by the timer (unsynchronized with the rest) and
  • exposing a TrackedOpRef instance (through oldest_op) outside of the critical section protected by OpTracker shard's lock (ShardedTrackingData::ops_in_flight_lock_sharded).

This can lead to calling destructor twice (nref := 0, then nref := 1 and finally the extra nref := 0) in a manner described in the comment to PR #22877.

void Monitor::tick()
{
  // ...

  mgr_client.update_daemon_health(get_health_metrics());
  new_tick();
}

vector<DaemonHealthMetric> Monitor::get_health_metrics()
{
  // ...
  TrackedOpRef oldest_op;
  auto count_slow_ops = [&](TrackedOp& op) {
      // ...
        oldest_op = &op;
      // ...
  };
  if (op_tracker.visit_ops_in_flight(&oldest_secs, count_slow_ops)) {
    // ...
  }
  // ...
}

The bug has been fixed in master and awaits backport (staging).

$ git show 181734be624fdf1a164ef34250e58cfe83a1630b:../src/common/TrackedOp.h
  // ...
  void put() {
    if (--nref == 0) {
      switch (state.load()) {
      case STATE_UNTRACKED:
        _unregistered();
        delete this;
        break;

      case STATE_LIVE:
        mark_event("done");
        tracker->unregister_inflight_op(this);
        break;

      case STATE_HISTORY:
        delete this;
        break;

      default:
        ceph_abort();
      }
    }
  }

#4 Updated by Radoslaw Zarzynski 12 months ago

  • Duplicates Bug #24664: osd: crash in OpTracker::unregister_inflight_op via OSD::get_health_metrics added

#5 Updated by Sage Weil 11 months ago

  • Status changed from New to Duplicate

Also available in: Atom PDF