Project

General

Profile

Bug #25076

Updated by Nathan Cutler over 5 years ago

Teuthology log: http://qa-proxy.ceph.com/teuthology/smithfarm-2018-07-24_02:10:24-upgrade:luminous-x-mimic-distro-basic-smithi/2807843/teuthology.log

Cluster topology: 2-node cluster, mon.a and mgr.x are on one host, mon.b and mon.c are on the other.

Background: the MONs are being upgraded while ceph-fuse task is running. It looks like one of the MONs crashes.

Packages are upgraded from v12.2.7 to v13.2.0

<pre>2018-07-24T06:55:58.424 INFO:teuthology.packaging:The installed version of ceph is 13.2.0-386-g181734b-1xenial
2018-07-24T06:55:58.424 INFO:teuthology.task.install:The correct ceph version 13.2.0-386-g181734b-1xenial is installed.
2018-07-24T06:55:58.425 INFO:teuthology.run_tasks:Running task print...
2018-07-24T06:55:58.653 INFO:teuthology.task.print:**** done install.upgrade both hosts
</pre>

We start "parallel", which in practice means we will restart the daemons while running ceph-fuse task:

<pre>2018-07-24T06:55:58.653 INFO:teuthology.run_tasks:Running task parallel...
2018-07-24T06:55:58.793 INFO:teuthology.task.parallel:starting parallel...
2018-07-24T06:55:58.794 INFO:teuthology.task.parallel:In parallel, running task full_sequential...
2018-07-24T06:55:58.794 INFO:teuthology.task.full_sequential:In full_sequential, running task sequential...
2018-07-24T06:55:58.795 INFO:teuthology.task.sequential:In sequential, running task ceph-fuse...
2018-07-24T06:55:58.813 INFO:tasks.ceph_fuse:Mounting ceph-fuse clients...
2018-07-24T06:55:58.813 INFO:tasks.ceph_fuse:Wait for MDS to reach steady state...
</pre>

We join the action a little while later - immediately prior to restarting the MONs:

<pre>2018-07-24T06:55:59.731 INFO:teuthology.orchestra.run.smithi157:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd pool get cephfs_data pg_num'
</pre>

This command does not immediately complete. Immediately after that log message the MON restarts begin:

<pre>2018-07-24T06:55:59.736 INFO:tasks.ceph.mon.a:Restarting daemon
2018-07-24T06:55:59.736 INFO:tasks.ceph.mon.a:Stopping old one...
2018-07-24T06:55:59.737 DEBUG:tasks.ceph.mon.a:waiting for process to exit
2018-07-24T06:55:59.737 INFO:teuthology.orchestra.run:waiting for 300
2018-07-24T06:55:59.782 INFO:tasks.ceph.mon.a:Stopped
2018-07-24T06:55:59.782 INFO:teuthology.orchestra.run.smithi157:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mon -f --cluster ceph -i a'
2018-07-24T06:55:59.794 INFO:teuthology.orchestra.run.smithi157.stderr:2018-07-24 06:55:59.786 7fe6e733f700 -1 WARNING: all dangerous and experimental features are enabled.
</pre>

In other words, mon.a is running on smithi157.

<pre>2018-07-24T06:55:59.801 INFO:tasks.ceph.mon.a:Started
2018-07-24T06:55:59.801 INFO:tasks.ceph.mon.b:Restarting daemon
2018-07-24T06:55:59.801 INFO:tasks.ceph.mon.b:Stopping old one...
2018-07-24T06:55:59.801 DEBUG:tasks.ceph.mon.b:waiting for process to exit
2018-07-24T06:55:59.801 INFO:teuthology.orchestra.run:waiting for 300
2018-07-24T06:55:59.813 INFO:tasks.ceph.mon.b:Stopped
2018-07-24T06:55:59.813 INFO:teuthology.orchestra.run.smithi066:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mon -f --cluster ceph -i b'
2018-07-24T06:55:59.833 INFO:tasks.ceph.mon.a.smithi157.stderr:2018-07-24 06:55:59.826 7fa0897fe180 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:55:59.834 INFO:tasks.ceph.mon.a.smithi157.stderr:2018-07-24 06:55:59.826 7fa0897fe180 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:55:59.834 INFO:tasks.ceph.mon.a.smithi157.stderr:2018-07-24 06:55:59.826 7fa0897fe180 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:55:59.862 INFO:tasks.ceph.mon.b:Started
2018-07-24T06:55:59.863 INFO:tasks.ceph.mon.c:Restarting daemon
2018-07-24T06:55:59.863 INFO:tasks.ceph.mon.c:Stopping old one...
2018-07-24T06:55:59.863 DEBUG:tasks.ceph.mon.c:waiting for process to exit
2018-07-24T06:55:59.863 INFO:teuthology.orchestra.run:waiting for 300
2018-07-24T06:55:59.876 INFO:tasks.ceph.mon.c:Stopped
2018-07-24T06:55:59.877 INFO:teuthology.orchestra.run.smithi066:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mon -f --cluster ceph -i c'
</pre>

mon.b and mon.c are running on smithi066.

<pre>2018-07-24T06:55:59.927 INFO:tasks.ceph.mon.b.smithi066.stderr:2018-07-24 06:55:59.919 7fca222cc180 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:55:59.927 INFO:tasks.ceph.mon.b.smithi066.stderr:2018-07-24 06:55:59.919 7fca222cc180 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:55:59.928 INFO:tasks.ceph.mon.b.smithi066.stderr:2018-07-24 06:55:59.923 7fca222cc180 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:55:59.945 INFO:tasks.ceph.mon.c:Started
2018-07-24T06:55:59.945 INFO:tasks.ceph.mgr.x:Restarting daemon
2018-07-24T06:55:59.946 INFO:tasks.ceph.mgr.x:Stopping old one...
2018-07-24T06:55:59.946 DEBUG:tasks.ceph.mgr.x:waiting for process to exit
2018-07-24T06:55:59.946 INFO:teuthology.orchestra.run:waiting for 300
2018-07-24T06:55:59.958 INFO:tasks.ceph.mgr.x:Stopped
2018-07-24T06:55:59.958 INFO:teuthology.orchestra.run.smithi157:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage daemon-helper kill ceph-mgr -f --cluster ceph -i x'
2018-07-24T06:56:00.000 INFO:tasks.ceph.mon.c.smithi066.stderr:2018-07-24 06:55:59.995 7f817594c180 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:56:00.000 INFO:tasks.ceph.mon.c.smithi066.stderr:2018-07-24 06:55:59.995 7f817594c180 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:56:00.002 INFO:tasks.ceph.mon.c.smithi066.stderr:2018-07-24 06:55:59.995 7f817594c180 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:56:00.027 INFO:tasks.ceph.mgr.x:Started
2018-07-24T06:56:00.027 INFO:teuthology.orchestra.run.smithi157:Running: "sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph -- tell 'mon.*' injectargs --mon_health_to_clog=true"
2018-07-24T06:56:00.091 INFO:tasks.ceph.mgr.x.smithi157.stderr:2018-07-24 06:56:00.082 7fa8b3cc8380 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:56:00.112 INFO:teuthology.orchestra.run.smithi157.stderr:2018-07-24 06:56:00.106 7fc74cbfe700 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:56:05.201 INFO:tasks.ceph.mgr.x.smithi157.stderr:2018-07-24 06:56:05.194 7fa8b3cc8380 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:56:05.206 INFO:teuthology.orchestra.run.smithi157.stderr:2018-07-24 06:56:05.194 7fc74cbfe700 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:56:05.206 INFO:tasks.ceph.mgr.x.smithi157.stderr:2018-07-24 06:56:05.198 7fa8b3cc8380 -1 WARNING: all dangerous and experimental features are enabled.
2018-07-24T06:56:05.207 INFO:teuthology.orchestra.run.smithi157.stderr:2018-07-24 06:56:05.198 7fe6e733f700 -1 WARNING: all dangerous and experimental features are enabled.
</pre>

mgr.x is running on smithi157. During this time, all three MONs and the sole MGR are undergoing restart via the following yaml:

<pre> - ceph.restart:
daemons:
- mon.a
- mon.b
- mon.c
- mgr.x
mon-health-to-clog: false
</pre>

We get the output from the "ceph osd pool get cephfs_data pg_num" command and immediately run next ceph-fuse task command, which is "ceph fs dump".

<pre>2018-07-24T06:56:05.369 INFO:teuthology.orchestra.run.smithi157.stdout:pg_num: 4
2018-07-24T06:56:05.370 INFO:teuthology.orchestra.run.smithi157:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph fs dump --format=json'
</pre>

And BOOM - mon.a crashes:

<pre>2018-07-24T06:56:05.379 INFO:teuthology.orchestra.run.smithi157.stderr:mon.b: injectargs:mon_health_to_clog = 'true'
2018-07-24T06:56:05.401 INFO:tasks.ceph.mon.a.smithi157.stderr:*** Caught signal (Bus error) **
2018-07-24T06:56:05.401 INFO:tasks.ceph.mon.a.smithi157.stderr: in thread 7fa0787c9700 thread_name:OpHistorySvc
2018-07-24T06:56:05.426 INFO:tasks.ceph.mon.a.smithi157.stderr: ceph version 13.2.0-386-g181734b (181734be624fdf1a164ef34250e58cfe83a1630b) mimic (stable)
2018-07-24T06:56:05.426 INFO:tasks.ceph.mon.a.smithi157.stderr: 1: (()+0x4b3cf0) [0x558522ad5cf0]
2018-07-24T06:56:05.426 INFO:tasks.ceph.mon.a.smithi157.stderr: 2: (()+0x11390) [0x7fa080444390]
2018-07-24T06:56:05.426 INFO:tasks.ceph.mon.a.smithi157.stderr: 3: (RefCountedObject::put() const+0xb8) [0x5585228cbf68]
2018-07-24T06:56:05.426 INFO:tasks.ceph.mon.a.smithi157.stderr: 4: (MonOpRequest::~MonOpRequest()+0x32) [0x5585228cc262]
2018-07-24T06:56:05.427 INFO:tasks.ceph.mon.a.smithi157.stderr: 5: (std::_Rb_tree<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::_Identity<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::allocator<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::_M_erase_aux(std::_Rb_tree_const_iterator<std::pair<double, boost::intrusive_ptr<TrackedOp> > >)+0x83) [0x7fa080b57333]
2018-07-24T06:56:05.427 INFO:tasks.ceph.mon.a.smithi157.stderr: 6: (OpHistory::cleanup(utime_t)+0x2b4) [0x7fa080b53fb4]
2018-07-24T06:56:05.427 INFO:tasks.ceph.mon.a.smithi157.stderr: 7: (OpHistory::_insert_delayed(utime_t const&, boost::intrusive_ptr<TrackedOp>)+0x20c) [0x7fa080b5493c]
2018-07-24T06:56:05.427 INFO:tasks.ceph.mon.a.smithi157.stderr: 8: (OpHistoryServiceThread::entry()+0xe9) [0x7fa080b54d19]
2018-07-24T06:56:05.427 INFO:tasks.ceph.mon.a.smithi157.stderr: 9: (()+0x76ba) [0x7fa08043a6ba]
2018-07-24T06:56:05.427 INFO:tasks.ceph.mon.a.smithi157.stderr: 10: (clone()+0x6d) [0x7fa07f15e41d]
</pre>

Some time later, the "workload" composite task completes, "workload completes", but upgrade-sequence (which is running in parallel) never finishes because restarting the initial ceph.restart fails to restart all MONs, so the MONs.

The
test doesn't continue past this point. There is a two-hour gap, after which something on smithi157 (presumably one of the osds) (mgr.x?) starts complaining that it can't reach a MON the MONs (?):

<pre>2018-07-24T07:51:59.381 INFO:teuthology.task.print:**** done rbd/test_librbd_python.sh 2-workload
2018-07-24T09:58:49.684 INFO:teuthology.orchestra.run.smithi157.stderr:2018-07-24 09:58:49.680 7f5c3df3d700 0 -- 172.21.15.157:0/2764616857 >> 172.21.15.157:6800/15306 conn(0x7f5c1801ac00 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2018-07-24T09:58:49.685 INFO:teuthology.orchestra.run.smithi157.stderr:2018-07-24 09:58:49.680 7f5c3df3d700 0 -- 172.21.15.157:0/2764616857 >> 172.21.15.157:6800/15306 conn(0x7f5c1801ac00 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2018-07-24T09:58:49.886 INFO:teuthology.orchestra.run.smithi157.stderr:2018-07-24 09:58:49.884 7f5c3df3d700 0 -- 172.21.15.157:0/2764616857 >> 172.21.15.157:6800/15306 conn(0x7f5c1801ac00 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2018-07-24T09:58:49.887 INFO:teuthology.orchestra.run.smithi157.stderr:2018-07-24 09:58:49.884 7f5c3df3d700 0 -- 172.21.15.157:0/2764616857 >> 172.21.15.157:6800/15306 conn(0x7f5c1801ac00 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2018-07-24T09:58:50.292 INFO:teuthology.orchestra.run.smithi157.stderr:2018-07-24 09:58:50.288 7f5c3df3d700 0 -- 172.21.15.157:0/2764616857 >> 172.21.15.157:6800/15306 conn(0x7f5c1801ac00 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
...
(lots more of these)
</pre>

Back