Project

General

Profile

Bug #20876

BADAUTHORIZER on mgr, hung ceph tell mon.*

Added by Sage Weil over 6 years ago. Updated almost 6 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2017-08-02T03:18:33.624 INFO:teuthology.orchestra.run.smithi176:Running: "sudo ceph --cluster ceph tell 'mon.*' injectargs -- --no-mon-health-to-clog" 
...
2017-08-02T06:03:33.790 INFO:teuthology.orchestra.run.smithi176.stderr:2017-08-02 06:03:33.790980 7f8b2b7fe700  0 -- 172.21.15.176:0/1139732091 >> 172.21.15.176:6800/773 conn(0x7f8b10029110 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-08-02T06:03:33.790 INFO:teuthology.orchestra.run.smithi176.stderr:2017-08-02 06:03:33.791138 7f8b2b7fe700  0 -- 172.21.15.176:0/1139732091 >> 172.21.15.176:6800/773 conn(0x7f8b10029110 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-08-02T06:03:33.994 INFO:teuthology.orchestra.run.smithi176.stderr:2017-08-02 06:03:33.995237 7f8b2b7fe700  0 -- 172.21.15.176:0/1139732091 >> 172.21.15.176:6800/773 conn(0x7f8b10029110 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-08-02T06:03:33.994 INFO:teuthology.orchestra.run.smithi176.stderr:2017-08-02 06:03:33.995436 7f8b2b7fe700  0 -- 172.21.15.176:0/1139732091 >> 172.21.15.176:6800/773 conn(0x7f8b10029110 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER

I think the mgr error is a red herring (altho possibly also problematic). The real issue is that the ceph tell mon.* is hung.
Thread 16 (Thread 0x7f8b3a9a7700 (LWP 6735)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f8b3ebab5eb in ceph::logging::Log::entry (this=0x7f8b34062c40) at /build/ceph-12.1.1-944-gb1162cb/src/log/Log.cc:454
#2  0x00007f8b494ba184 in start_thread (arg=0x7f8b3a9a7700) at pthread_create.c:312
#3  0x00007f8b491e737d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 15 (Thread 0x7f8b39a29700 (LWP 6738)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00007f8b3edaf7e6 in WaitUntil (when=..., mutex=..., this=0x7f8b2c04ca00) at /build/ceph-12.1.1-944-gb1162cb/src/common/Cond.h:64
#2  WaitInterval (interval=..., mutex=..., this=0x7f8b2c04ca00) at /build/ceph-12.1.1-944-gb1162cb/src/common/Cond.h:73
#3  CephContextServiceThread::entry (this=0x7f8b2c04c980) at /build/ceph-12.1.1-944-gb1162cb/src/common/ceph_context.cc:135
#4  0x00007f8b494ba184 in start_thread (arg=0x7f8b39a29700) at pthread_create.c:312
#5  0x00007f8b491e737d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 14 (Thread 0x7f8b39228700 (LWP 6739)):
#0  0x00007f8b491d9fdd in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f8b3eb65e63 in poll (__timeout=-1, __nfds=2, __fds=0x7f8b39227dc0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  AdminSocket::entry (this=0x7f8b34063790) at /build/ceph-12.1.1-944-gb1162cb/src/common/admin_socket.cc:250
#3  0x00007f8b494ba184 in start_thread (arg=0x7f8b39228700) at pthread_create.c:312
#4  0x00007f8b491e737d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 13 (Thread 0x7f8b38a27700 (LWP 6740)):
#0  0x00007f8b491e7a13 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f8b3ef205d4 in EpollDriver::event_wait (this=0x7f8b2c050b60, fired_events=std::vector of length 0, capacity 0, tvp=<optimized out>) at /build/ceph-12.1.1-944-gb1162cb/src/msg/async/EventEpoll.cc:114
#2  0x00007f8b3ecfcb37 in EventCenter::process_events (this=this@entry=0x7f8b2c04fc40, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000, working_dur=working_dur@entry=0x7f8b38a26d70) at /build/ceph-12.1.1-944-gb1162cb/src/msg/async/Event.cc:395
#3  0x00007f8b3ed0038e in NetworkStack::__lambda4::operator() (__closure=0x7f8b2c0d6850) at /build/ceph-12.1.1-944-gb1162cb/src/msg/async/Stack.cc:51
#4  0x00007f8b3e49ca60 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f8b494ba184 in start_thread (arg=0x7f8b38a27700) at pthread_create.c:312
#6  0x00007f8b491e737d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 12 (Thread 0x7f8b2bfff700 (LWP 6741)):
#0  0x00007f8b491e7a13 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f8b3ef205d4 in EpollDriver::event_wait (this=0x7f8b2c07da20, fired_events=std::vector of length 0, capacity 0, tvp=<optimized out>) at /build/ceph-12.1.1-944-gb1162cb/src/msg/async/EventEpoll.cc:114
#2  0x00007f8b3ecfcb37 in EventCenter::process_events (this=this@entry=0x7f8b2c07cb50, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000, working_dur=working_dur@entry=0x7f8b2bffed70) at /build/ceph-12.1.1-944-gb1162cb/src/msg/async/Event.cc:395
#3  0x00007f8b3ed0038e in NetworkStack::__lambda4::operator() (__closure=0x7f8b2c0d6a60) at /build/ceph-12.1.1-944-gb1162cb/src/msg/async/Stack.cc:51
#4  0x00007f8b3e49ca60 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f8b494ba184 in start_thread (arg=0x7f8b2bfff700) at pthread_create.c:312
#6  0x00007f8b491e737d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 11 (Thread 0x7f8b2b7fe700 (LWP 6742)):
#0  0x00007f8b491e7a13 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f8b3ef205d4 in EpollDriver::event_wait (this=0x7f8b2c0aa8e0, fired_events=std::vector of length 0, capacity 0, tvp=<optimized out>) at /build/ceph-12.1.1-944-gb1162cb/src/msg/async/EventEpoll.cc:114
#2  0x00007f8b3ecfcb37 in EventCenter::process_events (this=this@entry=0x7f8b2c0a9a10, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000, working_dur=working_dur@entry=0x7f8b2b7fdd70) at /build/ceph-12.1.1-944-gb1162cb/src/msg/async/Event.cc:395
#3  0x00007f8b3ed0038e in NetworkStack::__lambda4::operator() (__closure=0x7f8b2c0d6a40) at /build/ceph-12.1.1-944-gb1162cb/src/msg/async/Stack.cc:51
#4  0x00007f8b3e49ca60 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f8b494ba184 in start_thread (arg=0x7f8b2b7fe700) at pthread_create.c:312
#6  0x00007f8b491e737d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 10 (Thread 0x7f8b2affd700 (LWP 6743)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00007f8b47b63cf3 in __gthread_cond_timedwait (__abs_timeout=0x7f8b2affceb0, __mutex=<optimized out>, __cond=0x7f8b2c0dba00) at /usr/include/x86_64-linux-gnu/c++/4.8/bits/gthr-default.h:871
#2  __wait_until_impl<std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > (__atime=<synthetic pointer>, __lock=..., this=0x7f8b2c0dba00) at /usr/include/c++/4.8/condition_variable:160
#3  wait_until<ceph::time_detail::mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > (__atime=..., __lock=..., this=0x7f8b2c0dba00) at /usr/include/c++/4.8/condition_variable:113
#4  ceph::timer_detail::timer<ceph::time_detail::mono_clock>::timer_thread (this=0x7f8b2c0db998) at /build/ceph-12.1.1-944-gb1162cb/src/common/ceph_timer.h:144
#5  0x00007f8b3e49ca60 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8b494ba184 in start_thread (arg=0x7f8b2affd700) at pthread_create.c:312
#7  0x00007f8b491e737d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 9 (Thread 0x7f8b2a7fc700 (LWP 6744)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f8b3ebf21d8 in Wait (mutex=..., this=0x7f8b2c04e7f0) at /build/ceph-12.1.1-944-gb1162cb/src/common/Cond.h:48
#2  DispatchQueue::entry (this=0x7f8b2c04e788) at /build/ceph-12.1.1-944-gb1162cb/src/msg/DispatchQueue.cc:208
#3  0x00007f8b3ecd246d in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /build/ceph-12.1.1-944-gb1162cb/src/msg/DispatchQueue.h:101
#4  0x00007f8b494ba184 in start_thread (arg=0x7f8b2a7fc700) at pthread_create.c:312
#5  0x00007f8b491e737d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 8 (Thread 0x7f8b29ffb700 (LWP 6745)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f8b3ebf4ccf in Wait (mutex=..., this=0x7f8b2c04e9a0) at /build/ceph-12.1.1-944-gb1162cb/src/common/Cond.h:48
#2  DispatchQueue::run_local_delivery (this=0x7f8b2c04e788) at /build/ceph-12.1.1-944-gb1162cb/src/msg/DispatchQueue.cc:113
#3  0x00007f8b3ecd248d in DispatchQueue::LocalDeliveryThread::entry (this=<optimized out>) at /build/ceph-12.1.1-944-gb1162cb/src/msg/DispatchQueue.h:115
#4  0x00007f8b494ba184 in start_thread (arg=0x7f8b29ffb700) at pthread_create.c:312
#5  0x00007f8b491e737d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 7 (Thread 0x7f8b297fa700 (LWP 6746)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00007f8b3eb8163e in WaitUntil (when=..., mutex=..., this=0x7f8b340672a0) at /build/ceph-12.1.1-944-gb1162cb/src/common/Cond.h:64
#2  SafeTimer::timer_thread (this=0x7f8b34067290) at /build/ceph-12.1.1-944-gb1162cb/src/common/Timer.cc:110
#3  0x00007f8b3eb8294d in SafeTimerThread::entry (this=<optimized out>) at /build/ceph-12.1.1-944-gb1162cb/src/common/Timer.cc:30
#4  0x00007f8b494ba184 in start_thread (arg=0x7f8b297fa700) at pthread_create.c:312
#5  0x00007f8b491e737d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 6 (Thread 0x7f8b28ff9700 (LWP 6747)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f8b3eb83b41 in Wait (mutex=..., this=0x7f8b340673b8) at /build/ceph-12.1.1-944-gb1162cb/src/common/Cond.h:48
#2  Finisher::finisher_thread_entry (this=0x7f8b34067358) at /build/ceph-12.1.1-944-gb1162cb/src/common/Finisher.cc:101
---Type <return> to continue, or q <return> to quit---
#3  0x00007f8b494ba184 in start_thread (arg=0x7f8b28ff9700) at pthread_create.c:312
#4  0x00007f8b491e737d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 5 (Thread 0x7f8b23fff700 (LWP 6748)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f8b3eb813b7 in Wait (mutex=..., this=0x7f8b34067800) at /build/ceph-12.1.1-944-gb1162cb/src/common/Cond.h:48
#2  SafeTimer::timer_thread (this=0x7f8b340677f0) at /build/ceph-12.1.1-944-gb1162cb/src/common/Timer.cc:108
#3  0x00007f8b3eb8294d in SafeTimerThread::entry (this=<optimized out>) at /build/ceph-12.1.1-944-gb1162cb/src/common/Timer.cc:30
#4  0x00007f8b494ba184 in start_thread (arg=0x7f8b23fff700) at pthread_create.c:312
#5  0x00007f8b491e737d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 4 (Thread 0x7f8b237fe700 (LWP 6749)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f8b3eb813b7 in Wait (mutex=..., this=0x7f8b34067a60) at /build/ceph-12.1.1-944-gb1162cb/src/common/Cond.h:48
#2  SafeTimer::timer_thread (this=0x7f8b34067a50) at /build/ceph-12.1.1-944-gb1162cb/src/common/Timer.cc:108
#3  0x00007f8b3eb8294d in SafeTimerThread::entry (this=<optimized out>) at /build/ceph-12.1.1-944-gb1162cb/src/common/Timer.cc:30
#4  0x00007f8b494ba184 in start_thread (arg=0x7f8b237fe700) at pthread_create.c:312
#5  0x00007f8b491e737d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 3 (Thread 0x7f8b22ffd700 (LWP 6750)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f8b3eb83b41 in Wait (mutex=..., this=0x7f8b34067bf0) at /build/ceph-12.1.1-944-gb1162cb/src/common/Cond.h:48
#2  Finisher::finisher_thread_entry (this=0x7f8b34067b90) at /build/ceph-12.1.1-944-gb1162cb/src/common/Finisher.cc:101
#3  0x00007f8b494ba184 in start_thread (arg=0x7f8b22ffd700) at pthread_create.c:312
#4  0x00007f8b491e737d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 2 (Thread 0x7f8b3b1a8700 (LWP 6754)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f8b47b509ce in Cond::Wait (this=0x7f8b3b1a73b0, mutex=...) at /build/ceph-12.1.1-944-gb1162cb/src/common/Cond.h:48
#2  0x00007f8b47b56433 in librados::RadosClient::mon_command (this=this@entry=0x7f8b340670b0, name="b", cmd=std::vector of length 1, capacity 1 = {...}, inbl=..., outbl=outbl@entry=0x7f8b3b1a75b0, outs=outs@entry=0x7f8b3b1a7500) at /build/ceph-12.1.1-944-gb1162cb/src/librados/RadosClient.cc:879
#3  0x00007f8b47b23ac8 in rados_mon_command_target (cluster=0x7f8b340670b0, name=name@entry=0x7f8b4988b52c "b", cmd=cmd@entry=0x7f8b2c0e33e0, cmdlen=cmdlen@entry=1, inbuf=inbuf@entry=0x7f8b498a352c "", inbuflen=inbuflen@entry=0, outbuf=outbuf@entry=0x7f8b3b1a76b8, outbuflen=outbuflen@entry=0x7f8b3b1a76c0, 
    outs=outs@entry=0x7f8b3b1a76c8, outslen=outslen@entry=0x7f8b3b1a76d0) at /build/ceph-12.1.1-944-gb1162cb/src/librados/librados.cc:3442
#4  0x00007f8b47e673e0 in __pyx_pf_5rados_5Rados_50mon_command (__pyx_v_timeout=<optimized out>, __pyx_v_target=<optimized out>, __pyx_v_inbuf=<optimized out>, __pyx_v_cmd=<optimized out>, __pyx_v_self=0x7f8b3c52c670) at /build/ceph-12.1.1-944-gb1162cb/obj-x86_64-linux-gnu/src/pybind/rados/pyrex/rados.c:13626
#5  __pyx_pw_5rados_5Rados_51mon_command (__pyx_v_self=0x7f8b3c52c670, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at /build/ceph-12.1.1-944-gb1162cb/obj-x86_64-linux-gnu/src/pybind/rados/pyrex/rados.c:13368
#6  0x000000000052714b in PyEval_EvalFrameEx ()
#7  0x0000000000555551 in PyEval_EvalCodeEx ()
#8  0x0000000000525560 in PyEval_EvalFrameEx ()
#9  0x00000000005247ea in PyEval_EvalFrameEx ()
#10 0x0000000000568b3a in ?? ()
#11 0x00000000004c2604 in ?? ()
#12 0x00000000004c8c8b in PyEval_CallObjectWithKeywords ()
#13 0x000000000058b132 in ?? ()
#14 0x00007f8b494ba184 in start_thread (arg=0x7f8b3b1a8700) at pthread_create.c:312
#15 0x00007f8b491e737d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7f8b498dd740 (LWP 6731)):
#0  0x00007f8b491dec53 in select () at ../sysdeps/unix/syscall-template.S:81
#1  0x000000000055056a in ?? ()
#2  0x0000000000523f6d in PyEval_EvalFrameEx ()
#3  0x0000000000555551 in PyEval_EvalCodeEx ()
#4  0x0000000000524338 in PyEval_EvalFrameEx ()
#5  0x0000000000555551 in PyEval_EvalCodeEx ()
#6  0x0000000000524338 in PyEval_EvalFrameEx ()
#7  0x0000000000555551 in PyEval_EvalCodeEx ()
#8  0x0000000000525560 in PyEval_EvalFrameEx ()
#9  0x0000000000568b3a in ?? ()
#10 0x0000000000525cb7 in PyEval_EvalFrameEx ()
#11 0x0000000000555551 in PyEval_EvalCodeEx ()
#12 0x0000000000525560 in PyEval_EvalFrameEx ()
#13 0x0000000000555551 in PyEval_EvalCodeEx ()
#14 0x0000000000524338 in PyEval_EvalFrameEx ()
#15 0x0000000000555551 in PyEval_EvalCodeEx ()
#16 0x0000000000525560 in PyEval_EvalFrameEx ()
#17 0x0000000000567d14 in ?? ()
#18 0x0000000000465bf4 in PyRun_FileExFlags ()
#19 0x000000000046612d in PyRun_SimpleFileExFlags ()
#20 0x0000000000466d92 in Py_Main ()
#21 0x00007f8b4910ef45 in __libc_start_main (main=0x466e50 <main>, argc=9, argv=0x7fff285b1238, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff285b1228) at libc-start.c:287
#22 0x0000000000577c2e in _start ()

mon logs show
root@smithi176:~# grep 172.21.15.176:0/1139732091 /var/log/ceph/ceph-mon.a.log  | grep command
2017-08-02 03:18:33.723273 7f4750583700  1 -- 172.21.15.176:6789/0 <== client.6915 172.21.15.176:0/1139732091 7 ==== mon_command({"prefix": "mon dump", "format": "json"} v 0) v1 ==== 82+0+0 (1584798505 0 0) 0x7f4765113b00 con 0x7f4766829000
2017-08-02 03:18:33.724354 7f4750583700 10 mon.a@0(leader).paxosservice(monmap 1..1) dispatch 0x7f4765113b00 mon_command({"prefix": "mon dump", "format": "json"} v 0) v1 from client.6915 172.21.15.176:0/1139732091 con 0x7f4766829000
2017-08-02 03:18:33.725563 7f4750583700  1 -- 172.21.15.176:6789/0 --> 172.21.15.176:0/1139732091 -- mon_command_ack([{"prefix": "mon dump", "format": "json"}]=0 dumped monmap epoch 1 v1) v1 -- 0x7f4765115cc0 con 0
2017-08-02 03:18:33.730445 7f4750583700  1 -- 172.21.15.176:6789/0 <== client.6915 172.21.15.176:0/1139732091 8 ==== mon_command({"prefix": "get_command_descriptions"} v 0) v1 ==== 80+0+0 (3499370304 0 0) 0x7f4765115cc0 con 0x7f4766829000
2017-08-02 03:18:33.790358 7f4750583700  1 -- 172.21.15.176:6789/0 --> 172.21.15.176:0/1139732091 -- mon_command_ack([{"prefix": "get_command_descriptions"}]=0  v0) v1 -- 0x7f476606eac0 con 0
2017-08-02 03:18:33.860702 7f4750583700  1 -- 172.21.15.176:6789/0 <== client.6915 172.21.15.176:0/1139732091 9 ==== mon_command({"injected_args": ["--no-mon-health-to-clog"], "prefix": "injectargs"} v 0) v1 ==== 112+0+0 (1779917119 0 0) 0x7f476606eac0 con 0x7f4766829000
2017-08-02 03:18:33.882008 7f4750583700  1 -- 172.21.15.176:6789/0 --> 172.21.15.176:0/1139732091 -- mon_command_ack([{"injected_args": ["--no-mon-health-to-clog"], "prefix": "injectargs"}]=0 injectargs:mon_health_to_clog = 'false'  v0) v1 -- 0x7f4765115cc0 con 0
root@smithi176:~# grep 172.21.15.176:0/1139732091 /var/log/ceph/ceph-mon.b.log  | grep command
root@smithi176:~# grep 172.21.15.176:0/1139732091 /var/log/ceph/ceph-mon.c.log  | grep command

History

#1 Updated by Sage Weil over 6 years ago

  • Project changed from mgr to RADOS

#2 Updated by Sage Weil over 6 years ago

/a/sage-2017-08-02_01:58:49-rados-wip-sage-testing-distro-basic-smithi/1469949

#3 Updated by Sage Weil over 6 years ago

/a/sage-2017-08-12_21:09:40-rados-wip-sage-testing-20170812a-distro-basic-smithi/1518429

2017-08-13T00:59:47.899 INFO:teuthology.orchestra.run.smithi142:Running: "sudo ceph --cluster ceph tell 'mon.*' injectargs -- --no-mon-health-to-clog" 
2017-08-13T00:59:48.107 INFO:teuthology.orchestra.run.smithi142.stderr:2017-08-13 00:59:48.116795 7f095ff37700 -1 WARNING: all dangerous and experimental features are enabled.
2017-08-13T00:59:48.149 INFO:teuthology.orchestra.run.smithi142.stderr:2017-08-13 00:59:48.159082 7f095ff37700 -1 WARNING: all dangerous and experimental features are enabled.
2017-08-13T01:09:34.158 INFO:teuthology.orchestra.run.smithi142.stderr:2017-08-13 01:09:34.167986 7f095d7c1700  0 -- 172.21.15.142:0/3748075446 >> 172.21.15.142:6791/0 conn(0x7f093000ca00 :-1 s=STATE_CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1)._try_send injecting socket failure
2017-08-13T01:54:02.858 INFO:tasks.ceph.mon.c.smithi142.stderr:Dumping heap profile to /var/log/ceph//mon.c.profile.0003.heap (3075 MB allocated cumulatively, 12 MB currently in use)
2017-08-13T02:59:48.248 INFO:teuthology.orchestra.run.smithi142.stderr:2017-08-13 02:59:48.245484 7f094ffff700  0 -- 172.21.15.142:0/3748075446 >> 172.21.15.142:6814/2833179 conn(0x7f093c023bf0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-08-13T02:59:48.360 INFO:teuthology.orchestra.run.smithi142.stderr:2017-08-13 02:59:48.245682 7f094ffff700  0 -- 172.21.15.142:0/3748075446 >> 172.21.15.142:6814/2833179 conn(0x7f093c023bf0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-08-13T02:59:48.448 INFO:teuthology.orchestra.run.smithi142.stderr:2017-08-13 02:59:48.446467 7f094ffff700  0 -- 172.21.15.142:0/3748075446 >> 172.21.15.142:6814/2833179 conn(0x7f093c023bf0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-08-13T02:59:48.459 INFO:teuthology.orchestra.run.smithi142.stderr:2017-08-13 02:59:48.446668 7f094ffff700  0 -- 172.21.15.142:0/3748075446 >> 172.21.15.142:6814/2833179 conn(0x7f093c023bf0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-08-13T02:59:48.847 INFO:teuthology.orchestra.run.smithi142.stderr:2017-08-13 02:59:48.847643 7f094ffff700  0 -- 172.21.15.142:0/3748075446 >> 172.21.15.142:6814/2833179 conn(0x7f093c023bf0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-08-13T02:59:49.295 INFO:teuthology.orchestra.run.smithi142.stderr:2017-08-13 02:59:48.847880 7f094ffff700  0 -- 172.21.15.142:0/3748075446 >> 172.21.15.142:6814/2833179 conn(0x7f093c023bf0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-08-13T02:59:49.650 INFO:teuthology.orchestra.run.smithi142.stderr:2017-08-13 02:59:49.649256 7f094ffff700  0 -- 172.21.15.142:0/3748075446 >> 172.21.15.142:6814/2833179 conn(0x7f093c023bf0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-08-13T02:59:49.653 INFO:teuthology.orchestra.run.smithi142.stderr:2017-08-13 02:59:49.649458 7f094ffff700  0 -- 172.21.15.142:0/3748075446 >> 172.21.15.142:6814/2833179 conn(0x7f093c023bf0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-08-13T02:59:51.251 INFO:teuthology.orchestra.run.smithi142.stderr:2017-08-13 02:59:51.251668 7f094ffff700  0 -- 172.21.15.142:0/3748075446 >> 172.21.15.142:6814/2833179 conn(0x7f093c023bf0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-08-13T02:59:51.597 INFO:teuthology.orchestra.run.smithi142.stderr:2017-08-13 02:59:51.251934 7f094ffff700  0 -- 172.21.15.142:0/3748075446 >> 172.21.15.142:6814/2833179 conn(0x7f093c023bf0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-08-13T02:59:54.457 INFO:teuthology.orchestra.run.smithi142.stderr:2017-08-13 02:59:54.455776 7f094ffff700  0 -- 172.21.15.142:0/3748075446 >> 172.21.15.142:6814/2833179 conn(0x7f093c023bf0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-08-13T02:59:54.758 INFO:teuthology.orchestra.run.smithi142.stderr:2017-08-13 02:59:54.456015 7f094ffff700  0 -- 172.21.15.142:0/3748075446 >> 172.21.15.142:6814/2833179 conn(0x7f093c023bf0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-08-13T03:00:00.862 INFO:teuthology.orchestra.run.smithi142.stderr:2017-08-13 03:00:00.863197 7f094ffff700  0 -- 172.21.15.142:0/3748075446 >> 172.21.15.142:6814/2833179 conn(0x7f093c023bf0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-08-13T03:00:01.070 INFO:teuthology.orchestra.run.smithi142.stderr:2017-08-13 03:00:00.863414 7f094ffff700  0 -- 172.21.15.142:0/3748075446 >> 172.21.15.142:6814/2833179 conn(0x7f093c023bf0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-08-13T03:00:13.666 INFO:teuthology.orchestra.run.smithi142.stderr:2017-08-13 03:00:13.666287 7f094ffff700  0 -- 172.21.15.142:0/3748075446 >> 172.21.15.142:6814/2833179 conn(0x7f093c023bf0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-08-13T03:00:14.695 INFO:teuthology.orchestra.run.smithi142.stderr:2017-08-13 03:00:13.666522 7f094ffff700  0 -- 172.21.15.142:0/3748075446 >> 172.21.15.142:6814/2833179 conn(0x7f093c023bf0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
2017-08-13T03:00:28.681 INFO:teuthology.orchestra.run.smithi142.stderr:2017-08-13 03:00:28.682178 7f094ffff700  0 -- 172.21.15.142:0/3748075446 >> 172.21.15.142:6814/2833179 conn(0x7f093c023bf0 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
...

#4 Updated by Sage Weil over 6 years ago

/a/sage-2017-08-24_17:38:40-rados-wip-sage-testing2-luminous-20170824a-distro-basic-smithi/1560473

#5 Updated by Sage Weil over 6 years ago

  • Status changed from 12 to Can't reproduce

#6 Updated by Марк Коренберг almost 6 years ago

Triggered on Luminous 12.2.5 again.

Mon quorum worked as expected, after all monitors restart, not healed. all pgs was "100% unknown". All IO on cluster was stopped.

After full reboot of all nodes, everything healed.

Also, on cephfs clients at that time:

[480613.330772] ceph: mds0 reconnect start
[480613.383007] ceph: mds0 reconnect start
[480613.538979] libceph: ceph_aes_crypt got bad padding -64 on in_len 32
[480613.539143] libceph: mds0 10.80.20.99:6800 bad authorize reply
[480613.539238] ceph: mds0 reconnect success
[480614.405119] libceph: ceph_aes_crypt got bad padding -38 on in_len 32
[480614.405419] libceph: mds0 10.80.20.99:6800 bad authorize reply
[480615.396934] libceph: ceph_aes_crypt got bad padding -102 on in_len 32
[480615.397180] libceph: mds0 10.80.20.99:6800 bad authorize reply
[480617.380910] libceph: ceph_aes_crypt got bad padding 116 on in_len 32
[480617.381156] libceph: mds0 10.80.20.99:6800 bad authorize reply
[480621.605396] libceph: ceph_aes_crypt got bad padding -48 on in_len 32
[480621.606933] libceph: mds0 10.80.20.99:6800 bad authorize reply
[480629.540864] libceph: ceph_aes_crypt got bad padding 35 on in_len 32
[480629.541519] libceph: mds0 10.80.20.99:6800 bad authorize reply
[480645.412879] libceph: mds0 10.80.20.99:6800 bad authorize reply
[480661.802026] ceph: mds0 recovery completed
[480661.802222] ceph: mds0 recovery completed

Also available in: Atom PDF