Project

General

Profile

Bug #39150

mon: "FAILED ceph_assert(session_map.sessions.empty())" when out of quorum

Added by Patrick Donnelly 4 months ago. Updated 2 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:

Description

2019-04-06T09:27:34.791 INFO:tasks.ceph.mds.b:Sent signal 15
2019-04-06T09:27:34.791 INFO:tasks.ceph.mon.a:Sent signal 15
2019-04-06T09:27:34.792 INFO:tasks.ceph.mon.c:Sent signal 15
2019-04-06T09:27:34.792 INFO:tasks.ceph.mon.b:Sent signal 15
2019-04-06T09:27:34.803 INFO:tasks.ceph.mon.a.smithi085.stderr:2019-04-06 09:27:34.801 7f854e356700 -1 received  signal: Terminated from /usr/bin/python /usr/bin/daemon-helper kill ceph-mon -f --cluster ceph -i a  (PID: 17117) UID: 0
2019-04-06T09:27:34.803 INFO:tasks.ceph.mon.a.smithi085.stderr:2019-04-06 09:27:34.801 7f854e356700 -1 mon.a@0(electing) e1 *** Got Signal Terminated ***
2019-04-06T09:27:34.807 INFO:tasks.ceph.mon.c.smithi180.stderr:2019-04-06 09:27:34.806 7f26a7a1e700 -1 received  signal: Terminated from /usr/bin/python /usr/bin/daemon-helper kill ceph-mon -f --cluster ceph -i c  (PID: 17101) UID: 0
2019-04-06T09:27:34.807 INFO:tasks.ceph.mon.c.smithi180.stderr:2019-04-06 09:27:34.806 7f26a7a1e700 -1 mon.c@2(electing) e1 *** Got Signal Terminated ***
2019-04-06T09:27:34.807 INFO:tasks.ceph.mon.b.smithi180.stderr:2019-04-06 09:27:34.806 7fc14ddde700 -1 received  signal: Terminated from /usr/bin/python /usr/bin/daemon-helper kill ceph-mon -f --cluster ceph -i b  (PID: 17099) UID: 0
2019-04-06T09:27:34.808 INFO:tasks.ceph.mon.b.smithi180.stderr:2019-04-06 09:27:34.806 7fc14ddde700 -1 mon.b@1(electing) e1 *** Got Signal Terminated ***
2019-04-06T09:27:34.808 INFO:tasks.ceph.mds.b.smithi180.stderr:2019-04-06 09:27:34.806 7f97d9dd8700 -1 received  signal: Terminated from /usr/bin/python /usr/bin/daemon-helper kill ceph-mds -f --cluster ceph -i b  (PID: 19872) UID: 0
2019-04-06T09:27:34.808 INFO:tasks.ceph.mds.b.smithi180.stderr:2019-04-06 09:27:34.806 7f97d9dd8700 -1 mds.b *** got signal Terminated ***
2019-04-06T09:27:34.939 INFO:tasks.ceph.mon.c.smithi180.stderr:/build/ceph-15.0.0-122-gcf4d304/src/mon/Monitor.cc: In function 'virtual Monitor::~Monitor()' thread 7f26b7a03340 time 2019-04-06 09:27:34.940966
2019-04-06T09:27:34.939 INFO:tasks.ceph.mon.c.smithi180.stderr:/build/ceph-15.0.0-122-gcf4d304/src/mon/Monitor.cc: 267: FAILED ceph_assert(session_map.sessions.empty())
2019-04-06T09:27:34.941 INFO:tasks.ceph.mon.c.smithi180.stderr: ceph version 15.0.0-122-gcf4d304 (cf4d304f05231b6375986616bc965edc8181a4e1) octopus (dev)
2019-04-06T09:27:34.941 INFO:tasks.ceph.mon.c.smithi180.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f26aebba0d2]
2019-04-06T09:27:34.941 INFO:tasks.ceph.mon.c.smithi180.stderr: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7f26aebba2ad]
2019-04-06T09:27:34.942 INFO:tasks.ceph.mon.c.smithi180.stderr: 3: (Monitor::~Monitor()+0x962) [0x69dfc2]
2019-04-06T09:27:34.942 INFO:tasks.ceph.mon.c.smithi180.stderr: 4: (Monitor::~Monitor()+0x9) [0x69e039]
2019-04-06T09:27:34.942 INFO:tasks.ceph.mon.c.smithi180.stderr: 5: (main()+0x2801) [0x578df1]
2019-04-06T09:27:34.942 INFO:tasks.ceph.mon.c.smithi180.stderr: 6: (__libc_start_main()+0xf0) [0x7f26ad0fd830]
2019-04-06T09:27:34.942 INFO:tasks.ceph.mon.c.smithi180.stderr: 7: (_start()+0x29) [0x65bee9]
2019-04-06T09:27:34.942 INFO:tasks.ceph.mon.c.smithi180.stderr:*** Caught signal (Aborted) **
2019-04-06T09:27:34.943 INFO:tasks.ceph.mon.c.smithi180.stderr: in thread 7f26b7a03340 thread_name:ceph-mon

From: /ceph/teuthology-archive/pdonnell-2019-04-06_02:21:29-fs-wip-pdonnell-testing-20190405.231924-distro-basic-smithi/3814565/teuthology.log

Seems there were other issues with the mons during that run as well. Mons lost quorum around 08:58:28.846.

History

#1 Updated by Greg Farnum 4 months ago

  • Subject changed from mon: "FAILED ceph_assert(session_map.sessions.empty())" to mon: "FAILED ceph_assert(session_map.sessions.empty())" when out of quorum
  • Priority changed from High to Normal

The monitor was out of quorum for 30 minutes; it probably has to do with holding on to client connections or else not cleaning up the session map from when it was last in. I'm not sure this is high priority though since it's a crash on shutdown in a failure scenario...

#2 Updated by Patrick Donnelly 4 months ago

Greg Farnum wrote:

The monitor was out of quorum for 30 minutes; it probably has to do with holding on to client connections or else not cleaning up the session map from when it was last in. I'm not sure this is high priority though since it's a crash on shutdown in a failure scenario...

Unless you're looking at something different, the lost quorum happened during the test not at shutdown. The mds thrasher had just successfully thrashed (killed and standby took over successfully) an MDS around ~8 seconds earlier.

#3 Updated by Greg Farnum 4 months ago

mon.c timeline:
2019-04-06 08:58:28.846 hits a lease timeout and triggers the election process
2019-04-06 08:58:28.846 first output of "probing" state
2019-04-06 08:58:28.850 first output of "electing" state
2019-04-06 09:27:34.942 crash output line

It does not output the "peon" or "leader" state again in those 29 minutes; it times out 291 elections and starts 294 during that time. I don't know why it happened but mon.c was out of quorum that whole time.

#4 Updated by Sage Weil 4 months ago

mon.c is failing to connect to mon.a:

2019-04-06 09:19:20.484 7f269f20d700  1 --2- [v2:172.21.15.180:3301/0,v1:172.21.15.180:6790/0] >> [v2:172.21.15.85:3300/0,v1:172.21.15.85:6789/0] conn(0x3137200 0x2f77600 secure :-1 s=BANNER_CONNECTING pgs=3162 cs=280 l=0 rx=0x4171da0 tx=0x4b2b080)._handle_peer_banner_payload supported=0 required=0
2019-04-06 09:19:20.484 7f269f20d700  1 --2- [v2:172.21.15.180:3301/0,v1:172.21.15.180:6790/0] >> [v2:172.21.15.85:3300/0,v1:172.21.15.85:6789/0] conn(0x3137200 0x2f77600 secure :-1 s=START_CONNECT pgs=3162 cs=281 l=0 rx=0x4171da0 tx=0x4b2b080)._fault waiting 15.000000

same in the other direction:
2019-04-06 09:19:00.906 7f8546b47700  1 --2- [v2:172.21.15.85:3300/0,v1:172.21.15.85:6789/0] >> [v2:172.21.15.180:3301/0,v1:172.21.15.180:6790/0] conn(0x361b680 0x33f9b80 secure :-1 s=BANNER_CONNECTING pgs=581 cs=284 l=0 rx=0x8ed3ab0 tx=0x4eb4b80)._handle_peer_banner_payload supported=0 required=0
2019-04-06 09:19:00.906 7f8546b47700  1 --2- [v2:172.21.15.85:3300/0,v1:172.21.15.85:6789/0] >> [v2:172.21.15.180:3301/0,v1:172.21.15.180:6790/0] conn(0x361b680 0x33f9b80 secure :-1 s=START_CONNECT pgs=581 cs=285 l=0 rx=0x8ed3ab0 tx=0x4eb4b80)._fault waiting 15.000000

#5 Updated by Sage Weil 4 months ago

(not surprisingly, MON_DOWN is in the ceph.log too, and the run would have failed with that had it not failed for some other reason. will keep an eye out for that!)

#6 Updated by Patrick Donnelly 4 months ago

/ceph/teuthology-archive/pdonnell-2019-04-17_06:12:56-kcephfs-wip-pdonnell-testing-20190417.032809-distro-basic-smithi/3857629/teuthology.log

#7 Updated by Neha Ojha 2 months ago

/a/yuriw-2019-06-07_19:41:42-rados-wip-yuri4-testing-2019-06-07-1600-nautilus-distro-basic-smithi/4012630/

Also available in: Atom PDF