Project

General

Profile

Bug #20667

segv in cephx_verify_authorizing during monc init

Added by Sage Weil 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Immediate
Assignee:
-
Category:
-
Target version:
-
Start date:
07/18/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No
Component(RADOS):

Description

  -114> 2017-07-17 18:42:08.896052 7f1159782700  1 accepter.entry start
  -113> 2017-07-17 18:42:08.896173 7f116cfcdc80  1 accepter.start
  -112> 2017-07-17 18:42:08.896218 7f115417d700  1 accepter.entry start
  -111> 2017-07-17 18:42:08.896248 7f116cfcdc80  1 accepter.start
  -110> 2017-07-17 18:42:08.896262 7f116cfa7700  1 -- 0.0.0.0:6816/97757 >> - pipe(0x562cf974b400 sd=26 :6816 s=0 pgs=0 cs=0 l=0 c=0x562cf95e93c0).accept sd=26 -
  -109> 2017-07-17 18:42:08.896287 7f115297a700  1 accepter.entry start
  -108> 2017-07-17 18:42:08.896314 7f116cfcdc80 10 monclient: init
  -107> 2017-07-17 18:42:08.896322 7f116cfcdc80  5 adding auth protocol: cephx
...
     0> 2017-07-17 18:42:08.914160 7f116cfa7700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f116cfa7700 thread_name:ms_pipe_read

 ceph version 12.1.0-1071-gd9c1467 (d9c1467492756d344415dba37829153f45de5368) luminous (rc)
 1: (()+0xa25ba4) [0x562cee6a3ba4]
 2: (()+0x113e0) [0x7f116b4bb3e0]
 3: (cephx_verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list::iterator&, CephXServiceTicketInfo&, ceph::buffer::list&)+0x496) [0x562cee872066]
 4: (CephxAuthorizeHandler::verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list&, ceph::buffer::list&, EntityName&, unsigned long&, AuthCapsInfo&, CryptoKey&, unsigned long*)+0x31a) [0x562cee86409a]
 5: (OSD::ms_verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0xf9) [0x562cee0ed6b9]
 6: (SimpleMessenger::verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0x6c) [0x562cee76ca6c]
 7: (Pipe::accept()+0x1ada) [0x562cee915bca]
 8: (Pipe::reader()+0x1e98) [0x562cee91b398]
 9: (Pipe::Reader::entry()+0xd) [0x562cee92402d]
 10: (()+0x770a) [0x7f116b4b170a]
 11: (clone()+0x6d) [0x7f116a52882d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

note that monc is just starting up when the incoming connection comes in.

/a/sage-2017-07-17_17:34:45-rados-wip-sage-testing-distro-basic-smithi/1408659

core doesn't have a good backtrace:

Thread 1 (Thread 0x7f116cfa7700 (LWP 97831)):
#0  0x00007f116b4bb2b9 in __flockfile (stream=0x17ddd) at ../sysdeps/pthread/flockfile.c:28
#1  0x0000562cee6a3c2e in reraise_fatal (signum=11) at /build/ceph-12.1.0-1071-gd9c1467/src/global/signal_handler.cc:74
#2  handle_fatal_signal (signum=11) at /build/ceph-12.1.0-1071-gd9c1467/src/global/signal_handler.cc:138
#3  0x00007f116b4bb3e0 in __libc_sigaction (sig=97757, act=0x17e27, oact=0xb) at ../sysdeps/unix/sysv/linux/x86_64/sigaction.c:54
#4  0x0000000000001000 in ?? ()
#5  0x0000000000000000 in ?? ()

History

#1 Updated by Sage Weil 5 months ago

Another instance, this time jewel:

2017-07-18 21:44:13.889278 7f445ca71800 10 osd.4 445 create_logger
2017-07-18 21:44:13.889327 7f445ca71800  1 accepter.accepter.start
2017-07-18 21:44:13.889373 7f445ca71800  1 accepter.accepter.start
2017-07-18 21:44:13.889450 7f445ca71800  1 accepter.accepter.start
2017-07-18 21:44:13.889460 7f445c9e0700  1 -- 0.0.0.0:6801/4712 >> :/0 pipe(0x7f44673e8000 sd=21 :6801 s=0 pgs=0 cs=0 l=0 c=0x7f44673da280).accept sd=21 172.21.15.70:45499/0
2017-07-18 21:44:13.889489 7f445ca71800  1 accepter.accepter.start
2017-07-18 21:44:13.889710 7f445ca71800 -1 osd.4 445 log_to_monitors {default=true}
2017-07-18 21:44:13.889933 7f445ca71800 10 osd.4 445 set_disk_tp_priority class  priority -1
2017-07-18 21:44:13.890001 7f4436a9d700 20 osd.4 0 update_osd_stat osd_stat(266 MB used, 95076 MB avail, 95342 MB total, peers []/[] op hist [])
2017-07-18 21:44:13.890008 7f4436a9d700  5 osd.4 445 heartbeat: osd_stat(266 MB used, 95076 MB avail, 95342 MB total, peers []/[] op hist [])
2017-07-18 21:44:13.890042 7f4433296700 10 osd.4 445 agent_entry start
2017-07-18 21:44:13.890046 7f4433296700 20 osd.4 445 agent_entry empty queue
2017-07-18 21:44:13.890078 7f445ca71800  1 -- 0.0.0.0:6801/4712 --> 172.21.15.70:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- ?+0 0x7f446738da80 con 0x7f44673d9500
2017-07-18 21:44:13.890267 7f445bb11700  1 -- 172.21.15.72:6801/4712 learned my addr 172.21.15.72:6801/4712
2017-07-18 21:44:13.890284 7f445bb11700 10 osd.4 445  new session (outgoing) 0x7f4466bdeb60 con=0x7f4466ba5200 addr=172.21.15.72:6801/4712
2017-07-18 21:44:13.890297 7f445bb11700 10 osd.4 445 OSD::ms_get_authorizer type=mon
2017-07-18 21:44:13.890381 7f44472be700 10 osd.4 445 ms_handle_connect on mon
2017-07-18 21:44:13.890387 7f44472be700 10 osd.4 445 send_alive up_thru currently 393 want 0
2017-07-18 21:44:13.890390 7f44472be700 10 osd.4 445 requeue_pg_temp 0 + 0 -> 0
2017-07-18 21:44:13.890393 7f44472be700 10 osd.4 445 requeue_failures 0 + 0 -> 0
2017-07-18 21:44:13.890396 7f44472be700 20 osd.4 445 send_pg_stats
2017-07-18 21:44:13.890570 7f44472be700  1 -- 172.21.15.72:6801/4712 <== mon.0 172.21.15.70:6789/0 1 ==== mon_map magic: 0 v1 ==== 967+0+0 (4235217005 0 0) 0x7f44673f6400 con 0x7f44673d9500
2017-07-18 21:44:13.890626 7f44472be700  1 -- 172.21.15.72:6801/4712 <== mon.0 172.21.15.70:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (1973296307 0 0) 0x7f44673f6880 con 0x7f44673d9500
2017-07-18 21:44:13.890725 7f44472be700  1 -- 172.21.15.72:6801/4712 --> 172.21.15.70:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0 0x7f44673f6400 con 0x7f44673d9500
2017-07-18 21:44:13.890925 7f44472be700  1 -- 172.21.15.72:6801/4712 <== mon.0 172.21.15.70:6789/0 3 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 206+0+0 (2606622780 0 0) 0x7f44673f6640 con 0x7f44673d9500
2017-07-18 21:44:13.890990 7f44472be700  1 -- 172.21.15.72:6801/4712 --> 172.21.15.70:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- ?+0 0x7f44673f6880 con 0x7f44673d9500
2017-07-18 21:44:13.891201 7f44472be700  1 -- 172.21.15.72:6801/4712 <== mon.0 172.21.15.70:6789/0 4 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 393+0+0 (1258108948 0 0) 0x7f44673f6400 con 0x7f44673d9500
2017-07-18 21:44:13.891260 7f44472be700  1 -- 172.21.15.72:6801/4712 --> 172.21.15.70:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x7f4467268600 con 0x7f44673d9500
2017-07-18 21:44:13.891288 7f44472be700  1 -- 172.21.15.72:6801/4712 --> 172.21.15.70:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- ?+0 0x7f44673f6640 con 0x7f44673d9500
2017-07-18 21:44:13.891392 7f44472be700  1 -- 172.21.15.72:6801/4712 <== mon.0 172.21.15.70:6789/0 5 ==== mon_map magic: 0 v1 ==== 967+0+0 (4235217005 0 0) 0x7f44673f61c0 con 0x7f44673d9500
2017-07-18 21:44:13.891432 7f44472be700  1 -- 172.21.15.72:6801/4712 <== mon.0 172.21.15.70:6789/0 6 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 194+0+0 (4253047930 0 0) 0x7f446738da80 con 0x7f44673d9500
2017-07-18 21:44:13.891518 7f445ca71800 10 osd.4 445 ensuring pgs have consumed prior maps
2017-07-18 21:44:13.891521 7f445ca71800  7 osd.4 445 consume_map version 445
2017-07-18 21:44:13.891535 7f445ca71800 10 osd.4 pg_epoch: 415 pg[0.10( empty local-les=414 n=0 ec=1 les/c/f 414/415/0 406/411/382) [5,4] r=1 lpr=415 pi=393-410/2 crt=0'0 inactive NOTIFY NIBBLEWISE] null
2017-07-18 21:44:13.891548 7f445ca71800 10 osd.4 pg_epoch: 384 pg[0.14( empty local-les=384 n=0 ec=1 les/c/f 384/384/0 378/383/383) [5,4] r=1 lpr=384 pi=296-382/10 crt=0'0 inactive NOTIFY NIBBLEWISE] null
2017-07-18 21:44:13.891561 7f445ca71800 10 osd.4 pg_epoch: 444 pg[5.7( v 444'5 (0'0,444'5] local-les=443 n=5 ec=442 les/c/f 443/443/0 442/442/442) [5,4] r=1 lpr=444 crt=444'5 lcod 0'0 inactive NOTIFY NIBBLEWISE] null
2017-07-18 21:44:13.891554 7f445c9e0700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f445c9e0700 thread_name:ms_pipe_read

 ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)
 1: (()+0x902ac2) [0x7f445c417ac2]
 2: (()+0x10330) [0x7f445a9ce330]
 3: (cephx_verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list::iterator&, CephXServiceTicketInfo&, ceph::buffer::list&)+0x27f) [0x7f445c4247af]
 4: (CephxAuthorizeHandler::verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list&, ceph::buffer::list&, EntityName&, unsigned long&, AuthCapsInfo&, CryptoKey&, unsigned long*)+0x1a7) [0x7f445c420057]
 5: (OSD::ms_verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0x114) [0x7f445be858a4]
 6: (SimpleMessenger::verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0x7f) [0x7f445c4e22bf]
 7: (Pipe::accept()+0x1d8e) [0x7f445c5fee9e]
 8: (Pipe::reader()+0x1c2f) [0x7f445c60412f]
 9: (Pipe::Reader::entry()+0xd) [0x7f445c60bf6d]
 10: (()+0x8184) [0x7f445a9c6184]
 11: (clone()+0x6d) [0x7f4458af137d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

/a/sage-2017-07-18_16:32:41-upgrade:jewel-x-master-distro-basic-smithi/1412950

#2 Updated by Sage Weil 5 months ago

  • Priority changed from High to Urgent

#3 Updated by Sage Weil 5 months ago

/a/yuriw-2017-07-18_19:38:33-rados-wip-yuri-testing3_2017_7_19-distro-basic-smithi/1413393
/a/yuriw-2017-07-18_19:38:33-rados-wip-yuri-testing3_2017_7_19-distro-basic-smithi/1413445

#4 Updated by Sage Weil 5 months ago

  • Priority changed from Urgent to Immediate

#5 Updated by Sage Weil 5 months ago

/a/sage-2017-07-19_15:27:16-rados-wip-sage-testing2-distro-basic-smithi/1419306
/a/sage-2017-07-19_15:27:16-rados-wip-sage-testing2-distro-basic-smithi/1419408

#6 Updated by Sage Weil 5 months ago

this time with a core

2017-07-20 06:05:39.577856 7fd743321b80  0 osd.5 43 load_pgs opened 26 pgs
2017-07-20 06:05:39.577880 7fd743321b80 10 osd.5 43 build_past_intervals_parallel nothing to build
2017-07-20 06:05:39.577888 7fd743321b80  2 osd.5 43 superblock: I am osd.5
2017-07-20 06:05:39.577890 7fd743321b80  0 osd.5 43 using weightedpriority op queue with priority op cut off at 64.
2017-07-20 06:05:39.577894 7fd743321b80 10 osd.5 43 create_logger
2017-07-20 06:05:39.578046 7fd743321b80  1  Processor -- start
2017-07-20 06:05:39.578219 7fd743321b80  1  Processor -- start
2017-07-20 06:05:39.578308 7fd743321b80  1  Processor -- start
2017-07-20 06:05:39.578380 7fd743321b80  1  Processor -- start
2017-07-20 06:05:39.578461 7fd743321b80  1  Processor -- start
2017-07-20 06:05:39.578479 7fd73d250700  1 -- 0.0.0.0:6822/18442 >> - conn(0x7fd74c7ee000 :6822 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=60 -
2017-07-20 06:05:39.578480 7fd73ca4f700  1 -- 0.0.0.0:6822/18442 >> - conn(0x7fd74c7f1000 :6822 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=61 -
2017-07-20 06:05:39.578560 7fd743321b80  1  Processor -- start

then nothing.

/a/sage-2017-07-20_05:37:32-rados-wip-sage-testing2-distro-basic-smithi/1424017

#0  cephx_verify_authorizer (cct=cct@entry=0x7fd74c436000, keys=keys@entry=0x0, indata=..., ticket_info=..., reply_bl=...) at /build/ceph-12.1.1-195-gb80122d/src/auth/cephx/CephxProtocol.cc:425
425     /build/ceph-12.1.1-195-gb80122d/src/auth/cephx/CephxProtocol.cc: No such file or directory.
(gdb) bt
#0  cephx_verify_authorizer (cct=cct@entry=0x7fd74c436000, keys=keys@entry=0x0, indata=..., ticket_info=..., reply_bl=...) at /build/ceph-12.1.1-195-gb80122d/src/auth/cephx/CephxProtocol.cc:425
#1  0x00007fd742b86c12 in CephxAuthorizeHandler::verify_authorizer (this=<optimized out>, cct=0x7fd74c436000, keys=0x0, authorizer_data=..., authorizer_reply=..., entity_name=..., global_id=@0x7fd73d24da00: 140562420523264, caps_info=..., session_key=..., auid=0x7fd73d24da08)
    at /build/ceph-12.1.1-195-gb80122d/src/auth/cephx/CephxAuthorizeHandler.cc:22
#2  0x00007fd7424b7c11 in OSD::ms_verify_authorizer (this=0x7fd74c660000, con=0x7fd74c7ee000, peer_type=<optimized out>, protocol=<optimized out>, authorizer_data=..., authorizer_reply=..., isvalid=@0x7fd73d24dc7f: false, session_key=...) at /build/ceph-12.1.1-195-gb80122d/src/osd/OSD.cc:6947
#3  0x00007fd742c410ea in ms_deliver_verify_authorizer (session_key=..., isvalid=@0x7fd73d24dc7f: false, authorizer_reply=..., authorizer=..., protocol=<optimized out>, peer_type=<optimized out>, con=0x7fd74c7ee000, this=0x7fd74c49ee00) at /build/ceph-12.1.1-195-gb80122d/src/msg/Messenger.h:770
#4  verify_authorizer (session_key=..., isvalid=@0x7fd73d24dc7f: false, auth_reply=..., auth=..., protocol=<optimized out>, peer_type=<optimized out>, con=0x7fd74c7ee000, this=0x7fd74c49ee00) at /build/ceph-12.1.1-195-gb80122d/src/msg/async/AsyncMessenger.h:389
#5  AsyncConnection::handle_connect_msg (this=this@entry=0x7fd74c7ee000, connect=..., authorizer_bl=..., authorizer_reply=...) at /build/ceph-12.1.1-195-gb80122d/src/msg/async/AsyncConnection.cc:1512
#6  0x00007fd742c46837 in AsyncConnection::_process_connection (this=this@entry=0x7fd74c7ee000) at /build/ceph-12.1.1-195-gb80122d/src/msg/async/AsyncConnection.cc:1325
#7  0x00007fd742c49f68 in AsyncConnection::process (this=0x7fd74c7ee000) at /build/ceph-12.1.1-195-gb80122d/src/msg/async/AsyncConnection.cc:838
#8  0x00007fd742ab6ea1 in EventCenter::process_events (this=this@entry=0x7fd74c317880, timeout_microseconds=<optimized out>, timeout_microseconds@entry=30000000, working_dur=working_dur@entry=0x7fd73d24e930) at /build/ceph-12.1.1-195-gb80122d/src/msg/async/Event.cc:409
#9  0x00007fd742ab99ee in NetworkStack::__lambda4::operator() (__closure=0x7fd74c2da200) at /build/ceph-12.1.1-195-gb80122d/src/msg/async/Stack.cc:51
#10 0x00007fd73fe88a60 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x00007fd740500184 in start_thread (arg=0x7fd73d250700) at pthread_create.c:312
#12 0x00007fd73f5efffd in __qfcvt_r (value=0, ndigit=<optimized out>, decpt=0x0, sign=<optimized out>, buf=0x0, len=0) at efgcvt_r.c:148
#13 0x0000000000000000 in ?? ()
(gdb) p keys
$1 = (KeyStore *) 0x0

that line is
    if (!keys->get_service_secret(service_id, ticket.secret_id, service_secret)) {

in CephProtocol.cc

#7 Updated by Sage Weil 5 months ago

  • Status changed from Verified to Need Review

https://github.com/ceph/ceph/pull/16455

I think we also need to fix the root cause, though, in commit bf4938567943c80345966f9c5a3bdc75a913175b

#8 Updated by Sage Weil 5 months ago

  • Status changed from Need Review to Resolved

Also available in: Atom PDF