Project

General

Profile

Actions

Bug #18126

closed

Illegal instruction from Messenger::create -> std::random_device::_M_getval

Added by Sage Weil over 7 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2016-12-02 05:05:35.436894 90133c0 10 obtain_monmap
2016-12-02 05:05:35.482491 90133c0 10 obtain_monmap found mkfs monmap
2016-12-02 05:05:35.626316 90133c0 -1 *** Caught signal (Illegal instruction) **
 in thread 90133c0 thread_name:memcheck-amd64-

 ceph version 11.0.2-2165-ga68e9ab (a68e9ab353c6179a9cdbec28ac253e903eb7c3ca)
 1: (()+0x70e1ce) [0x8161ce]
 2: (()+0x113e0) [0xa90e3e0]
 3: (()+0xb7b15) [0xb370b15]
 4: (std::random_device::_M_getval()+0x92) [0xb370cb2]
 5: (Messenger::create(CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, entity_name_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, unsigned long)+0x430) [0x67a1c0]
 6: (main()+0x1e03) [0x3612f3]
 7: (__libc_start_main()+0xf0) [0xbb7a830]
 8: (_start()+0x29) [0x3d5789]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

/a/sage-2016-12-02_04:50:00-rados-wip-sage-testing---basic-smithi/594072

ceph-mon.b.log in run dir


Related issues 1 (1 open0 closed)

Related to RADOS - Bug #20360: rados/verify valgrind tests: osds fail to start (xenial valgrind)New

Actions
Actions #1

Updated by Sage Weil over 7 years ago

  • Priority changed from Urgent to Immediate

/a/sage-2016-12-03_19:33:05-rados-master---basic-smithi/599046

Actions #2

Updated by Sage Weil over 7 years ago

/a/sage-2016-12-03_19:34:03-rados-wip-sage-testing---basic-smithi/599343

Actions #3

Updated by Sage Weil over 7 years ago

bf7d77a84b144ffdc92efd7d19d3038b75911b54 looks like it could maybe be the culprit? it moved a global static to a function static.

Actions #4

Updated by Sage Weil over 7 years ago

  • Status changed from New to 15
  • Priority changed from Immediate to Urgent
Actions #5

Updated by Sage Weil over 7 years ago

In the meantime, we can force valgrind runs onto centos.

https://github.com/ceph/ceph-qa-suite/pull/1301

Actions #6

Updated by Sage Weil over 7 years ago

<sage> jamespage: any chance we can poke the valgrind package maintainer to update the (xenial) package?  the bug was fixed about a year ago.
<frickler> sage: backporting changes may take some time, could you test whether installing the package from yakkety would solve your issue? https://launchpad.net/ubuntu/+source/valgrind/1:3.12.0~svn20160714-1ubuntu2/+build/10602185
Actions #7

Updated by David Galloway over 7 years ago

Sepia smithis running Xenial now have valgrind 3.12.0~svn20160714-1ubuntu2 installed.

ansible -a "wget -O /tmp/valgrind.deb https://launchpad.net/ubuntu/+source/valgrind/1:3.12.0~svn20160714-1ubuntu2/+build/10602185/+files/valgrind_3.12.0~svn20160714-1ubuntu2_amd64.deb" smithi[079:116]
ansible -a "sudo dpkg -i /tmp/valgrind.deb" smithi[079:116]

(also did smithi126 separately)

Actions #9

Updated by Sage Weil almost 7 years ago

2017-06-01T03:26:56.134 INFO:tasks.ceph.osd.1.smithi192.stderr:vex amd64->IR: unhandled instruction bytes: 0xF 0xC7 0xF0 0x89 0x6 0xF 0x42 0xC1
2017-06-01T03:26:56.138 INFO:tasks.ceph.osd.1.smithi192.stderr:vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
2017-06-01T03:26:56.140 INFO:tasks.ceph.osd.1.smithi192.stderr:vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=0F
2017-06-01T03:26:56.143 INFO:tasks.ceph.osd.1.smithi192.stderr:vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
2017-06-01T03:26:56.147 INFO:tasks.ceph.osd.1.smithi192.stderr:==00:00:00:06.020 91335== valgrind: Unrecognised instruction at address 0xc053b15.
2017-06-01T03:26:56.150 INFO:tasks.ceph.osd.1.smithi192.stderr:==00:00:00:06.020 91335== Your program just tried to execute an instruction that Valgrind
2017-06-01T03:26:56.153 INFO:tasks.ceph.osd.1.smithi192.stderr:==00:00:00:06.020 91335== did not recognise.  There are two possible reasons for this.
2017-06-01T03:26:56.156 INFO:tasks.ceph.osd.1.smithi192.stderr:==00:00:00:06.020 91335== 1. Your program has a bug and erroneously jumped to a non-code
2017-06-01T03:26:56.158 INFO:tasks.ceph.osd.1.smithi192.stderr:==00:00:00:06.020 91335==    location.  If you are running Memcheck and you just saw a
2017-06-01T03:26:56.161 INFO:tasks.ceph.osd.1.smithi192.stderr:==00:00:00:06.020 91335==    warning about a bad jump, it's probably your program's fault.
2017-06-01T03:26:56.164 INFO:tasks.ceph.osd.1.smithi192.stderr:==00:00:00:06.020 91335== 2. The instruction is legitimate but Valgrind doesn't handle it,
2017-06-01T03:26:56.167 INFO:tasks.ceph.osd.1.smithi192.stderr:==00:00:00:06.020 91335==    i.e. it's Valgrind's fault.  If you think this is the case or
2017-06-01T03:26:56.170 INFO:tasks.ceph.osd.1.smithi192.stderr:==00:00:00:06.020 91335==    you are not sure, please let us know and we'll try to fix it.
2017-06-01T03:26:56.173 INFO:tasks.ceph.osd.1.smithi192.stderr:==00:00:00:06.020 91335== Either way, Valgrind will now raise a SIGILL signal which will
2017-06-01T03:26:56.176 INFO:tasks.ceph.osd.1.smithi192.stderr:==00:00:00:06.020 91335== probably kill your program.
2017-06-01T03:26:56.180 INFO:tasks.ceph.osd.1.smithi192.stderr:*** Caught signal (Illegal instruction) **
2017-06-01T03:26:56.183 INFO:tasks.ceph.osd.1.smithi192.stderr: in thread 96db6c0 thread_name:memcheck-amd64-
2017-06-01T03:26:56.187 INFO:tasks.ceph.osd.1.smithi192.stderr: ceph version  12.0.2-1874-g9581c1e (9581c1ec2323fe8aeeb9e60dc3397298b2350970) luminous (dev)
2017-06-01T03:26:56.190 INFO:tasks.ceph.osd.1.smithi192.stderr: 1: (()+0x9e05d2) [0xae85d2]
2017-06-01T03:26:56.193 INFO:tasks.ceph.osd.1.smithi192.stderr: 2: (()+0x11390) [0xb972390]
2017-06-01T03:26:56.196 INFO:tasks.ceph.osd.1.smithi192.stderr: 3: (()+0xb7b15) [0xc053b15]
2017-06-01T03:26:56.200 INFO:tasks.ceph.osd.1.smithi192.stderr: 4: (std::random_device::_M_getval()+0x92) [0xc053cb2]
2017-06-01T03:26:56.202 INFO:tasks.ceph.osd.1.smithi192.stderr: 5: (MonClient::_add_conns(unsigned long)+0xe9) [0xb414e9]
2017-06-01T03:26:56.206 INFO:tasks.ceph.osd.1.smithi192.stderr: 6: (MonClient::_reopen_session(int)+0x45f) [0xb42e0f]
2017-06-01T03:26:56.209 INFO:tasks.ceph.osd.1.smithi192.stderr: 7: (MonClient::authenticate(double)+0x62d) [0xb4452d]
2017-06-01T03:26:56.212 INFO:tasks.ceph.osd.1.smithi192.stderr: 8: (OSD::init()+0x265a) [0x5936ca]
2017-06-01T03:26:56.216 INFO:tasks.ceph.osd.1.smithi192.stderr: 9: (main()+0x2ebc) [0x4a8b5c]
2017-06-01T03:26:56.221 INFO:tasks.ceph.osd.1.smithi192.stderr: 10: (__libc_start_main()+0xf0) [0xc85d830]
2017-06-01T03:26:56.224 INFO:tasks.ceph.osd.1.smithi192.stderr: 11: (_start()+0x29) [0x52eb59]
2017-06-01T03:26:56.229 INFO:tasks.ceph.osd.1.smithi192.stderr:2017-06-01 03:26:56.075549 96db6c0 -1 osd.1 0 log_to_monitors {default=true}
2017-06-01T03:26:56.232 INFO:tasks.ceph.osd.1.smithi192.stderr:2017-06-01 03:26:56.174695 96db6c0 -1 *** Caught signal (Illegal instruction) **

/a/sage-2017-06-01_02:27:12-rados-wip-sage-testing2---basic-smithi/1249784

not fixed yet on xenial

teuthology:1248735  04:47 PM $ ssh smithi192
Welcome to Ubuntu 16.04.1 LTS (GNU/Linux 4.12.0-rc3-ceph-gdc9938ed50b8 x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage
Last login: Thu Jun  1 16:39:53 2017 from 172.21.0.51
ubuntu@smithi192:~$ dpkg -l | grep valgrind
ii  valgrind                                  1:3.11.0-1ubuntu4.1                      amd64        instrumentation framework for building dynamic analysis tools

valgrind is already the newest version (1:3.11.0-1ubuntu4.1).

Actions #10

Updated by Greg Farnum almost 7 years ago

  • Assignee set to David Galloway

Hmm, any idea why the valgrind versions seem to have regressed, David?

Actions #11

Updated by David Galloway almost 7 years ago

Greg Farnum wrote:

Hmm, any idea why the valgrind versions seem to have regressed, David?

A few of the smithi have definitely been reimaged since I last touched this. At the time, I didn't have the foresight to put this in ceph-cm-ansible. I'll do that now.

The version that'll get installed is intended for Zesty. Is that ok? https://launchpad.net/ubuntu/zesty/amd64/valgrind/1:3.12.0-1ubuntu1

Actions #13

Updated by Sage Weil almost 7 years ago

1:3.12.0-1.1ubuntu1 on smithi107 showed the error in #20360

Actions #14

Updated by David Galloway almost 7 years ago

Okay, I've uploaded https://launchpad.net/ubuntu/+source/valgrind/1:3.12.0~svn20160714-1ubuntu2/+build/10602185/+files/valgrind_3.12.0~svn20160714-1ubuntu2_amd64.deb to chacra and removed the zesty version (3.12.0-1ubuntu1) which introduced a new bug http://tracker.ceph.com/issues/20360.

The hope is that 3.12.0~svn20160714-1ubuntu2 will fix the original valgrind bug in this issue and won't have the new bug in the 3.12.0-1ubuntu1 version.

I've removed valgrind from all the Xenial testnodes [1] so the svn version will get installed on the next ceph-cm-ansible run.

[1] for host in $(tl --brief -a -m smithi --os-type ubuntu --os-version 16.04 | grep -v 'slave\|dfuller\|tracker' | cut -d ' ' -f1); do ssh $host "sudo apt-get purge -y valgrind"; done

Actions #15

Updated by Sage Weil almost 7 years ago

  • Related to Bug #20360: rados/verify valgrind tests: osds fail to start (xenial valgrind) added
Actions #16

Updated by Sage Weil almost 7 years ago

  • Priority changed from Urgent to Normal

confining valgrind tests to centos again, so this is not a high priority.

Actions #17

Updated by Patrick Donnelly over 4 years ago

  • Status changed from 15 to Fix Under Review
Actions #18

Updated by David Galloway over 4 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF