Project

General

Profile

Bug #20529

Illegal instruction in RocksDB

Added by Dennis Busch almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
-
Category:
build
Target version:
-
Start date:
07/06/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Trying to create a ceph-mon. Seems from experience of other users that this issue might only be valid on Opteron CPUs. While creation I get the following:

# ceph-mon --mkfs -i 0 --monmap /tmp/monmap --keyring /etc/pve/priv/ceph.mon.keyring
monmaptool: monmap file /tmp/monmap
monmaptool: generated fsid 5d603e48-432c-4b58-8b17-461ecd20018d
epoch 0
fsid 5d603e48-432c-4b58-8b17-461ecd20018d
last_changed 2017-07-05 21:40:15.972493
created 2017-07-05 21:40:15.972493
0: 192.168.110.5:6789/0 mon.0
monmaptool: writing epoch 0 to /tmp/monmap (1 monitors)
ceph-mon: set fsid to 7dde6305-0fc1-4ada-99a9-68617f02baad
*** Caught signal (Illegal instruction) **
 in thread 7f2264e3fc80 thread_name:ceph-mon
 ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086) luminous (dev)
 1: (()+0x82bcc2) [0x5567dbad3cc2]
 2: (()+0x110c0) [0x7f22641f70c0]
 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) [0x5567dbc94cc1]
 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x26bc) [0x5567dbb7c14c]
 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x11f) [0x5567dbb4331f]
 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0xe40) [0x5567dbb44d90]
 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x698) [0x5567dbb465f8]
 8: (RocksDBStore::do_open(std::ostream&, bool)+0x68e) [0x5567db6dd00e]
 9: (RocksDBStore::create_and_open(std::ostream&)+0xd7) [0x5567db6de5c7]
 10: (MonitorDBStore::create_and_open(std::ostream&)+0xe3) [0x5567db5bec83]
 11: (main()+0x7ef) [0x5567db52d4af]
 12: (__libc_start_main()+0xf1) [0x7f226160b2b1]
 13: (_start()+0x2a) [0x5567db5bbdba]
2017-07-05 21:40:16.355087 7f2264e3fc80 -1 *** Caught signal (Illegal instruction) **
 in thread 7f2264e3fc80 thread_name:ceph-mon

 ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086) luminous (dev)
 1: (()+0x82bcc2) [0x5567dbad3cc2]
 2: (()+0x110c0) [0x7f22641f70c0]
 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) [0x5567dbc94cc1]
 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x26bc) [0x5567dbb7c14c]
 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x11f) [0x5567dbb4331f]
 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0xe40) [0x5567dbb44d90]
 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x698) [0x5567dbb465f8]
 8: (RocksDBStore::do_open(std::ostream&, bool)+0x68e) [0x5567db6dd00e]
 9: (RocksDBStore::create_and_open(std::ostream&)+0xd7) [0x5567db6de5c7]
 10: (MonitorDBStore::create_and_open(std::ostream&)+0xe3) [0x5567db5bec83]
 11: (main()+0x7ef) [0x5567db52d4af]
 12: (__libc_start_main()+0xf1) [0x7f226160b2b1]
 13: (_start()+0x2a) [0x5567db5bbdba]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2017-07-05 21:40:16.355087 7f2264e3fc80 -1 *** Caught signal (Illegal instruction) **
 in thread 7f2264e3fc80 thread_name:ceph-mon

 ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086) luminous (dev)
 1: (()+0x82bcc2) [0x5567dbad3cc2]
 2: (()+0x110c0) [0x7f22641f70c0]
 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) [0x5567dbc94cc1]
 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x26bc) [0x5567dbb7c14c]
 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x11f) [0x5567dbb4331f]
 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0xe40) [0x5567dbb44d90]
 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x698) [0x5567dbb465f8]
 8: (RocksDBStore::do_open(std::ostream&, bool)+0x68e) [0x5567db6dd00e]
 9: (RocksDBStore::create_and_open(std::ostream&)+0xd7) [0x5567db6de5c7]
 10: (MonitorDBStore::create_and_open(std::ostream&)+0xe3) [0x5567db5bec83]
 11: (main()+0x7ef) [0x5567db52d4af]
 12: (__libc_start_main()+0xf1) [0x7f226160b2b1]
 13: (_start()+0x2a) [0x5567db5bbdba]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

ceph-mon.kvmtest1.log View (193 KB) alexandre derumier, 07/21/2017 02:08 AM

ceph-mon.0-pve51.log View (73.7 KB) Benjamin Candler, 07/21/2017 12:44 PM

ceph-mon.0-pve52.log View (73.7 KB) Benjamin Candler, 07/21/2017 12:44 PM

ceph-mon.0-pve53.log View (36.9 KB) Benjamin Candler, 07/21/2017 12:45 PM

pve51.txt View - pveceph createmon (5.51 KB) Benjamin Candler, 07/21/2017 12:45 PM

pve52.txt View - pveceph createmon (5.47 KB) Benjamin Candler, 07/21/2017 12:45 PM

pve53.txt View - pveceph createmon (5.47 KB) Benjamin Candler, 07/21/2017 12:45 PM

pve51-xeon.txt View - pveceph createmon (5.46 KB) Benjamin Candler, 07/21/2017 01:33 PM


Related issues

Copied to Ceph - Backport #21396: Illegal instruction in RocksDB Resolved

History

#1 Updated by Greg Farnum almost 2 years ago

  • Subject changed from Illegal instruction in thread thread_name:ceph-mon to Illegal instruction in RocksDB
  • Priority changed from Normal to High

I wonder if we're passing compile options down incorrectly? Or is RocksDB on Opteron just busted?

#2 Updated by Dennis Busch almost 2 years ago

Additional Information:
- CPU is a AMD Opteron Processor 8439 SE
- The hosting OS is Proxmox VE 5.0
- Reproducable also with jemalloc activated in /etc/default/ceph

#3 Updated by Dennis Busch almost 2 years ago

This bug ist not restricted to Opteron. We also got notes that the very same bug occurs on slightly older Xeon CPUs.

#4 Updated by Daniel Oliveira almost 2 years ago

I am trying to reproduce this issue.

#5 Updated by Sage Weil almost 2 years ago

  • Status changed from New to Verified
  • Assignee set to Daniel Oliveira

#6 Updated by Daniel Oliveira almost 2 years ago

David, would you mind sharing some more info about the environment?
I've tried the following and they all seem to work properly (so far. All from VMs):

<attempt>
ceph-mon --mkfs -i 3 --monmap /tmp/monmaptest/monmapfile --keyring /var/lib/ceph/mon/ceph-sles12-node3/keyring
ceph-mon: set fsid to 0e515af3-7488-3cc9-8392-1e59f6ee0ef0
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-3 for mon.3
</attempt>

<attempt>
ceph-mon --mkfs -i 0 --monmap /tmp/monmaptest/monmapfile --keyring /var/lib/ceph/mon/ceph-sles12-node2/keyring
ceph-mon: set fsid to 0e515af3-7488-3cc9-8392-1e59f6ee0ef0
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-0 for mon.0
</attempt>

<attempt>
ceph-mon --mkfs -i $HOSTNAME --monmap /tmp/monmaptest/monmapfile --keyring /var/lib/ceph/mon/ceph-sles12-node1/keyring
ceph-mon: set fsid to 0e515af3-7488-3cc9-8392-1e59f6ee0ef0
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-sles12-node1 for mon.1
</attempt>

ceph version 12.1.0-289-g117b171715 (117b1717154e1236b2d37c405a86a9444cf7871d) luminous (dev)

I am curious to see if '--debug_mon' would show something interesting.

#7 Updated by Daniel Oliveira almost 2 years ago

Dennis,

Would you have any updates on this issue?

@All,
Has anybody else seen/being able to reproduce this at will?

Thanks,

#8 Updated by Klas M almost 2 years ago

Daniel Oliveira wrote:

Dennis,

Would you have any updates on this issue?

@All,
Has anybody else seen/being able to reproduce this at will?

Thanks,

Hi all, idk if this helps, but here is some more stats for you.

CPU2 x AMD Turion(tm) II Neo N40L Dual-Core Processor (1 Socket)
Kernel VersionLinux 4.10.15-1-pve #1 SMP PVE 4.10.15-15 (Fri, 23 Jun 2017 08:57:55 +0200)

  • Caught signal (Illegal instruction)
    in thread 7fbb99f54c80 thread_name:ceph-mon
    ceph version 12.1.0 (330b5d17d66c6c05b08ebc129d3e6e8f92f73c60) luminous (dev)
    1: (()+0x82bcc2) [0x560be7175cc2]
    2: (()+0x110c0) [0x7fbb9930d0c0]
    3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) [0x560be7336cc1]
    4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocato
    5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<ro
    6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::chartor<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std
    7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_t
    2: (()+0x110c0) [0x7fbb9930d0c0]
    3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) [0x560be7336cc1]
    4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x26bc) [0x560be721e14c]
    5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<ro 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x11f) [0x560be71e531f]
    6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >, rocksdb::DB
    )+0xe40) [0x560be71e6d90]
    7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB
    *)+0x698) [0x560be71e85f8]
    8: (RocksDBStore::do_open(std::ostream&, bool)+0x68e) [0x560be6d7f00e]
    9: (MonitorDBStore::open(std::ostream&)+0xec) [0x560be6c60f3c]
    10: (main()+0x12bf) [0x560be6bcff7f]
    11: (__libc_start_main()+0xf1) [0x7fbb967212b1]
    12: (_start()+0x2a) [0x560be6c5ddba]
    NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

    0> 2017-07-18 23:02:17.815910 7fbb99f54c80 -1 ** Caught signal (Illegal instruction) *
    in thread 7fbb99f54c80 thread_name:ceph-mon

    ceph version 12.1.0 (330b5d17d66c6c05b08ebc129d3e6e8f92f73c60) luminous (dev)
    1: (()+0x82bcc2) [0x560be7175cc2]
    2: (()+0x110c0) [0x7fbb9930d0c0]
    3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) [0x560be7336cc1]
    4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x26bc) [0x560be721e14c]
    5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x11f) [0x560be71e531f]
    6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >, rocksdb::DB*)+0xe40) [0x560be71e6d90]
    7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x698) [0x560be71e85f8]
    8: (RocksDBStore::do_open(std::ostream&, bool)+0x68e) [0x560be6d7f00e]
    9: (MonitorDBStore::open(std::ostream&)+0xec) [0x560be6c60f3c]
    10: (main()+0x12bf) [0x560be6bcff7f]
    11: (__libc_start_main()+0xf1) [0x7fbb967212b1]
    12: (_start()+0x2a) [0x560be6c5ddba]
    NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Illegal instruction

it's just a small home lab, hope it helps.

thaks for all the good work you do! <3

#9 Updated by Daniel Oliveira almost 2 years ago

Klas,

Thanks for the update. I will try it again (as soon as I can get an AMD box), but with VMs and the Intel processors so far, I haven't being able to see the same crash, which leads me to think on @Greg's comment (http://tracker.ceph.com/issues/20529#note-1) and Dennis' comment (http://tracker.ceph.com/issues/20529#note-3) on 'older Xeon CPUs'.

Do we have a core file we could look at? Also, when I last updated the bug, I mentioned about trying to use/set 'debug mon'(http://docs.ceph.com/docs/kraken/rados/troubleshooting/log-and-debug/). Would you be able to capture that when you get a min?

Thanks,

#10 Updated by Dennis Busch almost 2 years ago

Would it help if I provide you direct access to the server? It's "only" a learning environment that will be re-installed soon, so there would be no security issues about that.

#11 Updated by Thomas Lucke almost 2 years ago

Hello, i'm the "old Xeon CPU" guy.

works:
Intel Xeon X5672

fails:
Intel Xeon X5355

not tested:
Intel Xeon E5645 <I think, it works>
Intel Xeon E5450 <I think, it maybe fails>
Intel Xeon L5320 <I think, that it also fails>

Because some of the machines are productive i could not test them with new version. I would try to create some debug logs until friday, but the best is try yourself on the testcluster provided by Dennis. Thanks for the opportunity Dennis.

#12 Updated by Brendan Mirotchnick almost 2 years ago

I've been following this thread for a few days now. If it helps, I was able to confirm the problem on the following processors as well:

Intel Xeon E5335
AMD Phenom 9850

Happy to provide more information if required.

Thanks!

#13 Updated by alexandre derumier almost 2 years ago

Hi,

I can reproduce it on old xeon 5110.

ceph-mon: set fsid to 967014e1-48ac-464e-8ccf-cfe732d052c7
  • Caught signal (Illegal instruction)
    in thread 7f79283e5c80 thread_name:ceph-mon
    ceph version 12.1.0 (330b5d17d66c6c05b08ebc129d3e6e8f92f73c60) luminous (dev)
    1: (()+0x82bcc2) [0x561382899cc2]
    2: (()+0x110c0) [0x7f79277bb0c0]
    3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) [0x561382a5acc1]
    4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x26bc) [0x56138294214c]
    5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x11f) [0x56138290931f]
    6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >, rocksdb::DB
    )+0xe40) [0x56138290ad90]
    7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB
    *)+0x698) [0x56138290c5f8]
    8: (RocksDBStore::do_open(std::ostream&, bool)+0x68e) [0x5613824a300e]
    9: (RocksDBStore::create_and_open(std::ostream&)+0xd7) [0x5613824a45c7]
    10: (MonitorDBStore::create_and_open(std::ostream&)+0xe3) [0x561382384c83]
    11: (main()+0x7ef) [0x5613822f34af]
    12: (__libc_start_main()+0xf1) [0x7f7924bcf2b1]
    13: (_start()+0x2a) [0x561382381dba]
    2017-07-11 00:04:02.167698 7f79283e5c80 -1 ** Caught signal (Illegal instruction) *
    in thread 7f79283e5c80 thread_name:ceph-mon
ceph version 12.1.0 (330b5d17d66c6c05b08ebc129d3e6e8f92f73c60) luminous (dev)
1: (()+0x82bcc2) [0x561382899cc2]
2: (()+0x110c0) [0x7f79277bb0c0]
3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) [0x561382a5acc1]
4: (rocksdb::VersionSet::Recover(std::vector&lt;rocksdb::ColumnFamilyDescriptor, std::allocator&lt;rocksdb::ColumnFamilyDescriptor&gt; > const&, bool)+0x26bc) [0x56138294214c]
5: (rocksdb::DBImpl::Recover(std::vector&lt;rocksdb::ColumnFamilyDescriptor, std::allocator&lt;rocksdb::ColumnFamilyDescriptor&gt; > const&, bool, bool, bool)+0x11f) [0x56138290931f]
6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; > const&, std::vector&lt;rocksdb::ColumnFamilyDescriptor, std::allocator&lt;rocksdb::ColumnFamilyDescriptor&gt; > const&, std::vector&lt;rocksdb::ColumnFamilyHandle*, std::allocator&lt;rocksdb::ColumnFamilyHandle*&gt; >, rocksdb::DB*)+0xe40) [0x56138290ad90]
7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; > const&, rocksdb::DB**)+0x698) [0x56138290c5f8]
8: (RocksDBStore::do_open(std::ostream&, bool)+0x68e) [0x5613824a300e]
9: (RocksDBStore::create_and_open(std::ostream&)+0xd7) [0x5613824a45c7]
10: (MonitorDBStore::create_and_open(std::ostream&)+0xe3) [0x561382384c83]
11: (main()+0x7ef) [0x5613822f34af]
12: (__libc_start_main()+0xf1) [0x7f7924bcf2b1]
13: (_start()+0x2a) [0x561382381dba]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.
0> 2017-07-11 00:04:02.167698 7f79283e5c80 -1 ** Caught signal (Illegal instruction) *
in thread 7f79283e5c80 thread_name:ceph-mon
ceph version 12.1.0 (330b5d17d66c6c05b08ebc129d3e6e8f92f73c60) luminous (dev)
1: (()+0x82bcc2) [0x561382899cc2]
2: (()+0x110c0) [0x7f79277bb0c0]
3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) [0x561382a5acc1]
4: (rocksdb::VersionSet::Recover(std::vector&lt;rocksdb::ColumnFamilyDescriptor, std::allocator&lt;rocksdb::ColumnFamilyDescriptor&gt; > const&, bool)+0x26bc) [0x56138294214c]
5: (rocksdb::DBImpl::Recover(std::vector&lt;rocksdb::ColumnFamilyDescriptor, std::allocator&lt;rocksdb::ColumnFamilyDescriptor&gt; > const&, bool, bool, bool)+0x11f) [0x56138290931f]
6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; > const&, std::vector&lt;rocksdb::ColumnFamilyDescriptor, std::allocator&lt;rocksdb::ColumnFamilyDescriptor&gt; > const&, std::vector&lt;rocksdb::ColumnFamilyHandle*, std::allocator&lt;rocksdb::ColumnFamilyHandle*&gt; >, rocksdb::DB*)+0xe40) [0x56138290ad90]
7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; > const&, rocksdb::DB**)+0x698) [0x56138290c5f8]
8: (RocksDBStore::do_open(std::ostream&, bool)+0x68e) [0x5613824a300e]
9: (RocksDBStore::create_and_open(std::ostream&)+0xd7) [0x5613824a45c7]
10: (MonitorDBStore::create_and_open(std::ostream&)+0xe3) [0x561382384c83]
11: (main()+0x7ef) [0x5613822f34af]
12: (__libc_start_main()+0xf1) [0x7f7924bcf2b1]
13: (_start()+0x2a) [0x561382381dba]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

#14 Updated by alexandre derumier almost 2 years ago

Could it be related to rocksdb build ?

"
https://github.com/facebook/rocksdb/blob/master/INSTALL.md

By default the binary we produce is optimized for the platform you're compiling on (-march=native or the equivalent). SSE4.2 will thus be enabled automatically if your CPU supports it. To print a warning if your CPU does not support SSE4.2, build with USE_SSE=1 make static_lib or, if using CMake, cmake -DFORCE_SSE42=ON. If you want to build a portable binary, add PORTABLE=1 before your make commands, like this: PORTABLE=1 make static_lib."

are ceph packages builded with PORTABLE=1 ?

#15 Updated by alexandre derumier almost 2 years ago

This old commit had PORTABLE=1
https://github.com/ceph/ceph/pull/6311/files

but I don't find it anymore in master

in master,

src/CMakeLists.txt:

  1. We really want to have the CRC32 calculation in RocksDB accelerated
  2. with SSE 4.2. For details refer to rocksdb/util/crc32c.cc.
    if (HAVE_INTEL_SSE4_2)
    list(APPEND ROCKSDB_CMAKE_ARGS -DCMAKE_CXX_FLAGS=${SIMD_COMPILE_FLAGS})
    else()
    list(APPEND ROCKSDB_CMAKE_ARGS -DWITH_SSE42=OFF)
    endif()

So If package are build on cpu with sse 4.2, it'll not work on older system ?

#16 Updated by Daniel Oliveira almost 2 years ago

Thanks for all the updates.
As per @Greg's comment (http://tracker.ceph.com/issues/20529#note-1), it is a possibility. We need to debug it some more.

@Dennis, yes we definitely can try access to your environment and debug it there.

@Alexandre, I will check on the build process and see what/how/if it applies (PORTABLE variable) to what we are seeing here.

#17 Updated by alexandre derumier almost 2 years ago

I have done more tests,

I can't reproduce it on debian jessie in a qemu machine with limited cpu flags

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology eagerfpu pni cx16 x2apic hypervisor lahf_lm

I'll try the same in a debian stretch vm.

(I have attached the ceph-mon log of my old xeon server with stretch)

#18 Updated by alexandre derumier almost 2 years ago

more tests result:

- debian jessie vm with qemu64 cpu on recent xeon v3 : ok
- debian strech vm with qemu64 cpu on recent xeon v3 : ok

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology eagerfpu pni cx16 x2apic hypervisor lahf_lm

same vm :
- debian jessie vm with qemu64 cpu on old xeon 5110 : error
- debian strech vm with qemu64 cpu on old xeon 5110 : error

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx lm constant_tsc nopl xtopology pni cx16 x2apic hypervisor lahf_lm

(flags difference is ht && eagerfpu)

#19 Updated by Benjamin Candler almost 2 years ago

As I'm testing the new pve5, I ran into this illegal instruction too. 3 older machines, just for testing, three times the 'same' error.

pve51: Core 2 Quad Q6600 2.4 GHz
pve52: Core 2 Duo E6550
2.33 GHz
pve53: Core 2 Duo E6750 @ 2.66 GHz

I'm searching for a Xeon-CPU to exchange in pve51, I'm sure there is one around, maybe it could work, will report back.

#20 Updated by alexandre derumier almost 2 years ago

Just a note :

it's not related to proxmox, I'm able to produce it with default debian jessie/stretch. (same vm work on a new cpu machine, but not on old cpu)

#21 Updated by Benjamin Candler almost 2 years ago

In response to Alexandres note some more installation informations:

Minimal debian stretch, updated, proxmox 5.0-23/af4267bf installed afterwards, clustered, and used proxmox to create the basic ceph environment.

And yes, as far as I can say too, it has nothing to do with proxmox. I've attached the pve5?.txts to show the commands proxmox' "pveceph createmon" executes and fails.

#22 Updated by Benjamin Candler almost 2 years ago

Exchanged Intel Core 2 Quad with Xeon 3050 in pve51, same problem. Attached command line log.

#23 Updated by John Jaser almost 2 years ago

I'm seeing this too:

AMD A6-5400K APU - OK
AMD Phenom(tm) II X4 955 - Fails
Pentium(R) Dual-Core CPU E5300 @ 2.60GHz - Fails

#24 Updated by John Jaser almost 2 years ago

Not surprising, but deploying a bluestore OSD blows up in a similar way on the identified systems. I can also provide remote access to the above boxes if wanted.

thanks.

#25 Updated by Daniel Oliveira almost 2 years ago

@All, Thanks for the updates. I am going thru the log files.

@Dennis,
I will access your environment in a few to check on some more info.

@John,
Yes, if you can provide a remote access, that would be great too, so we can check/validate more than one environment.

Thanks,
-Daniel

#26 Updated by Francois Payette almost 2 years ago

Daniel Oliveira wrote:

@All, Thanks for the updates. I am going thru the log files.

@Dennis,
I will access your environment in a few to check on some more info.

@John,
Yes, if you can provide a remote access, that would be great too, so we can check/validate more than one environment.

Thanks,
-Daniel

Hello, we're experiencing the same issue on older non SSE4_2 cpus.

In ceph master in https://github.com/ceph/rocksdb/blob/e15382c09c87a65eaeca9bda233bab503f1e5772/Makefile @line144 there's a check for environment $PORTABLE. Is there a log of the build somewhere? Can someone that has the build environment setup run a quick test?

TIA!
best,
-Francois

#27 Updated by Sage Weil almost 2 years ago

  • Priority changed from High to Immediate

#28 Updated by Daniel Oliveira almost 2 years ago

@ John / Dennis,

Thanks for making your environment available. Please, where/how were your binaries built? Was it using your environment? Or on a CPU that supports SSE4_2?
Based on http://tracker.ceph.com/issues/20529#note-15, if it was build on a CPU that doesn't support it, then 'WITH_SSE42=OFF' would be used. However, I wonder if a build from a 'SSE42' aware CPU, could be causing the problem in question when running on CPUs that would not support it.

In general, binaries built with 'SSE 4.2' intrinsics (-msse4.2) can be used on the SSE-optimized paths. While these paths correctly have runtime checks for SSE 4.2 support, the flag allows the compiler to automatically emit SSE 4.2 instructions.

So, we could try:
1. a build without the optimization in question
2. try a build with the commit (https://github.com/ceph/rocksdb/blob/e15382c09c87a65eaeca9bda233bab503f1e5772/Makefile), based on comment http://tracker.ceph.com/issues/20529#note-26

I will get a build w/o the optimization in question, so we could test it.

Thanks.

#29 Updated by John Jaser almost 2 years ago

Daniel-

I installed binaries from http://download.ceph.com/debian-luminous/dists/ for stretch and jessie.

#30 Updated by Sage Weil almost 2 years ago

  • Backport set to luminous

#31 Updated by Sage Weil almost 2 years ago

I've pushed builds that include Daniel's fix... can someone with an old CPU test please?

https://shaman.ceph.com/repos/ceph/wip-rocksdb-instruction/f536eda1496333929fd1ce2648c4121dc6e45034/

#32 Updated by alexandre derumier almost 2 years ago

Sage Weil wrote:

I've pushed builds that include Daniel's fix... can someone with an old CPU test please?

https://shaman.ceph.com/repos/ceph/wip-rocksdb-instruction/f536eda1496333929fd1ce2648c4121dc6e45034/

I'm back from holiday, I'll test them tomorrow. I don't see debian build in this repo. (I'll try xenial on debian stretch, if it don't work, I'll rebuild package for debian from this branch)

#33 Updated by Brendan Mirotchnick almost 2 years ago

I just tried on Proxmox 5 (Debian). I set up the repo like this:

deb [trusted=yes] https://1.chacra.ceph.com/r/ceph/wip-rocksdb-instruction/f536eda1496333929fd1ce2648c4121dc6e45034/ubuntu/xenial/flavors/default/ xenial main

'apt upgrade' is successful. Unfortunately, this new version still fails. I also noticed that the debug information that was previously outputted is no longer there. So, I cannot confirm if anything else has changed.

#34 Updated by alexandre derumier almost 2 years ago

Brendan Mirotchnick wrote:

I just tried on Proxmox 5 (Debian). I set up the repo like this:

deb [trusted=yes] https://1.chacra.ceph.com/r/ceph/wip-rocksdb-instruction/f536eda1496333929fd1ce2648c4121dc6e45034/ubuntu/xenial/flavors/default/ xenial main

'apt upgrade' is successful. Unfortunately, this new version still fails. I also noticed that the debug information that was previously outputted is no longer there. So, I cannot confirm if anything else has changed.

yes, I think xenial build is not working on debian9. (I'm getting "illegal instruction" with a simple #ceph-mon , #ceph-osd ,...).
I'm begin to build the packages on my old xeon, should be ready tomorrow.

#35 Updated by Daniel Oliveira almost 2 years ago

I am checking on something else as well.

#36 Updated by alexandre derumier almost 2 years ago

Ok, I have tested on ubuntu 16.04 with my xeon 5110

root@ubuntuceph:/home/admin2# ceph-mon
Illegal instruction (core dumped)
root@ubuntuceph:/home/admin2# ceph-osd
Illegal instruction (core dumped)

#37 Updated by Francois Payette almost 2 years ago

same issue on xeon x3220 with version 12.1.2.

ceph-mon --mkfs -i pvd2 --monmap /tmp/monmap --keyring /etc/pve/priv/ceph.mon.keyring' failed: got signal 4

1: (()+0x9202a4) [0x558759a102a4]
2: (()+0x110c0) [0x7f2dbba5f0c0]
3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) [0x558759bd0511]
4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x26bc) [0x558759ab799c]
5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x11f) [0x558759a7eb6f]
6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >, rocksdb::DB*)+0xe40) [0x558759a805e0]
7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x698) [0x558759a81e48]
8: (RocksDBStore::do_open(std::ostream&, bool)+0x908) [0x5587595aadd8]
9: (RocksDBStore::create_and_open(std::ostream&)+0xd7) [0x5587595ac817]
10: (MonitorDBStore::create_and_open(std::ostream&)+0xe3) [0x558759467193]
11: (main()+0x7de) [0x55875938566e]
12: (__libc_start_main()+0xf1) [0x7f2db8e732b1]
13: (_start()+0x2a) [0x55875946426a]

best,
F

#38 Updated by Norman Uittenbogaart almost 2 years ago

Dennis Busch wrote:

Trying to create a ceph-mon. Seems from experience of other users that this issue might only be valid on Opteron CPUs. While creation I get the following:

[...]

I'm having Xeon's E5335's and are having the same kind of dumps when trying to create OSD's.

#39 Updated by Norman Uittenbogaart almost 2 years ago

Norman Uittenbogaart wrote:

Dennis Busch wrote:

Trying to create a ceph-mon. Seems from experience of other users that this issue might only be valid on Opteron CPUs. While creation I get the following:

[...]

I'm having Xeon's E5335's and are having the same kind of dumps when trying to create OSD's.

Forgot to mention that is on version 12.1.2

#40 Updated by alexandre derumier almost 2 years ago

Hi,

I have tested 12.0.0 and it's works fine.

12.1.0 have the same rockdb error.

I'll try to build last version without this PR
https://github.com/ceph/ceph/pull/13741/files

#41 Updated by alexandre derumier almost 2 years ago

works also with 12.0.3

#42 Updated by Daniel Oliveira almost 2 years ago

I've created a new PR, with a bit different fix, where instead of depending on the target("sse4.2"), we check for "sse4.2" at runtime:
https://github.com/ceph/rocksdb/pull/21

If we could build it with "sse3" and test it, it would be great.

#43 Updated by alexandre derumier almost 2 years ago

Daniel Oliveira wrote:

I've created a new PR, with a bit different fix, where instead of depending on the target("sse4.2"), we check for "sse4.2" at runtime:
https://github.com/ceph/rocksdb/pull/21

If we could build it with "sse3" and test it, it would be great.

Thanks Daniel . I'll build it, will try to test it tomorrow.

#44 Updated by Daniel Oliveira almost 2 years ago

Alexandre,

Thank you very much.

#45 Updated by alexandre derumier almost 2 years ago

Daniel Oliveira wrote:

Alexandre,

Thank you very much.

Hi, I have tested your patch, I still have same error on my xeon 5110. (does it work for you ?)
(I have build with git clone ceph, apply on patch on rocksdb, and dpkg-buildpackage to build debian packages)

ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc)
1: (()+0x92ec14) [0x555a18b92c14]
2: (()+0x110c0) [0x7f142f13b0c0]
3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) [0x555a18d53791]
4: (rocksdb::VersionSet::Recover(std::vector&lt;rocksdb::ColumnFamilyDescriptor, std::allocator&lt;rocksdb::ColumnFamilyDescriptor&gt; > const&, bool)+0x26bc) [0x555a18c3a30c]
5: (rocksdb::DBImpl::Recover(std::vector&lt;rocksdb::ColumnFamilyDescriptor, std::allocator&lt;rocksdb::ColumnFamilyDescriptor&gt; > const&, bool, bool, bool)+0x11f) [0x555a18c014df]
6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; > const&, std::vector&lt;rocksdb::ColumnFamilyDescriptor, std::allocator&lt;rocksdb::ColumnFamilyDescriptor&gt; > const&, std::vector&lt;rocksdb::ColumnFamilyHandle*, std::allocator&lt;rocksdb::ColumnFamilyHandle*&gt; >, rocksdb::DB*)+0xe40) [0x555a18c02f50]
7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; > const&, rocksdb::DB**)+0x698) [0x555a18c047b8]
8: (RocksDBStore::do_open(std::ostream&, bool)+0x908) [0x555a186d9518]
9: (RocksDBStore::create_and_open(std::ostream&)+0xd7) [0x555a186daf57]
10: (MonitorDBStore::create_and_open(std::ostream&)+0xe3) [0x555a18592f63]
11: (main()+0x7de) [0x555a184faa5e]
12: (__libc_start_main()+0xf1) [0x7f142c54f2b1]
13: (_start()+0x2a) [0x555a1859003a]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.
0> 2017-08-16 17:11:09.877817 7f142fd63f80 -1 ** Caught signal (Illegal instruction) *
in thread 7f142fd63f80 thread_name:ceph-mon
ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc)
1: (()+0x92ec14) [0x555a18b92c14]
2: (()+0x110c0) [0x7f142f13b0c0]
3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) [0x555a18d53791]
4: (rocksdb::VersionSet::Recover(std::vector&lt;rocksdb::ColumnFamilyDescriptor, std::allocator&lt;rocksdb::ColumnFamilyDescriptor&gt; > const&, bool)+0x26bc) [0x555a18c3a30c]
5: (rocksdb::DBImpl::Recover(std::vector&lt;rocksdb::ColumnFamilyDescriptor, std::allocator&lt;rocksdb::ColumnFamilyDescriptor&gt; > const&, bool, bool, bool)+0x11f) [0x555a18c014df]
6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; > const&, std::vector&lt;rocksdb::ColumnFamilyDescriptor, std::allocator&lt;rocksdb::ColumnFamilyDescriptor&gt; > const&, std::vector&lt;rocksdb::ColumnFamilyHandle*, std::allocator&lt;rocksdb::ColumnFamilyHandle*&gt; >, rocksdb::DB*)+0xe40) [0x555a18c02f50]
7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; > const&, rocksdb::DB**)+0x698) [0x555a18c047b8]
8: (RocksDBStore::do_open(std::ostream&, bool)+0x908) [0x555a186d9518]
9: (RocksDBStore::create_and_open(std::ostream&)+0xd7) [0x555a186daf57]
10: (MonitorDBStore::create_and_open(std::ostream&)+0xe3) [0x555a18592f63]
11: (main()+0x7de) [0x555a184faa5e]
12: (__libc_start_main()+0xf1) [0x7f142c54f2b1]
13: (_start()+0x2a) [0x555a1859003a]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

#46 Updated by Daniel Oliveira almost 2 years ago

So it probably was not built with "sse3" by still, we are checking for "sse4.2" at runtime. Is there any way we could access your environment?

Thanks,
-Daniel

#47 Updated by alexandre derumier almost 2 years ago

Daniel Oliveira wrote:

So it probably was not built with "sse3" by still, we are checking for "sse4.2" at runtime. Is there any way we could access your environment?

Thanks,
-Daniel

I can provide you a vpn access, can you email me at ?

#48 Updated by alexandre derumier almost 2 years ago

Hi,

I have tested with manual building, instead dpkg-builpackage

git clone ...
apply patch on rockdb
mkdir build
cd build/
cmake .. -DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" -DCMAKE_BUILD_TYPE="Debug"
make
make install

and it's working .... no more rocksdb error.

I'll double check tomorrow, maybe something don't work with dpkg-buildpackage ?
I'll try to build manually again without the patch to compare

#49 Updated by Norman Uittenbogaart almost 2 years ago

Dennis Busch wrote:

Trying to create a ceph-mon. Seems from experience of other users that this issue might only be valid on Opteron CPUs. While creation I get the following:

[...]

I think this is related, I have same errors for the monitor when creating OSD's on a old xeon.
Is this related?

Preparing data directory
*** Caught signal (Illegal instruction) **
 in thread 7f37904e6e00 thread_name:ceph-osd
 ceph version 12.1.4 (913cc16a67d4a352a20bb5ce6dd6b8259eeeb5d5) luminous (rc)
 1: (()+0xa05d64) [0x557d615b2d64]
 2: (()+0x110c0) [0x7f378dd010c0]
 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) [0x557d61a79dd1]
 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x26bc) [0x557d6195df6c]
 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x11f) [0x557d6192538f]
 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0xe40) [0x557d61926e00]
 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x698) [0x557d61928668]
 8: (RocksDBStore::_test_init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x52) [0x557d614f54d2]
 9: (FileStore::mkfs()+0x7b6) [0x557d613a9656]
 10: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x346) [0x557d60fecd06]
 11: (main()+0xe9b) [0x557d60f3cc4b]
 12: (__libc_start_main()+0xf1) [0x7f378ccb62b1]
 13: (_start()+0x2a) [0x557d60fc893a]
2017-08-18 00:29:47.767954 7f37904e6e00 -1 *** Caught signal (Illegal instruction) **
 in thread 7f37904e6e00 thread_name:ceph-osd

 ceph version 12.1.4 (913cc16a67d4a352a20bb5ce6dd6b8259eeeb5d5) luminous (rc)
 1: (()+0xa05d64) [0x557d615b2d64]
 2: (()+0x110c0) [0x7f378dd010c0]
 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) [0x557d61a79dd1]
 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x26bc) [0x557d6195df6c]
 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x11f) [0x557d6192538f]
 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0xe40) [0x557d61926e00]
 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x698) [0x557d61928668]
 8: (RocksDBStore::_test_init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x52) [0x557d614f54d2]
 9: (FileStore::mkfs()+0x7b6) [0x557d613a9656]
 10: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x346) [0x557d60fecd06]
 11: (main()+0xe9b) [0x557d60f3cc4b]
 12: (__libc_start_main()+0xf1) [0x7f378ccb62b1]
 13: (_start()+0x2a) [0x557d60fc893a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

     0> 2017-08-18 00:29:47.767954 7f37904e6e00 -1 *** Caught signal (Illegal instruction) **
 in thread 7f37904e6e00 thread_name:ceph-osd

 ceph version 12.1.4 (913cc16a67d4a352a20bb5ce6dd6b8259eeeb5d5) luminous (rc)
 1: (()+0xa05d64) [0x557d615b2d64]
 2: (()+0x110c0) [0x7f378dd010c0]
 3: (rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) [0x557d61a79dd1]
 4: (rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool)+0x26bc) [0x557d6195df6c]
 5: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x11f) [0x557d6192538f]
 6: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0xe40) [0x557d61926e00]
 7: (rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**)+0x698) [0x557d61928668]
 8: (RocksDBStore::_test_init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x52) [0x557d614f54d2]
 9: (FileStore::mkfs()+0x7b6) [0x557d613a9656]
 10: (OSD::mkfs(CephContext*, ObjectStore*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, uuid_d, int)+0x346) [0x557d60fecd06]
 11: (main()+0xe9b) [0x557d60f3cc4b]
 12: (__libc_start_main()+0xf1) [0x7f378ccb62b1]
 13: (_start()+0x2a) [0x557d60fc893a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#50 Updated by alexandre derumier almost 2 years ago

alexandre derumier wrote:

Hi,

I have tested with manual building, instead dpkg-builpackage

git clone ...
apply patch on rockdb
mkdir build
cd build/
cmake .. -DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" -DCMAKE_BUILD_TYPE="Debug"
make
make install

and it's working .... no more rocksdb error.

I'll double check tomorrow, maybe something don't work with dpkg-buildpackage ?
I'll try to build manually again without the patch to compare

I have do a manual rebuild (same workflow) without the patch, and It's working too ....
I'll test with dpkg-buildpackage again, trying to play with options in debian rules file to compare.

#51 Updated by Daniel Oliveira almost 2 years ago

Alexandre,

Thanks for the info. So, quick question just so we can understand some things here. From the instructions I sent to you:
<instructions>
1. When you are at your Ceph project directory, and you are about to build Ceph, we usually build it to a build directory like: '/ceph/build/' , which is where we run our cmake command. From this ./ceph/build/ directory we will run: cmake .. -DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" -DCMAKE_BUILD_TYPE="Debug"

This will disable any compiler optimizations and create a debug build, with all the symbols enabled. Then, you run your usual: cmake -jX (where X is your number of cores on the machine running the build.)

Then, we want to rerun the test, which the crash stack will have way more info.

Finally, after that...
2. Build 'rocksdb' with '-msse4.2'
Just edit your 'ceph/src/rocksdb/CMakeLists.txt' and your line #124, should have something like: set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -msse4.2")
Lets change that to: '-msse3' instead, then rebuild Ceph, and re-run the tests. We just want to confirm it will only happen if/when we build it with '-msse4.2' . As you already have all your CMakeLists.txt created with debug info from step #1, just go back to './ceph/build/' and run your: cmake -jX again and rerun the test.

</instructions>

Are you saying it doesn't crash at all, after building it with '-msse3'? and then still crashes with '-msse4.2' as per #2? Or what exactly? If it still crashes, do we have a stack now, with some more info, so we can check on a few more things?

My understanding is that you had tried also to rebuild it before, and the crash still happened.

Please, let us know.

#52 Updated by alexandre derumier almost 2 years ago

Daniel Oliveira wrote:

Alexandre,

Thanks for the info. So, quick question just so we can understand some things here. From the instructions I sent to you:
<instructions>
1. When you are at your Ceph project directory, and you are about to build Ceph, we usually build it to a build directory like: '/ceph/build/' , which is where we run our cmake command. From this ./ceph/build/ directory we will run: cmake .. -DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" -DCMAKE_BUILD_TYPE="Debug"

This will disable any compiler optimizations and create a debug build, with all the symbols enabled. Then, you run your usual: cmake -jX (where X is your number of cores on the machine running the build.)

Then, we want to rerun the test, which the crash stack will have way more info.

Finally, after that...
2. Build 'rocksdb' with '-msse4.2'
Just edit your 'ceph/src/rocksdb/CMakeLists.txt' and your line #124, should have something like: set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -msse4.2")
Lets change that to: '-msse3' instead, then rebuild Ceph, and re-run the tests. We just want to confirm it will only happen if/when we build it with '-msse4.2' . As you already have all your CMakeLists.txt created with debug info from step #1, just go back to './ceph/build/' and run your: cmake -jX again and rerun the test.

</instructions>

Are you saying it doesn't crash at all, after building it with '-msse3'? and then still crashes with '-msse4.2' as per #2? Or what exactly? If it still crashes, do we have a stack now, with some more info, so we can check on a few more things?

My understanding is that you had tried also to rebuild it before, and the crash still happened.

Please, let us know.

What I said, is that I can't produce the rocksb bug, with step1 (cmake .. -DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" -DCMAKE_BUILD_TYPE="Debug"), with or without your patch.
I'm currently doing more test, but builds time is quite long on this dual core xeon. (Can't do more than 1 or 2 builds by day ...)

#53 Updated by Daniel Oliveira almost 2 years ago

Alexandre,

I wonder if that is due the fact the build is being compiled on a machine that does not support 'sse4.2' already. Then the make files and #IFDEFs in the code would not be considered, giving you a build with those 'compiler intrinsics' ruled out. If that is the case, we need to build it on a

We would need to have it built for whatever distros we are facing the problem with, from a machine/cpu that supports 'sse4.2' (which i think and hope it's your case), so we can get the build/packages distributed to a machine that does not support it, and re-test it.

Thanks,

#54 Updated by Daniel Oliveira almost 2 years ago

It seems some of us here are either not facing the problem in question anymore or already upgraded hardware to a version where '-msse4.2' is fully supported. As we talked about before, we probably will want to build it with '-msse3' by default and enable '-msse4.2' dynamically/runtime in the code where/when we can really take advantage of it. How many of us here are still facing the problem and would have a way to allow us remote access? Please, Would you mind to ping me so we can talk?

Thanks,

#55 Updated by alexandre derumier almost 2 years ago

Daniel Oliveira wrote:

Alexandre,

I wonder if that is due the fact the build is being compiled on a machine that does not support 'sse4.2' already. Then the make files and #IFDEFs in the code would not be considered, giving you a build with those 'compiler intrinsics' ruled out. If that is the case, we need to build it on a

We would need to have it built for whatever distros we are facing the problem with, from a machine/cpu that supports 'sse4.2' (which i think and hope it's your case), so we can get the build/packages distributed to a machine that does not support it, and re-test it.

Thanks,

I have tried to build on an xeon v3, then launch the binary on my old xeon, with differents build options:

-DCMAKE_C_FLAGS="-O0" : fail
-DCMAKE_C_FLAGS="-g3" : fail
-DCMAKE_C_FLAGS="-gdwarf-4" : fail
-DCMAKE_BUILD_TYPE="Debug" : ok

So, I looked for debug in
src/rocksdb/CMakeLists.txt

if(NOT CMAKE_BUILD_TYPE STREQUAL "Debug")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O2 -fno-omit-frame-pointer")
include(CheckCXXCompilerFlag)
CHECK_CXX_COMPILER_FLAG("-momit-leaf-frame-pointer" HAVE_OMIT_LEAF_FRAME_POINTER)
if(HAVE_OMIT_LEAF_FRAME_POINTER)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -momit-leaf-frame-pointer")
endif()
endif()

if I comment this code, and use
-DCMAKE_BUILD_TYPE="Debug" --> Fail

I'll do more test (with msse3 this time), I'll keep you in touch.

#56 Updated by alexandre derumier almost 2 years ago

Also, looking at rocksdb git, after git pull recursive ceph git,

we are at

commit e15382c09c87a65eaeca9bda233bab503f1e5772
Author: Siying Dong <>
Date: Wed Apr 26 17:12:28 2017 -0700

I'm seeing some new commmits in rocksdb git , like

https://github.com/facebook/rocksdb/commit/11c5d4741a1e11a1315d5ca644ce555e07e91f61
"cross-platform compatibility improvements"

I'll do tests

#57 Updated by alexandre derumier almost 2 years ago

alexandre derumier wrote:

Also, looking at rocksdb git, after git pull recursive ceph git,

we are at

commit e15382c09c87a65eaeca9bda233bab503f1e5772
Author: Siying Dong <>
Date: Wed Apr 26 17:12:28 2017 -0700

I'm seeing some new commmits in rocksdb git , like

https://github.com/facebook/rocksdb/commit/11c5d4741a1e11a1315d5ca644ce555e07e91f61
"cross-platform compatibility improvements"

I'll do tests

also this commit

"compile with correct flags to determine SSE4.2 support"
https://github.com/facebook/rocksdb/commit/c5f0c6cc660f1f4a8051db2aac3b8afc17818e70

https://github.com/facebook/rocksdb/issues/2488

#58 Updated by alexandre derumier almost 2 years ago

cockroach db had the same problem with fastcrc && sse4.2

they have fixed it with this commit:

https://github.com/benesch/cockroach/commit/8054fd23cd2ca1a15766e85335d6037272a1e47e

#59 Updated by alexandre derumier almost 2 years ago

Ok,

I'm now able to build with default options on my xeon e5 and make it work on old xeon

with ceph git

--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@ -813,11 +813,7 @@ if (NOT WITH_SYSTEM_ROCKSDB)

  1. We really want to have the CRC32 calculation in RocksDB accelerated
  2. with SSE 4.2. For details refer to rocksdb/util/crc32c.cc.
    - if (HAVE_INTEL_SSE4_2)
    - list(APPEND ROCKSDB_CMAKE_ARGS DCMAKE_CXX_FLAGS=${SIMD_COMPILE_FLAGS})
    else()
    - list(APPEND ROCKSDB_CMAKE_ARGS DWITH_SSE42=OFF)
    endif()
    + list(APPEND ROCKSDB_CMAKE_ARGS -DWITH_SSE42=OFF)

@

(I think -msse4.1 was in SIMD_COMPILE_FLAGS too)

rocksdb git
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ 126,9 126,8 @@ if(WIN32)
endif()
else()
option(WITH_SSE42 "build with SSE4.2" ON)
if(WITH_SSE42)
- set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} msse4.2")
endif()
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -msse3")

@

Now, for xeon e5, we need to get sse4.2 working at runtime.
I don't known how to verify your patch on crc32c.cc?

also, cockroach devs have same approach with hack crc32cc.cc
https://github.com/benesch/cockroach/commit/47ccc407fbaf61d9c95d36112b9af38625e90914
https://github.com/cockroachdb/rocksdb/compare/0403a97f12d46b05bc48fe99a7b432eca4832bd6...87c3949e360f5ce178e8074cfb90795c8c229be0#diff-f54be34b9b64b37988070860edd87aaf

#60 Updated by alexandre derumier almost 2 years ago

I have build debian packages, with coakroach rockdb patch +
https://github.com/cockroachdb/rocksdb/commit/87c3949e360f5ce178e8074cfb90795c8c229be0.patch


- if (HAVE_INTEL_SSE4_2)
- list(APPEND ROCKSDB_CMAKE_ARGS DCMAKE_CXX_FLAGS=${SIMD_COMPILE_FLAGS})
else()
- list(APPEND ROCKSDB_CMAKE_ARGS DWITH_SSE42=OFF)
endif()
+ list(APPEND ROCKSDB_CMAKE_ARGS -DWITH_SSE42=OFF)

http://odisoweb1.odiso.net/ceph12.1.4-runtimesse4/

could be great to have some feedback on amd && old intel processor. But also newer cpu, to see if performance regression occur.

#61 Updated by Norman Uittenbogaart almost 2 years ago

alexandre derumier wrote:

I have build debian packages, with coakroach rockdb patch +
https://github.com/cockroachdb/rocksdb/commit/87c3949e360f5ce178e8074cfb90795c8c229be0.patch

[...]

http://odisoweb1.odiso.net/ceph12.1.4-runtimesse4/

could be great to have some feedback on amd && old intel processor. But also newer cpu, to see if performance regression occur.

I can confirm these packages work!

#62 Updated by Daniel Oliveira almost 2 years ago

Interesting, but if am not mistaken the CMakeLists.txt file seem to be doing what we mentioned before, building rocksdb w/o '-sse4.2' by default.
However, we probably still will want to use 'sse4.2' if/when possible, and the last PR (https://github.com/ceph/rocksdb/pull/21) would check on it dynamically/runtime as opposed to build/compile time. That's the combined approach we probably will want: a) CMakeLists.txt changes (like one of those mentioned above) and b) Check on 'sse4.2' dynamically.

@Sage, pls would you have any other ideas/comments on this one?

#63 Updated by alexandre derumier almost 2 years ago

Daniel Oliveira wrote:

a) CMakeLists.txt changes (like one of those mentioned above)

Note that changing ceph/src/rocksdb/CMakeLists.txt with

"2. Build 'rocksdb' with '-msse4.2'
Just edit your 'ceph/src/rocksdb/CMakeLists.txt' and your line #124, should have something like: set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -msse4.2")
"

is not working, because of
ceph//src/CMakeLists.txt
list(APPEND ROCKSDB_CMAKE_ARGS DCMAKE_CXX_FLAGS=${SIMD_COMPILE_FLAGS})

(which send also sse4.2 flag (and other sse2,sse3,....), as it's detected previously).

just removing this, fix the error on processors not supporting sse4.2.

b) Check on 'sse4.2' dynamically.

Note that with your rocksdb patch, I have build error when I'm trying to build debian package.
(cockroachdb patch seem to be smaller/cleaner)
I think it's working fine (as perf record show me fastpath for crc), but I don't have the infrastructure to do benchmark currently.

#64 Updated by Daniel Oliveira almost 2 years ago

Alexandre,

Thanks for the update. Yes, the http://tracker.ceph.com/issues/20529#note-31 we had first sent in also had a fix really close to the one you are referring to. I guess the only difference was the fact the build did not have the CMakeLists.txt changes, which will make a difference. Once we get those all set (proposed changes in the code [regardless if http://tracker.ceph.com/issues/20529#note-31 or http://tracker.ceph.com/issues/20529#note-42] and CMakeLists.txt), we should be able to rerun the tests and see how it goes.

#65 Updated by Daniel Oliveira almost 2 years ago

We have now a combination of the efforts mentioned in http://tracker.ceph.com/issues/20529#note-64, in the commit: https://github.com/oliveiradan/rocksdb/commit/ab709cbc6c3a5d956ca0dc5b9faf63f4adfb12ce where we can check for 'sse4.2' not only during compile time, but also during runtime, should we need it.

The changes needed in the src/CMakeLists.txt are in the commit: https://github.com/oliveiradan/ceph/commit/88b0c9048328e265e3f3852140e11aaed0929e9c

I was able to build and run it in my lab, but as I cannot reproduce the issue here, we would need to build and test it out there and a new PR will be created when we are ok with these changes.

The old PRs related to this issue (https://github.com/ceph/rocksdb/pull/20 and https://github.com/ceph/rocksdb/pull/21) were closed.

The new PRs related to this issue are:
https://github.com/ceph/rocksdb/pull/23
https://github.com/ceph/ceph/pull/17347

#67 Updated by Daniel Oliveira almost 2 years ago

tchaikov commented 4 days ago
please note, upstream has merged facebook#2807, and the change on ceph side is posted at ceph/ceph#17388.
https://github.com/ceph/rocksdb/pull/23#issuecomment-327976144

#68 Updated by Kefu Chai almost 2 years ago

#70 Updated by Kefu Chai almost 2 years ago

  • Category changed from Monitor to build
  • Status changed from Verified to Pending Backport
  • Assignee deleted (Daniel Oliveira)
  • Target version deleted (v12.1.0)

#71 Updated by Nathan Cutler almost 2 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF