Project

General

Profile

Bug #59196

ceph_test_lazy_omap_stats segfault while waiting for active+clean

Added by Brad Hubbard 11 months ago. Updated 4 days ago.

Status:
Fix Under Review
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2023-03-11T08:23:47.545 DEBUG:teuthology.orchestra.run.smithi005:> sudo TESTDIR=/home/ubuntu/cephtest bash -c ceph_test_lazy_omap_stats
2023-03-11T08:23:48.487 INFO:teuthology.orchestra.run.smithi005.stdout:pool 'lazy_omap_test_pool' created
2023-03-11T08:23:48.489 INFO:teuthology.orchestra.run.smithi005.stdout:Querying pool id
2023-03-11T08:23:48.492 INFO:teuthology.orchestra.run.smithi005.stdout:Found pool ID: 2
2023-03-11T08:23:48.496 INFO:teuthology.orchestra.run.smithi005.stdout:Created payload with 2000 keys of 445 bytes each. Total size in bytes = 890000
2023-03-11T08:23:48.496 INFO:teuthology.orchestra.run.smithi005.stdout:Waiting for active+clean
2023-03-11T08:23:48.513 DEBUG:teuthology.orchestra.run:got remote process result: None
2023-03-11T08:23:48.513 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_a6a4a7f010ae6b3f7fc2aef91377d4a6bee6de40/teuthology/run_tasks.py", line 103, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_teuthology_a6a4a7f010ae6b3f7fc2aef91377d4a6bee6de40/teuthology/run_tasks.py", line 82, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_a6a4a7f010ae6b3f7fc2aef91377d4a6bee6de40/teuthology/task/exec.py", line 66, in task
    remote.run(
  File "/home/teuthworker/src/git.ceph.com_teuthology_a6a4a7f010ae6b3f7fc2aef91377d4a6bee6de40/teuthology/orchestra/remote.py", line 525, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_a6a4a7f010ae6b3f7fc2aef91377d4a6bee6de40/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_a6a4a7f010ae6b3f7fc2aef91377d4a6bee6de40/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_a6a4a7f010ae6b3f7fc2aef91377d4a6bee6de40/teuthology/orchestra/run.py", line 179, in _raise_for_status
    raise CommandCrashedError(command=self.command)
teuthology.exceptions.CommandCrashedError: Command crashed: 'sudo TESTDIR=/home/ubuntu/cephtest bash -c ceph_test_lazy_omap_stats'
2023-03-11T08:23:48.594 ERROR:teuthology.run_tasks: Sentry event: https://sentry.ceph.com/organizations/ceph/?query=bc132455da90423caddad14e0a097e30

Found this in system kernel and journalctl logs.

2023-03-11T08:23:48.510950+00:00 smithi005 kernel: [  641.041577] ceph_test_lazy_[34021]: segfault at 7ffda9628fd8 ip 0000563670ee5b19 sp 00007ffda9628fe0 error 6 in ceph_test_lazy_omap_stats[563670ec7000+21000]

History

#1 Updated by Brad Hubbard 11 months ago

  • Tags set to test-failure

Note that this tracker was originally #59058 until it was accidentally deleted by myself.

Below is a summary of the comments in that previous tracker.

Issue #59058 has been updated by Brad Hubbard.

Reproduced this and I suspect this only happens on Jammy as it has only been
seen once and on that distro that we have only started testing with.

It looks like a stack overflow due to unbounded recursion in std::regex code
which has precedents. I may be able to get around it by massaging the regular
expression being used, we'll see after some more testing.

Issue #59058 has been updated by Brad Hubbard.

We may be dealing with something similar to
https://tracker.ceph.com/issues/55304 here. I can not reproduce this issue on
the latest Jammy container image and if I upload the version of
ceph_test_lazy_omap_stats, that I built and successfully ran on my local
container, to the smithi machine failing the test it runs without segfaulting
whereas the version of ceph_test_lazy_omap_stats that was installed for the test
does segfault when run manually.

I see some difference in symbols when I compare the output of 'nm' and that led
me to compare the versions of gcc they were compiled with.

root@smithi026:/home/ubuntu# strings ./ceph_test_lazy_omap_stats|grep "GCC: ("
GCC: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
root@smithi026:/home/ubuntu# strings /usr/lib/debug/.build-id/08/de203b7b3fa0b5080173750be0a7b2576335d9.debug|grep "GCC: ("
GCC: (Ubuntu 11.2.0-19ubuntu1) 11.2.0

So the binary that works was created with gcc 11.3.0 and the version that fails
with 11.2.0. Next step is to see if I can set up a Jammy system with 11.2.0
installed and build ceph_test_lazy_omap_stats to see if doing that will
reproduce the issue.

Issue #59058 has been updated by Brad Hubbard.

Compiling with 11.2.0 failed to reproduce but in comparing the failing binary
to the one that succeeds under the debugger I found what appears to be the
cause of the issue at a low level (with the higher level cause still a mystery
but most likely some sort of issue in the build environment or build vs.
runtime environment).

The last code before we enter the std code are these two lines.

https://github.com/ceph/ceph/blob/68658d6e3ef7eae3b0a224f1a6059fd605da5f75/src/test/lazy-omap-stats/lazy_omap_stats_test.cc#L581-L582

If I place a breakpoint on the last line I get the following discrepancy.

Success case:

(gdb) whatis match
type = std::__cxx11::smatch
(gdb) p match
$1 = {<std::vector<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >> = std::vector of length 0, capacity 0, _M_begin = non-dereferenceable iterator for std::vector}

Fail case:

(gdb) whatis match
type = std::__cxx11::smatch
(gdb) p match
$1 = {<std::vector<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >> = std::vector of length 172889574, capacity -1954508256483 = {
<error reading variable: Cannot access memory at address 0x7fff00000010>

So in the failure case there appears to be an issue with the just-initialised
smatch variable. Continuing to look at this.

Issue #59058 has been updated by Laura Flores.

Tags set to test-failure

/a/lflores-2023-03-27_02:17:31-rados-wip-aclamk-bs-elastic-shared-blob-save-25.03.2023-a-distro-default-smithi/7220933
/a/lflores-2023-03-27_02:17:31-rados-wip-aclamk-bs-elastic-shared-blob-save-25.03.2023-a-distro-default-smithi/7221086

Issue #59058 has been updated by Laura Flores.

/a/yuriw-2023-03-27_23:05:54-rados-wip-yuri4-testing-2023-03-25-0714-distro-default-smithi/7222036

Issue #59058 has been updated by Laura Flores.

This is happening quite frequently in the rados suite. It certainly points to a recent regression.

#2 Updated by Radoslaw Zarzynski 11 months ago

  • Status changed from New to In Progress

Looks the problem is under investigation. Please correct me if I'm wrong.

#3 Updated by Laura Flores 11 months ago

Yes Radek, it is being investigated by Brad.

/a/yuriw-2023-03-27_23:05:54-rados-wip-yuri4-testing-2023-03-25-0714-distro-default-smithi/7222036

#4 Updated by Laura Flores 11 months ago

/a/yuriw-2023-03-30_21:53:20-rados-wip-yuri7-testing-2023-03-29-1100-distro-default-smithi/7227904

#5 Updated by Laura Flores 11 months ago

/a/yuriw-2023-04-04_15:24:40-rados-wip-yuri4-testing-2023-03-31-1237-distro-default-smithi/7231452

#6 Updated by Laura Flores 11 months ago

/a/yuriw-2023-03-30_21:29:24-rados-wip-yuri2-testing-2023-03-30-0826-distro-default-smithi/7227539

#7 Updated by Laura Flores 11 months ago

/a/lflores-2023-04-07_22:22:04-rados-wip-yuri4-testing-2023-04-07-1825-distro-default-smithi/7235344

#8 Updated by Sridhar Seshasayee 10 months ago

/a/sseshasa-2023-05-02_03:12:27-rados-wip-sseshasa3-testing-2023-05-01-2154-distro-default-smithi/7260300

journalctl-b0.gz:May 02 04:26:30 smithi175 sudo[34509]:   ubuntu : PWD=/home/ubuntu ; USER=root ; ENV=TESTDIR=/home/ubuntu/cephtest ; COMMAND=/usr/bin/bash -c ceph_test_lazy_omap_stats
journalctl-b0.gz:May 02 04:26:31 smithi175 kernel: ceph_test_lazy_[34510]: segfault at 7ffeedae1ff8 ip 00005567a18b6549 sp 00007ffeedae1f30 error 6 in ceph_test_lazy_omap_stats[5567a1898000+21000]

kern.log.gz:2023-05-02T04:26:31.788640+00:00 smithi175 kernel: [  640.526218] ceph_test_lazy_[34510]: segfault at 7ffeedae1ff8 ip 00005567a18b6549 sp 00007ffeedae1f30 error 6 in ceph_test_lazy_omap_stats[5567a1898000+21000]

#9 Updated by Radoslaw Zarzynski 10 months ago

Let's check whether this reproduces in Reef too. If so, then... there is no OMAP without RocksDB and we upgraded it recently...

#10 Updated by Brad Hubbard 10 months ago

Radoslaw Zarzynski wrote:

Let's check whether this reproduces in Reef too. If so, then... there is no OMAP without RocksDB and we upgraded it recently...

Hey Radek,

To me it's more significant that every instance above was seen on VERSION="22.04.1 LTS (Jammy Jellyfish)" and I think this has something to do with the way we are building for Jammy. I think somehow we are exposing some sort of library mismatch, or something similar. I need to try and reproduce the build environment to test this theory I guess, which I may need some help with.

#11 Updated by Laura Flores 9 months ago

/a/yuriw-2023-05-11_15:01:38-rados-wip-yuri8-testing-2023-05-10-1402-distro-default-smithi/7271184

So far, no Reef sightings.

#12 Updated by Brad Hubbard 9 months ago

Laura Flores wrote:

/a/yuriw-2023-05-11_15:01:38-rados-wip-yuri8-testing-2023-05-10-1402-distro-default-smithi/7271184

So far, no Reef sightings.

And Jammy yet again.

#13 Updated by Laura Flores 9 months ago

  • Backport set to reef

/a/lflores-2023-05-22_16:08:13-rados-wip-yuri6-testing-2023-05-19-1351-reef-distro-default-smithi/7282703

Was already in Reef as far back as March 11 (/a/yuriw-2023-03-10_22:46:37-rados-reef-distro-default-smithi/7203287), so this test batch is not introducing the bug to Reef.

#14 Updated by Brad Hubbard 9 months ago

Still specific to Jammy.

#15 Updated by Laura Flores 9 months ago

/a/yuriw-2023-05-24_14:33:21-rados-wip-yuri6-testing-2023-05-23-0757-reef-distro-default-smithi/7285192

#16 Updated by Radoslaw Zarzynski 9 months ago

The RocksDB upgrade PR has been merged on 1st March.

https://github.com/ceph/ceph/pull/49006

#17 Updated by Radoslaw Zarzynski 9 months ago

Brad, let's sync talk about that in DS meeting.

#18 Updated by Laura Flores 8 months ago

/a/yuriw-2023-06-22_20:29:56-rados-wip-yuri3-testing-2023-06-22-0812-reef-distro-default-smithi/7313235

#19 Updated by Matan Breizman 6 months ago

/a/yuriw-2023-08-22_18:16:03-rados-wip-yuri10-testing-2023-08-17-1444-distro-default-smithi/7376687

#20 Updated by Radoslaw Zarzynski 6 months ago

This time it's CentOS!

rzarzynski@teuthology:/a/yuriw-2023-08-22_18:16:03-rados-wip-yuri10-testing-2023-08-17-1444-distro-default-smithi/7376687$ less teuthology.log
...
2023-08-22T21:37:19.930 DEBUG:teuthology.packaging:Querying https://shaman.ceph.com/api/search?status=ready&project=ceph&flavor=default&distros=centos%2F9%2Fx86_64&ref=wip-yuri10-testing-2023-08-17-1444
2023-08-22T21:37:20.152 INFO:teuthology.task.internal:Found packages for ceph version 18.0.0-5573.gf0ed7046

#21 Updated by Brad Hubbard 6 months ago

Taking a fresh look at this, thanks Radek.

#22 Updated by Laura Flores 6 months ago

/a/yuriw-2023-08-15_18:58:56-rados-wip-yuri3-testing-2023-08-15-0955-distro-default-smithi/7369175

#23 Updated by Matan Breizman 4 months ago

/a/yuriw-2023-10-11_14:08:36-rados-wip-yuri11-testing-2023-10-10-1226-reef-distro-default-smithi/7421542/
/a/yuriw-2023-10-11_14:08:36-rados-wip-yuri11-testing-2023-10-10-1226-reef-distro-default-smithi/7421695/

#24 Updated by Nitzan Mordechai 4 months ago

/a/yuriw-2023-10-16_14:44:27-rados-wip-yuri10-testing-2023-10-11-0812-distro-default-smithi/7429668
/a/yuriw-2023-10-16_14:44:27-rados-wip-yuri10-testing-2023-10-11-0812-distro-default-smithi/7429845
/a/yuriw-2023-10-16_14:44:27-rados-wip-yuri10-testing-2023-10-11-0812-distro-default-smithi/7429846

#25 Updated by Laura Flores 4 months ago

/a/yuriw-2023-10-24_00:11:54-rados-wip-yuri4-testing-2023-10-23-0903-distro-default-smithi/7435549

#26 Updated by Nitzan Mordechai 4 months ago

/a/yuriw-2023-10-30_15:34:36-rados-wip-yuri10-testing-2023-10-27-0804-distro-default-smithi/7441096
/a/yuriw-2023-10-30_15:34:36-rados-wip-yuri10-testing-2023-10-27-0804-distro-default-smithi/7441250

#27 Updated by Laura Flores 4 months ago

/a/yuriw-2023-10-31_14:43:48-rados-wip-yuri4-testing-2023-10-30-1117-distro-default-smithi/7442155

#28 Updated by Laura Flores 4 months ago

/a/yuriw-2023-11-02_14:20:05-rados-wip-yuri6-testing-2023-11-01-0745-reef-distro-default-smithi/7444597

#29 Updated by Radoslaw Zarzynski 4 months ago

bump up.

#30 Updated by Laura Flores 4 months ago

/a/yuriw-2023-11-05_15:32:58-rados-reef-release-distro-default-smithi/7448518

#31 Updated by Radoslaw Zarzynski 3 months ago

Brad is taking a look.

#32 Updated by Nitzan Mordechai 3 months ago

/a/yuriw-2023-12-07_16:37:24-rados-wip-yuri8-testing-2023-12-06-1425-distro-default-smithi/7482188
/a/yuriw-2023-12-07_16:37:24-rados-wip-yuri8-testing-2023-12-06-1425-distro-default-smithi/7482168

#33 Updated by Matan Breizman 2 months ago

/a/yuriw-2023-12-26_16:10:01-rados-wip-yuri3-testing-2023-12-19-1211-distro-default-smithi/7501415

#34 Updated by Aishwarya Mathuria about 2 months ago

/a/yuriw-2024-01-03_16:19:00-rados-wip-yuri6-testing-2024-01-02-0832-distro-default-smithi/7505560/
/a/yuriw-2024-01-03_16:19:00-rados-wip-yuri6-testing-2024-01-02-0832-distro-default-smithi/7505716/

#35 Updated by Nitzan Mordechai about 1 month ago

/a/yuriw-2024-01-18_15:10:37-rados-wip-yuri3-testing-2024-01-17-0753-distro-default-smithi/7520620
/a/yuriw-2024-01-18_15:10:37-rados-wip-yuri3-testing-2024-01-17-0753-distro-default-smithi/7520463

#36 Updated by Kamoltat (Junior) Sirivadhna 23 days ago

/a/yuriw-2024-01-31_19:20:14-rados-wip-yuri3-testing-2024-01-29-1434-distro-default-smithi/7540671

#37 Updated by Laura Flores 18 days ago

/a/yuriw-2024-02-05_19:32:33-rados-wip-yuri4-testing-2024-02-05-0849-distro-default-smithi/7547525

#38 Updated by Matan Breizman 14 days ago

/a/yuriw-2024-02-09_00:15:46-rados-wip-yuri2-testing-2024-02-08-0727-distro-default-smithi/7553332

/a/yuriw-2024-02-09_00:15:46-rados-wip-yuri2-testing-2024-02-08-0727-distro-default-smithi/7553494

#39 Updated by Radoslaw Zarzynski 13 days ago

  • Assignee changed from Brad Hubbard to Nitzan Mordechai

#40 Updated by Aishwarya Mathuria 11 days ago

/a/yuriw-2024-02-13_15:50:02-rados-wip-yuri2-testing-2024-02-12-0808-reef-distro-default-smithi/7558347/

#41 Updated by Laura Flores 11 days ago

/a/lflores-2024-02-13_16:18:32-rados-wip-yuri5-testing-2024-02-12-1152-distro-default-smithi/7558507

#42 Updated by Nitzan Mordechai 10 days ago

  • Pull request ID set to 55596

#43 Updated by Aishwarya Mathuria 5 days ago

/a/yuriw-2024-02-14_14:58:57-rados-wip-yuri4-testing-2024-02-13-1546-distro-default-smithi/7560007/

#44 Updated by Nitzan Mordechai 4 days ago

  • Status changed from In Progress to Fix Under Review

Also available in: Atom PDF