Project

General

Profile

Actions

Bug #59196

open

ceph_test_lazy_omap_stats segfault while waiting for active+clean

Added by Brad Hubbard about 1 year ago. Updated 1 day ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
squid,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2023-03-11T08:23:47.545 DEBUG:teuthology.orchestra.run.smithi005:> sudo TESTDIR=/home/ubuntu/cephtest bash -c ceph_test_lazy_omap_stats
2023-03-11T08:23:48.487 INFO:teuthology.orchestra.run.smithi005.stdout:pool 'lazy_omap_test_pool' created
2023-03-11T08:23:48.489 INFO:teuthology.orchestra.run.smithi005.stdout:Querying pool id
2023-03-11T08:23:48.492 INFO:teuthology.orchestra.run.smithi005.stdout:Found pool ID: 2
2023-03-11T08:23:48.496 INFO:teuthology.orchestra.run.smithi005.stdout:Created payload with 2000 keys of 445 bytes each. Total size in bytes = 890000
2023-03-11T08:23:48.496 INFO:teuthology.orchestra.run.smithi005.stdout:Waiting for active+clean
2023-03-11T08:23:48.513 DEBUG:teuthology.orchestra.run:got remote process result: None
2023-03-11T08:23:48.513 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_a6a4a7f010ae6b3f7fc2aef91377d4a6bee6de40/teuthology/run_tasks.py", line 103, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_teuthology_a6a4a7f010ae6b3f7fc2aef91377d4a6bee6de40/teuthology/run_tasks.py", line 82, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_a6a4a7f010ae6b3f7fc2aef91377d4a6bee6de40/teuthology/task/exec.py", line 66, in task
    remote.run(
  File "/home/teuthworker/src/git.ceph.com_teuthology_a6a4a7f010ae6b3f7fc2aef91377d4a6bee6de40/teuthology/orchestra/remote.py", line 525, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_a6a4a7f010ae6b3f7fc2aef91377d4a6bee6de40/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_a6a4a7f010ae6b3f7fc2aef91377d4a6bee6de40/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_a6a4a7f010ae6b3f7fc2aef91377d4a6bee6de40/teuthology/orchestra/run.py", line 179, in _raise_for_status
    raise CommandCrashedError(command=self.command)
teuthology.exceptions.CommandCrashedError: Command crashed: 'sudo TESTDIR=/home/ubuntu/cephtest bash -c ceph_test_lazy_omap_stats'
2023-03-11T08:23:48.594 ERROR:teuthology.run_tasks: Sentry event: https://sentry.ceph.com/organizations/ceph/?query=bc132455da90423caddad14e0a097e30

Found this in system kernel and journalctl logs.

2023-03-11T08:23:48.510950+00:00 smithi005 kernel: [  641.041577] ceph_test_lazy_[34021]: segfault at 7ffda9628fd8 ip 0000563670ee5b19 sp 00007ffda9628fe0 error 6 in ceph_test_lazy_omap_stats[563670ec7000+21000]
Actions #1

Updated by Brad Hubbard about 1 year ago

  • Translation missing: en.field_tag_list set to test-failure

Note that this tracker was originally #59058 until it was accidentally deleted by myself.

Below is a summary of the comments in that previous tracker.

Issue #59058 has been updated by Brad Hubbard.

Reproduced this and I suspect this only happens on Jammy as it has only been
seen once and on that distro that we have only started testing with.

It looks like a stack overflow due to unbounded recursion in std::regex code
which has precedents. I may be able to get around it by massaging the regular
expression being used, we'll see after some more testing.

Issue #59058 has been updated by Brad Hubbard.

We may be dealing with something similar to
https://tracker.ceph.com/issues/55304 here. I can not reproduce this issue on
the latest Jammy container image and if I upload the version of
ceph_test_lazy_omap_stats, that I built and successfully ran on my local
container, to the smithi machine failing the test it runs without segfaulting
whereas the version of ceph_test_lazy_omap_stats that was installed for the test
does segfault when run manually.

I see some difference in symbols when I compare the output of 'nm' and that led
me to compare the versions of gcc they were compiled with.

root@smithi026:/home/ubuntu# strings ./ceph_test_lazy_omap_stats|grep "GCC: ("
GCC: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
root@smithi026:/home/ubuntu# strings /usr/lib/debug/.build-id/08/de203b7b3fa0b5080173750be0a7b2576335d9.debug|grep "GCC: ("
GCC: (Ubuntu 11.2.0-19ubuntu1) 11.2.0

So the binary that works was created with gcc 11.3.0 and the version that fails
with 11.2.0. Next step is to see if I can set up a Jammy system with 11.2.0
installed and build ceph_test_lazy_omap_stats to see if doing that will
reproduce the issue.

Issue #59058 has been updated by Brad Hubbard.

Compiling with 11.2.0 failed to reproduce but in comparing the failing binary
to the one that succeeds under the debugger I found what appears to be the
cause of the issue at a low level (with the higher level cause still a mystery
but most likely some sort of issue in the build environment or build vs.
runtime environment).

The last code before we enter the std code are these two lines.

https://github.com/ceph/ceph/blob/68658d6e3ef7eae3b0a224f1a6059fd605da5f75/src/test/lazy-omap-stats/lazy_omap_stats_test.cc#L581-L582

If I place a breakpoint on the last line I get the following discrepancy.

Success case:

(gdb) whatis match
type = std::__cxx11::smatch
(gdb) p match
$1 = {<std::vector<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >> = std::vector of length 0, capacity 0, _M_begin = non-dereferenceable iterator for std::vector}

Fail case:

(gdb) whatis match
type = std::__cxx11::smatch
(gdb) p match
$1 = {<std::vector<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >> = std::vector of length 172889574, capacity -1954508256483 = {
<error reading variable: Cannot access memory at address 0x7fff00000010>

So in the failure case there appears to be an issue with the just-initialised
smatch variable. Continuing to look at this.

Issue #59058 has been updated by Laura Flores.

Tags set to test-failure

/a/lflores-2023-03-27_02:17:31-rados-wip-aclamk-bs-elastic-shared-blob-save-25.03.2023-a-distro-default-smithi/7220933
/a/lflores-2023-03-27_02:17:31-rados-wip-aclamk-bs-elastic-shared-blob-save-25.03.2023-a-distro-default-smithi/7221086

Issue #59058 has been updated by Laura Flores.

/a/yuriw-2023-03-27_23:05:54-rados-wip-yuri4-testing-2023-03-25-0714-distro-default-smithi/7222036

Issue #59058 has been updated by Laura Flores.

This is happening quite frequently in the rados suite. It certainly points to a recent regression.

Actions #2

Updated by Radoslaw Zarzynski about 1 year ago

  • Status changed from New to In Progress

Looks the problem is under investigation. Please correct me if I'm wrong.

Actions #3

Updated by Laura Flores about 1 year ago

Yes Radek, it is being investigated by Brad.

/a/yuriw-2023-03-27_23:05:54-rados-wip-yuri4-testing-2023-03-25-0714-distro-default-smithi/7222036

Actions #4

Updated by Laura Flores about 1 year ago

/a/yuriw-2023-03-30_21:53:20-rados-wip-yuri7-testing-2023-03-29-1100-distro-default-smithi/7227904

Actions #5

Updated by Laura Flores about 1 year ago

/a/yuriw-2023-04-04_15:24:40-rados-wip-yuri4-testing-2023-03-31-1237-distro-default-smithi/7231452

Actions #6

Updated by Laura Flores about 1 year ago

/a/yuriw-2023-03-30_21:29:24-rados-wip-yuri2-testing-2023-03-30-0826-distro-default-smithi/7227539

Actions #7

Updated by Laura Flores about 1 year ago

/a/lflores-2023-04-07_22:22:04-rados-wip-yuri4-testing-2023-04-07-1825-distro-default-smithi/7235344

Actions #8

Updated by Sridhar Seshasayee 12 months ago

/a/sseshasa-2023-05-02_03:12:27-rados-wip-sseshasa3-testing-2023-05-01-2154-distro-default-smithi/7260300

journalctl-b0.gz:May 02 04:26:30 smithi175 sudo[34509]:   ubuntu : PWD=/home/ubuntu ; USER=root ; ENV=TESTDIR=/home/ubuntu/cephtest ; COMMAND=/usr/bin/bash -c ceph_test_lazy_omap_stats
journalctl-b0.gz:May 02 04:26:31 smithi175 kernel: ceph_test_lazy_[34510]: segfault at 7ffeedae1ff8 ip 00005567a18b6549 sp 00007ffeedae1f30 error 6 in ceph_test_lazy_omap_stats[5567a1898000+21000]

kern.log.gz:2023-05-02T04:26:31.788640+00:00 smithi175 kernel: [  640.526218] ceph_test_lazy_[34510]: segfault at 7ffeedae1ff8 ip 00005567a18b6549 sp 00007ffeedae1f30 error 6 in ceph_test_lazy_omap_stats[5567a1898000+21000]
Actions #9

Updated by Radoslaw Zarzynski 12 months ago

Let's check whether this reproduces in Reef too. If so, then... there is no OMAP without RocksDB and we upgraded it recently...

Actions #10

Updated by Brad Hubbard 12 months ago

Radoslaw Zarzynski wrote:

Let's check whether this reproduces in Reef too. If so, then... there is no OMAP without RocksDB and we upgraded it recently...

Hey Radek,

To me it's more significant that every instance above was seen on VERSION="22.04.1 LTS (Jammy Jellyfish)" and I think this has something to do with the way we are building for Jammy. I think somehow we are exposing some sort of library mismatch, or something similar. I need to try and reproduce the build environment to test this theory I guess, which I may need some help with.

Actions #11

Updated by Laura Flores 11 months ago

/a/yuriw-2023-05-11_15:01:38-rados-wip-yuri8-testing-2023-05-10-1402-distro-default-smithi/7271184

So far, no Reef sightings.

Actions #12

Updated by Brad Hubbard 11 months ago

Laura Flores wrote:

/a/yuriw-2023-05-11_15:01:38-rados-wip-yuri8-testing-2023-05-10-1402-distro-default-smithi/7271184

So far, no Reef sightings.

And Jammy yet again.

Actions #13

Updated by Laura Flores 11 months ago

  • Backport set to reef

/a/lflores-2023-05-22_16:08:13-rados-wip-yuri6-testing-2023-05-19-1351-reef-distro-default-smithi/7282703

Was already in Reef as far back as March 11 (/a/yuriw-2023-03-10_22:46:37-rados-reef-distro-default-smithi/7203287), so this test batch is not introducing the bug to Reef.

Actions #14

Updated by Brad Hubbard 11 months ago

Still specific to Jammy.

Actions #15

Updated by Laura Flores 11 months ago

/a/yuriw-2023-05-24_14:33:21-rados-wip-yuri6-testing-2023-05-23-0757-reef-distro-default-smithi/7285192

Actions #16

Updated by Radoslaw Zarzynski 11 months ago

The RocksDB upgrade PR has been merged on 1st March.

https://github.com/ceph/ceph/pull/49006

Actions #17

Updated by Radoslaw Zarzynski 11 months ago

Brad, let's sync talk about that in DS meeting.

Actions #18

Updated by Laura Flores 10 months ago

/a/yuriw-2023-06-22_20:29:56-rados-wip-yuri3-testing-2023-06-22-0812-reef-distro-default-smithi/7313235

Actions #19

Updated by Matan Breizman 8 months ago

/a/yuriw-2023-08-22_18:16:03-rados-wip-yuri10-testing-2023-08-17-1444-distro-default-smithi/7376687

Actions #20

Updated by Radoslaw Zarzynski 8 months ago

This time it's CentOS!

rzarzynski@teuthology:/a/yuriw-2023-08-22_18:16:03-rados-wip-yuri10-testing-2023-08-17-1444-distro-default-smithi/7376687$ less teuthology.log
...
2023-08-22T21:37:19.930 DEBUG:teuthology.packaging:Querying https://shaman.ceph.com/api/search?status=ready&project=ceph&flavor=default&distros=centos%2F9%2Fx86_64&ref=wip-yuri10-testing-2023-08-17-1444
2023-08-22T21:37:20.152 INFO:teuthology.task.internal:Found packages for ceph version 18.0.0-5573.gf0ed7046

Actions #21

Updated by Brad Hubbard 8 months ago

Taking a fresh look at this, thanks Radek.

Actions #22

Updated by Laura Flores 8 months ago

/a/yuriw-2023-08-15_18:58:56-rados-wip-yuri3-testing-2023-08-15-0955-distro-default-smithi/7369175

Actions #23

Updated by Matan Breizman 6 months ago

/a/yuriw-2023-10-11_14:08:36-rados-wip-yuri11-testing-2023-10-10-1226-reef-distro-default-smithi/7421542/
/a/yuriw-2023-10-11_14:08:36-rados-wip-yuri11-testing-2023-10-10-1226-reef-distro-default-smithi/7421695/

Actions #24

Updated by Nitzan Mordechai 6 months ago

/a/yuriw-2023-10-16_14:44:27-rados-wip-yuri10-testing-2023-10-11-0812-distro-default-smithi/7429668
/a/yuriw-2023-10-16_14:44:27-rados-wip-yuri10-testing-2023-10-11-0812-distro-default-smithi/7429845
/a/yuriw-2023-10-16_14:44:27-rados-wip-yuri10-testing-2023-10-11-0812-distro-default-smithi/7429846

Actions #25

Updated by Laura Flores 6 months ago

/a/yuriw-2023-10-24_00:11:54-rados-wip-yuri4-testing-2023-10-23-0903-distro-default-smithi/7435549

Actions #26

Updated by Nitzan Mordechai 6 months ago

/a/yuriw-2023-10-30_15:34:36-rados-wip-yuri10-testing-2023-10-27-0804-distro-default-smithi/7441096
/a/yuriw-2023-10-30_15:34:36-rados-wip-yuri10-testing-2023-10-27-0804-distro-default-smithi/7441250

Actions #27

Updated by Laura Flores 6 months ago

/a/yuriw-2023-10-31_14:43:48-rados-wip-yuri4-testing-2023-10-30-1117-distro-default-smithi/7442155

Actions #28

Updated by Laura Flores 6 months ago

/a/yuriw-2023-11-02_14:20:05-rados-wip-yuri6-testing-2023-11-01-0745-reef-distro-default-smithi/7444597

Actions #29

Updated by Radoslaw Zarzynski 5 months ago

bump up.

Actions #30

Updated by Laura Flores 5 months ago

/a/yuriw-2023-11-05_15:32:58-rados-reef-release-distro-default-smithi/7448518

Actions #31

Updated by Radoslaw Zarzynski 5 months ago

Brad is taking a look.

Actions #32

Updated by Nitzan Mordechai 4 months ago

/a/yuriw-2023-12-07_16:37:24-rados-wip-yuri8-testing-2023-12-06-1425-distro-default-smithi/7482188
/a/yuriw-2023-12-07_16:37:24-rados-wip-yuri8-testing-2023-12-06-1425-distro-default-smithi/7482168

Actions #33

Updated by Matan Breizman 4 months ago

/a/yuriw-2023-12-26_16:10:01-rados-wip-yuri3-testing-2023-12-19-1211-distro-default-smithi/7501415

Actions #34

Updated by Aishwarya Mathuria 4 months ago

/a/yuriw-2024-01-03_16:19:00-rados-wip-yuri6-testing-2024-01-02-0832-distro-default-smithi/7505560/
/a/yuriw-2024-01-03_16:19:00-rados-wip-yuri6-testing-2024-01-02-0832-distro-default-smithi/7505716/

Actions #35

Updated by Nitzan Mordechai 3 months ago

/a/yuriw-2024-01-18_15:10:37-rados-wip-yuri3-testing-2024-01-17-0753-distro-default-smithi/7520620
/a/yuriw-2024-01-18_15:10:37-rados-wip-yuri3-testing-2024-01-17-0753-distro-default-smithi/7520463

Actions #36

Updated by Kamoltat (Junior) Sirivadhna 3 months ago

/a/yuriw-2024-01-31_19:20:14-rados-wip-yuri3-testing-2024-01-29-1434-distro-default-smithi/7540671

Actions #37

Updated by Laura Flores 2 months ago

/a/yuriw-2024-02-05_19:32:33-rados-wip-yuri4-testing-2024-02-05-0849-distro-default-smithi/7547525

Actions #38

Updated by Matan Breizman 2 months ago

/a/yuriw-2024-02-09_00:15:46-rados-wip-yuri2-testing-2024-02-08-0727-distro-default-smithi/7553332

/a/yuriw-2024-02-09_00:15:46-rados-wip-yuri2-testing-2024-02-08-0727-distro-default-smithi/7553494

Actions #39

Updated by Radoslaw Zarzynski 2 months ago

  • Assignee changed from Brad Hubbard to Nitzan Mordechai
Actions #40

Updated by Aishwarya Mathuria 2 months ago

/a/yuriw-2024-02-13_15:50:02-rados-wip-yuri2-testing-2024-02-12-0808-reef-distro-default-smithi/7558347/

Actions #41

Updated by Laura Flores 2 months ago

/a/lflores-2024-02-13_16:18:32-rados-wip-yuri5-testing-2024-02-12-1152-distro-default-smithi/7558507

Actions #42

Updated by Nitzan Mordechai 2 months ago

  • Pull request ID set to 55596
Actions #43

Updated by Aishwarya Mathuria about 2 months ago

/a/yuriw-2024-02-14_14:58:57-rados-wip-yuri4-testing-2024-02-13-1546-distro-default-smithi/7560007/

Actions #44

Updated by Nitzan Mordechai about 2 months ago

  • Status changed from In Progress to Fix Under Review
Actions #45

Updated by Laura Flores about 2 months ago

/a/yuriw-2024-02-28_22:53:11-rados-wip-yuri2-testing-2024-02-16-0829-reef-distro-default-smithi/7576306

Actions #46

Updated by Radoslaw Zarzynski about 2 months ago

note from scrub: the PR is approved. Needs-qa.

Actions #47

Updated by Sridhar Seshasayee about 1 month ago

/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587684
/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587943

ceph_test_lazy_omap_stats still appears to crash with the fix at a later point when processing the "pg dump" output.

Logs:

2024-03-10T00:36:58.974 DEBUG:teuthology.orchestra.run.smithi005:> sudo TESTDIR=/home/ubuntu/cephtest bash -c ceph_test_lazy_omap_stats
2024-03-10T00:36:59.122 INFO:teuthology.orchestra.run.smithi005.stdout:pool 'lazy_omap_test_pool' created
2024-03-10T00:36:59.126 INFO:teuthology.orchestra.run.smithi005.stdout:Querying pool id
2024-03-10T00:36:59.128 INFO:teuthology.orchestra.run.smithi005.stdout:Found pool ID: 2
2024-03-10T00:36:59.131 INFO:teuthology.orchestra.run.smithi005.stdout:Created payload with 2000 keys of 445 bytes each. Total size in bytes = 890000
2024-03-10T00:36:59.132 INFO:teuthology.orchestra.run.smithi005.stdout:Waiting for active+clean
2024-03-10T00:36:59.384 INFO:teuthology.orchestra.run.smithi005.stdout:.
2024-03-10T00:37:00.168 INFO:teuthology.orchestra.run.smithi005.stdout:Wrote 2000 omap keys of 445 bytes to the 69650377-ca6a-4d76-9ed0-b8232baf4954 object
2024-03-10T00:37:00.168 INFO:teuthology.orchestra.run.smithi005.stdout:Scrubbing
2024-03-10T00:37:00.168 INFO:teuthology.orchestra.run.smithi005.stdout:Before scrub stamps:
2024-03-10T00:37:00.170 INFO:teuthology.orchestra.run.smithi005.stdout:dumped all
2024-03-10T00:37:00.189 INFO:teuthology.orchestra.run.smithi005.stdout:pg = 1.0 stamp = 2024-03-10T00:36:54.112470+0000
2024-03-10T00:37:00.189 INFO:teuthology.orchestra.run.smithi005.stdout:pg = 1.1 stamp = 2024-03-10T00:36:54.112470+0000
2024-03-10T00:37:00.189 INFO:teuthology.orchestra.run.smithi005.stdout:pg = 1.2 stamp = 2024-03-10T00:36:54.112470+0000
2024-03-10T00:37:00.189 INFO:teuthology.orchestra.run.smithi005.stdout:pg = 1.3 stamp = 2024-03-10T00:36:54.112470+0000

...

2024-03-10T00:37:25.609 INFO:teuthology.orchestra.run.smithi005.stdout:Scrubbing complete
2024-03-10T00:37:25.610 INFO:teuthology.orchestra.run.smithi005.stdout:dumped all
2024-03-10T00:37:25.610 INFO:teuthology.orchestra.run.smithi005.stdout:version 29
2024-03-10T00:37:25.610 INFO:teuthology.orchestra.run.smithi005.stdout:stamp 2024-03-10T00:37:25.107820+0000
2024-03-10T00:37:25.610 INFO:teuthology.orchestra.run.smithi005.stdout:last_osdmap_epoch 0
2024-03-10T00:37:25.610 INFO:teuthology.orchestra.run.smithi005.stdout:last_pg_scan 0
2024-03-10T00:37:25.610 INFO:teuthology.orchestra.run.smithi005.stdout:PG_STAT  OBJECTS  MISSING_ON_PRIMARY  DEGRADED  MISPLACED  UNFOUND  BYTES  OMAP_BYTES*  OMAP_KEYS*  LOG  LOG_DUPS  DISK_LOG  STATE         STATE_STAMP                      VERSION  REPORTED  UP       UP_PRIMARY  ACTING   ACTING_PRIMARY  LAST_SCRUB  SCRUB_STAMP                      LAST_DEEP_SCRUB  DEEP_SCRUB_STAMP                 SNAPTRIMQ_LEN  LAST_SCRUB_DURATION  SCRUB_SCHEDULING                                            OBJECTS_SCRUBBED  OBJECTS_TRIMMED
2024-03-10T00:37:25.611 INFO:teuthology.orchestra.run.smithi005.stdout:2.1f           0                   0         0          0        0      0            0           0    0         0         0  active+clean  2024-03-10T00:37:17.212953+0000      0'0     15:22  [0,1,2]           0  [0,1,2]               0         0'0  2024-03-10T00:37:17.212856+0000              0'0  2024-03-10T00:37:17.212856+0000              0                    0  periodic scrub scheduled @ 2024-03-11T01:30:12.432348+0000                 0                0

...

2024-03-10T00:37:25.614 INFO:teuthology.orchestra.run.smithi005.stdout:2  1  0  0  0  0  0  890000  2000  2  2
2024-03-10T00:37:25.614 INFO:teuthology.orchestra.run.smithi005.stdout:1  0  0  0  0  0  0       0     0  0  0
2024-03-10T00:37:25.614 INFO:teuthology.orchestra.run.smithi005.stdout:
2024-03-10T00:37:25.615 INFO:teuthology.orchestra.run.smithi005.stdout:sum  1  0  0  0  0  0  890000  2000  2  2
2024-03-10T00:37:25.615 INFO:teuthology.orchestra.run.smithi005.stdout:OSD_STAT  USED    AVAIL    USED_RAW  TOTAL    HB_PEERS  PG_SUM  PRIMARY_PG_SUM
2024-03-10T00:37:25.615 INFO:teuthology.orchestra.run.smithi005.stdout:2         27 MiB  100 GiB    27 MiB  100 GiB     [0,1]      36              11
2024-03-10T00:37:25.615 INFO:teuthology.orchestra.run.smithi005.stdout:1         27 MiB  100 GiB    27 MiB  100 GiB     [0,2]      38              16
2024-03-10T00:37:25.615 INFO:teuthology.orchestra.run.smithi005.stdout:0         27 MiB  100 GiB    27 MiB  100 GiB     [1,2]      38              13
2024-03-10T00:37:25.615 INFO:teuthology.orchestra.run.smithi005.stdout:sum       80 MiB  300 GiB    80 MiB  300 GiB
2024-03-10T00:37:25.615 INFO:teuthology.orchestra.run.smithi005.stdout:
2024-03-10T00:37:25.615 INFO:teuthology.orchestra.run.smithi005.stdout:* NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon afterwards depending on utilization. See http://docs.ceph.com/en/latest/dev/placement-group/#omap-statistics for further details.
2024-03-10T00:37:25.615 INFO:teuthology.orchestra.run.smithi005.stdout:
2024-03-10T00:37:25.792 DEBUG:teuthology.orchestra.run:got remote process result: None
2024-03-10T00:37:25.793 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/run_tasks.py", line 105, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/run_tasks.py", line 83, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/task/exec.py", line 66, in task
    remote.run(
  File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/orchestra/remote.py", line 523, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/orchestra/run.py", line 179, in _raise_for_status
    raise CommandCrashedError(command=self.command)
teuthology.exceptions.CommandCrashedError: Command crashed: 'sudo TESTDIR=/home/ubuntu/cephtest bash -c ceph_test_lazy_omap_stats'
2024-03-10T00:37:26.000 ERROR:teuthology.util.sentry: Sentry event: https://sentry.ceph.com/organizations/ceph/?query=b4490d53d0074f1ea4e0a94a7cf24187
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/run_tasks.py", line 105, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/run_tasks.py", line 83, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/task/exec.py", line 66, in task
    remote.run(
  File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/orchestra/remote.py", line 523, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/orchestra/run.py", line 179, in _raise_for_status
    raise CommandCrashedError(command=self.command)
teuthology.exceptions.CommandCrashedError: Command crashed: 'sudo TESTDIR=/home/ubuntu/cephtest bash -c ceph_test_lazy_omap_stats'
2024-03-10T00:37:26.002 DEBUG:teuthology.run_tasks:Unwinding manager ceph
2024-03-10T00:37:26.011 INFO:tasks.ceph.ceph_manager.ceph:waiting for clean
Actions #48

Updated by Radoslaw Zarzynski about 1 month ago

The fix isn't merged yet which could explain the reoccurrence above

Actions #49

Updated by Sridhar Seshasayee about 1 month ago

Radoslaw Zarzynski wrote:

The fix isn't merged yet which could explain the reoccurrence above

The run mentioned in #note-47 above includes the associated PR for testing. The fix apparently worked but failed down the line at some other point.

Actions #50

Updated by Nitzan Mordechai about 1 month ago

according to the console logs:

[  473.104619] ceph_test_lazy_[35269]: segfault at 7fff643adff8 ip 0000558a2c9a3953 sp 00007fff643adf20 error 6 in ceph_test_lazy_omap_stats[558a2c987000+20000] likely on CPU 7 (core 3, socket 0)

we still getting segfault somehow, checking

Actions #51

Updated by Nitzan Mordechai about 1 month ago

now the segfault happens on check_one function where we also have pre-regex to truncate the output that causing segfault. i fixed it as well and pushed to exist PR

Actions #52

Updated by Brad Hubbard about 1 month ago

Nitzan Mordechai wrote:

now the segfault happens on check_one function where we also have pre-regex to truncate the output that causing segfault. i fixed it as well and pushed to exist PR

Explained this in the PR and resubmitted for testing since needs_qa was removed due to the test failures. This is probably my fault as I should have picked that up during my review. Hopefully we are not going to see too many more of these.

Actions #53

Updated by Nitzan Mordechai about 1 month ago

Brad Hubbard wrote:

Nitzan Mordechai wrote:

now the segfault happens on check_one function where we also have pre-regex to truncate the output that causing segfault. i fixed it as well and pushed to exist PR

Explained this in the PR and resubmitted for testing since needs_qa was removed due to the test failures. This is probably my fault as I should have picked that up during my review. Hopefully we are not going to see too many more of these.

Thanks for bringing this up. I saw you already added the PR note and tag for need-qa, thanks!

Actions #54

Updated by Radoslaw Zarzynski about 1 month ago

In QA.

Actions #55

Updated by Aishwarya Mathuria 29 days ago

/a/yuriw-2024-03-19_00:09:45-rados-wip-yuri5-testing-2024-03-18-1144-distro-default-smithi/7609959

Actions #56

Updated by Brad Hubbard 28 days ago

Looking at the above crash which is referred to in https://github.com/ceph/ceph/pull/55596#issuecomment-2011798771

#0  0x000055555557a081 in std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_M_handle_match (
    __match_mode=std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_Match_mode::_Prefix, __i=11, 
    this=0x7fffffffdcd0) at /usr/include/c++/11/bits/regex_executor.tcc:326
...
#72779 std::regex_search<std::char_traits<char>, std::allocator<char>, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, char, std::__cxx11::regex_traits<char> > (__s="version 843\nstamp 2024-03-22T00:37:58.305948+0000\nlast_osdmap_epoch 0\nlast_pg_scan 0\nPG_STAT  OBJECTS  MISSING_ON_PRIMARY  DEGRADED  MISPLACED  UNFOUND  BYTES  OMAP_BYTES*  OMAP_KEYS*  LOG  LOG_DUPS  "..., __s="version 843\nstamp 2024-03-22T00:37:58.305948+0000\nlast_osdmap_epoch 0\nlast_pg_scan 0\nPG_STAT  OBJECTS  MISSING_ON_PRIMARY  DEGRADED  MISPLACED  UNFOUND  BYTES  OMAP_BYTES*  OMAP_KEYS*  LOG  LOG_DUPS  "..., __f=0, __e=..., __m=...) at /usr/include/c++/11/bits/regex.h:2445
#72780 LazyOmapStatsTest::check_one (this=0x7fffffffe460) at ./src/test/lazy-omap-stats/lazy_omap_stats_test.cc:305
#72781 0x000055555556d240 in LazyOmapStatsTest::run (this=0x7fffffffe460, argc=<optimized out>, argv=<optimized out>) at ./src/test/lazy-omap-stats/lazy_omap_stats_test.cc:602
#72782 0x0000555555560233 in main (argc=1, argv=0x7fffffffe638) at ./src/test/lazy-omap-stats/main.cc:20
(gdb) f 72780
#72780 LazyOmapStatsTest::check_one (this=0x7fffffffe460) at ./src/test/lazy-omap-stats/lazy_omap_stats_test.cc:305
305       regex_search(full_output, match, reg);
(gdb) l
300       string full_output = get_output();
301       cout << full_output << endl;
302       regex reg(
303           "\n((PG_STAT[\\s\\S]*)\n)OSD_STAT"); // Strip OSD_STAT table so we don't find matches there
304       smatch match;
305       regex_search(full_output, match, reg);
306       auto truncated_output = match[1].str();
307       cout << truncated_output << endl;
308       reg = regex(
309           "\n" 

So this is the same issue with the new code.

This is most likely https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86164 and it looks like we need a similar approach to https://github.com/scylladb/scylladb/pull/13452

I'm going to look at the feasibility of moving to boost::regex, at least until this gets sorted out in libstdc++

Stand by.

Actions #57

Updated by Laura Flores 25 days ago

/a/yuriw-2024-03-22_13:09:48-rados-wip-yuri11-testing-2024-03-21-0851-reef-distro-default-smithi/7616706

Actions #58

Updated by Laura Flores 24 days ago

/a/yuriw-2024-03-20_18:33:32-rados-wip-yuri6-testing-2024-03-18-1406-squid-distro-default-smithi/7613235

Actions #59

Updated by Laura Flores 24 days ago

  • Backport changed from reef to squid,reef
Actions #61

Updated by Brad Hubbard 21 days ago

  • Pull request ID changed from 55596 to 56574
Actions #62

Updated by Laura Flores 18 days ago

/a/yuriw-2024-03-24_22:19:24-rados-wip-yuri10-testing-2024-03-24-1159-distro-default-smithi/7620621

Actions #63

Updated by Radoslaw Zarzynski 11 days ago

The fix is in QA.

Actions #64

Updated by Brad Hubbard 11 days ago

  • Assignee changed from Nitzan Mordechai to Brad Hubbard

Taking this back.

Actions #65

Updated by Radoslaw Zarzynski 4 days ago

Bump up. In QA.

Actions #66

Updated by Aishwarya Mathuria 2 days ago

/a/yuriw-2024-04-09_14:35:50-rados-wip-yuri5-testing-2024-03-21-0833-distro-default-smithi/7648693/

Actions #67

Updated by Matan Breizman 1 day ago

/a/yuriw-2024-04-16_23:25:35-rados-wip-yuriw-testing-20240416.150233-distro-default-smithi/7659395
/a/yuriw-2024-04-16_23:25:35-rados-wip-yuriw-testing-20240416.150233-distro-default-smithi/7659539

Actions

Also available in: Atom PDF