Bug #59196
openceph_test_lazy_omap_stats segfault while waiting for active+clean
0%
Description
2023-03-11T08:23:47.545 DEBUG:teuthology.orchestra.run.smithi005:> sudo TESTDIR=/home/ubuntu/cephtest bash -c ceph_test_lazy_omap_stats 2023-03-11T08:23:48.487 INFO:teuthology.orchestra.run.smithi005.stdout:pool 'lazy_omap_test_pool' created 2023-03-11T08:23:48.489 INFO:teuthology.orchestra.run.smithi005.stdout:Querying pool id 2023-03-11T08:23:48.492 INFO:teuthology.orchestra.run.smithi005.stdout:Found pool ID: 2 2023-03-11T08:23:48.496 INFO:teuthology.orchestra.run.smithi005.stdout:Created payload with 2000 keys of 445 bytes each. Total size in bytes = 890000 2023-03-11T08:23:48.496 INFO:teuthology.orchestra.run.smithi005.stdout:Waiting for active+clean 2023-03-11T08:23:48.513 DEBUG:teuthology.orchestra.run:got remote process result: None 2023-03-11T08:23:48.513 ERROR:teuthology.run_tasks:Saw exception from tasks. Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_teuthology_a6a4a7f010ae6b3f7fc2aef91377d4a6bee6de40/teuthology/run_tasks.py", line 103, in run_tasks manager = run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/git.ceph.com_teuthology_a6a4a7f010ae6b3f7fc2aef91377d4a6bee6de40/teuthology/run_tasks.py", line 82, in run_one_task return task(**kwargs) File "/home/teuthworker/src/git.ceph.com_teuthology_a6a4a7f010ae6b3f7fc2aef91377d4a6bee6de40/teuthology/task/exec.py", line 66, in task remote.run( File "/home/teuthworker/src/git.ceph.com_teuthology_a6a4a7f010ae6b3f7fc2aef91377d4a6bee6de40/teuthology/orchestra/remote.py", line 525, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_teuthology_a6a4a7f010ae6b3f7fc2aef91377d4a6bee6de40/teuthology/orchestra/run.py", line 455, in run r.wait() File "/home/teuthworker/src/git.ceph.com_teuthology_a6a4a7f010ae6b3f7fc2aef91377d4a6bee6de40/teuthology/orchestra/run.py", line 161, in wait self._raise_for_status() File "/home/teuthworker/src/git.ceph.com_teuthology_a6a4a7f010ae6b3f7fc2aef91377d4a6bee6de40/teuthology/orchestra/run.py", line 179, in _raise_for_status raise CommandCrashedError(command=self.command) teuthology.exceptions.CommandCrashedError: Command crashed: 'sudo TESTDIR=/home/ubuntu/cephtest bash -c ceph_test_lazy_omap_stats' 2023-03-11T08:23:48.594 ERROR:teuthology.run_tasks: Sentry event: https://sentry.ceph.com/organizations/ceph/?query=bc132455da90423caddad14e0a097e30
Found this in system kernel and journalctl logs.
2023-03-11T08:23:48.510950+00:00 smithi005 kernel: [ 641.041577] ceph_test_lazy_[34021]: segfault at 7ffda9628fd8 ip 0000563670ee5b19 sp 00007ffda9628fe0 error 6 in ceph_test_lazy_omap_stats[563670ec7000+21000]
Updated by Brad Hubbard about 1 year ago
- Translation missing: en.field_tag_list set to test-failure
Note that this tracker was originally #59058 until it was accidentally deleted by myself.
Below is a summary of the comments in that previous tracker.
Issue #59058 has been updated by Brad Hubbard.
Reproduced this and I suspect this only happens on Jammy as it has only been
seen once and on that distro that we have only started testing with.
It looks like a stack overflow due to unbounded recursion in std::regex code
which has precedents. I may be able to get around it by massaging the regular
expression being used, we'll see after some more testing.
Issue #59058 has been updated by Brad Hubbard.
We may be dealing with something similar to
https://tracker.ceph.com/issues/55304 here. I can not reproduce this issue on
the latest Jammy container image and if I upload the version of
ceph_test_lazy_omap_stats, that I built and successfully ran on my local
container, to the smithi machine failing the test it runs without segfaulting
whereas the version of ceph_test_lazy_omap_stats that was installed for the test
does segfault when run manually.
I see some difference in symbols when I compare the output of 'nm' and that led
me to compare the versions of gcc they were compiled with.
root@smithi026:/home/ubuntu# strings ./ceph_test_lazy_omap_stats|grep "GCC: ("
GCC: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
root@smithi026:/home/ubuntu# strings /usr/lib/debug/.build-id/08/de203b7b3fa0b5080173750be0a7b2576335d9.debug|grep "GCC: ("
GCC: (Ubuntu 11.2.0-19ubuntu1) 11.2.0
So the binary that works was created with gcc 11.3.0 and the version that fails
with 11.2.0. Next step is to see if I can set up a Jammy system with 11.2.0
installed and build ceph_test_lazy_omap_stats to see if doing that will
reproduce the issue.
Issue #59058 has been updated by Brad Hubbard.
Compiling with 11.2.0 failed to reproduce but in comparing the failing binary
to the one that succeeds under the debugger I found what appears to be the
cause of the issue at a low level (with the higher level cause still a mystery
but most likely some sort of issue in the build environment or build vs.
runtime environment).
The last code before we enter the std code are these two lines.
If I place a breakpoint on the last line I get the following discrepancy.
Success case:
(gdb) whatis match
type = std::__cxx11::smatch
(gdb) p match
$1 = {<std::vector<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >> = std::vector of length 0, capacity 0, _M_begin = non-dereferenceable iterator for std::vector}
Fail case:
(gdb) whatis match
type = std::__cxx11::smatch
(gdb) p match
$1 = {<std::vector<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >> = std::vector of length 172889574, capacity -1954508256483 = {
<error reading variable: Cannot access memory at address 0x7fff00000010>
So in the failure case there appears to be an issue with the just-initialised
smatch variable. Continuing to look at this.
Issue #59058 has been updated by Laura Flores.
Tags set to test-failure
/a/lflores-2023-03-27_02:17:31-rados-wip-aclamk-bs-elastic-shared-blob-save-25.03.2023-a-distro-default-smithi/7220933
/a/lflores-2023-03-27_02:17:31-rados-wip-aclamk-bs-elastic-shared-blob-save-25.03.2023-a-distro-default-smithi/7221086
Issue #59058 has been updated by Laura Flores.
/a/yuriw-2023-03-27_23:05:54-rados-wip-yuri4-testing-2023-03-25-0714-distro-default-smithi/7222036
Issue #59058 has been updated by Laura Flores.
This is happening quite frequently in the rados suite. It certainly points to a recent regression.
Updated by Radoslaw Zarzynski about 1 year ago
- Status changed from New to In Progress
Looks the problem is under investigation. Please correct me if I'm wrong.
Updated by Laura Flores about 1 year ago
Yes Radek, it is being investigated by Brad.
/a/yuriw-2023-03-27_23:05:54-rados-wip-yuri4-testing-2023-03-25-0714-distro-default-smithi/7222036
Updated by Laura Flores about 1 year ago
/a/yuriw-2023-03-30_21:53:20-rados-wip-yuri7-testing-2023-03-29-1100-distro-default-smithi/7227904
Updated by Laura Flores about 1 year ago
/a/yuriw-2023-04-04_15:24:40-rados-wip-yuri4-testing-2023-03-31-1237-distro-default-smithi/7231452
Updated by Laura Flores about 1 year ago
/a/yuriw-2023-03-30_21:29:24-rados-wip-yuri2-testing-2023-03-30-0826-distro-default-smithi/7227539
Updated by Laura Flores about 1 year ago
/a/lflores-2023-04-07_22:22:04-rados-wip-yuri4-testing-2023-04-07-1825-distro-default-smithi/7235344
Updated by Sridhar Seshasayee 12 months ago
/a/sseshasa-2023-05-02_03:12:27-rados-wip-sseshasa3-testing-2023-05-01-2154-distro-default-smithi/7260300
journalctl-b0.gz:May 02 04:26:30 smithi175 sudo[34509]: ubuntu : PWD=/home/ubuntu ; USER=root ; ENV=TESTDIR=/home/ubuntu/cephtest ; COMMAND=/usr/bin/bash -c ceph_test_lazy_omap_stats journalctl-b0.gz:May 02 04:26:31 smithi175 kernel: ceph_test_lazy_[34510]: segfault at 7ffeedae1ff8 ip 00005567a18b6549 sp 00007ffeedae1f30 error 6 in ceph_test_lazy_omap_stats[5567a1898000+21000] kern.log.gz:2023-05-02T04:26:31.788640+00:00 smithi175 kernel: [ 640.526218] ceph_test_lazy_[34510]: segfault at 7ffeedae1ff8 ip 00005567a18b6549 sp 00007ffeedae1f30 error 6 in ceph_test_lazy_omap_stats[5567a1898000+21000]
Updated by Radoslaw Zarzynski 12 months ago
Let's check whether this reproduces in Reef too. If so, then... there is no OMAP without RocksDB and we upgraded it recently...
Updated by Brad Hubbard 12 months ago
Radoslaw Zarzynski wrote:
Let's check whether this reproduces in Reef too. If so, then... there is no OMAP without RocksDB and we upgraded it recently...
Hey Radek,
To me it's more significant that every instance above was seen on VERSION="22.04.1 LTS (Jammy Jellyfish)" and I think this has something to do with the way we are building for Jammy. I think somehow we are exposing some sort of library mismatch, or something similar. I need to try and reproduce the build environment to test this theory I guess, which I may need some help with.
Updated by Laura Flores 11 months ago
/a/yuriw-2023-05-11_15:01:38-rados-wip-yuri8-testing-2023-05-10-1402-distro-default-smithi/7271184
So far, no Reef sightings.
Updated by Brad Hubbard 11 months ago
Laura Flores wrote:
/a/yuriw-2023-05-11_15:01:38-rados-wip-yuri8-testing-2023-05-10-1402-distro-default-smithi/7271184
So far, no Reef sightings.
And Jammy yet again.
Updated by Laura Flores 11 months ago
- Backport set to reef
/a/lflores-2023-05-22_16:08:13-rados-wip-yuri6-testing-2023-05-19-1351-reef-distro-default-smithi/7282703
Was already in Reef as far back as March 11 (/a/yuriw-2023-03-10_22:46:37-rados-reef-distro-default-smithi/7203287), so this test batch is not introducing the bug to Reef.
Updated by Laura Flores 11 months ago
/a/yuriw-2023-05-24_14:33:21-rados-wip-yuri6-testing-2023-05-23-0757-reef-distro-default-smithi/7285192
Updated by Radoslaw Zarzynski 11 months ago
The RocksDB upgrade PR has been merged on 1st March.
Updated by Radoslaw Zarzynski 11 months ago
Brad, let's sync talk about that in DS meeting.
Updated by Laura Flores 10 months ago
/a/yuriw-2023-06-22_20:29:56-rados-wip-yuri3-testing-2023-06-22-0812-reef-distro-default-smithi/7313235
Updated by Matan Breizman 8 months ago
/a/yuriw-2023-08-22_18:16:03-rados-wip-yuri10-testing-2023-08-17-1444-distro-default-smithi/7376687
Updated by Radoslaw Zarzynski 8 months ago
This time it's CentOS!
rzarzynski@teuthology:/a/yuriw-2023-08-22_18:16:03-rados-wip-yuri10-testing-2023-08-17-1444-distro-default-smithi/7376687$ less teuthology.log ... 2023-08-22T21:37:19.930 DEBUG:teuthology.packaging:Querying https://shaman.ceph.com/api/search?status=ready&project=ceph&flavor=default&distros=centos%2F9%2Fx86_64&ref=wip-yuri10-testing-2023-08-17-1444 2023-08-22T21:37:20.152 INFO:teuthology.task.internal:Found packages for ceph version 18.0.0-5573.gf0ed7046
Updated by Brad Hubbard 8 months ago
Taking a fresh look at this, thanks Radek.
Updated by Laura Flores 8 months ago
/a/yuriw-2023-08-15_18:58:56-rados-wip-yuri3-testing-2023-08-15-0955-distro-default-smithi/7369175
Updated by Matan Breizman 6 months ago
/a/yuriw-2023-10-11_14:08:36-rados-wip-yuri11-testing-2023-10-10-1226-reef-distro-default-smithi/7421542/
/a/yuriw-2023-10-11_14:08:36-rados-wip-yuri11-testing-2023-10-10-1226-reef-distro-default-smithi/7421695/
Updated by Nitzan Mordechai 6 months ago
/a/yuriw-2023-10-16_14:44:27-rados-wip-yuri10-testing-2023-10-11-0812-distro-default-smithi/7429668
/a/yuriw-2023-10-16_14:44:27-rados-wip-yuri10-testing-2023-10-11-0812-distro-default-smithi/7429845
/a/yuriw-2023-10-16_14:44:27-rados-wip-yuri10-testing-2023-10-11-0812-distro-default-smithi/7429846
Updated by Laura Flores 6 months ago
/a/yuriw-2023-10-24_00:11:54-rados-wip-yuri4-testing-2023-10-23-0903-distro-default-smithi/7435549
Updated by Nitzan Mordechai 6 months ago
/a/yuriw-2023-10-30_15:34:36-rados-wip-yuri10-testing-2023-10-27-0804-distro-default-smithi/7441096
/a/yuriw-2023-10-30_15:34:36-rados-wip-yuri10-testing-2023-10-27-0804-distro-default-smithi/7441250
Updated by Laura Flores 6 months ago
/a/yuriw-2023-10-31_14:43:48-rados-wip-yuri4-testing-2023-10-30-1117-distro-default-smithi/7442155
Updated by Laura Flores 6 months ago
/a/yuriw-2023-11-02_14:20:05-rados-wip-yuri6-testing-2023-11-01-0745-reef-distro-default-smithi/7444597
Updated by Laura Flores 5 months ago
/a/yuriw-2023-11-05_15:32:58-rados-reef-release-distro-default-smithi/7448518
Updated by Nitzan Mordechai 4 months ago
/a/yuriw-2023-12-07_16:37:24-rados-wip-yuri8-testing-2023-12-06-1425-distro-default-smithi/7482188
/a/yuriw-2023-12-07_16:37:24-rados-wip-yuri8-testing-2023-12-06-1425-distro-default-smithi/7482168
Updated by Matan Breizman 4 months ago
/a/yuriw-2023-12-26_16:10:01-rados-wip-yuri3-testing-2023-12-19-1211-distro-default-smithi/7501415
Updated by Aishwarya Mathuria 4 months ago
/a/yuriw-2024-01-03_16:19:00-rados-wip-yuri6-testing-2024-01-02-0832-distro-default-smithi/7505560/
/a/yuriw-2024-01-03_16:19:00-rados-wip-yuri6-testing-2024-01-02-0832-distro-default-smithi/7505716/
Updated by Nitzan Mordechai 3 months ago
/a/yuriw-2024-01-18_15:10:37-rados-wip-yuri3-testing-2024-01-17-0753-distro-default-smithi/7520620
/a/yuriw-2024-01-18_15:10:37-rados-wip-yuri3-testing-2024-01-17-0753-distro-default-smithi/7520463
Updated by Kamoltat (Junior) Sirivadhna 3 months ago
/a/yuriw-2024-01-31_19:20:14-rados-wip-yuri3-testing-2024-01-29-1434-distro-default-smithi/7540671
Updated by Laura Flores 2 months ago
/a/yuriw-2024-02-05_19:32:33-rados-wip-yuri4-testing-2024-02-05-0849-distro-default-smithi/7547525
Updated by Matan Breizman 2 months ago
/a/yuriw-2024-02-09_00:15:46-rados-wip-yuri2-testing-2024-02-08-0727-distro-default-smithi/7553332
/a/yuriw-2024-02-09_00:15:46-rados-wip-yuri2-testing-2024-02-08-0727-distro-default-smithi/7553494
Updated by Radoslaw Zarzynski 2 months ago
- Assignee changed from Brad Hubbard to Nitzan Mordechai
Updated by Aishwarya Mathuria 2 months ago
/a/yuriw-2024-02-13_15:50:02-rados-wip-yuri2-testing-2024-02-12-0808-reef-distro-default-smithi/7558347/
Updated by Laura Flores 2 months ago
/a/lflores-2024-02-13_16:18:32-rados-wip-yuri5-testing-2024-02-12-1152-distro-default-smithi/7558507
Updated by Aishwarya Mathuria about 2 months ago
/a/yuriw-2024-02-14_14:58:57-rados-wip-yuri4-testing-2024-02-13-1546-distro-default-smithi/7560007/
Updated by Nitzan Mordechai about 2 months ago
- Status changed from In Progress to Fix Under Review
Updated by Laura Flores about 2 months ago
/a/yuriw-2024-02-28_22:53:11-rados-wip-yuri2-testing-2024-02-16-0829-reef-distro-default-smithi/7576306
Updated by Radoslaw Zarzynski about 2 months ago
note from scrub: the PR is approved. Needs-qa.
Updated by Sridhar Seshasayee about 1 month ago
/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587684
/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587943
ceph_test_lazy_omap_stats still appears to crash with the fix at a later point when processing the "pg dump" output.
Logs:
2024-03-10T00:36:58.974 DEBUG:teuthology.orchestra.run.smithi005:> sudo TESTDIR=/home/ubuntu/cephtest bash -c ceph_test_lazy_omap_stats 2024-03-10T00:36:59.122 INFO:teuthology.orchestra.run.smithi005.stdout:pool 'lazy_omap_test_pool' created 2024-03-10T00:36:59.126 INFO:teuthology.orchestra.run.smithi005.stdout:Querying pool id 2024-03-10T00:36:59.128 INFO:teuthology.orchestra.run.smithi005.stdout:Found pool ID: 2 2024-03-10T00:36:59.131 INFO:teuthology.orchestra.run.smithi005.stdout:Created payload with 2000 keys of 445 bytes each. Total size in bytes = 890000 2024-03-10T00:36:59.132 INFO:teuthology.orchestra.run.smithi005.stdout:Waiting for active+clean 2024-03-10T00:36:59.384 INFO:teuthology.orchestra.run.smithi005.stdout:. 2024-03-10T00:37:00.168 INFO:teuthology.orchestra.run.smithi005.stdout:Wrote 2000 omap keys of 445 bytes to the 69650377-ca6a-4d76-9ed0-b8232baf4954 object 2024-03-10T00:37:00.168 INFO:teuthology.orchestra.run.smithi005.stdout:Scrubbing 2024-03-10T00:37:00.168 INFO:teuthology.orchestra.run.smithi005.stdout:Before scrub stamps: 2024-03-10T00:37:00.170 INFO:teuthology.orchestra.run.smithi005.stdout:dumped all 2024-03-10T00:37:00.189 INFO:teuthology.orchestra.run.smithi005.stdout:pg = 1.0 stamp = 2024-03-10T00:36:54.112470+0000 2024-03-10T00:37:00.189 INFO:teuthology.orchestra.run.smithi005.stdout:pg = 1.1 stamp = 2024-03-10T00:36:54.112470+0000 2024-03-10T00:37:00.189 INFO:teuthology.orchestra.run.smithi005.stdout:pg = 1.2 stamp = 2024-03-10T00:36:54.112470+0000 2024-03-10T00:37:00.189 INFO:teuthology.orchestra.run.smithi005.stdout:pg = 1.3 stamp = 2024-03-10T00:36:54.112470+0000 ... 2024-03-10T00:37:25.609 INFO:teuthology.orchestra.run.smithi005.stdout:Scrubbing complete 2024-03-10T00:37:25.610 INFO:teuthology.orchestra.run.smithi005.stdout:dumped all 2024-03-10T00:37:25.610 INFO:teuthology.orchestra.run.smithi005.stdout:version 29 2024-03-10T00:37:25.610 INFO:teuthology.orchestra.run.smithi005.stdout:stamp 2024-03-10T00:37:25.107820+0000 2024-03-10T00:37:25.610 INFO:teuthology.orchestra.run.smithi005.stdout:last_osdmap_epoch 0 2024-03-10T00:37:25.610 INFO:teuthology.orchestra.run.smithi005.stdout:last_pg_scan 0 2024-03-10T00:37:25.610 INFO:teuthology.orchestra.run.smithi005.stdout:PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN LAST_SCRUB_DURATION SCRUB_SCHEDULING OBJECTS_SCRUBBED OBJECTS_TRIMMED 2024-03-10T00:37:25.611 INFO:teuthology.orchestra.run.smithi005.stdout:2.1f 0 0 0 0 0 0 0 0 0 0 0 active+clean 2024-03-10T00:37:17.212953+0000 0'0 15:22 [0,1,2] 0 [0,1,2] 0 0'0 2024-03-10T00:37:17.212856+0000 0'0 2024-03-10T00:37:17.212856+0000 0 0 periodic scrub scheduled @ 2024-03-11T01:30:12.432348+0000 0 0 ... 2024-03-10T00:37:25.614 INFO:teuthology.orchestra.run.smithi005.stdout:2 1 0 0 0 0 0 890000 2000 2 2 2024-03-10T00:37:25.614 INFO:teuthology.orchestra.run.smithi005.stdout:1 0 0 0 0 0 0 0 0 0 0 2024-03-10T00:37:25.614 INFO:teuthology.orchestra.run.smithi005.stdout: 2024-03-10T00:37:25.615 INFO:teuthology.orchestra.run.smithi005.stdout:sum 1 0 0 0 0 0 890000 2000 2 2 2024-03-10T00:37:25.615 INFO:teuthology.orchestra.run.smithi005.stdout:OSD_STAT USED AVAIL USED_RAW TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM 2024-03-10T00:37:25.615 INFO:teuthology.orchestra.run.smithi005.stdout:2 27 MiB 100 GiB 27 MiB 100 GiB [0,1] 36 11 2024-03-10T00:37:25.615 INFO:teuthology.orchestra.run.smithi005.stdout:1 27 MiB 100 GiB 27 MiB 100 GiB [0,2] 38 16 2024-03-10T00:37:25.615 INFO:teuthology.orchestra.run.smithi005.stdout:0 27 MiB 100 GiB 27 MiB 100 GiB [1,2] 38 13 2024-03-10T00:37:25.615 INFO:teuthology.orchestra.run.smithi005.stdout:sum 80 MiB 300 GiB 80 MiB 300 GiB 2024-03-10T00:37:25.615 INFO:teuthology.orchestra.run.smithi005.stdout: 2024-03-10T00:37:25.615 INFO:teuthology.orchestra.run.smithi005.stdout:* NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon afterwards depending on utilization. See http://docs.ceph.com/en/latest/dev/placement-group/#omap-statistics for further details. 2024-03-10T00:37:25.615 INFO:teuthology.orchestra.run.smithi005.stdout: 2024-03-10T00:37:25.792 DEBUG:teuthology.orchestra.run:got remote process result: None 2024-03-10T00:37:25.793 ERROR:teuthology.run_tasks:Saw exception from tasks. Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/run_tasks.py", line 105, in run_tasks manager = run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/run_tasks.py", line 83, in run_one_task return task(**kwargs) File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/task/exec.py", line 66, in task remote.run( File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/orchestra/remote.py", line 523, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/orchestra/run.py", line 455, in run r.wait() File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/orchestra/run.py", line 161, in wait self._raise_for_status() File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/orchestra/run.py", line 179, in _raise_for_status raise CommandCrashedError(command=self.command) teuthology.exceptions.CommandCrashedError: Command crashed: 'sudo TESTDIR=/home/ubuntu/cephtest bash -c ceph_test_lazy_omap_stats' 2024-03-10T00:37:26.000 ERROR:teuthology.util.sentry: Sentry event: https://sentry.ceph.com/organizations/ceph/?query=b4490d53d0074f1ea4e0a94a7cf24187 Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/run_tasks.py", line 105, in run_tasks manager = run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/run_tasks.py", line 83, in run_one_task return task(**kwargs) File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/task/exec.py", line 66, in task remote.run( File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/orchestra/remote.py", line 523, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/orchestra/run.py", line 455, in run r.wait() File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/orchestra/run.py", line 161, in wait self._raise_for_status() File "/home/teuthworker/src/git.ceph.com_teuthology_e691533f9cbb33d85b2187bba20d7102f098636d/teuthology/orchestra/run.py", line 179, in _raise_for_status raise CommandCrashedError(command=self.command) teuthology.exceptions.CommandCrashedError: Command crashed: 'sudo TESTDIR=/home/ubuntu/cephtest bash -c ceph_test_lazy_omap_stats' 2024-03-10T00:37:26.002 DEBUG:teuthology.run_tasks:Unwinding manager ceph 2024-03-10T00:37:26.011 INFO:tasks.ceph.ceph_manager.ceph:waiting for clean
Updated by Radoslaw Zarzynski about 1 month ago
The fix isn't merged yet which could explain the reoccurrence above
Updated by Sridhar Seshasayee about 1 month ago
Radoslaw Zarzynski wrote:
The fix isn't merged yet which could explain the reoccurrence above
The run mentioned in #note-47 above includes the associated PR for testing. The fix apparently worked but failed down the line at some other point.
Updated by Nitzan Mordechai about 1 month ago
according to the console logs:
[ 473.104619] ceph_test_lazy_[35269]: segfault at 7fff643adff8 ip 0000558a2c9a3953 sp 00007fff643adf20 error 6 in ceph_test_lazy_omap_stats[558a2c987000+20000] likely on CPU 7 (core 3, socket 0)
we still getting segfault somehow, checking
Updated by Nitzan Mordechai about 1 month ago
now the segfault happens on check_one function where we also have pre-regex to truncate the output that causing segfault. i fixed it as well and pushed to exist PR
Updated by Brad Hubbard about 1 month ago
Nitzan Mordechai wrote:
now the segfault happens on check_one function where we also have pre-regex to truncate the output that causing segfault. i fixed it as well and pushed to exist PR
Explained this in the PR and resubmitted for testing since needs_qa was removed due to the test failures. This is probably my fault as I should have picked that up during my review. Hopefully we are not going to see too many more of these.
Updated by Nitzan Mordechai about 1 month ago
Brad Hubbard wrote:
Nitzan Mordechai wrote:
now the segfault happens on check_one function where we also have pre-regex to truncate the output that causing segfault. i fixed it as well and pushed to exist PR
Explained this in the PR and resubmitted for testing since needs_qa was removed due to the test failures. This is probably my fault as I should have picked that up during my review. Hopefully we are not going to see too many more of these.
Thanks for bringing this up. I saw you already added the PR note and tag for need-qa, thanks!
Updated by Aishwarya Mathuria 29 days ago
/a/yuriw-2024-03-19_00:09:45-rados-wip-yuri5-testing-2024-03-18-1144-distro-default-smithi/7609959
Updated by Brad Hubbard 28 days ago
Looking at the above crash which is referred to in https://github.com/ceph/ceph/pull/55596#issuecomment-2011798771
#0 0x000055555557a081 in std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_M_handle_match ( __match_mode=std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_Match_mode::_Prefix, __i=11, this=0x7fffffffdcd0) at /usr/include/c++/11/bits/regex_executor.tcc:326 ... #72779 std::regex_search<std::char_traits<char>, std::allocator<char>, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, char, std::__cxx11::regex_traits<char> > (__s="version 843\nstamp 2024-03-22T00:37:58.305948+0000\nlast_osdmap_epoch 0\nlast_pg_scan 0\nPG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS "..., __s="version 843\nstamp 2024-03-22T00:37:58.305948+0000\nlast_osdmap_epoch 0\nlast_pg_scan 0\nPG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS "..., __f=0, __e=..., __m=...) at /usr/include/c++/11/bits/regex.h:2445 #72780 LazyOmapStatsTest::check_one (this=0x7fffffffe460) at ./src/test/lazy-omap-stats/lazy_omap_stats_test.cc:305 #72781 0x000055555556d240 in LazyOmapStatsTest::run (this=0x7fffffffe460, argc=<optimized out>, argv=<optimized out>) at ./src/test/lazy-omap-stats/lazy_omap_stats_test.cc:602 #72782 0x0000555555560233 in main (argc=1, argv=0x7fffffffe638) at ./src/test/lazy-omap-stats/main.cc:20 (gdb) f 72780 #72780 LazyOmapStatsTest::check_one (this=0x7fffffffe460) at ./src/test/lazy-omap-stats/lazy_omap_stats_test.cc:305 305 regex_search(full_output, match, reg); (gdb) l 300 string full_output = get_output(); 301 cout << full_output << endl; 302 regex reg( 303 "\n((PG_STAT[\\s\\S]*)\n)OSD_STAT"); // Strip OSD_STAT table so we don't find matches there 304 smatch match; 305 regex_search(full_output, match, reg); 306 auto truncated_output = match[1].str(); 307 cout << truncated_output << endl; 308 reg = regex( 309 "\n"
So this is the same issue with the new code.
This is most likely https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86164 and it looks like we need a similar approach to https://github.com/scylladb/scylladb/pull/13452
I'm going to look at the feasibility of moving to boost::regex, at least until this gets sorted out in libstdc++
Stand by.
Updated by Laura Flores 25 days ago
/a/yuriw-2024-03-22_13:09:48-rados-wip-yuri11-testing-2024-03-21-0851-reef-distro-default-smithi/7616706
Updated by Laura Flores 24 days ago
/a/yuriw-2024-03-20_18:33:32-rados-wip-yuri6-testing-2024-03-18-1406-squid-distro-default-smithi/7613235
Updated by Brad Hubbard 21 days ago
Closing https://github.com/ceph/ceph/pull/55596 in favour of https://github.com/ceph/ceph/pull/56574
Updated by Brad Hubbard 21 days ago
- Pull request ID changed from 55596 to 56574
Updated by Laura Flores 18 days ago
/a/yuriw-2024-03-24_22:19:24-rados-wip-yuri10-testing-2024-03-24-1159-distro-default-smithi/7620621
Updated by Brad Hubbard 11 days ago
- Assignee changed from Nitzan Mordechai to Brad Hubbard
Taking this back.
Updated by Aishwarya Mathuria 2 days ago
/a/yuriw-2024-04-09_14:35:50-rados-wip-yuri5-testing-2024-03-21-0833-distro-default-smithi/7648693/
Updated by Matan Breizman 1 day ago
/a/yuriw-2024-04-16_23:25:35-rados-wip-yuriw-testing-20240416.150233-distro-default-smithi/7659395
/a/yuriw-2024-04-16_23:25:35-rados-wip-yuriw-testing-20240416.150233-distro-default-smithi/7659539