Bug #62777: rados/valgrind-leaks: expected valgrind issues and found none - RADOS - Ceph

Actions

Copy link

Bug #62777

open

rados/valgrind-leaks: expected valgrind issues and found none

Added by Laura Flores 8 months ago. Updated 2 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

rados/valgrind-leaks/{1-start 2-inject-leak/mon centos_latest}

/a/yuriw-2023-08-11_02:49:40-rados-wip-yuri4-testing-2023-08-10-1739-distro-default-smithi/7366916

2023-08-11T09:05:29.545 ERROR:teuthology.run_tasks:Manager failed: ceph
Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_2d91d6813480a3969a4f052fc486a43386694206/qa/tasks/ceph.py", line 328, in valgrind_post
    yield
  File "/home/teuthworker/src/git.ceph.com_teuthology_7fda95956ac10132c9b74016ba832db907df09fa/teuthology/contextutil.py", line 46, in nested
    if exit(*exc):
  File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_2d91d6813480a3969a4f052fc486a43386694206/qa/tasks/ceph.py", line 1471, in run_daemon
    teuthology.stop_daemons_of_type(ctx, type_, cluster_name)
  File "/home/teuthworker/src/git.ceph.com_teuthology_7fda95956ac10132c9b74016ba832db907df09fa/teuthology/misc.py", line 1171, in stop_daemons_of_type
    daemon.stop()
  File "/home/teuthworker/src/git.ceph.com_teuthology_7fda95956ac10132c9b74016ba832db907df09fa/teuthology/orchestra/daemon/state.py", line 139, in stop
    run.wait([self.proc], timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_teuthology_7fda95956ac10132c9b74016ba832db907df09fa/teuthology/orchestra/run.py", line 473, in wait
    check_time()
  File "/home/teuthworker/src/git.ceph.com_teuthology_7fda95956ac10132c9b74016ba832db907df09fa/teuthology/contextutil.py", line 134, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: reached maximum tries (51) after waiting for 300 seconds

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_7fda95956ac10132c9b74016ba832db907df09fa/teuthology/run_tasks.py", line 154, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_2d91d6813480a3969a4f052fc486a43386694206/qa/tasks/ceph.py", line 1957, in task
    mon0_remote.run(
  File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/home/teuthworker/src/git.ceph.com_teuthology_7fda95956ac10132c9b74016ba832db907df09fa/teuthology/contextutil.py", line 54, in nested
    raise exc[1]
  File "/usr/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_2d91d6813480a3969a4f052fc486a43386694206/qa/tasks/ceph.py", line 251, in ceph_log
    yield
  File "/home/teuthworker/src/git.ceph.com_teuthology_7fda95956ac10132c9b74016ba832db907df09fa/teuthology/contextutil.py", line 46, in nested
    if exit(*exc):
  File "/usr/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_2d91d6813480a3969a4f052fc486a43386694206/qa/tasks/ceph.py", line 365, in valgrind_post
    raise Exception('expected valgrind issues and found none')
Exception: expected valgrind issues and found none

Actions

Copy link

Updated by Radoslaw Zarzynski 8 months ago

Yeah, we have a test intentionally causing a leak just to ensure valgrind truly works.
I wonder what might if this tests fails before the place where the leak is made (due to e.g. network issues).

In the snipper:

teuthology.exceptions.MaxWhileTries: reached maximum tries (51) after waiting for 300 seconds

Let's keep an eye but if the hypothesis is correct, these errors will be very, very infrequent.

Actions

Copy link

Updated by Nitzan Mordechai 8 months ago

Also, the monitors didn't stop, we are checking valgring logs of running process (the memory leak error will show only after the process done and the leak was found)

2023-08-11T09:00:28.441 INFO:teuthology.misc:Shutting down mon daemons...
2023-08-11T09:00:28.442 DEBUG:tasks.ceph.mon.a:waiting for process to exit
2023-08-11T09:00:28.442 INFO:teuthology.orchestra.run:waiting for 300
2023-08-11T09:00:28.506 INFO:tasks.ceph.mon.a.smithi130.stderr:2023-08-11T09:00:28.492+0000 9f3c640 -1 received  signal: Terminated from /usr/bin/python3 /bin/daemon-helper term env OPENSSL_ia32cap=~0x1000000000000000 valgrind --trace-children=no --child-silent-after-fork=yes --soname-synonyms=somalloc=*tcmall
oc* --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/mon.a.log --time-stamp=yes --vgdb=yes --exit-on-first-error=yes --error-exitcode=42 --tool=memcheck --leak-check=full --show-reachable=yes ceph-mon -f --cluster ceph -i a  (PID: 84154) UID: 0
2023-08-11T09:00:28.507 INFO:tasks.ceph.mon.a.smithi130.stderr:2023-08-11T09:00:28.494+0000 9f3c640 -1 mon.a@0(leader) e1 *** Got Signal Terminated ***
2023-08-11T09:00:31.745 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 has been restored
2023-08-11T09:05:27.511 INFO:tasks.ceph:Checking cluster log for badness...

Actions

Copy link

Updated by Aishwarya Mathuria 7 months ago

/a/yuriw-2023-10-05_21:43:37-rados-wip-yuri6-testing-2023-10-04-0901-distro-default-smithi/7412032

Actions

Copy link

Updated by Radoslaw Zarzynski 7 months ago

Hi Nitzan!

IIUC the test doesn't properly wait for exit of the process. Am I correct?

(this sounds like a nasty test issue).

Actions

Copy link

Updated by Nitzan Mordechai 5 months ago

Radoslaw Zarzynski wrote:

Hi Nitzan!

IIUC the test doesn't properly wait for exit of the process. Am I correct?

(this sounds like a nasty test issue).

Now that I check it again, we actually exist on the first error, which is different from the leak we expected.
We already have fix that didn't merge yet: https://tracker.ceph.com/issues/61774

Actions

Copy link