Bug #63949: leak in mds.c detected by valgrind during CephFS QA run - CephFS - Ceph

Actions

Copy link

Bug #63949

open

leak in mds.c detected by valgrind during CephFS QA run

Added by Rishabh Dave 4 months ago. Updated 2 months ago.

Status:

Triaged

Priority:

Normal

Assignee:

Leonid Usov

Category:

Performance/Resource Usage

Target version:

Ceph - v19.0.0

% Done:

Source:

Tags:

Backport:

quincy,reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Link to the failure - https://pulpito.ceph.com/rishabh-2023-12-17_11:19:43-fs-rishabh-2023dec11-testing-default-smithi/7495166

I have talked with Laura and Venky about this. Unlike other failures in this QA run, this one looks like it is unrelated to RADOS/core. According to Venky, it isn't caused any of the CephFS PRs that were being tested in this QA run. Therefore, it is likely that it is caused by some CephFS PR was merged previously.

It appears that valgrind detected leak in mds.c. From valgrind.yaml -

- file_path: /var/log/ceph/valgrind/mds.c.log
  kind: Leak_PossiblyLost

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Venky Shankar 4 months ago

Category set to Performance/Resource Usage
Status changed from New to Triaged
Assignee set to Leonid Usov
Target version set to v19.0.0
Backport set to quincy,reef

Actions

Copy link

Updated by Leonid Usov 4 months ago

I looked at the logs of the valgrind and I noticed that the valgrind.yaml report is merely scratching the surface. It appears that the ValgrindScanner is only taking the first error it finds in the XML report file, but for our run there are TONS of leaks. Consider this line from the problematic report :

  <text>32 bytes in 1 blocks are possibly lost in loss record 1,295 of 3,793</text>

As for the leaks themselves, I don't think that the __trans_list_add one is cephfs's fault. There are many more severe leaks, most from the stdlib code like thread stack or the ceph-common library.

There are tens of

      <ip>0x4EB1DE9</ip>
      <obj>/usr/lib/ceph/libceph-common.so.2</obj>
      <fn>ceph::buffer::v15_2_0::create_aligned_in_mempool(unsigned int, unsigned int, int)</fn>
      <dir>./obj-x86_64-linux-gnu/src/common/./src/common</dir>
      <file>buffer.cc</file>
      <line>110</line>

kind of leaks of 104 or 120 bytes, called via the CephXAuthorizer::add_challenge;

there are 6 a global_init leaks of 450 bytes, also from the ceph-common library,

there are tens of pthread_create leaks via std::thread initialized all over the code.

I suspect that we are looking here at a valgrind report of a crash. I wouldn't expect a crashed executable to clean up memory, so the scary valgrind report is merely a secondary side effect of something else.

The test being run is a stress test. Do we expect the MDS to crash in this case? I'll inspect the corresponding mds log

Actions

Copy link

Updated by Leonid Usov 4 months ago

Looking at the mds.c.log I don't find any signs of a crash, but I do see the `respawn` happening at the very end:

     0> 2023-12-17T13:35:27.972+0000 d1af640  1 mds.c respawn!

Also, from the teuthology log:

  [v2:172.21.15.43:3300/0,v1:172.21.15.43:6789/0] -- mon_command({"prefix": "fs fail", "fs_name": "cephfs"} v 0) v1 -- 0x7f2778074480 con 0x7f2778153a80
2023-12-17T13:35:27.864 INFO:teuthology.orchestra.run.smithi043.stderr:2023-12-17T13:35:27.876+0000 7f2774ff9640  1 -- 172.21.15.43:0/4180275394 <== mon.0 v2:172.21.15.43:3300/0 7 ==== mon_command_ack([{"prefix": "fs fail", "fs_name": "cephfs"}]=0 cephfs marked not joinable; MDS cannot join the cluster. All MDS ranks marked failed. v108) v1 ==== 161+0+0 (secure 0 0 0) 0x7f2770016030 con 0x7f2778153a80

Actions

Copy link

Updated by Venky Shankar 4 months ago

Leonid Usov wrote:

I looked at the logs of the valgrind and I noticed that the valgrind.yaml report is merely scratching the surface. It appears that the ValgrindScanner is only taking the first error it finds in the XML report file, but for our run there are TONS of leaks. Consider this line from the problematic report :

[...]

As for the leaks themselves, I don't think that the __trans_list_add one is cephfs's fault. There are many more severe leaks, most from the stdlib code like thread stack or the ceph-common library.

There are tens of

[...]

kind of leaks of 104 or 120 bytes, called via the CephXAuthorizer::add_challenge;

there are 6 a global_init leaks of 450 bytes, also from the ceph-common library,

there are tens of pthread_create leaks via std::thread initialized all over the code.

I suspect that we are looking here at a valgrind report of a crash. I wouldn't expect a crashed executable to clean up memory, so the scary valgrind report is merely a secondary side effect of something else.

That makes sense. Is there a coredump of the daemon crash you mention about?

The test being run is a stress test. Do we expect the MDS to crash in this case? I'll inspect the corresponding mds log

This job yaml is:

fs/verify/{begin/{0-install 1-ceph 2-logrotate} clusters/1a5s-mds-1c-client conf/{client mds mon osd} distro/{ubuntu/{overrides ubuntu_latest}} mount/kclient/{k-testing mount ms-die-on-skipped} objectstore-ec/bluestore-comp-ec-root overrides/{ignorelist_health ignorelist_wrongly_marked_down mon-debug session_timeout} ranks/3 tasks/fsstress validater/valgrind}

With 3 active MDS ranks. This test ins't a thrash test where the thrash task would kill of a (random) MDS, so, the MDS respawning needs to be root caused. The monitors would have removed this MDS from the map and you'd get the required details in the monitor logs as to why this MDS was removed.

Actions

Copy link

Updated by Leonid Usov 4 months ago

Ok. Looking at the MDS logs I find the same evidence that an `fs fail cephfs` command was sent:

2023-12-17T13:35:27.552+0000 10352640 20 mon.a@0(leader) e1  entity_name client.admin global_id 4980 (new_ok) caps allow *
2023-12-17T13:35:27.552+0000 10352640  0 mon.a@0(leader) e1 handle_command mon_command({"prefix": "fs fail", "fs_name": "cephfs"} v 0) v1
...
2023-12-17T13:35:27.552+0000 10352640  7 mon.a@0(leader).mds e107 prepare_update mon_command({"prefix": "fs fail", "fs_name": "cephfs"} v 0) v1
2023-12-17T13:35:27.556+0000 10352640  1 mon.a@0(leader).mds e107 fail_mds_gid 4245 mds.c role 2
2023-12-17T13:35:27.560+0000 10352640 10 mon.a@0(leader).osd e29 blocklist [v2:172.21.15.43:6833/1978871051,v1:172.21.15.43:6835/1978871051] until 2023-12-18T13:35:27.560705+0000
2023-12-17T13:35:27.568+0000 10352640  1 mon.a@0(leader).mds e107 fail_mds_gid 4260 mds.f role 1
2023-12-17T13:35:27.568+0000 10352640 10 mon.a@0(leader).osd e29 blocklist [v2:172.21.15.111:6836/1297041329,v1:172.21.15.111:6837/1297041329] until 2023-12-18T13:35:27.570746+0000
2023-12-17T13:35:27.568+0000 10352640  1 mon.a@0(leader).mds e107 fail_mds_gid 4263 mds.b role 0
2023-12-17T13:35:27.568+0000 10352640 10 mon.a@0(leader).osd e29 blocklist [v2:172.21.15.111:6838/3025782855,v1:172.21.15.111:6839/3025782855] until 2023-12-18T13:35:27.571711+0000

The command origin is logged in the teuthology log:

2023-12-17T13:35:26.800 DEBUG:teuthology.orchestra.run.smithi043:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph fs fail cephfs

It seems as if the fsstress command finished successfully

2023-12-17T12:40:14.620 INFO:tasks.workunit.client.0.smithi043.stdout:5/999: creat d5/d3f/f159 x:0 0 0
2023-12-17T12:40:14.650 INFO:tasks.workunit.client.0.smithi043.stderr:+ rm -rf -- ./tmp.XipIPTBbjY
...
2023-12-17T13:33:39.032 INFO:tasks.workunit:Stopping ['suites/fsstress.sh'] on client.0...
2023-12-17T13:33:39.032 DEBUG:teuthology.orchestra.run.smithi043:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2023-12-17T13:33:39.921 DEBUG:teuthology.parallel:result is None
2023-12-17T13:33:39.921 DEBUG:teuthology.orchestra.run.smithi043:> sudo rm -rf -- /home/ubuntu/cephtest/mnt.0/client.0
2023-12-17T13:33:40.813 INFO:tasks.workunit:Deleted dir /home/ubuntu/cephtest/mnt.0/client.0
2023-12-17T13:33:40.813 DEBUG:teuthology.run_tasks:Unwinding manager kclient
2023-12-17T13:33:40.824 INFO:tasks.kclient:Unmounting kernel clients...
2023-12-17T13:33:40.824 DEBUG:teuthology.orchestra.run.smithi043:> set -ex
2023-12-17T13:33:40.824 DEBUG:teuthology.orchestra.run.smithi043:> dd if=/proc/self/mounts of=/dev/stdout
2023-12-17T13:33:40.833 DEBUG:teuthology.orchestra.run.smithi043:> set -ex
2023-12-17T13:33:40.833 DEBUG:teuthology.orchestra.run.smithi043:> dd if=/proc/self/mounts of=/dev/stdout
2023-12-17T13:33:40.888 DEBUG:teuthology.orchestra.run.smithi043:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd dump --format=json

My current theory is that there is some kind of race between checking valgrind validator and tearing down the cluster after the test is complete. Not sure where to look for the synchronization between the two

Actions

Copy link

Updated by Leonid Usov 4 months ago

I've run the same test 50 times on the latest main and the results aren't green :/

https://pulpito.ceph.com/leonidus-2024-01-17_20:20:37-fs-main-distro-default-smithi/

21 instances failed with the valgrind error
25 instances failed with Command failed with status 3: 'sudo logrotate etc/logrotate.d/ceph-test.conf'
4 instances failed to power on the machine

So it seems like a rather stable reproduction. We need to look deeper at how the test is set up.

Actions

Copy link

Updated by Venky Shankar 4 months ago

Leonid Usov wrote:

Looking at the mds.c.log I don't find any signs of a crash, but I do see the `respawn` happening at the very end:

This is the tear down phase of the tests (as you mention), so teuthology is unwinding tasks and one of the tasks that setup the file systems is not tearing it down (fs fail, remove pools, etc..). So, this is fine as it isn't something in between running tests where `fs fail` is getting run as you mention in note-5.

Now, I'm looking at ./remote/smithi043/log/valgrind/mds.c.log.gz (valgrind log for mds.c, where the leak is suspected). The respawn which is an execv() call when config `mds_valgrind_exit` isn't set and that does not work well with valgrind. I checked the configs in the job and `mds_valgrind_exit` isn't enabled (the config is disabled by default).

Maybe we should be enabling that for valgrind tests in which case the mds would `exit(0)` (but then, who start the mds process if required - maybe its fine during teardown for this test).

[...]

Also, from the teuthology log:

[...]

Actions

Copy link

Updated by Leonid Usov 3 months ago

@Venky, I see the `mds_valgrind_exit` set, but it's set with spaces:

 mds valgrind exit: true
 'mds valgrind exit': True,
etc.

Also, I see this log message

  dout(1) << "respawn!" << dendl;

and I don't see this log message

  dout(1) << " exe_path " << exe_path << dendl;
...
  dout(0) << "respawn execv " << orig_argv[0]
      << " failed with " << cpp_strerror(errno) << dendl;

in the mds logs, which suggests that it did exit early from the mds.

Could it be that this trick no longer works with valgrind?

  /* valgrind can't handle execve; just exit and let QA infra restart */
  if (g_conf().get_val<bool>("mds_valgrind_exit")) {
    _exit(0);
  }

Actions

Copy link

Updated by Leonid Usov 3 months ago

maybe, we just need to update qa/valgrind.supp ?

Actions

Copy link

#10

Updated by Venky Shankar 3 months ago

Leonid Usov wrote:

@Venky, I see the `mds_valgrind_exit` set, but it's set with spaces:

Ah! I see it now.

[...]

Also, I see this log message

[...]

and I don't see this log message

[...]

in the mds logs, which suggests that it did exit early from the mds.

Sounds fine till here.

Could it be that this trick no longer works with valgrind?

I doubt that. I think we can go through the QA runs here

https://tracker.ceph.com/projects/cephfs/wiki/Main

and see where it did first appear. We can figure out which PRs were included in that testing branch. I won't be surprised it the leak sneaked its way in since there is another valgrind report that might have just looked the same - this one:

https://tracker.ceph.com/issues/63233

Actions

Copy link

#11

Updated by Venky Shankar 3 months ago

Leonid, I think I'm close to isolating the problematic change. Will update...

Actions

Copy link

#12

Updated by Venky Shankar 3 months ago

Venky Shankar wrote:

Leonid, I think I'm close to isolating the problematic change. Will update...

I suspected a PR in run https://tracker.ceph.com/projects/cephfs/wiki/Main#06-Dec-2023 to be causing this since it was in this run this type of valgrind warning was first seen, So, I reverted the only MDS PR that was in the branch

- https://github.com/ceph/ceph/pull/48895

and ran valgrind test with the custom branch and main. Results:

- https://pulpito.ceph.com/vshankar-2024-01-24_05:26:54-fs-wip-valgrind-testing-testing-default-smithi/
- https://pulpito.ceph.com/vshankar-2024-01-24_05:27:39-fs-main-testing-default-smithi/

The custom branch has far more MDS leak reports for (say) the dbench job (1 MDS rank).

- https://pulpito.ceph.com/vshankar-2024-01-24_05:26:54-fs-wip-valgrind-testing-testing-default-smithi/7530328/
- https://pulpito.ceph.com/vshankar-2024-01-24_05:27:39-fs-main-testing-default-smithi/7530335/

The difference in job being the distro used and the cephfs client kclient vs user-space.

So, I don't think the leaks are originating from the MDS and is possibly stdlib or ceph-common (as Leonid doubted in earlier). Can someone fault my logic here?

Actions

Copy link

#13