Project

General

Profile

Bug #58098

qa/workunits/rados/test_crash.sh: crashes are never posted

Added by Laura Flores 2 months ago. Updated 4 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2022-11-23_15:09:06-rados-wip-yuri10-testing-2022-11-22-1711-distro-default-smithi/7087281

2022-11-23T18:09:31.261 INFO:tasks.workunit.client.0.smithi149.stderr:+ sudo systemctl restart ceph-crash
2022-11-23T18:09:31.261 INFO:tasks.workunit.client.0.smithi149.stderr:+ sleep 30
2022-11-23T18:09:31.994 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~5s
2022-11-23T18:09:31.995 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~5s
2022-11-23T18:09:31.995 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~5s
2022-11-23T18:09:37.197 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~10s
2022-11-23T18:09:37.198 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~10s
2022-11-23T18:09:37.198 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~10s
2022-11-23T18:09:42.400 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~16s
2022-11-23T18:09:42.400 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~16s
2022-11-23T18:09:42.401 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~16s
2022-11-23T18:09:47.603 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~21s
2022-11-23T18:09:47.603 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~21s
2022-11-23T18:09:47.603 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~21s
2022-11-23T18:09:52.805 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~26s
2022-11-23T18:09:52.805 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~26s
2022-11-23T18:09:52.805 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~26s
2022-11-23T18:09:58.008 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~31s
2022-11-23T18:09:58.008 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~31s
2022-11-23T18:09:58.008 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~31s
2022-11-23T18:10:01.262 INFO:tasks.workunit.client.0.smithi149.stderr:++ ceph crash ls
2022-11-23T18:10:01.263 INFO:tasks.workunit.client.0.smithi149.stderr:++ wc -l
2022-11-23T18:10:01.699 DEBUG:teuthology.orchestra.run:got remote process result: 1
2022-11-23T18:10:01.700 INFO:tasks.workunit.client.0.smithi149.stderr:+ '[' 0 = 4 ']'
2022-11-23T18:10:01.700 INFO:tasks.workunit.client.0.smithi149.stderr:+ exit 1

The issue here seems to be that we are checking for crashes too early. After inducing crashes, we give ceph-crash 30 seconds to restart, but it sometimes takes longer than that for osds to restart. In this case, osds 0, 1, and 2 took 31 seconds to restart.

A possible fix is to sleep for longer, or check for crashes in a time loop.

journalctl-b0.gz (97.5 KB) Laura Flores, 12/06/2022 03:42 PM

History

#1 Updated by Laura Flores 2 months ago

  • Tags changed from test-failure to test-failure, low-hanging-fruit
  • Tags set to low-hanging-fruit

#2 Updated by Laura Flores 2 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 49104

#3 Updated by Laura Flores 2 months ago

Neha suggested we see how reproducible this is, so as not to mask any underlying problems by sleeping longer. I scheduled 20 jobs of the original failed test to check how reproducible the bug is: http://pulpito.front.sepia.ceph.com/lflores-2022-11-30_22:53:49-rados-main-distro-default-smithi/

#4 Updated by Laura Flores 2 months ago

Didn't reproduce in the 20x run above, but it did reproduce a second time here:

/a/yuriw-2022-11-28_21:09:37-rados-wip-yuri4-testing-2022-11-10-1051-distro-default-smithi/7094865

2022-12-01T15:30:51.097 INFO:tasks.workunit.client.0.smithi196.stderr:+ sudo systemctl restart ceph-crash
2022-12-01T15:30:51.098 INFO:tasks.workunit.client.0.smithi196.stderr:+ sleep 30
2022-12-01T15:30:51.199 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~5s
2022-12-01T15:30:51.199 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~5s
2022-12-01T15:30:51.200 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~5s
2022-12-01T15:30:56.402 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~10s
2022-12-01T15:30:56.402 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~10s
2022-12-01T15:30:56.403 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~10s
2022-12-01T15:31:01.605 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~16s
2022-12-01T15:31:01.606 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~16s
2022-12-01T15:31:01.606 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~16s
2022-12-01T15:31:06.808 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~21s
2022-12-01T15:31:06.809 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~21s
2022-12-01T15:31:06.809 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~21s
2022-12-01T15:31:12.011 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~26s
2022-12-01T15:31:12.012 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~26s
2022-12-01T15:31:12.012 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~26s
2022-12-01T15:31:17.215 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~31s
2022-12-01T15:31:17.216 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~31s
2022-12-01T15:31:17.216 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~31s
2022-12-01T15:31:21.046 INFO:tasks.workunit.client.0.smithi196.stderr:++ ceph crash ls
2022-12-01T15:31:21.047 INFO:tasks.workunit.client.0.smithi196.stderr:++ wc -l
2022-12-01T15:31:21.480 DEBUG:teuthology.orchestra.run:got remote process result: 1
2022-12-01T15:31:21.481 INFO:tasks.workunit.client.0.smithi196.stderr:+ '[' 0 = 4 ']'
2022-12-01T15:31:21.481 INFO:tasks.workunit.client.0.smithi196.stderr:+ exit 1

#5 Updated by Laura Flores 2 months ago

In one of the jobs that passed, the OSDs were also failed for 31 seconds, but this time, the crashes were detected. So, my initial theory that we aren't waiting long enough to detect crashes might not be right.

/a/lflores-2022-11-30_22:53:49-rados-main-distro-default-smithi/7099188

2022-12-02T10:53:43.894 INFO:tasks.workunit.client.0.smithi202.stderr:+ sudo systemctl restart ceph-crash
2022-12-02T10:53:43.894 INFO:tasks.workunit.client.0.smithi202.stderr:+ sleep 30
2022-12-02T10:53:47.148 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~5s
2022-12-02T10:53:47.148 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~5s
2022-12-02T10:53:47.148 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~5s
2022-12-02T10:53:52.350 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~10s
2022-12-02T10:53:52.351 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~10s
2022-12-02T10:53:52.351 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~10s
2022-12-02T10:53:57.554 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~16s
2022-12-02T10:53:57.554 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~16s
2022-12-02T10:53:57.554 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~16s
2022-12-02T10:54:02.757 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~21s
2022-12-02T10:54:02.757 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~21s
2022-12-02T10:54:02.757 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~21s
2022-12-02T10:54:07.960 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~26s
2022-12-02T10:54:07.960 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~26s
2022-12-02T10:54:07.961 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~26s
2022-12-02T10:54:13.163 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.0 is failed for ~31s
2022-12-02T10:54:13.163 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.1 is failed for ~31s
2022-12-02T10:54:13.164 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.2 is failed for ~31s
2022-12-02T10:54:13.845 INFO:tasks.workunit.client.0.smithi202.stderr:++ ceph crash ls
2022-12-02T10:54:13.846 INFO:tasks.workunit.client.0.smithi202.stderr:++ wc -l
2022-12-02T10:54:14.278 INFO:tasks.workunit.client.0.smithi202.stderr:+ '[' 4 = 4 ']'

#6 Updated by Laura Flores 2 months ago

  • Status changed from Fix Under Review to In Progress

In the job that passed, the mgr.server reports a recent crash:

/a/lflores-2022-11-30_22:53:49-rados-main-distro-default-smithi/7099188/remote/smithi202/log/ceph-mgr.x.log.gz

2022-12-02T10:54:13.708+0000 7fa2b0dee700 20 mgr.server operator() health checks:
{
    "RECENT_CRASH": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "3 daemons have recently crashed",
            "count": 3
        },
        "detail": [
            {
                "message": "osd.2 crashed on host smithi202 at 2022-12-02T10:53:36.007195Z" 
            },
            {
                "message": "osd.0 crashed on host smithi202 at 2022-12-02T10:53:36.007204Z" 
            },
            {
                "message": "osd.1 crashed on host smithi202 at 2022-12-02T10:53:36.007382Z" 
            }
        ]
    }
}

But this is nowhere to be seen in the failed job:
/a/yuriw-2022-11-28_21:09:37-rados-wip-yuri4-testing-2022-11-10-1051-distro-default-smithi/7094865

#7 Updated by Laura Flores 2 months ago

In a passed job, the crashes are posted:

2022-12-02T10:53:43.257+0000 7fa2b0dee700 15 mgr get_health_checks getting health checks for crash
2022-12-02T10:53:44.715+0000 7fa29fdec700  1 -- [v2:172.21.15.202:6824/95811,v1:172.21.15.202:6825/95811] <== client.4202 172.21.15.202:0/3863207119 1 ==== mgr_command(tid 0: {"prefix": "crash post", "target": ["mon-mgr", ""]}) v1 ==== 75+0+921 (secure 0 0 0) 0x55d5e39111e0 con 0x55d5e3c9e000
2022-12-02T10:53:44.716+0000 7fa29fdec700 10 mgr.server _handle_command decoded-size=2 prefix=crash post

This step was never reached in the failed job.

#8 Updated by Matan Breizman 2 months ago

/a/yuriw-2022-11-28_21:13:47-rados-wip-yuri11-testing-2022-11-18-1506-distro-default-smithi/7095031/

#9 Updated by Laura Flores 2 months ago

  • Assignee set to Laura Flores

#10 Updated by Laura Flores 2 months ago

Three recent instances of this bug in the main branch point to a regression. My next steps here will be to schedule more teuthology tests to see how many runs it will take to reproduce the bug, and then run the reproducer on Quincy.

#12 Updated by Laura Flores 2 months ago

/a/yuriw-2022-11-28_21:26:12-rados-wip-yuri7-testing-2022-11-18-1548-distro-default-smithi/7095988
/a/lflores-2022-12-02_15:43:15-rados-wip-yuri7-testing-2022-11-18-1548-distro-default-smithi/7101676

#13 Updated by Laura Flores 2 months ago

Wondering if there could have been a regression caused by https://github.com/ceph/ceph/pull/48713.

#14 Updated by Laura Flores 2 months ago

  • Subject changed from qa/workunits/rados/test_crash.sh: workunit checks for crashes too early to qa/workunits/rados/test_crash.sh: crashes are never posted

#15 Updated by Laura Flores 2 months ago

I have a feeling that the tests I scheduled earlier on the main branch all passed since the SHA it picked up is older (697b0d99d9a52df92b20c8e053dcd46c75f364a9). It doesn't have the commits from the ceph-crash PR I linked above, while the test branches do.

#16 Updated by Laura Flores 2 months ago

Laura Flores wrote:

Scheduled 50x tests to run here: http://pulpito.front.sepia.ceph.com/lflores-2022-12-05_17:05:59-rados-wip-yuri10-testing-2022-11-22-1711-distro-default-smithi/

This was scheduled on a main test branch that contains the ceph-crash commits, so I think this will be more telling than the run I did on the older main SHA.

#17 Updated by Laura Flores 2 months ago

  • Pull request ID deleted (49104)

#18 Updated by Laura Flores 2 months ago

  • Tags changed from test-failure, low-hanging-fruit to test-failure
  • Tags deleted (low-hanging-fruit)

#19 Updated by Laura Flores 2 months ago

The bug reproduced on Yuri's test branch. The difference between the test branch and the main SHA is that the test branch includes https://github.com/ceph/ceph/pull/48713 (plus several more commits), while the main branch does not since it's older.

To verify whether the commits from https://github.com/ceph/ceph/pull/48713 are indeed responsible, I will rebuild Yuri's test branch with those two commits reverted, and then run the reproducer.

#21 Updated by Tim Serong about 2 months ago

Is there a way to view the journalctl-b0.gz archive from the failed runs? Because if ceph-crash can't post crashes or is failing for some other reason, that ought to be where any error messages land.

One thought I have is that given https://github.com/ceph/ceph/pull/48713 makes ceph-crash run as the unprivileged "ceph" user, maybe that user doesn't have read access to the ceph admin keyring in the test environment?

#22 Updated by Laura Flores about 2 months ago

Yes, there's one available at /a/yuriw-2022-11-23_15:09:06-rados-wip-yuri10-testing-2022-11-22-1711-distro-default-smithi/7087281/remote/smithi149/syslog/journalctl-b0.gz. I can also attach it to the Tracker.

#23 Updated by Laura Flores about 2 months ago

#24 Updated by Laura Flores about 2 months ago

I scheduled some tests here with the reverts committed to see if they pass: http://pulpito.front.sepia.ceph.com/lflores-2022-12-06_16:57:43-rados-wip-revert-pr-48713-distro-default-smithi/

#25 Updated by Laura Flores about 2 months ago

  • Status changed from In Progress to New
  • Priority changed from Normal to High
  • Regression changed from No to Yes

Laura Flores wrote:

I scheduled some tests here with the reverts committed to see if they pass: http://pulpito.front.sepia.ceph.com/lflores-2022-12-06_16:57:43-rados-wip-revert-pr-48713-distro-default-smithi/

All tests passed on the branch with the commits from https://github.com/ceph/ceph/pull/48713 reverted. This indicates that a regression was caused by this PR.

@Tim I checked the journal log and didn't see anything obvious, but I will report back here if I find any relevant error messages. It's also attached to the tracker if you can take a look.

#26 Updated by Tim Serong about 2 months ago

  • Assignee deleted (Laura Flores)
  • Priority changed from High to Low
  • Regression deleted (Yes)

Thanks Laura, I'll try to figure out what's going on. So far, looking at the journal log, the keyring must be OK, or this wouldn't have worked:

ov 23 18:09:31 smithi149 systemd[1]: Started Ceph crash dump collector.
Nov 23 18:09:31 smithi149 ceph-crash[100092]: INFO:ceph-crash:pinging cluster to exercise our key
Nov 23 18:09:31 smithi149 ceph-crash[100096]:   cluster:
Nov 23 18:09:31 smithi149 ceph-crash[100096]:     id:     45b5db75-1025-4ab1-b9b5-e56a9bc377f8
Nov 23 18:09:31 smithi149 ceph-crash[100096]:     health: HEALTH_OK
Nov 23 18:09:31 smithi149 ceph-crash[100096]:  
Nov 23 18:09:31 smithi149 ceph-crash[100096]:   services:
Nov 23 18:09:31 smithi149 ceph-crash[100096]:     mon: 1 daemons, quorum a (age 76s)
Nov 23 18:09:31 smithi149 ceph-crash[100096]:     mgr: x(active, since 71s)
Nov 23 18:09:31 smithi149 ceph-crash[100096]:     osd: 3 osds: 3 up (since 68s), 3 in (since 75s)
Nov 23 18:09:31 smithi149 ceph-crash[100096]:  
Nov 23 18:09:31 smithi149 ceph-crash[100096]:   data:
Nov 23 18:09:31 smithi149 ceph-crash[100096]:     pools:   1 pools, 8 pgs
Nov 23 18:09:31 smithi149 ceph-crash[100096]:     objects: 0 objects, 0 B
Nov 23 18:09:31 smithi149 ceph-crash[100096]:     usage:   15 MiB used, 270 GiB / 270 GiB avail
Nov 23 18:09:31 smithi149 ceph-crash[100096]:     pgs:     8 active+clean
Nov 23 18:09:31 smithi149 ceph-crash[100096]:  
Nov 23 18:09:31 smithi149 ceph-crash[100092]: INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s

#27 Updated by Tim Serong about 2 months ago

  • Assignee set to Laura Flores
  • Priority changed from Low to High
  • Regression set to Yes

(Sorry, I didn't mean to update any of those fields with my previous comment)

#28 Updated by Tim Serong about 2 months ago

I've just rerun "rados/singleton/{all/test-crash mon_election/connectivity msgr-failures/few msgr/async objectstore/bluestore-low-osd-mem-target rados supported-random-distro$/{rhel_8}}" with the current main branch from ceph-ci (which doesn't include https://github.com/ceph/ceph/pull/48713). This of course worked fine: http://pulpito.front.sepia.ceph.com/tserong-2022-12-07_04:41:02-rados-main-distro-default-smithi/

Then I went poking around in the test node. Check this out:

# ls -alR /var/lib/ceph/crash
/var/lib/ceph/crash:
total 0
drwxr-x---.  3 ceph ceph  20 Dec  7 05:00 .
drwxr-x---. 15 ceph ceph 222 Dec  7 04:57 ..
drwxr-x---.  5 ceph ceph 222 Dec  7 05:00 posted

/var/lib/ceph/crash/posted:
total 0
drwxr-x---. 5 ceph ceph 222 Dec  7 05:00 .
drwxr-x---. 3 ceph ceph  20 Dec  7 05:00 ..
drwx------. 2 root root  41 Dec  7 05:00 2022-12-07T05:00:27.921857Z_b812bcae-67a8-43d1-aa8a-830f83c2df18
drwx------. 2 root root  41 Dec  7 05:00 2022-12-07T05:00:27.921864Z_c92949f5-9dee-4b01-b448-2452cd418a36
drwx------. 2 root root  41 Dec  7 05:00 2022-12-07T05:00:27.922094Z_3817eb3a-7f99-4802-97d8-5bf3e9c097df

'/var/lib/ceph/crash/posted/2022-12-07T05:00:27.921857Z_b812bcae-67a8-43d1-aa8a-830f83c2df18':
total 1768
drwx------. 2 root root      41 Dec  7 05:00 .
drwxr-x---. 5 ceph ceph     222 Dec  7 05:00 ..
-r--r--r--. 1 root root       0 Dec  7 05:00 done
-rw-r--r--. 1 root root 1803121 Dec  7 05:00 log
-rw-------. 1 root root     921 Dec  7 05:00 meta

'/var/lib/ceph/crash/posted/2022-12-07T05:00:27.921864Z_c92949f5-9dee-4b01-b448-2452cd418a36':
total 1736
drwx------. 2 root root      41 Dec  7 05:00 .
drwxr-x---. 5 ceph ceph     222 Dec  7 05:00 ..
-r--r--r--. 1 root root       0 Dec  7 05:00 done
-rw-r--r--. 1 root root 1773084 Dec  7 05:00 log
-rw-------. 1 root root     921 Dec  7 05:00 meta

'/var/lib/ceph/crash/posted/2022-12-07T05:00:27.922094Z_3817eb3a-7f99-4802-97d8-5bf3e9c097df':
total 1812
drwx------. 2 root root      41 Dec  7 05:00 .
drwxr-x---. 5 ceph ceph     222 Dec  7 05:00 ..
-r--r--r--. 1 root root       0 Dec  7 05:00 done
-rw-r--r--. 1 root root 1849824 Dec  7 05:00 log
-rw-------. 1 root root     921 Dec  7 05:00 meta

Note that the crash files are all owned by root. I reckon what's happening here, is that because the ceph daemons themselves run as root in the test environment, when they're killed, the crash files are written as root. This presumably means that ceph-crash, running as the unprivileged ceph user, can't read those files, so never posts the crash. The os.path.isfile(metapath) check at https://github.com/ceph/ceph/blob/dfe10da454794ed07b9e8d2edadb57c0c16e4e4c/src/ceph-crash.in#L71 is probably returning false as the meta files are only readable by root.

I'll do another test with https://github.com/ceph/ceph/pull/48713 applied to confirm for certain. Just gotta wait for a build...

#29 Updated by Tim Serong about 2 months ago

Still waiting for that build (debuginfo seems to take an unbelievably long time to publish...)

Meanwhile, I did a couple more experiments. I was actually slightly surprised to learn that the ceph daemons run as root in the test environment, given that when deployed in production they're run with --setuser ceph --setgroup ceph. That unprivileged ceph user was introduced back in the Jewel release in 2016. Shouldn't we also be running as that user when doing automated tests?

So, I tried modifying qa/tasks/ceph.py to add those parameters when starting all the daemons, (https://github.com/ceph/ceph-ci/commit/d17799515a5b61794a67712e18ee7170cb6d1ea4) then did another test run (http://pulpito.front.sepia.ceph.com/tserong-2022-12-07_06:56:22-rados:singleton:all:test-crash.yaml-main-distro-default-smithi/). That failed because /var/lib/ceph/mgr/ceph-x/keyring is owned by root.

Thinking to limit the perms change to only the OSDs, and only when running the ceph-crash test, I also tried overriding ceph.conf for OSDs only, to set the user and group to ceph (https://github.com/SUSE/ceph/commit/16699b4c4676774887fab669bcc95377c9287b9f). That of course also failed in my subsequent test run (http://pulpito.front.sepia.ceph.com/tserong-2022-12-07_07:48:38-rados:singleton:all:test-crash.yaml-main-distro-default-smithi/) apparently because the OSD directories (e.g. /var/lib/ceph/osd/ceph-0) are created by root, and the subsequent ceph-osd --mkfs fails due to not being able to write to that directory.

Based on the above I have to assume that making the test environment run the ceph daemons as the unprivileged ceph user is a non-trivial change.

So, what next? https://github.com/ceph/ceph/pull/48713 makes ceph-crash run as the ceph user, not as root, in order to fix CVE-2022-3650, and this works fine in production when the ceph daemons are all also running as the ceph user. I thought of adding --setuser and --setgroup options to ceph-crash, so that it can run as the ceph user in production, but run as root when under test, but that seems ugly to me. While it would kinda fix the problem, we'd be testing something subtly different than what gets run by users, and also we'd have a ceph-crash script which could theoretically still be run as root in production if those options were somehow turned off.

#30 Updated by Laura Flores about 2 months ago

Can we make the default behavior a ceph user, and then provide --setgroup and --setuser options in case we need to reset back to root? That way, we could just modify qa/workunits/rados/test_crash.sh to set the group and user to root, but still have default behavior be secure for production.

#31 Updated by Tim Serong about 2 months ago

Laura Flores wrote:

Can we make the default behavior a ceph user, and then provide --setgroup and --setuser options in case we need to reset back to root? That way, we could just modify qa/workunits/rados/test_crash.sh to set the group and user to root, but still have default behavior be secure for production.

Yep, we could that (although I'd probably do it as --no-drop-privs rather than explicitly setting the user to root). But, you suggesting modifying qa/workunits/rados/test_crash.sh gave me another idea. What do you think of this?

https://github.com/ceph/ceph/pull/49314

Tested OK at http://pulpito.front.sepia.ceph.com/tserong-2022-12-08_00:16:03-rados-wip-tserong-testing-distro-default-smithi/

#32 Updated by Laura Flores about 2 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 49314

It seems reasonable to me!

#33 Updated by Laura Flores about 2 months ago

  • Assignee changed from Laura Flores to Tim Serong

#34 Updated by Laura Flores 17 days ago

Bumping this up, since it's still occurring in main:

/a/yuriw-2023-01-12_20:11:41-rados-main-distro-default-smithi/7138670

The fix is in testing, but the holdup involves issues with the rhel satellite causing rhel jobs to fail.

#35 Updated by Neha Ojha 4 days ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF