Project

General

Profile

Bug #50526

OSD massive creation: OSDs not created

Added by Juan Miguel Olmo Martínez about 2 months ago. Updated 25 days ago.

Status:
Resolved
Priority:
Urgent
Category:
-
Target version:
-
% Done:

100%

Source:
Tags:
Backport:
pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

OSDs are not created when the drive group used to launch the osd creation affect a big number of OSDs (75 in my case).

Symptoms:
- OSD service does not show the right number of OSDs

# ceph orch ls
osd.defaultDG      0/3  -          -    f22-h21-000-6048r.rdu2.scalelab.redhat.com;f22-h25-000-6048r.rdu2.scalelab.redhat.com;f22-h29-000-6048r.rdu2.scalelab.redhat.com

- Big number of OSD created but 0 out

#ceph -s
    osd: 73 osds: 0 up, 54 in (since 62m)

- In all the hosts where OSDs must be created:

There is a permanent file lock caused by the "cephadm ceph-volume lvm list"

root@f22-h21-000-6048r:/var/log
# lslocks
COMMAND            PID  TYPE SIZE MODE  M START END PATH
...
python3         363847 FLOCK   0B WRITE 0     0   0 /run/cephadm/2c67b4d8-a439-11eb-b919-bc97e17cee60.lock
...
# ps -ef | grep 363847
root      363847  361214  0 15:13 ?        00:00:00 /usr/bin/python3 /var/lib/ceph/2c67b4d8-a439-11eb-b919-bc97e17cee60/cephadm.6268970e1745c66ce4f3d1de4aa246ccd1c5684345596e8d04a3ed72ad870349 --image registry.redhat.io/rhceph-beta/rhceph-5-rhel8@sha256:24c617082680ef85c43c6e2c4fe462c69805d2f38df83e51f968cec6b1c097a2 ceph-volume --fsid 2c67b4d8-a439-11eb-b919-bc97e17cee60 -- lvm list --format json
root      364262  353802  0 15:17 pts/2    00:00:00 grep --color=auto 363847

- In all the hosts where OSDs must be created:

OSD systemd services are not created

# systemctl list-units ceph*@osd*
0 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

- In all the hosts where OSDs must be created:

The infrastructure needed for the osds seems to be created

# lsblk
NAME                                                                                                  MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
...
sdc                                                                                                     8:32   0   1.8T  0 disk  
`-ceph--229b758f--ccf1--4014--84e3--a526b5d5cefc-osd--block--b80cbfa5--610b--47c8--b805--83471b2a0c64 253:94   0   1.8T  0 lvm   
sdd                                                                                                     8:48   0   1.8T  0 disk  
`-ceph--557721b5--61d2--4dcd--8452--30e68e442a0a-osd--block--0d703fbf--81fd--44cc--a789--42a6cf8606a8 253:96   0   1.8T  0 lvm   
sde                                                                                                     8:64   0   1.8T  0 disk  
`-ceph--b28c6069--5ca8--45d8--b854--b09106e2fd75-osd--block--53b82fbe--761e--4e6a--9b7d--258a0a4196e0 253:98   0   1.8T  0 lvm   
sdf                                                                                                     8:80   0   1.8T  0 disk  
`-ceph--a16adf39--5f21--4531--9837--1394cad47a80-osd--block--1f31b9fd--5165--43a6--8aee--c519da4b63b3 253:100  0   1.8T  0 lvm   
sdg                                                                                                     8:96   0   1.8T  0 disk  
...  

nvme0n1                                                                                               259:0    0 745.2G  0 disk  
|-ceph--4f9cbc13--1649--4f1a--94d7--772de7da6646-osd--db--1e85d960--2fdd--42d3--9f56--c62b7a8b085a    253:95   0  62.1G  0 lvm   
....

Related issues

Related to Orchestrator - Feature #48292: cephadm: allow more than 60 OSDs per host New
Related to Orchestrator - Bug #47873: /usr/lib/sysctl.d/90-ceph-osd.conf getting installed in container, rendering it ineffective Resolved

History

#1 Updated by Juan Miguel Olmo Martínez about 2 months ago

  • % Done changed from 0 to 50
  • Pull request ID set to 41045

#2 Updated by Ken Dreyer about 2 months ago

  • Backport set to pacific

#3 Updated by Sebastian Wagner about 2 months ago

  • Related to Feature #48292: cephadm: allow more than 60 OSDs per host added

#4 Updated by Sebastian Wagner about 2 months ago

  • Description updated (diff)

#5 Updated by Andreas Håkansson about 2 months ago

We have the same or a very similar problem,
In out test case adding more than 8 disk with db on a separate nvme device fails.

On the osd-servers there is a python3 process running a cephadm copy with the following parameters:

 ceph-volume --fsid [redacted] lvm-list  --format json

lsof shows that the process has cephadm.log and an ceph lock file open.
In the cephadm.log there is a partial result from ceph-volume lvm list --format json.

When running the

ceph-volume lvm list --format

manually we get a complete rsult.

We have tried the following on the hosts.
downgrading podman to 2.0.x
Upgrading podman to 3.0.x using kubic
downgrading ro conmon 2.0.22.x

We have used pacific from both the current RedHat beta and main ceph with the same issue.

The problem does not exist in the octopus version.

It looks like the hang is on the other side. Is it ceph orch or an mgr that is the receiving side for the json output?

#6 Updated by Juan Miguel Olmo Martínez about 2 months ago

  • Related to Bug #47873: /usr/lib/sysctl.d/90-ceph-osd.conf getting installed in container, rendering it ineffective added

#7 Updated by Juan Miguel Olmo Martínez about 2 months ago

Andreas Håkansson wrote:

We have the same or a very similar problem,
In out test case adding more than 8 disk with db on a separate nvme device fails.

I think that the fix will also work for your issue, it would be nice if you can confirm.

Apart of the fix take into account that if you want to have a big number of OSDs you will need to tweak some system parameters ( see the related bugs for more information). Summarizing:

# Kernel parameters
fs.aio-max-nr = 1048576
kernel.pid_max = 4194304

# Ceph parameter
ceph config set osd.* ms_bind_port_max 7568.

It looks like the hang is on the other side. Is it ceph orch or an mgr that is the receiving side for the json output?

Yes, the output is processed by the cephadm binary in each host which is called by the cephadm orchestrator mgr module running in the active mgr.

#8 Updated by David Orman about 2 months ago

Juan Miguel Olmo Martínez wrote:

I think that the fix will also work for your issue, it would be nice if you can confirm.

Apart of the fix take into account that if you want to have a big number of OSDs you will need to tweak some system parameters ( see the related bugs for more information). Summarizing:

[...]

It looks like the hang is on the other side. Is it ceph orch or an mgr that is the receiving side for the json output?

Yes, the output is processed by the cephadm binary in each host which is called by the cephadm orchestrator mgr module running in the active mgr.

We have those settings applied already, but we have the same issue as the OP, however with only 24 OSDs w/ 12 per NVME for DB/WAL. This works fine on Octopus 15.2.10, but breaks on 16.2.3. I think this is an important note. This has actually broken our upgrades from 15.2.10 to 16.2.3, since the upgrade will not progress when it blocks on these remote ssh calls while attempting to apply the OSD specification. All of our servers are as below:

root@ceph01:/etc/sysctl.d# sysctl -a |egrep 'fs.aio-max-nr|kernel.pid_max'
fs.aio-max-nr = 1048576
kernel.pid_max = 4194304

We have the same symptoms, with the 'hung' processes doing the lvm list, spawned via SSH. I think the NVME + OSD (DB/WAL on NVME) combination must increase output volume to an amount that leads to issues. If you strace one of these processes, you can see it stuck writing to what looks like a pipe:

root@ceph01:/etc/sysctl.d# strace -p 186161 -s2000
strace: Process 186161 attached
write(2, "/usr/bin/podman: \"ceph.vdo\": \"0\"\n", 49

Looking at the cephadm logs, you can see that part has not yet been logged for the device being output (it's a truncated version, with that being the next expected line). It seems like some limitation gets hit, and then it just hangs. I would say this is definitely an urgent issue as it's impacting every cluster we test on upgrade of a certain size. This particular cluster is 21 nodes, 24x OSDs and 2 NVME per node, 12x OSDs per NVME.

#9 Updated by Juan Miguel Olmo Martínez about 1 month ago

David Orman wrote:

Juan Miguel Olmo Martínez wrote:

I think that the fix will also work for your issue, it would be nice if you can confirm.

Apart of the fix take into account that if you want to have a big number of OSDs you will need to tweak some system parameters ( see the related bugs for more information). Summarizing:

[...]

It looks like the hang is on the other side. Is it ceph orch or an mgr that is the receiving side for the json output?

Yes, the output is processed by the cephadm binary in each host which is called by the cephadm orchestrator mgr module running in the active mgr.

We have those settings applied already, but we have the same issue as the OP, however with only 24 OSDs w/ 12 per NVME for DB/WAL. This works fine on Octopus 15.2.10, but breaks on 16.2.3. I think this is an important note. This has actually broken our upgrades from 15.2.10 to 16.2.3, since the upgrade will not progress when it blocks on these remote ssh calls while attempting to apply the OSD specification. All of our servers are as below:

root@ceph01:/etc/sysctl.d# sysctl -a |egrep 'fs.aio-max-nr|kernel.pid_max'
fs.aio-max-nr = 1048576
kernel.pid_max = 4194304

We have the same symptoms, with the 'hung' processes doing the lvm list, spawned via SSH. I think the NVME + OSD (DB/WAL on NVME) combination must increase output volume to an amount that leads to issues. If you strace one of these processes, you can see it stuck writing to what looks like a pipe:

root@ceph01:/etc/sysctl.d# strace -p 186161 -s2000
strace: Process 186161 attached
write(2, "/usr/bin/podman: \"ceph.vdo\": \"0\"\n", 49

Looking at the cephadm logs, you can see that part has not yet been logged for the device being output (it's a truncated version, with that being the next expected line). It seems like some limitation gets hit, and then it just hangs. I would say this is definitely an urgent issue as it's impacting every cluster we test on upgrade of a certain size. This particular cluster is 21 nodes, 24x OSDs and 2 NVME per node, 12x OSDs per NVME.

Are you sure that the fix has been applied properly?

Can you check if the modification is present in the "cephadm binary" of one of the hosts with the problem?

log in one of the hosts where the osds cannot be created:
and go to the folder: /var/lib/ceph/<your_ceph_cluster_fsid>

in that folder you will find a "cephadm.xxxxxxxxxxx" file, be sure that the modification to remove the verbose output is in-place:
you code must be like:
https://github.com/ceph/ceph/blob/2f4dc3147712f1991242ef0d059690b5fa3d8463/src/cephadm/cephadm#L4576

I have verified in our labs a way to reproduce easily the problem:

0. Please stop the cephadm orchestrator:

In your bootstrap node:

# cephadm shell
# ceph mgr module disable cephadm

1. In one of the hosts where you want to create osds and you have a big amount of devices:

See if you have a "cephadm" filelock:
for example:

# lslocks | grep cephadm
python3         1098782  FLOCK   0B WRITE 0     0   0 /run/cephadm/9fa2b396-adb5-11eb-a2d3-bc97e17cf960.lock

if that is the case. just kill the process to start with a "clean" situation

2. Go to the folder: /var/lib/ceph/<your_ceph_cluster_fsid>

you will find there a file called "cephadm.xxxxxxxxxxxxxx".

execute:

# python3 cephadm.xxxxxxxxxxxxxx ceph-volume inventory

3. If the problem is present in your cephadm file, you will have the command blocked and you will see again a cephadm filelock

4. In the case that the modification was not present. Change your cephadm.xxxxxxxxxx file to include the modification I did (is just to remove the verbosity parameter in the call_throws call)

https://github.com/ceph/ceph/blob/2f4dc3147712f1991242ef0d059690b5fa3d8463/src/cephadm/cephadm#L4576

go to step 1, to clean the filelock and try again... with the modification in place it must work.

#10 Updated by David Orman about 1 month ago

To be clear, we have not applied this patch. I was merely adding information to point out the impact is not restricted to large numbers of OSDs but also impacted those with smaller counts and split db/wal, and that this was also a regression, and impacted upgrades. It was an attempt to help size impact, as well as add a few more data points. We will look further into this.

#11 Updated by David Orman about 1 month ago

We've created a PR to fix the root cause of this issue: https://github.com/alfredodeza/remoto/pull/63

#12 Updated by Juan Miguel Olmo Martínez about 1 month ago

David Orman wrote:

We've created a PR to fix the root cause of this issue: https://github.com/alfredodeza/remoto/pull/63

@David Orman: As you explain in your PR:

"this bug was discovered while attempting to upgrade a Ceph cluster from 15.2.10 to 16.2.3. "

Which is a completely different situation from the one pointed in the bug description (massive osd creation in a cephadm cluster), and also completely different from the situation pointed in comment 5.

Besides that, as i explained in comment 9, the bug can be reproduced using directly "cephadm binary" with the command ceph-volume. And "cephadm binary" does not use at all "remoto", so I think that we can discard completely your PR as solution for the issue pointed in this bug.

Note:This does not mean that your PR will be useful to cover/solve another bug.

#13 Updated by Cory Snyder about 1 month ago

@Juan, allow me to provide more detail on the scenario that we encountered. As far as I can tell, the root cause of our problem (https://github.com/alfredodeza/remoto/issues/62) is very likely to also be the root cause of your problem - but I'll be happy to discuss it further if you still disagree after reading more details about our case.

Our problem manifested itself while upgrading from Octopus (v15.2.10) to Pacific (v16.2.3). The reason that we experienced it during the upgrade is because there was a change to the args that are passed to remoto by the ceph mgr when attempting to run:

ceph-volume lvm list --format

on the OSDs. Ultimately that change causes Pacific to use this conditional branch when making the ceph-volume call through remoto, instead of the alternative.

As you know, based upon the PR that you submitted, this ceph-volume command sends a lot of logging output to stderr. The deadlock bug that is caused by using the stream read method vs. communicate only manifests itself if the size of output to stderr is large enough to completely fill the OS pipe buffer. The amount of output grows as more OSDs are added. It is my understanding that we simply already had enough OSDs to generate enough output to stderr such that the bug manifested itself during our upgrade. In your case, it seems that the addition of new OSDs may have put you over the threshold.

What we saw on the OSD hosts is exactly what you described. We saw the same process executing:

cephadm ... ceph-volume --fsid <redacted> -- lvm list --format json

The process was created by remoto in the _remote_check method. As David noted in a previous comment, this process was stuck trying to write to stderr. It was stuck because stderr for this process is the pipe that is created by remoto, and remoto wasn't attempting to read from the pipe because it was stuck attempting to read from stdout. The lock file that you mentioned was held by this stuck process - and wasn't being unlocked due to the fact that the process was in deadlock. Attempts to run the same cephadm command on the OSD host manually got stuck trying to acquire the lock since it was perpetually held by the stuck process. The same cephadm command succeeded when run manually from the OSD host after the stuck process was killed and the lock was manually cleared.

I was originally suspicious of the same asyncio.StreamReader usage within cephadm that you called out in the description on your PR. But ultimately, we found that the cephadm process was only stuck there because this logger.info call was attempting to write to stderr and the stderr pipe was full (and not being drained).

#14 Updated by Juan Miguel Olmo Martínez about 1 month ago

@Cory Snyder wrote:

@Juan, allow me to provide more detail on the scenario that we encountered. As far as I can tell, the root cause of our problem (https://github.com/alfredodeza/remoto/issues/62) is very likely to also be the root cause of your problem - but I'll be happy to discuss it further if you still disagree after reading more details about our case.

Just to know how do you explain that a fix in the remoto library is going to fix a problem in cephadm "binary" (as we call it) that does not use that library.
Can you find a reference to "remoto" in https://github.com/ceph/ceph/blob/master/src/cephadm/cephadm ?

Because as I explained in several previous comments you can reproduce the bug using "cephadm ceph-volume inventory" or "cephadm ceph-volume lvm list", and not using at all "cephadm" orchestrator. (the orchestrator is the mgr module where we use remoto)

#15 Updated by Cory Snyder about 1 month ago

Juan Miguel Olmo Martínez wrote:

@Cory Snyder wrote:

@Juan, allow me to provide more detail on the scenario that we encountered. As far as I can tell, the root cause of our problem (https://github.com/alfredodeza/remoto/issues/62) is very likely to also be the root cause of your problem - but I'll be happy to discuss it further if you still disagree after reading more details about our case.

Just to know how do you explain that a fix in the remoto library is going to fix a problem in cephadm "binary" (as we call it) that does not use that library.
Can you find a reference to "remoto" in https://github.com/ceph/ceph/blob/master/src/cephadm/cephadm ?

Because as I explained in several previous comments you can reproduce the bug using "cephadm ceph-volume inventory" or "cephadm ceph-volume lvm list", and not using at all "cephadm" orchestrator. (the orchestrator is the mgr module where we use remote)

When we originally attempted to reproduce the issue by directly executing "cephadm ceph-volume lvm list", it appeared that we were successful because we did see the process hanging. We later found that this process was stuck trying to acquire the file lock and that its stack trace did not match the trace of the hung process that was created by the orchestrator through remoto. When the lock was cleared and we executed "cephadm ceph-volume lvm list" again, it completed successfully. The comment from @Andreas indicates that they also failed to reproduce this issue manually.

If you're asserting that you were able to reproduce the problem by manually executing these commands and can confirm that the manual invocations have identical stack traces to those that are initiated by the orchestrator, I concede that your manual invocations must be subject to a different problem and that the remoto fix may be irrelevant to you. In that case, I apologize for surmising that your manual invocations were actually stuck trying to acquire the lock (as ours originally were).

In any case, I can state with absolute certainty that the remoto patch resolves a bug with exactly the same symptoms that you've described here when these cephadm commands are invoked through the orchestrator.

#16 Updated by Sebastian Wagner 29 days ago

  • Status changed from New to Resolved

closing this for now. I'd create a new issue, if this pops up again

#17 Updated by Juan Miguel Olmo Martínez 25 days ago

  • % Done changed from 50 to 100

Also available in: Atom PDF